Exploratory Data Analysis Project II Summary

Zhuo Chen
4 min readMay 6, 2021

1. A Glimpse of the data set

Github link: https://github.com/Zhuo-Chen-byte

Data set composition part I
Data set composition part II

We use the same data set as the first project. This data set is composed of information of patients (age, whether contracted specific diseases such as diabetes) and whether they showed up in their appointed meeting (indicated by the No-show variable, Yes for showing up and no for not). Given this information, we are going to explore a correlation between whether a patient showed up in his / her appointment (indicated by the variable No-show) and the rest variables (Gender, Age, Neighbourhood, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, and SMS_received).

2. Logistic Regression

The converted dataframe

The first model we apply is logistic regression. After converting all variables (except Age and the target variable No-show) to dummy variables and split the converted data set to training set and testing set by a ratio of 70 / 30, we set up and fit data into the corresponding model. After training and testing, our logistic regression model reached an accuracy of 0.798068969082825 on the training set and an accuracy of 0.798057783943543 on the testing set.

The confusion matrix of our logistic regression model
The cross validation results of our logistic regression model

3. Decision Trees

The decision trees model we constructed, with a max depth of 3

The second model we apply is decision trees. After setting up the model, training, and testing it with corresponding dataset, our decision trees model reached an accuracy of 0.798068969082825 on the training set and an accuracy of 0.798057783943543 on the testing set. According to our decision trees model, the top 3 most important features according to our decision trees model are whether the patient received his / her SMS, the age of the patient, and whether the patient’ s neighborhood is Santa Martha.

The top 3 most important features according to our decision trees model
The confusion matrix of our decision trees model
The cross validation results of our decision trees model

4. Bagging

The third model we apply is bagging. After setting up the model, training, and testing it with corresponding dataset, our bagging model reached an accuracy of 0.798068969082825 on the training set and an accuracy of 0.798057783943543 on the testing set.

The confusion matrix of our bagging model
The cross validation results of our bagging model

5. RandomForest

The fourth model we apply is random forests. After setting up the model, training, and testing it with corresponding dataset, our random forest model reached an accuracy of 0.8557284665494779 on the training set and an accuracy of 0.7534531636407503 on the testing set. According to our random forests model, the top 3 most important features are the patient’s age and whether the patient received his / her SMS.

The top 3 most important features according to our random forests model
The confusion matrix of our random forests model
The cross validation results of our random forests model

6. Boosting

The fifth model we apply is boosting. After setting up the model, training, and testing it with corresponding dataset, our boosting model reached an accuracy of 0.7982369972081481 on the training set and an accuracy of 0.798057783943543 on the testing set. According to our boosting model, the top 3 most important features are also the patient’s age and whether the patient received his / her SMS (the same as those of the random forests model).

The top 3 most important features according to our boosting model
The confusion matrix of our boosting model
The cross validation results of our boosting model

7. Fine tuning our logistic regression model

Parameters for fine tuning the logistic regression model

Finally, we come back and fine tune the logistic regression model, which performed the best among all those we constructed. After being fine tuned with the given parameters, the logistic regression model reached an accuracy of 0.7981077448040533 on the training set and 0.798057783943543 on the testing set, performing better on the training set while performing the same on the testing set.

The confusion matrix of our fine tuned logistic regression model

In addition, during the fine tuning, we find that the best results are achieved when we choose C = 10, max_iter = 3, penalty = l1, and solver = liblinear.

The best parameters during fine tuning

--

--