1. A Glimpse of the data set
Github link: https://github.com/Zhuo-Chen-byte
We use the same data set as the first project. This data set is composed of information of patients (age, whether contracted specific diseases such as diabetes) and whether they showed up in their appointed meeting (indicated by the No-show variable, Yes for showing up and no for not). Given this information, we are going to explore a correlation between whether a patient showed up in his / her appointment (indicated by the variable No-show) and the rest variables (Gender, Age, Neighbourhood, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, and SMS_received).
2. Logistic Regression
The first model we apply is logistic regression. After converting all variables (except Age and the target variable No-show) to dummy variables and split the converted data set to training set and testing set by a ratio of 70 / 30, we set up and fit data into the corresponding model. After training and testing, our logistic regression model reached an accuracy of 0.798068969082825 on the training set and an accuracy of 0.798057783943543 on the testing set.
3. Decision Trees
The second model we apply is decision trees. After setting up the model, training, and testing it with corresponding dataset, our decision trees model reached an accuracy of 0.798068969082825 on the training set and an accuracy of 0.798057783943543 on the testing set. According to our decision trees model, the top 3 most important features according to our decision trees model are whether the patient received his / her SMS, the age of the patient, and whether the patient’ s neighborhood is Santa Martha.
4. Bagging
The third model we apply is bagging. After setting up the model, training, and testing it with corresponding dataset, our bagging model reached an accuracy of 0.798068969082825 on the training set and an accuracy of 0.798057783943543 on the testing set.
5. RandomForest
The fourth model we apply is random forests. After setting up the model, training, and testing it with corresponding dataset, our random forest model reached an accuracy of 0.8557284665494779 on the training set and an accuracy of 0.7534531636407503 on the testing set. According to our random forests model, the top 3 most important features are the patient’s age and whether the patient received his / her SMS.
6. Boosting
The fifth model we apply is boosting. After setting up the model, training, and testing it with corresponding dataset, our boosting model reached an accuracy of 0.7982369972081481 on the training set and an accuracy of 0.798057783943543 on the testing set. According to our boosting model, the top 3 most important features are also the patient’s age and whether the patient received his / her SMS (the same as those of the random forests model).
7. Fine tuning our logistic regression model
Finally, we come back and fine tune the logistic regression model, which performed the best among all those we constructed. After being fine tuned with the given parameters, the logistic regression model reached an accuracy of 0.7981077448040533 on the training set and 0.798057783943543 on the testing set, performing better on the training set while performing the same on the testing set.
In addition, during the fine tuning, we find that the best results are achieved when we choose C = 10, max_iter = 3, penalty = l1, and solver = liblinear.