Retention Predictive Modeling Project Part 2
Due: 12/3/2018 at the Data Mining World Championships
In this part of the project you will be partitioning your project data and building the best predictive model possible. Recall the purpose of this model is to flag potential applicants at high risk for not returning to Miami after their first year at the time of application to Miami.
The in class competition will be whose model can identify the most 1’s (being not retained) in the 2018 data out of the top 100 predicted probabilities as identified by your model.
Data Partitioning
The first decision you will have to make is how to partition the data. Recall, you are predicting over time so you might want to consider how the year should be included.
Should the partition be stratified by year (i.e. some of every year in the training data)?
Should you save a holdout year and use that as your validation data?
Use at least 70% for the training dataset and keep 30% or less for the validation or holdout. Make sure you have enough data in the training dataset considering we will be using cross validation.
Oversampling
You have the option to create an oversampled training data set as we did in the last homework. If you choose to use this option please be sure that you have enough data in training data set. This is not required, but could improve model performance.
Please leave the validation set in the original proportions so that you do not have to make any adjustments when comparing models.
Modeling
You are to assign the “event” as the probability that someone leaves Miami. All models and results must be constructed in this manner.
I would like for you to build at least the following models:
1.A logistic regression model.
2.A random forest model.
3.A boosted tree model.
4.A neural network model.
ISA 591 students must include at least one model we did not cover in the course.
This is the minimum number of models to attempt (i.e. building only four models will not get you a perfect grade). You can easily add models by using different sets of predictors or settings.
I will be grading you on the proper model settings not model performance.
1.Did you use the proper type of cross validation or validation?
2.Did you perform model selection?
3.Did you use recommended settings?
You will be required to provide evidence why you chose the model you did. It would be best to use multiple measures like lift, ROC and complexity.
On the day of the competition you will only evaluate the model you chose to be the best using the new 2016 data. We will only model the Domestic students. The data will be in a similar format as the original domestic.csv you downloaded. You can use your code to update any variables necessary.
Deliverables
1.A business memo, written for anyone in Miami’s administration including
a.A summary of the predictive accuracy of your chosen model (think about interpreting the lift of your model vs. a random sample)
b.The variables that need to be included in the model.
c.The most important variable to predicted retention.
d.Provide a cutoff (predicted probability) at which Miami should set their systems to flag a student as at risk for not returning after their first year.
e.Provide information on how the model will perform in practice. In other words, what percent of students will your model flag as “not retained”?
f.Provide a discussion of variables that you think that should be collected that might aid in improving this model in light of the problem (so no college performance data).
Your memo must have correctly labeled tables and figures which are correctly referred to in the document. Your memo should not tell the “story” of what you did. It should only discuss the outcome, i.e. your final model. I will be more stringent on the grading of this in this part of the project.
2.An appendix describing (this can be more and created from R markdown)
a.How and why the data was partitioned.
b.The type of model used and settings so that someone can re-create it.
c.Summary of the models performance (Lift, ROC, Misclassification, etc.) and why you chose this model. Use graphs or constructed tables, no R output.
3.Upload your model code to canvas.
4.Upload your project write-up to canvas.
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。