1Z0-931 Autonomous Database
MIS & Data Analysis FALL (1201) 2020
Final Capstone Project
- Your team must assemble the data yourselves from the downloaded seasons from the NBA. The general format will follow all the other examples - two sheets of data, each in datasheet format (no totals, no missing values, no malformed data or junk, one header row (no 2-row headers). One sheet already contains the values that we are aTemp2ng to predict. This is used to create the models. The other worksheet must be in iden2cal format as the source data except that the target is excluded - this is what we are geUng predic2ons for. And remember, anything that's done to the old data must be done to the new data (e.g. removing, renaming, consolida2ng or adding variables). These are the two required worksheets for each of the two models – Classifica2on and Predic2on.
- You are doing two separate sets of models - the predic2on one does wins (numerical), the classifica2on one does playoffs (categorical). These should be done independently from each other.
- Please ensure that the wins and playoff status (the two targets) are never used as features for the other one; they are the targets – otherwise one must be excluded.
- Use Excel as much as possible to clean up your datasets – missing values, malformed values, removing unrelated data, etc.
- Test and score your data using Orange – use 5–6 different algorithms in your first round of training and tes2ng your models.
- From those results, take the top two (or three) algorithms to conduct further tes2ng and “tweaking” in order to improve their performance. Adjust these models using various methods – manipula2ng the data itself, make adjustments using Orange tools (e.g. impute, feature sta2s2cs, discre2ze, select columns, etc.), or have Orange decide what needs to be changed is another way of improving predic2ons (e.g. preprocess). Accuracy is a goal, so all the things men2oned to increase accuracy are fair game. Try different things to improve the models. The worst that can happen is that it doesn't help.
- Pick the best of the two in this second round and do one more round with your final model algorithm. Use this round to do some final “tweaks” for possibly last minute improvements.
- This is a realis2c simula2on, so parts are ambiguous, which is normal. I'll post any common/repeated clarifica2ons to the assignment as 2me goes by. Please check for updates.
- Pay aTen2on to the details, read the assignment early, and seek clarifica2on on anything that doesn't make sense to your team or you don’t understand. If something is unequivocally stated in the assignment, that is because it is a required (and graded) part of the assignment.
- Do not spend your 2me learning about basketball. There are many teams, they play each other, one wins, one loses, and a tremendous amount of sta2s2cs are captured and stored – that’s all you need to know about professional basketball to successfully complete this assignment.
- When you write the report, your goal is to clearly and concisely communicate your findings. Structure it as though you were trying to convince me of something (your findings) - the clearer you are, the greater the chance I will understand, and the more likely I am convinced. Have all team members proof each other’s work and the final drae prior to submission. Once you submit, you cannot take it back. If a reviewer feels they must hunt for informa2on, ask for clarifica2on, or can't understand or see if/how a ques2on was answered or provided informa2on, that is an indica2on that your team needs to revise it.
- Lastly, START EARLY. This project follows the same paTern as all of the examples given in class, but you need to apply that to a novel situa2on. That can cause confusion and you WILL need to work through challenges – AS A TEAM! I cannot emphasize this enough. No extensions can be given as the semester is over, and the expecta2on is high as you've already done all the individual skills necessary. This final project just puts them all together, and with the excep2on of working with raw data from the “wild” (aka the real world), it is the same process as the last DA Assignment.
- Lastly, asking basic ques2ons at the last minute is not prudent nor efficient. Arranging the data can be done now given the informa2on provided in the last two lectures. Crea2ng basic models can be done now. Star2ng to make improvements can be done now (and should be done now because this part takes 2me). Documen2ng your work CMIS2250 – MIS & Data Analy2cs
FALL (1201) 2020 – Final Capstone Project can be done throughout the process, beginning now. If you are wondering how to download data or evaluate error rates over the next two weekends, you're unfortunately on your own – another reason for a team-oriented type of assignment. The last minute is for last minute things - final tweaks to get the last bit of accuracy, the best way to present details, etc.
Se)ng Up Your Working File
Once again, the link to data source below (note that different year links are near the top to scroll back or forward): hTps://www.basketball-reference.com/leagues/NBA_2019.html
If you have followed the instruc2ons correctly, you will end up with five 'sets' of data in total. (For some people, this is 5 sheets, but you could also include/exclude data from the same sheet, as we have done in Orange to segregate it). These data sets are to be used in construc2ng your data file ready for uploading to Orange in order to train and test your models (not to predict). This newly constructed dataset is also a required deliverable along with your Predicted data worksheet. Not these 5 sets of basketball stats that were downloaded. The downloaded datasets are strictly for construc2on of your one source worksheet for Orange training and tes2ng/scoring to find the best, most accurate model to use in predic2ng.
As follows, some hints:
- Training (source) data for wins - Wins is known, includes all inputs/features. Test and Score. Evaluate and improve RMSE/R2
- Scoring (subject) data for wins - Wins is unknown (or deleted, so we pretend we don't know it), includes the same inputs/features. Predic2ons.
- Training (source) data for playoffs - Playoffs (yes/no) is known, includes all inputs/features. Test and Score. Evaluate and improve CA.
- Scoring (subject) data for playoffs - Playoffs is unknown (or deleted, so we pretend we don't know it), includes the same inputs/features. Predic2ons.
- Beyond that it is almost the same as your previous assignment. Build models, evaluate accuracy, improve, document.
Remember, if you alter your source data variables (columns, not rows) such as removing, adding, renaming, or consolida2ng column(s), you MUST do the same to your subject (Predic2on) dataset; if you recall, the new dataset (Predic2on) must match the structure of your training dataset (prepped data). Best of luck,