## Classification Problem : To buy, or not to buy, that is the question

### CONTEXT

The world of stock market trading is buzzing with new strategies on how to maximize profits using Machine Learning and Data Analytics. And that is where you, as a Junior Data Scientist, step in!

Companies spend billions of dollars every year in research to try and jump ahead of the competition and make the most profits. Let’s say your team is tasked to identify which stocks to buy and which not to for this year. Finding value in stocks is an art that very few have mastered!

Can you be one of them?!

The last column of the dataset represent the class of each stock, where:

• if the value of a stock increases during the year (Jan-Dec), then `class=1`;
• if the value of a stock decreases during the year (Jan-Dec), then `class=0`.

In other words, stocks that belong to class `1` are stocks that one should buy at the start of year (Jan), and sell at the end of year (Dec).

The last column, `class`, lists a binary classification for each stock, where

• for each stock, the `1` identifies those stocks that an hypothetical trader should BUY at the start of the year and sell at the end of the year for a profit.
• for each stock, the `0` identifies those stocks that an hypothetical trader should NOT BUY, since their value will decrease, meaning a loss of capital.

At the end of the project, your Machine Learning model must be able to predict whether you must BUY or NOT BUY a particular stock! Isn’t that cool?

——————————————————————————————————

Some things to be aware of:

1. Some column values may be missing ( `nan` cells). If you find yourself in such a scenario, your team can choose the best technique to clean each dataset. (We learnt these terms in Lab : `dropna``fillna`).

You can use ANY classification approaches such as logistic regression, Na¨ıve Bayes, K-Nearest Neighbours or other approaches.

Final Jupyter notebook or model used for final submission

Project report in PDF that contains all the details of the major steps of the project #2 such as:

– Description of your solution in details, such as Pre-processing, Feature engineering, Model building and comparison, Hyperparameter setting and tuning, Performance evaluation, Any novel ideas, and Lessons learnt. (If you used any of these details)

Make your report self-contained such that after reading your report, a student in a data science program should be able to replicate your results. Restrict your total page count to no more than 10 pages using reasonable formatting (including cover and all appendix). As such, within the limit, provide all important details of your project as clear/concise as possible, and do not add Jupyter notebook in the appendix.

5 points

– F1 score >= 0.45 for 5 + 1 bonus points

– F1 score >= 0.40 for 5 points

– F1 score >= 0.38 for 4 points

– F1 score >= 0.35 for 3 points

– F1 score >= 0.30 for 2 points.

– F1 score >= 0.25 for 1 point.

Quality of report: 5 points