By Abhijit Annaldas, Microsoft.
I was planning agenda for my one hour talk. Conveying the learning paths, setting up the environment and explaining the important machine learning concepts finally made it to agenda after a lot of contemplation and thought. I initially thought about various ways this talk could have been done including - hands on python with linear regression, explaining linear regression in detail, or just sharing my learning journey that I went through past 18 months almost. But I wanted to start something that leaves the audience with lots of new information and questions to work on. Create curiosity and interest in them. And I guess I was able to do that to a decent level. Basically, to get them started with Machine Learning. That’s how this guide ended up being called Getting Started with Machine Learning in one hour.
The notes for the talk were great for an introductory learning path, but were structured only for myself to help with the talk. Hence I wrote a machine learning getting started guide out of it and here it is. I’m very happy the way this ended up taking shape and I’m excited to share this!
There are two main approaches to learn Machine Learning. Theoretical Machine Learning approach and Applied Machine Learning approach. I’ve written about it in my earlier blog post.
Theoretical Machine Learning
Below are the subjects that you can start with (ordered as I think they are appropriate). For theoretical approach of learning Machine Learning, below subjects should be studied with great rigor and in depths.
- Linear Algebra - MIT, IISc. Bangalore
- Calculus - Basics, Coursera, Advanced, Coursera
- Probability and Statistics - MIT
- Statistical Learning Theory - MIT, Stanford
- Machine Learning - Coursera, Caltech
- Programming language to implement machine learning research ideas.
The way forward could be reading research papers, implementing research work/new algorithms, developing expertise and picking a specialization further on to the research path.
Applied Machine Learning
- Good understanding of the basics of above subjects (1 to 4).
- Machine Learning (imp concepts explained below): Coursera, Caltech
- Python or R Programming Language, as per your preference.
- Learn to use popular machine learning, data manipulation and visualization libraries in the chosen programming language. I personally use Python programming language, hence I’ll elaborate on that below.
- Must know Python Libraries: numpy, pandas, scikit-learn, matplotlib
- Other popular python libraries: LightGBM, XGBoost, CatBoost
Quick Start Option
If you want to get a taste of what is Machine Learning about and what it could be like. You can start this way for experimenting, getting quick hands on. Not an ideal way if you want to get serious about Data Science in long run.
- Know Machine Learning Concepts Overview (below)
- Learn Python or R
- Understand and learn to use popular libraries in your language of choice
Python Environment setup
- Python.org Download, Learn OR
- Anaconda Download, Learn
- Code Editor / IDE
- Visual Studio Code (Search and install python extension, pick the most downloaded one)
- Jupyter (Installs with Anaconda)
- Installing python packages
- Managing packages with pip, python’s native tool:
pip install <package-name>
- Managing packages with anaconda:
conda install <package-name>
- Managing packages with pip, python’s native tool:
- Managing Python (native) virtual environments (if multiple environments are needed)
- Create virtual environment:
python -m venv c:\path\to\env\folder
- Command help:
python -m venv -h
- Switch environments:
activate.batscript located in the virtual environment folder
- Python (native) virtual environments documentation
- Create virtual environment:
- Managing Anaconda virtual environments (if multiple environments are needed)
- Default conda environment -
- List available environments -
conda env list
- Create new environment -
conda create --name environment_name
- Switch to environment -
source activate environment_name
- Anaconda virtual environments documentation
- Default conda environment -
Machine Learning Concepts Overview
- Machine Learning: Is an approach to find patterns from a large set of data through a function f(x) which effectively generalizes to unseen x to find learned patterns in unseen data and make the inferences the Machine Learning Model was trained for.
- Dataset: Data being used to apply machine learning and find patterns from. For supervised type of machine learning applications, the dataset contains both x (input/attributes/independent variables) and y (target/labels/dependent variables) data. For unsupervised data it’s just x, input and the output of the data is some sort of learned patterns (like clusters, groups, etc.)
- Train set: A subset of Dataset fed to (train) machine learning algorithm to learn patterns
- Evaluation / Validation / Cross Validation Set: Subset of Dataset not in Train set used to evaluate how the machine learning algorithm is doing.
- Test set: Dataset to predict learned insights for. For supervised problems, target/label y like in train set is to be predicted and hence it isn’t a part of train set. For unsupervised, train and test sets can be identical.
- Supervised: In supervised problems, the historical data includes the labels (target attribute, outcomes) that need to be predicted for future/unseen data. For example, for housing price prediction we have data about house (area, # of bedrooms, location, etc.) and price. Here the after training a machine learning model with given data (X - data) and price (Y - labels), in future, price (Y) will be predicted for new/unseen data (X).
- Unsupervised: In unsupervised learning, there is no label or target attribute. A typical example would be clustering data based on learned patterns. Like for a dataset of house details (area, location, price, # of bedrooms, # of floors, built date, etc.) the algorithm needs to find if there is any hidden patterns. For example some houses are very expensive while some others are of usual price. Some houses are very big while some houses are of usual size. With these patterns, records/data is clustered into groups like Luxury-Homes, Non-Luxury Homes, Bunglows, Apartment, etc.
- Reinforcement: In Reinforcement Learning, an ‘Agent’ acts in an ‘Environment’ and receives positive or negative feedback. Positive feedback tells an agent that it has done well, and agent proceeds on similar plan/action. Negative feedback tells an agent that it has done something wrong, and should change it’s course of action. The agent and the environment are software/programmed implementations. The core of reinforcement learning is building an agent (or agent’s behaviour in some way) that learns to successfully accomplish a specific task in an environment.
- Popular Algorithms: Linear Regression, Logistic Regression, Support Vector Machines, K-Nearest-Neighbors, Decision Trees, Random Forest, Gradient Boosting, Ensemble Learning
- Preprocessing: In real world scenario data is rarely clean and neat in a state that Machine Learning algorithms can be directly applied on. Preprocessing is a process of cleaning data to feed to machine learning algorithm. Some of the common preprocessing steps are…
- Missing Value: When some of the values are missing, they are usually dealt by adding median/mean values or deleting corresponding row, or using the value from the previous row, etc. There are many ways of doing this. What exactly needs to be done depends on the kind of data, problem being solved and business goals.
- Categorical Variables: Discrete finite set of values. Like ‘car type’, ‘department’, etc. These values are converted either into numbers or vectors. Conversion to vectors is known as One-Hot Encoding. There are numerous ways of doing this in python. Some machine learning algorithms/libraries themselves handle categorical columns by encoding internally. One way of encoding is using sklearn.preprocessing.OneHotEncoder in scikit-learn.
- Scaling: Proportionately reducing values in columns into a common scale like 0 to 1. Having values in all columns in a common range might improve accuracy and training speed to some extent.
- Text: Text needs to be processed using Natural Language Processing techniques (out of scope of this guide), when it isn’t preprocessed, it is usually excluded from the training data that is fed to a machine learning algorithm.
- Imbalanced datasets: The data shouldn’t be biased, skewed. For e.g., consider a classification task where an algorithm classifies data into 3 different classes - A, B and C. If the dataset has very few/high records of one class w.r.t. others it is said to be biased/imbalanced. Usually data is oversampled in such cases by synthetically generating more random data from existing data. Some machine learning algorithms/libraries allow providing weights or some parameter to balance out the skew internally without us doing the heavy lifting of fixing a skewed dataset. For example, SVM: Separating hyperplane for unbalanced classes in scikit-learn.
- Outliers: Outliers need to be dealt with on a case by case basis based on the problem and business case.
- Data Transformation: When a column/attribute in a dataset doesn’t have an inherent pattern, it is transformed into something like log(values), sqrt(values), etc. where the transformed values might have interesting pattern/uniformity that can be learned. This is again, obviously case by case basis and needs data exploration to find a right fit.
- Feature Engineering: Feature Engineering is a process of deriving hidden insights from existing data. Consider a housing price prediction dataset which has columns ‘plot-width’, ‘plot-length’, ‘number of bedrooms’ and ‘price’. Here we see a key attribute area of the house is missing, but can be calculated based on ‘plot-width’ and ‘plot-length’. So a calculated column, ‘area’ is added to the dataset. This is known as feature engineering. Feature Engineering might be of different difficulty level, sometimes a derived attribute is right in front of sight like here, sometimes it’s really hidden and needs lot of thinking.
- Training: This is a main step where the machine learning algorithm is trained on the given data to find generalized patterns to be applied on unseen data. Below are some important nitty-gritty details of this phase…
- Feature Selection: Not all features/columns contribute to the learning. These are the columns where the data in them don’t affect the outcome. Such features are removed from the dataset. What features to train on and what features to exclude is decided based on feature importance given by a machine learning algorithm being applied. Most of the modern algorithms do provide the feature importances. If an algorithm doesn’t provide, scikit-learn has capabilities which can help in feature selection. Also correlated features are removed.
- Dimensionality Reduction: Dimentionality reduction also aims to find the most important features of all the features, aiming to reduce the dimensionality of the data. The main difference w.r.t. feature importance based feature selection is that, in Dimensionality Reduction, a subset of features and/or derived features are selected. In other words, we may not be able to map the extracted features to the original features. You can find more about dimensionality reduction in scikit-learn here.
- Feature Selection vs Dimensionality Reduction: In my opinion, one of the two ways should solve the purpose. If we do both feature selection based on feature importance and dimensionality reduction, we should first do based on feature importances. And then introduce dimensionality reduction. It goes without saying that we should evaluate the performance at every step to understand what’s working and what’s not. Feature selection based on feature importance is easy to interpret as the selected features are subset of all, which isn’t a case with dimensionality reduction.
- Evaluation Metric: Evaluation metric is a metric used to evaluate predictions for their correctness. A machine learning algorithm while training uses an evaluation metric to evaluate, compute cost and optimize on the cost convex function. Though each algorithm has a default evaluation metric, it is recommended to specify the exact evaluation metric as per the business case/problem. Like some problems can afford false positives, but cannot afford any false negatives. By specifying the evaluation metric, these nitty gritty details of the model can be controlled.
- Parameter tuning: Though most of the today’s state of the art algorithms have sensible default values for the parameters, it always helps to tune the parameters to control the accuracy of a model and improve overall predictions. Parameter tuning can be done on a trial and error basis by repeatedly changing and assessing the accuracy. Alternatively a set of parameter values can be provided to try all/different permutations of those parameters and find the best parameter combination. This can be done using some helper functions called Hyper-parameter Optimizers in scikit-learn.
- Overfitting (Bias): Overfitting is a state where the machine learning model almost memorizes all the training data and predicts almost accurately on data that’s already in training set. This is a state where the model fails to generalize and predict on unseen data. This is also known as model having high bias. Overfitting can be dealt with using Regularization, tuning hyperparameters if configured inappropriately, holding off partial dataset to use correct cross validation(1)(2) strategy.
- Underfitting (Variance): Underfitting is a state where the machine learning model’s predictions don’t do well even when predicting on data already in the training set. This is also known as model having high variance. Underfitting can be dealt with adding more data, adding/removing features, trying different machine learning algorithm, etc.
- Bias and Variance trade-off (sweet spot): The goal of model training is to find a sweet spot where the model cross validation error is minimum. Initially both cross validation and train error are high (Underfitting/high variance). As the model is training, the error keeps dropping to a certain point where cross validation is minimum and also close to train error (sweet spot). This is optimal spot. After this point, if the model further keeps reducing error (on train set), it almost memorizes the train set ends up overfitting which means higher error on unseen data.
- Regularization: At some point when the model is trying to learn further (reducing error, tending towards overfitting), regularization helps in countering the overfitting effects. Regularization is usually a parameter that’s added during cost/error calculation. Machine learning algorithms may not always provide regularization parameter explictly. In such case, usually there are other parameters that can be tuned to introduce regularization to the extent required.
- Prediction: To make predictions with trained machine learning model, the prediction method of the model is called by providing the test dataset as parameter. The test dataset should be preprocessed exactly the way it was done on the training dataset. In other words, in the same format of training data which was fed to the machine learning model for training.
- Other terminologies:
- Model Stacking: When single machine learning algorithm doesn’t do well, multiple machine learning algorithms are used to make predictions and the predictions are combined together in different ways. Most simplest being a weighted predictions. Sometimes, other machine learning model (meta-model) is used on top of the predictions of the first level models. This could go to any level of complexity and can have different pipelines.
Fun fact is that a majority (over 90% I guess) of all the machine learning problems solved today are solved using just Random Forests, Gradient Boosted Decision Trees, SVM, KNN, Linear Regression, Logistic Regression.
But, there are some set of problems that cannot be solved using above techniques. Problems like image classification, image recognition, natural language processing, audio processing, etc. are solved using a technique called Deep Learning. Before starting deep learning, I believe it’s essential to master all of the above concepts first.
Good Deep Learning resources…
- Fast.ai – thanks for the suggestion Pranay Tiwari!
- neuralnetworksanddeeplearning.com - an online book, stresses on theory and fundamentals
- Deep Learning Specialization at Coursera by Andrew Ng
- deeplearningbook.org - an online book
If you know deep learning concepts and want to get your hands dirty, some popular Deep Learning Libraries are: Keras, CNTK, Tensorflow, tflearn, sonnet, pytorch, caffe, Theano
Yes, practice is the most important thing and this guide would have been incomplete without mentioning about practicing machine learning. To practice and master your skills further, below are the things you can do…
- Get datasets from various online data sources. One such popular data source is UCI Machine Learning Repository. Additionally, you can search ‘datasets for machine learning’.
- Participate in online machine learning/data science hackathons. Some of the popular ones are - Kaggle, HackerEarth, etc. If you end up starting with something that’s very difficult, try persisting a bit. If it still feels difficult, park it aside and find other. There’s no need to be disappointed. Usually problems on online hackathon have some level of difficulty which may not always be suitable for beginners.
- Blog about what you learn! It’ll help you solidify your understanding and thoughts about the subject.
- Follow Data Science, Machine Learning topics on Quora, lot of great advice and questions/answers to learn from.
- Start listening to podcasts (available on link below)
- Check out some useful links on my Data Science Learning Resources page.
If you are considering the field of Machine Learning/Data Science seriously and you are thinking of making a career switch, think about the your motivations and why you’d like to do it.
If you are sure, I have one advice for you. Never ever ever give up or think if its all worth it. It’s definitely worth it and I can say that as I have walked that path since last 18 months… almost every day, every weekend and every spare hour of my time (except when I was travelling or I was totally drowned by my day job commitments). The road ahead to master data science isn’t easy. As they say, “Rome was not built in a day!”. You’ll need to learn lot of subjects. Juggle between different learning priorities. Even after learning a lot you’ll still find new things that you have never thought/heard about before. New concepts/techniques that you keep discovering might make you feel that you still don’t know a lot of things and there is a lot more ground to cover. This is common. Just stick with it. Set big goals, plan for small tasks and just focus on task at hand. If something new comes up, just scribble it down in your diary and get back to it later.
If you have been reading all the way till here, I appreciate your effort and the time you have invested. I hope this guide was useful to you and has made it little easier for you to get started on your own learning adventure. At some later point of time, if you think this guide has made some difference in your learning adventure, please please come back and leave a comment here. Or reach me at avannaldas .at. hotmail .dot. com. I’d love to hear from you. It’ll give me immense satisfaction to know that this has helped you, and my effort in putting this together was worthwhile.
This was my biggest write up ever. I have spent many hours writing, editing and reviewing this. If you see any mistakes or things that can be improved, please let me know in comments or via email. I’ll fix it the earliest I can and will attribute it to you. This will help everyone who reads this.
All the best!
Bio: Abhijit Annaldas is a Software Engineer and a voracious learner who has acquired Machine Learning knowledge and expertise to a fair extent. He is improving expertise day by day by learning new stuﬀ and relentless practice, and has extensive experience building enterprise scale applications in diﬀerent Microsoft and Open Source technologies as a Software Engineer at Microsoft, India since June 2012.
Original. Reposted with permission.
- How to Learn Machine Learning in 10 Days
- The Guerrilla Guide to Machine Learning with Python
- Density Based Spatial Clustering of Applications with Noise (DBSCAN)