Programming Skills, A Complete Roadmap for Learning Data Science — Part 1
In the following series of posts I am going to be describing a complete program for learning data science from scratch. These are all skills that are on my own personal roadmap that I have been following for the last few years, and will continue to follow as I progress into more advanced Data Science.
I have chosen to start with programming skills. Personally I have found that I learn better by first attempting the practical implementation of data science techniques before learning the theory, maths and statistics behind it. I found this technique enables me to learn in a very fast, efficient way. Programming in my experience gives the foundation to start to learn data analysis, machine learning and data engineering.
Below I am going to list the essential skills in programming, in the order that I feel works best to learn them. Python is my preferred programming language for data science so this post is heavily python focussed, but the basic techniques are applicable to any other data science language you may be learning.
Basic skills in python programming
As a first step I would suggest taking an introduction to python course or similar. Code Academy’s intro to python course was the first programming course that I took. It gives an introduction to basic syntax, conditionals and control flow, functions, loops and an introduction to classes.
Pandas and Numpy
Once I had a grasp of the basics I started using Jupyter Notebooks (although I now use Jupyter Labs and would highly recommend using this instead), and I started to learn the data manipulation libraries. These are pandas and numpy, and they are the foundation for most data analysis, and data preparation for machine learning.
For pandas I preferred using the excellent tutorials within the documentation rather than any specific online courses. The tutorials here give a fantastic introduction to all aspects of the library and include datasets so that you can practically apply them https://pandas.pydata.org/pandas-docs/stable/tutorials.html. Additionally this https://dataanalysispython.readthedocs.io/en/latest/introduction.html is hands down one of the best resources for generally covering data analysis in Python, and covers pandas and numpy incredibly well.
Classes, Objects and Packages
In data science there will often be instances where you need to reuse pieces of code and it can be useful to create classes to make it easier to do this. For me this was also the first step in becoming experienced with learning to program outside Jupyter Lab. Additionally when you start to use data science in production it is likely to become essential to be able to create your own packages. It took me a while to really “get” object oriented programming and I found that most online tutorials would use abstract examples, such as creating a “dog” class. I found it hard to bridge the gap between learning this, and actually using it for a data science application.
This is a wonderful example of a tutorial I found that walks through creating a python class for the purposes of obtaining data from an API https://opendatascience.com/an-introduction-to-object-oriented-data-science-in-python/. There are also some accompanying slides here https://github.com/gizm00/blog_code/blob/master/odsc/intro_oods/pdsg_meetup_nov_2016.pdf. Using this, and spending some time looking through code for some of the open source data science libraries I use, enabled me to be able to turn some of my previous work into classes and ultimately packages.
This is a really useful skill which I am still working on. This is the process of writing python code that will “crawl” a website, and automatically obtain structured data. This will allow you access to novel datasets for both practicing data science, and for use in data science work. The beautiful soup library is one of the most common libraries for this. I learnt the basics for this via the Dataquest data science learning track, but also found walking through this tutorial https://towardsdatascience.com/byod-build-your-own-dataset-for-free-67133840dc85 really useful.
API, or application program interface, in the context of data science is an application developed by websites to enable access to data, in particular data that changes regularly. Being able to access these, as a data scientist, gives you more access to new data, or data that provides more context to the information you are already working with.
Again I learnt this through the Dataquest course, but if you are looking for a free resource they have also published this excellent tutorial https://www.dataquest.io/blog/python-api-tutorial/.
I am currently working my way through the following book https://www.datascienceatthecommandline.com/chapter-1-introduction.html, which is available freely online. This covers how you can use the command line for all aspects of data science from obtaining, cleaning and exploring data, to creating a regression model with the Sci-Kit Learn Laboratory. Learning how to use the command line has made my workflows much more efficient, and as I am starting to work on data science in production it is becoming an essential skill.
Github allows you to track changes that you make to your code, and undo them if needed. It also allows you to work collaboratively on data science projects, and can be a great place to share and showcase your own work. The essential skills to learn here include:
- Git configuration
- Adding and removing files
- How to undo changes
- How to create and merge branches, and how to handle conflicts
- How to create and clone repositories
- How to push and pull changes to remote repositories
There are a number of tutorials available online, I particularly liked this one from Data Camp https://www.datacamp.com/courses/introduction-to-git-for-data-science. However, I found creating my own account, and practising these techniques in a real scenario essential to being able to fully grasp the concepts.
This post has given an overview of the key programming skills that I have on my roadmap for learning data science. In later posts I am going to be covering my roadmap for data analysis, maths and statistics for data science, machine learning and data engineering.