Leonardo Ferreira

“Data Science A-Z from Zero to Kaggle Kernels Master”

A brief story of my last year learning Data Science

I’m from Brazil and many people of all world get in touch with me to ask for tips to learn or get a vacant job in the area of data science, so I decided to write this text to have something a bit more “structured” and contribute in the better way with people who are in the beginning of this journey.

In this initial article, I will make a kind of contextualization for the next texts that I intend to write, where I will go deeper in some of the topics that I will address in this text.

I will tell my story in the data area so far, and in the end, I will leave some tips to help those who are starting and also want to enter the area but don’t know very well what to do, where to start or where to go.

My first contact with data science was just over a year and a half ago, on May 25, 2017. I was unemployed and trying to operate on the stock market consistently and although I mastered many techniques that worked just fine, it required immense emotional control, and after much stress, I saw an article in Forbes on Data Science where there was the quote about being “the sexiest profession of the 21st century” and their whys. This got me a lot of attention right from the beginning, It was talking about the high average salary, and that there was an immense demand for this type of professional, and that was necessary to master a lot of things like business, math, statistics and programming, and to the eyes of many people, is a kind of rare professional, that the higher and more balanced his abilities can be considered a “unicorn”.

I graduated in Accounting, I had worked in corporate treasury, trading on stock market, selling items on internet and, because of the life I have, I always had a certain feeling for business, entrepreneurship and financial mathematics. In college I had math and statistics, but in programming, my notion was close to zero.

I am also a great enthusiast of philosophy, theories of knowledge, and I believe that this helped me to structure the form I wanted to learn, because I was already coming from a stage of self-learning, I started going after sites, blogs and news that explained the which was Data Science / ML / Big Data and I discovered what was behind each of these concepts, and with each new subject that I encountered, I tried go deeper and search for references. At that time I still did not know the concept of MOOC’s that are online courses with a didactic more appropriate to the 21st century, where those who have the discipline to study, can optimize the time of learning absurdly, save money and still have a knowledge far ahead that any classroom course . At the time, I also acquired a great book called “Data Science from Scratch - First Principles with Python” and in the first few weeks, appealing to my business profile and showing great interest, and even knowing practically nothing of the area, some interviews have already appeared.

In total, there were 10 interviews that I got through LinkedIn and I didn’t get the job in any of them, but it gave me a great strength to continue dedicating myself full-time (about 12h ~ 14h daily to study from Monday to Monday), because I realized that there was really a lot of demand and if I did, I would undoubtedly have an opportunity, because it is a completely new industry, many even say that it is the new oil.

The first MOOC I met was Udemy. At first I found interesting and soon appeared the promotions from $ 20.00. So I got carried away and bought numerous courses, including “Machine Learning A-Z”, “Data Science from Zero to Hero”, some of Tableau, but soon I realized how stupid I had been, and I ended up requesting reimbursement for the 3 courses, because my English at the time was horrible and the courses were just videos. In the meantime, I got to know other platforms, and one of them went to Udacity, where I bet my chips.

I saw lots of videos talking about how were the courses called Nanodegree’s, and it things caught my eye. Especially the modern didactics and correction done by a “real person”, who returns the feedbacks. The first part was very fast, totally excited and while it haven’t code I was loving the course. But when it got to the part of the code, it started getting harder for me to follow, and at the time there was still a lot of videos without subtitles in Portuguese.

My first alone analysis that I did in Python was on July 21, 2017 with Titanic dataset and applying the concepts I had learned in the statistics module, and this project contributed to clarify a lot about the application of Data Science, even though in such a simple dataset. And in the next module, I ended up not being able to continue with the course, because I wasn’t able to understand the part of the code. And I concluded that without a programming base, I would be just wasting my time.

Before I asked to cancel the course, I tried to pick up a logic base by attending various lessons and YouTube courses like Gustavo Guanabara and several other very good people who are dedicated to teaching logic and programming for free on the internet. After more than a month, dedicating myself to developing this base of Python, I met Datacamp, which was the platform that transformed my way of learning and seeing teaching online.

The Datacamp fit perfectly for me, because my English was still very precarious, and there were written instructions and material, where I often had to go to Google Translator to “make sure” and also for having an area for you to put the code you made and submit for testing, with automatic correction and that if you hit the resolution, you would earn points on the platform. I started by testing some courses, and since I had found the platform different from the ones I had already tested, I decided to try the paid version and started the “Data Scientist Track with Python”, where if I could not develop the code or understand the problem I could click on “HINT” and a hint or instruction on the task would appear, and if I still did not know how to do it, just hit the HINT again and the system presented the complete code. I was doing the course and completing the theory with the book Data Science of Zero, more specific books, texts, blogs, papers, groups, forums, videos on Youtube and several other references to understand better each concept and applications … But when it started the part of Machine Learning, I began to feel quite insecure because it had the impression that I “was not learning”, which can be normal for who is beginning in any area and has not yet put the learning in practice.

Dominated by frustration, and unsure of what to do, I decided to try the “Data Scientist Track with R” because I had read that R was a statistical language, and because I was frustrated with Python, I got deep into the R, and maybe for now being somewhat familiar with Python, I had a great facility with the R language, where I finished in a short time, and I also reinforced most of the concepts and reapplied to other datasets … After finishing this Track, I decided to do another R course , focused on the financial area, the “QuantitativeAnalyst Track with R”, where I learned a lot of interesting and new things. Until in December 2017, I got a internship opportunity, and as I had to use use Python at work, I went back to the Python course in the Datacamp.

Since I had been working with Python for some time, I was quite lost when I got back, and it was when I had to put my knowledge into practice in the first “real dataset”, with the task of predicting fraud for a company, and this time, there was no one to tell me what to do, how to analyze, what to use and what metrics were important … There was not much right and wrong.

In the first moment I was completely lost, without knowing where to start, because until then, I had only seen theory, especially with a dataset that was not “popular” (as the iris, mushrooms, taxi ny, breast cancer etc), where the variables that should be investigated were somewhat obvious and very limited.

As the days passed, things were making more sense and the project was delivered on time, and in the end we had a good result in the predictions of the frauds, but even if everything went well, it was quite tense for me, because I still got some difficulties doing anything that involved code, and as I had heard about Kaggle, which is a platform for predictive modeling and analytical competitions in which statisticians and data miners compete to produce the best models to predict and to describe the sets of data sent by companies and users that was recently acquired by google, I went after a classic dataset they was recommended to me, which was the German Credit Risk where the objective was to develop an analysis and predict if the credit could or not being granted to a particular client based on their histories, and since I am an accountant, I had a certain facility to explore and present the data in a which ultimately earned me many votes on the data platform and motivated me to continue producing more kernels until I was comfortable enough with code and different methods of analysis across different data types and different industries, trying to focus data in a more abstract form, but always considering the different nuances of each industry.

I quickly realized that there was a niche for analysis with a more financial / economic side, that for some reason, few people focused on this type of analysis, but that there was a good acceptance.

My first interaction with Kaggle was on January 8, 2018 and in a month and a half, I reached the level of Kaggle Expert Kernels. I was very excited and was really enchanted by everything I was learning, and after only a few months, I won my first prize in Kiva’s Kernels Data Science for Good, taking a thousand-dollar prize. At the same time that the competition rolled in April 2018, I changed my company and went to work in a company that visualizes really big data, with javascript (and I didn’t know javascript yet) in a platform of its own, where I already entered as a data scientist, and a month and a bit later, May 20, I was the 21st person in the world and the first Brazilian to win the Kaggle Kernel Master status.

After all analysis and works I have done in Kaggle, and the projects that I have participated in the last few months for big companies, plus the long hours of studies, I already have accumulated a good baggage to be able to have a good understanding and frequent insights of data, to products, analyzes, and often only by curiosity or knowledge itself. This is a very interesting and rich area for anyone who likes to learn, is not afraid of challenges and especially to peoples that like to solve problems.

From what I have seen so far, there is no single profile or consensus for a data scientist, and this is very liberating because it opens space for people from the most varied spectrum of society, and although many try to put innumerable constraints and your area or academic status as a prerequisite, the important thing is that you can think, understand, explain and mainly apply the knowledge in any type of dataset.

I wrote too much here, I told my story just to contextualize the tips I give to friends and people who call me on LinkedIn or Kaggle, and show that it is possible to learn in a short time and with quality, as long as there is dedication and focus.

I have currently had good conversations about data with people all over the world, and mostly worked very closely with the award-winning and great master Anderson L. Amaral on many automation solutions and consultings of Machine Learning, both feature engineering, modelling, preprocessing, optimization, and even data cleaning.

Also, I’m looking for an relocation to a country where english is the native language, to improve my english and data/ML skills; I really love to learn and a better domain of this language will help me to improve a lot of my other skills. So I intend to go soon.

Finally, here are some tips to people that are in this way and want to become a data scientist:

1 — IT TAKES TIME

Just as you cannot lose weight or become muscular in a few weeks, or months, do not expect data science to be any different. Although I had plenty of free time to study, the main factor in learning will be consistency and the time you spent on tasks. The more you devote yourself, the faster you will learn.

Always think that in order to learn new things and become smarter, we need as much effort as weight lifting, running a marathon or any other type of high-performance activity, all this requires dedication and discipline.

2 — TAKE THE QUESTIONS

Do not accept not understanding how something works. At first it will be a bit difficult because there are many concepts related to the area. However, over time, these concepts will repeat themselves a lot and consequently the concepts will become intuitive. One good thing to think about is that no one will take away your knowledge from you, so “miss” as long as you can, but understand the concepts.

One thing I usually use to motivate myself is to look for children, people with limitations, or even the elderly to do what I intend to do.o I feel more challenged.

3 — THINK THROUGH THE DATA

When you are learning something, it may be interesting to use it as a thinking tool … In that case, look at things and try to think of solutions with data, products, metrics that might apply to what you are observing and related …. So you will already train your ability to analyze and apply the concepts intuitively.

When hiring a data scientist, companies aims to increase their profit, reduce some expense or to prevent losses / fraud, so be aware of the importance of “business” if you want to work for companies.

4 — APPLY YOUR KNOWLEDGE

After some time accumulating knowledge through lessons, MOOC’s, videos, books, it is interesting that you try to do your own analysis and develop your own “style”. One place I found phenomenal for this is Kaggle, but there are lots of other platforms like drivendata and data.world. In the beginning give preference to apply the knowledge in datasets that you have some domain, and formulate of simple form the initial questions that will try to answer. This was one of the reasons of I have chosen many financial datasets in the beginning. Besides, just because you’re practicing on your own, know that you’re already an outlier.

5 — LET PEOPLE KNOW WHAT YOU DO

This is one of the most interesting areas today but still many people complain that they do not get a chance in the area, and in my view, one of the main reasons for this is that people do not know what you are doing. Learning to communicate well is essential.

I think that’s enough for an introductory text. What I wrote above is a summary of everything that has been going on for a year and a half, and I intend to make some more texts that will go deeper into the topics mentioned above and also some observations regarding the different Data Science profiles, techniques and tips to learn fast, methods to help in understanding the dataset and how to take value from the data exploration, and also about automation of ML that is something that I have been dedicating myself to a time and that without doubt will become increasingly part of the day to day of the date scientists from all over the world.

Soon I will be back, I thank those who have read so far and if it has become very repetitive, I apologize, because I’m not used to writing this much. Also, sorry for any english mistake.

Please give your feedback in comments and share this article to help another peoples that also are beginning. Also, feel free to get in touch with me on LinkedIn, and if you have some interest, visit my profiles on Kaggle and GitHub where I usually post some of the projects I am developing.

EDIT: Datacamp staff read the text and offered a discount coupon for the beginners that want to know more about the platform. If you want to test a free week send an email to sales@datacamp.com and in the subject put “Leonardo” and say that you would like to test.