1. Don’t Complain about It, Fix It: Contribute to Open Source Software (More)
Open source software is only as good as its community and/or developer(s). Developers are human and typically cannot manage all bugs and feature requests themselves. My goal is to routinely contribute back to the community either with new features, or by fixing bugs that I discover. This not only helps the community at large, but also helps me as a software engineer. There is no better way to become an even better engineer than by wading through someone else’s code. While this is something I did all day every day at my $DAYJOB, I do it less while on my sabbatical.
Some of the projects I use the most and that I hope to contribute to are scikit-learn and pandas, particularly parts for higher performance computing such as out-of-core processing, and batched processing. These “tricks” are critical to working with huge datasets on small machines, particularly for students that may not wish to pay for Amazon EC2, Azure etc.
2. Focus on Sharing, Not Just Doing
One of the qualities of my Ph.D. advisor that I admire the most is his dedication to sharing pretty much everything that he does, even if it isn’t complete. Anytime there is a new medium for writing and sharing technical content, he adopts it for this work. Through him, I learned about GitBooks and RPubs. There is also GitHub Pages, ways to share Jupyter Notebooks, R Notebooks, etc. It is hard to keep up with all these new ways of sharing work, but the takeaway is that I need to get better at it.
When I am asked, I typically recommend that people only post on their GitHub completed projects that they are proud of. I am thinking of using a secondary GitHub account for exploration. There are often times I start a project, get distracted and never complete it. But, many times I learn some interesting techniques or hacks that somebody else could use and do not have time to blog about it. Right now, all of that knowledge goes to waste. By sharing this work somebody else can find these gems, even if the project itself is not complete(d). In academia there is the mantra publish or perish. While my academic pursuits will end at a Ph.D., some teaching and maybe a conference talk or two, I want to start taking this to heart in the technical world — give talks, contribute to meetups, blog more, participate on StackOverflow, Quora, Gitter, IRC… and maybe Slack...
One other aspect of sharing work I have done is that it encourages accountability. If I post unfinished work, I may be more likely to finish it, and if there is enough content to remind me of what I did and why I was doing it, that would encourage even more success.
This concept of sharing also applies to my persona at-home habit of writing one-off scripts. In software engineering we focus on reusable code. I want to start taking these one-off scripts and turning them into scripts with at least a command-line interface. This of course assumes that the script has some use to someone other than myself. It is my job to try to make it so, all in the name of sharing and contributing. It is not always about the goal. While some of my advisor’s shared manuscripts and code snippets are not useful to me in what they do, I have learned a lot about coding techniques and new algorithms and that makes sharing content worth it.
3. Create a Usable Web Service, Running on a Real Server (not localhost)
People say that communication is a big part of being a data Scientist. I believe this depends on the type of data Scientist role. A data Science Engineer focuses more of his/her time on accessibility… developing data products or tools that allow people (or machines) to make decisions or present data in a way that a human can easily understand with interactive graphics and other forms of user interaction. Of course this is a special form of communication.
I’ve built machine learning systems, but at the time I did not appreciate the full lifecycle of the system. The system needs to sell itself. Not only should it implement a model in a scalable way, it also needs to adapt to new data (online learning and tuning) and also . At the time, I thought this was a pain, but I now realize that this is what makes a system speak to the user: a full feedback loop.
4. Open My Mind (More) to Neural Networks and Deep Learning
I received my copy of Deep Learning by Goodfellow and friends and I intend to read it cover to cover. I never had an interest in computer vision because I was not sure if we could ever solve vision problems on consumer hardware and yet here we are. While deep learning is part of that, I feel that it may be a more natural fit for vision and it would be more accessible to me and others. Of course, I am also very interested in applications to natural language processing.
5. Learn a New Language
Of course, I might also just read Stroustrup’s C++ book cover-to-cover as well as Bruce Eckel’s Java book cover to cover to beef up my C++ and Java respectively as both of those languages are very important for high performance computing (C++), distributed computing (both, but mostly Java) and systems development.
6. Learn about Electronics and Explore
My parents need a new doorbell, one that has a camera that always runs, has decent motion detection and sends alerts over multiple different channels of communication. We have the Ring which is proprietary and just does not do a great job of this. Wireless performance is terrible, and the bell only rings in one room. With some electronic components and either an Arduino or Raspberry Pi, I am convinced I can do better at least for our purpose. I can also access all of the video and alerts on my own server rather than having to pay and deal with the cloud. Another thing… my mother has an elaborate Christmas display in the front yard connected to several timers. The timers are neversynched properly and half the yard will be dark. I want to create a power bar that can be programmed over wifi or Bluetooth and that keeps itself synched. Such a device already exists, but I want to do it myself.
My fear or electricity and either electrocuting myself or wasting money burning out circuit boards has precluded me from participating in this fascinating field. I plan on going through this book on electronics to get me started, and from there we will see!
As for myself, personally...
I only have one personal resolution. One that is doable and that would give me joy: Travel somewhere new just to mountain bike. Who knows where I will end up in 2017, but if it involves me mountain biking somewhere other than Mammoth, Lake Tahoe or Southern California, I will consider that a success. Some places on my wishlist include Moab, UT, Bend, OR, Ashland, OR, Whistler, BC, Downieville (not really a trip though), Crested Butte, CO, Park City, UT and maybe Brevard, NC... or... Scotland?
Bio: Ryan Rosario is a data Scientist and Machine Learning Engineer from Los Angeles, CA. His interests include text mining, natural language processing and geospatial analysis. He previously worked at Facebook where he developed NLP systems and is finishing a Ph.D. in Statistics at UCLA.
Original. Reposted with permission.
- AI, data Science, Machine Learning: Main Developments in 2016, Key Trends in 2017
- The Most Popular Language For Machine Learning and data Science Is...
- 90 Active Blogs on Analytics, Big Data, data Mining, data Science, Machine Learning (updated)