If you have any working knowledge of the technological world, then you will be familiar with GitHub. The web-based hosting service enables the distribution of version-controlled code, encouraging the sharing and collaboration between those who use code regularly. The theory is that, by lowering these barriers and instigating the distribution of new ideas, technology spreads more rapidly, as it can easily be accessed and used by others – this is known as Open Source Software.
Open Source Software in the Data Science World
GitHub has garnered a lot of attention from other disciplines – namely from data scientists. Data science encounters many of the same issues, in that work is often duplicated because of lack of communication, and data cannot be shared because of a lack of a proper platform. However, there are several start-ups that you should look out for, that are planning to change the way that information is shared in the data science world.
Reducing Information Duplication
Imagine the annoyance after you have churned out multiple queries, repeated the process numerous times a day, gathering huge amounts of information – only to find that someone else has already completed this work in the past, saving potentially days’ worth of work. This is what Mode Analytics has set out to do, by making better use of existing resources. The start-up also benefits from having multiple perspectives from different backgrounds reading the data, so that different conclusions can be drawn from it.
Harnessing the Power of Version Control
From talking to various data scientists, Domino Data Lab quickly realised that the process of data analysis should be simplified: some said it was too difficult to connect to a shared cloud, and others said it was too much of an annoyance collaborating with other data scientists.
The defining feature of Domino is version control, which allow developers to see changes made to code, and to go back to a previous version if anything goes wrong with the new version. The software that Domino has created remembers what data is used with certain scripts, always allowing users the ability to reproduce results, and also stores information of which algorithms data scientists have put to work,” explains Antonio Lisenby, a Data Analyst at Writemyx and 1Day2write.
Sharing is Caring
The main crux of what Sense do is look at how to get more value from data more quickly, which, for them, means a cloud-based area to share findings, results, and visualizations. Dropbox serves a purpose when it comes to exchanging files, but is far more awkward to share results, which is the gap that Sense is aiming to plug.
Keeping it in the Community
Plot.ly’s boasts the ability to host tools for manipulating and analysing data, but what its founders believe sets it apart is its focus on the community aspect of data science, acting as a social meeting place for those who are interested in data. The platform allows members to follow data, get updates and comment – like GitHub and other, more widely used, social networks. The start-up has gathered a sizeable following in a relatively short space of time, and is in the process of developing an on-site software to bring in revenue, for those companies who do not wish for sensitive data to be taken off-site.
The Future Looks Bright
Data collaboration is the main weakness in the data science world, and, with many actively trying to improve the way in which data is shared, the future looks hopeful. The main issue facing companies such as Domino and Sense, however, is that there are only a finite amount of data scientists, and so the remit is not as broad as with a general social media site. However, with data science still in its relative infancy, the next few years could mean that these start-ups were miles ahead of the game.
About the Author
Sign up for the free insideBIGDATA newsletter.