KDnuggets What is Academic Torrents and Where is Data Sharing Going?

Learn more about Academic Torrents, a platform for researchers to share data consisting of a site where users can search for datasets, and a BitTorrent backbone which makes sharing data scalable and fast.

By Joseph Paul Cohen, Founder and Director, Institute for Reproducible Research.

Academic Torrents

Academic Torrents is a platform for researchers to share data. It consists of two pieces: a site where users can search for datasets, and a BitTorrent backbone which makes sharing data scalable and fast. The goal is to facilitate the sharing of datasets amongst researchers. It was created by the Institute for Reproducible Research (a U.S. 501(c)3 non-profit).

The site provides access to over 15TB of data including popular machine learning datasets such as all of UCI, Imagenet, and Wikipedia. Though some of these datasets are available elsewhere, Academic Torrents stitches multiple hosting locations together so downloading is much faster and also fault-tolerant. For downloaders there are no sign-up or verification processes in the way, and the collection is more comprehensive than anywhere else. Many datasets such as Netflix, where the original hosting location is no longer avaliable, are made available using Academic Torrents.

As data gets bigger, peer-to-peer file transfer becomes increasingly attractive, since it is the only way distribution scales with the number of users. Academic Torrents currently facilitates the transfer of over 900 GB/day and over 30000 users/monthly.

The guiding principle of Academic Torrents is to ensure that the data the community needs is always available and can be obtained quickly. In order to ensure that data is always available it needs to be stored in more than one location in case the initial location is not available. Typically, when a user downloads data from a secondary website it is unclear if they they found the correct data. BitTorrent allows data to be mirrored transparently in a peer to peer fashion while maintaining the correctness and authenticity of the data. A speed increase is gained because a user can download from all the mirrors at once.

How is data mirrored?

Many contributors voluntarily mirror datasets that they like or that they think are important for the community. Anyone can become a mirror by simply downloading data and leaving their BitTorrent client running in seeding mode. Academic Torrents provides "Collections" which allow any user to curate a list of torrents. Each Collection has an RSS feed that can be used by a BitTorrent client to automatically download new items. This allows volunteers to spend minimal time managing their donated resources.

Some examples are:

  • Video Lectures: academictorrents.com/collection/video-lectures
  • Deep Learning: academictorrents.com/collection/deep-learning
  • Spatial Datasets: academictorrents.com/collection/spatial-datasets
  • Computer Vision: academictorrents.com/collection/computer-vision

Standard webservers can also be used as a data source without any special software. Academic Torrents has a unique feature that allows the uploader of a torrent to manage a list of HTTP URLs that the data can be downloaded from dynamically. This allows the data to change physical locations transparently while allowing full control by the uploader.

Repairing corrupt data?

When using data in research it is important to verify that the data you have is the exact data that was used in previous work. Verifying data is hard if hashes were never taken of the original or previously correct data was corrupted in storage. Using BitTorrent, verification and repair of a dataset is built in. A BitTorrent client is able to verify if a specific part of the dataset has been changed or corrupted then that part is automatically downloaded and the file is repaired.

How to contribute?

The best way to contribute is to make sure your data (or the data you work with) is listed on Academic Torrents. You first make a torrent file of the data using a client (such as Transmission) and add the announce URL that is specified on the Academic Torrents upload page.

  • If the data is hosted on a website then include that URL to the data on the upload page in the Backup URL section.
  • If you will host the data via bittorrent then you need to have a client (such as Transmission) running.

Fill out a description so people can find your data and then submit this torrent on the upload page.

Email academictorrents.com with any questions or comments.

Bio: Joseph Paul Cohen is a deep learning and computer vision researcher who is currently a Postdoctoral Fellow at the University of Montreal, and is the Founder and Director of the Institute for Reproducible Research, a 501(c)(3) non-profit which works on projects such as Academic Torrents and Short Science.​

Related:

  • Top 10 Open Dataset Resources on Github
  • Deep Learning and Startups: Notes on Rework Conference, San Francisco
  • Data Science Basics: 3 Insights for Beginners