The Parallel Universe of Dark Data and Dark Matter

Scientists estimate that 95 percent of the matter in the universe is dark. Not strictly the domain of “billions and billions of galaxies,” as Carl Sagan was fond of saying, the universe is composed of unobservable matter completely invisible to light and other forms of electromagnetic radiation, yet physicists are confident it exists because of the gravitational effects it exerts.

Dark data is for most organizations of a similar character. In the past, the enterprise considered that all of its data could be systematically converged into a data warehouse and then identified, reconciled, rationalized, and generally tidied up and reported on. But evidence now reveals that 90 percent of all data across the enterprise is dark. Often meticulously gathered and collected to meet changing compliance mandates, it is invisible and little used past its creation, taking up terabytes of space that is difficult to access and analyze.

Gartner describes dark data as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” As does a vast expanse of dark matter in space, we know it exists, but the term itself implies that it holds some hidden value. However, in practice, this kind of data is really not of any use at all.

Dark data can be both structured and unstructured. An example of structured data is contracts and reports with shippers and suppliers, which become dark with time. Unstructured data can be bits of personally identifiable information like birth dates, social security numbers and billing details. It all has one thing in common, which is that is much more useful if it’s available for decision-making regarding matters like contract renewal or privacy regulation compliance.

Advances in artificial intelligence (AI) mean we now have new ways of unraveling the secrets of dark data, but like any business process or tool, if applied incorrectly we can end up with the wrong result. For instance, last year Microsoft had to pull its AI chatbot called “Tay” after Twitter users taught it to be racist. Talk about learning the wrong things! With all of the data now available, especially when considering the vastness of dark data in the enterprise, what a system actually learns need to be targeted, refined, and focused in order to produce actionable results.

Only one-half of one percent of all data is managed in any way. But then again, how important are people’s old holiday snaps and log files from obsolete apps, transcripts of phone calls or SMS messages? Those are all time-related items, and are dependant on the timeframe in question. Much like dark matter just before the Big Bang, when the universe was condensed, data from the world of 30 years ago was limited and critical at its time of creation. But as time has progressed, the universe has expanded, much like our data, and both contain a lot of empty space or, as we would call it with data, “noise.”

All of the hype around dark data and AI-driven dark-analytics is missing the first crucial step, which is to identify what is real and what is useful, and for what application. While it may sound exciting, using AI to automatically process vast amounts of data in ultra-fast operations is not going to provide much insight.

We have to remove and classify this noise which, not unlike the Higgs Boson, is an elementary particle of information that is unable to be examined using existing knowledge. Once we get to the “good data,” that is data that is useful to analytics and thus for gaining insight, only then can we use targeted applications and methods to not only yield insights, but manage for potentially wrong conclusions–and consequently, make better decisions.

About the Author

David Gingell is Seal Software’s Chief Marketing Officer. Seal Software is the leading provider of contract discovery, data extraction, and analytics. With Seal’s machine learning and natural language processing technologies, companies can find contracts of any file type across their networks, quickly understand what risks or opportunities are hidden in their contracts, and place them in a centralized repository. Based in San Francisco, Seal empowers enterprises around the world to maximize revenue opportunities, reduce costs, and mitigate risks associated with contractual documents, systems, and processes.