The age of Big Data is upon us. Fuelled by an incendiary mix of overblown claims and dire warnings, the public debate over the handling and exploitation of digital information on an astronomically large scale has been framed in stark terms: on one side are transformative forces that could immeasurably improve the human condition; on the other, powers so subversive and toxic that a catastrophic erosion of fundamental liberties looks inevitable.
The tension between these opposites has marooned the discussion of Big Data. It is stuck somewhere between Bletchley Park — the former Government Communications Headquarters (GCHQ) location where the godfather of the computational universe, Alan Turing, primed today’s Big Data explosion during the Second World War — and the satirical tomfoolery of South Park, which recently portrayed the living core of all data as an incarcerated Father Christmas cruelly wired up to a machine by the US’s National Security Agency (NSA).
We know from Edward Snowden’s widely publicized whistle-blowing revelations that the NSA — in collusion with GCHQ — lifted vast amounts of data from Google and Yahoo, under the once-top-secret codename, Muscular. At the same time, we’re told that the potential for beneficial insights mined from anonymous, adequately protected data is enormous.
Big Data helps us find things we “might like” to buy on Amazon, for example, but it has also left us vulnerable to surveillance by state and other agencies. Companies such as Google and Facebook are essentially Big Data businesses, whose staggering profitability stems from the application of data analysis to advertising: these “free” services are paid for by personal data surrendered automatically with every click.
In finance, meanwhile, optimists foresee a theoretical end to all stock-market crashes, thanks to insights derived from huge-scale data-crunching, while others predict an automated, algorithmic road to ruin. Similarly, the cost and efficiency of healthcare provision is set to be radically transformed for the better with access to massive amounts of data — likewise the development of new drugs and treatments. But what about the mining of medical data without patient consent? So the debate goes on.
One aspect of Big Data, however, is beyond question: it is indeed very big, and it’s getting bigger by the millisecond. An IBM report in September estimated that 2.5 quintillion bytes of data are created every day (that’s 25 followed by 17 zeros, or roughly 10 quadrillion laptop hard drives) and that 90 per cent of the world’s data has been generated in the past two years: everything from geo-tagged phone texts and tweets to credit-card transactions and uploaded videos. By 2020, it’s thought that the number of bytes will be 57 times greater than all the grains of sand on the world’s beaches.
So what’s actually going on at the coalface of Big Data, a code-centric world of striping, load-balancing, clustering and massively parallel processing? What do the analysts working with Big Data say it’s going to do for us?
“You get a fuller picture of the phenomenon you’re interested in, with more dimensions, and that lets you derive greater insights,” says Big Data pioneer Doug Cutting, chief architect at enterprise software company Cloudera and founder of the popular open-source Big Data tool Hadoop. Cutting’s work on internet search technology for Yahoo during the mid-2000s provided the ideal proving ground for combining vastly increased computing power with huge and diverse datasets. “And from that we’ve seen a new style of computing emerge.”
The revolutionary effects of this new approach cannot be understated, especially within the scientific community. For Brad Voytek, professor of computational cognitive science and neuroscience at the University of California San Diego, and “data evangelist” for app-based taxi service Uber, Big Data has had a profound effect on the traditional scientific method. “You can sweep through huge amounts of data and come up with new observations,” he says. “That’s where the power of Big Data comes in. It’s automating the observation process. It’s making everything easier but in a way that few people yet understand. It’s going to dramatically speed up the scientific process and people have been doing some really cool stuff with it.”
Michael Schmidt, founder and chief executive of American “machine-learning” start-up Nutonian, established a Big Data landmark when, in partnership with robotics engineer Hod Lipson at Cornell University, New York, he created Eureqa — a piece of software that deduced Newton’s Second Law of Motion by analyzing data from the chaotic movements of a double pendulum. What took Newton years, the Eureqa algorithm accomplished in a matter of hours. With Nutonian, Schmidt is now opening up that Big Data technology beyond the college lab.
“We want to accelerate the process that scientists go through, to help you discover very deep principles from data,” he says. “We want to explain how things work.” The range of Eureqa’s uses couldn’t be more striking, from the construction of better warplanes to helping save the lives of infants. Schmidt is currently working with the United States Air Force, analysing the strength of advanced super-alloys used in engine components. “They are really interested in anticipating failures — knowing when things are going to break, explode or stop working. We were able to show them the most important things that go into a failure of a particular engine part, at a finer resolution than ever before.”