Big Data

Der Spiegel, May 2013

Edited by Andy Ross

Big Data is the next big thing. It promises both total control and the logical management of our future lives. An estimated 2.8 ZB of data was created in 2012, with a predicted volume of 40 ZB by 2020. This exponential growth doubles every two years.

Google and Facebook are giants of Big Data. But many other organizations are analyzing all this data. Memory is cheap, so new computers can analyze a lot of data fast. Algorithms create order from chaos. They find hidden patterns and offer new insights and business models. Algorithms bring vast power.

Blue Yonder is a small young company. Managing director Uwe Weiss analyzes the data generated by supermarket cash registers, weather services, vacation schedules, and traffic reports. All this data flows into analysis software that learns as it goes and finds new patterns. Blue Yonder has used its data to drive a market research system on buying behavior. Weiss: "Big Data is currently revamping our entire economy, and we're just at the beginning."

Big Data now brings hope for millions of cancer patients. In the Hasso Plattner Institute (HPI), in Potsdam, near Berlin, a €1.5 million SAP HANA analytic engine with a thousand cores has so much memory it can process Big Data thousands of times faster than other machines. SAP co-founder Hasso Plattner sponsors the institute and personally pushed the "Oncolyzer" rig. The HANA in-memory technology has won prizes for innovation and is now the flagship SAP platform.

Researchers at the University of Manchester are working on another Big Data project to help senior citizens who live alone. The device is installed on the floor like an ordinary carpet, with sensors recording footsteps. It can determine whether the person is up and about, and can analyze activities to see how they compare with the person's normal movements. Anomalies can trigger an alarm.

The military and intelligence communities also employ the power of data analysis. Big Data played a key role in the hunt for Osama bin Laden, leading investigators to Abbottabad in Pakistan.

California software company Splunk ws named a few weeks ago as one of the five most innovative companies in the world. Governments, agencies, and businesses in almost a hundred countries are customers, as are the Pentagon and the Department of Homeland Security. Splunk apps analyze data supplied by all kinds of machines, including cell phone towers, air-conditioners, web servers, and airplanes.

Hamburg-based startup Kreditech lends money via the Internet. Instead of requiring credit information from their customers, Kreditech determines the probability of default using a social scoring method based on fast data analysis. The company extracts as much data as possible from its users, including personal data from EBay and Facebook profiles and other social networking sites. It even records how long applicants take to fill out the questionnaire, the frequency of errors and deletions, and what kind of computer they use. The more information it has, the higher a customer's potential credit line.

Kreditech is expanding rapidly in eastern Europe and plans to launch soon in Russia. But it terminated its service in Germany when the Federal Financial Supervisory Authority (BaFin) proposed to examine its business model. The model generates revenue not only from microcredit deals and interest but also from renting credit scores to other companies. Despite all this, investors find social scoring very attractive.

Business models like Kreditech's illustrate the sensitivity of the issues that Big Data raises. Users give up their data freely, bit by bit, and everyone adds to this huge new data resource every day. But what happens to a stash of credit profiles if its owners are taken over or go bust?

TomTom, a Dutch manufacturer of GPS navigation equipment, sold its data to the Dutch government, which then passed on the data to the police. They used it to set up speed traps in places where they were most likely to generate revenue from speeding TomTom users. TomTom issued a public apology.

Big Data applications are especially valuable when they generate personalized profiles. This may be appealing to retailers and some consumers, but data privacy advocates see many Big Data concepts as Big Brother scenarios of a completely new dimension.

Many companies say the data they gather, store, and analyze remains anonymous. But our mobility patterns alone can be used to identify almost all of us uniquely. The more data is in circulation and available for analysis, the more likely it is that anonymity becomes algorithmically impossible.

Most people don't want companies to store their personal data or to track their online behavior. A proposed European data protection directive includes a "right to be forgotten" on the web. But this may be utopian. We face an impending tyranny of algorithms.

AR I worked in the SAP HANA development team from 2003 to 2009.

Big Data

MIT Technology Review, May 2013

Edited by Andy Ross

SAP likes Big Data. SAP is working with young companies to help them take advantage of its revolutionary HANA in-memory Big Data platform. Some of the most adroit users of Big Data are small startups. Fortune 1000 CIOs often say they have a lot of data but haven't yet figured out a way to translate it into real results.

SAP HANA was the brainchild of Hasso Plattner and Vishal Sikka. The HANA platform takes advantage of a new generation of columnar databases running on multicore processors. The entire system is in RAM, and users say data queries that used to take days now run in seconds.

AR HANA was the brainchild of all of us in the HANA team too.

Google Brains For Big Data

Wired, May 2013

Edited by Andy Ross

Stanford professor Andrew Ng joined Google's X Lab to build huge AI systems for working on Big Data.

He ended up building the world's largest artificial neural network (ANN). Ng's new brain watched YouTube videos for a week and taught itself all about cats. Then it learned to recognize voices and interpret Google StreetView images. The work moved from X Labs to the Google Knowledge Team. Now "deep learning" could boost Google Glass, Google image search, and even basic web search.

Ng invited AI pioneer Geoffrey Hinton to come to Mountain View and tinker with algorithms. Android Jelly Bean included new algorithms for voice recognition and cut the error rate by a quarter. Ng departed and Hinton joined Google, where he plans to take deep learning to the next level.

Hinton thinks ANN models of documents could boost web search like they did voice recognition. Google's knowledge graph is a database of nearly 600 million entities that when you search for something pops up information about it to the right of your search results. Hinton says ANNs could study the graph and then cull the errors and refine new facts for it.

ANN research has boomed as researchers harness the power of graphics processors (GPUs) to build bigger ANNs that can learn fast from Big Data. With unsupervised learning algorithms the machines can learn on their own, but for really big ANNs Google first had to write code that would harness all the machines and still run if some nodes failed. It takes a lot of work to train ANN models. Training the YouTube cat model used 16 000 chip cores. But then it took just 100 cores to spot cats in videos.

Hinton aims to test a teranode ANN soon.

AR I like the idea of using ANNs for document search. It will improve result relevance as much as Google probability-based translation improved quality over rule-based translation.

Big Data

Mark P. Mills
City Journal, July 2013

Edited by Andy Ross

What makes Big Data useful is software. When the first microprocessor was invented in 1971, software was a $1 billion industry. Software today has grown to a $350 billion industry. Big Data analytics will grow software to a multi-trillion dollar industry.

Image data processing lets Facebook track where and when vacationing is trending. Looking at billions of photos over weeks or years and correlating them with related data sets (vacation bookings, air traffic), tangential information (weather, interest rates, unemployment), or orthogonal information (social or political trends), we can associate massive data sets and unveil all manner of facts.

Isaac Asimov called the idea of using massive data sets to predict human behavior psychohistory. The bigger the data set, he said, the more predictable the future. With Big Data analytics, we can see beyond the apparently random motion of a few thousand molecules of air to see the balloon they are inside, and beyond that to the bunch of party balloons on a windy day. The software world has moved from air molecules to weather patterns.

The new era will involve data collected from just about everything. Until now, given the scale and complexities of commerce, industry, society, and life, you couldn't measure everything, so you approximated by statistical sampling and estimation. That era is almost over. Instead of estimating how many cars are on a road, we will count each and every one in real time as well as hundreds of related facts about each car.

Big data sets can reveal trends that tell us what will happen without the need to know why. With robust correlations, you don't need a theory, you just know. Observational data can yield enormously predictive tools. The why of many things that we observe, from entropy to evolution, has eluded physicists and philosophers. Big data may amplify our ability to make sense of nearly everything in the world.

The Big Data revolution is propelled by the convergence of three technology domains: powerful but cheap information engines, ubiquitous wireless broadband, and smart sensors. Nearly a century ago, the air travel revolution was enabled by the convergent maturation of powerful combustion engines, aluminum metallurgy, and the oil industry.

Business surveys show $3 trillion in the global information and communications technology (ICT) infrastructure spending planned for the next decade. This puts Big Data in the same league as Big Oil, projected to spend $5 trillion over the same decade. All this is bullish for the future of the global economy.

Big Data Analysis

By Jennifer Ouellette
Quanta, October 9, 2013

Edited by Andy Ross

Since 2005, computing power has grown largely by using multiple cores and multiple levels of memory. The new architecture is no longer a single CPU plus RAM and a hard drive. Supercomputers are giving way to distributed data centers and cloud computing.

These changes prompt a new approach to big data. Many problems in big data are about managing the movement of data. Increasingly, the data is distributed across multiple computers in a large data center or in the cloud. Big data researchers seek to minimize how much data is moved back and forth from slow memory to fast memory. The new paradigm is to analyze the data in a distributed way, with each node in a network performing a small piece of a computation. The partial solutions are then integrated for the full result.

MIT physicist Seth Lloyd says quantum computing could assist big data by searching huge unsorted data sets. Whereas a classical computer runs with bits (0 or 1), a quantum computer uses qubits that can be 0 and 1 at the same time, in superpositions. Lloyd has developed a conceptual prototype for quantum RAM (Q-RAM) plus a Q-App — "quapp" — targeted to machine learning. He thinks his system could find patterns within data without actually looking at any individual records, to preserve the quantum superposition.

Caltech physicist Harvey Newman foresees a future for big data that relies on armies of intelligent agents. Each agent records what is happening locally but shares the information widely. Billions of agents would form a vast global distributed intelligent entity.


Evgeny Morozov
MIT Technology Review, October 22, 2013

Edited by Andy Ross

Technology companies and government agencies have a shared interest in the collection and rapid analysis of user data.

The analyzed data can help solve problems like obesity, climate change, and drunk driving by steering our behavior. Devices can ping us whenever we are about to do something stupid, unhealthy, or unsound. This preventive logic is coercive. The technocrats can neutralize politics by replacing the messy stuff with data driven administration.

Privacy is not an end in itself but a means of realizing an ideal of democratic politics where citizens are trusted to be more than just suppliers of information to technocrats. In the future we are sleepwalking into, everything seems to work but no one knows exactly why or how. Too little privacy can endanger democracy, but so can too much privacy.

Democracies risk falling victim to a legal regime of rights that allow citizens to pursue their own private interests without any reference to the public. When citizens demand their rights but are unaware of their responsibilities, the political questions that have defined democratic life over centuries are subsumed into legal, economic, or administrative domains. A democracy without engaged citizens might not survive.

The balance between privacy and transparency needs adjustment in times of rapid technological change. The balance is a political issue, not to be settled by a combination of theories, markets, and technologies. Computerization increasingly appears as a means to adapt an individual to a predetermined, standardized behavior that aims at maximum compliance with the model patient, consumer, taxpayer, employee, or citizen.

Big data constrains how we mature politically and socially. The invisible barbed wire of big data limits our lives to a comfort zone that we did not choose and that we cannot rebuild or expand. The more information we reveal about ourselves, the denser but more invisible this barbed wire becomes. We gradually lose our understanding of why things happen to us. But we can cut through the barbed wire. Privacy is the resource that allows us to do that.

Think of privacy in economic terms. By turning our data into a marketable asset, we can control who has access to it and we can make money. To ensure a good return on my data portfolio, I need to ensure that my data is not already available elsewhere. But my decision to sell my data will impact other people. People who hide their data will be considered deviants with something to hide. Data sharing should not be delegated to an electronic agent unless want to cleanse our life of its political dimension.

Reducing the privacy problem to the legal dimension is worthless if the democratic regime needed to implement our answer unravels. We must link the future of privacy with the future of democracy:

1 We must politicize the debate about privacy and information sharing.
2 We must learn how to sabotage the system with information boycotts.
3 We need provocative digital services to reawaken our imaginations.

The digital right to privacy is secondary. The fate of democracy is primary.