Big data in academia

For academia, and for many other sectors, big data looks like a new Eldorado: a fantastic mine of information where we hope to find the key to great discoveries. In the social sciences, for example, questions such as “what is social media data telling us about our societies?” or “how can big data be used to efficiently allocate resources, such as energy or capital?” are becoming more frequent.

Quite naturally, academia is abuzz, thinking of how to bestuse big data. It is also teeming with competitiveness, as researchers race tomake the discoveries waiting for us in those massive data centres.

A cautious approach

That being said, academics, and especially universities, must exercise caution. There are som e unique risks that must be taken into account, risks that stem from academic research being (in most cases) made publicly available. Therefore, all the results obtained, the methods used, and often all the data collected during a research project will eventually be made publicly available. This level of transparency is unique to academia. So however magical big datamay appear, using it requires careful considerations of both data protectionlaws and of research ethical principles.

Concretely, when using data that was produced by individuals,we must not only take care to respect their privacy when processing the data,we must also ensure that the content produced cannot be used against them. Further precautions must be taken when working with data of vulnerable individuals,such as young children and teenagers, for instance.

In that respect, the considerable experience that universities have handling human subject data within the life sciences is ofgreat help and existing research protocols devised in that context are being adapted to address the new challenges posed by big data.

From big data to big collaboration

Researchers in statistics and mathematics have to be mindful of, and interact closely with, concepts of law and ethics. This is one of many examples of collaborations and interactions across academic fields driven by the use and exploration of big data. The reason for this is that because big data requires a wider range of expertise to be applied simultaneously to answer research questions. In contrast, to understand smaller, traditional data sets used not so long ago, knowing both the data and the methods would be sufficient to tackle most projects.

Consider an effort to understand the drivers behind memes on Twitter. A meme is an idea, behaviour, or style that spreads from person to person within a culture. A typical research project could, for example, try to understand how memes develop, who they spread to and why? Such a project would require asoftware engineer able to produce the database necessary to manage the largedata set; a statistician or machine-learner to develop the proper methodology to explore the data and learn its properties; a linguist to help efficiently mine the tweets by identifying the relevant semantic structures; a social scientist to interpret the results and frame the research questions, and so on.

One of the challenges for academia is that there is noculture of collaboration on this scale in the field yet. Finding the right skills and synergies to produce efficient teamwork is the first challenge academics face when tackling a big data project.

The changing nature of data

But it is not the size of big data that requires so many different skills. In that sense, the term ‘big data’ is misleading. What makes bigdata different is the way the data is collected.

Traditionally data was collected for a specific purpose, relating to a specific scientific question. Academics sought to first delineate a scientific question, then build the most appropriate techniques to answer that question, and finally collect the appropriate data set to use the technique on. This systematic process led, hopefully, to an answer.

This tradition dates back to Greek Antiquity when, for instance, Eratosthenes measured the angle of the sun at different locations, and used geometry tocompute the circumference of the globe.

In this setting, we have a clear research question: what isthe circumference of the globe? To address this question, Eratosthenes used geometry to identify where he will find this answer: the angle of the sun at different locations. Then, he found how to accurately measure that information: using a gnomon.

The research project presented three clear steps that could be addressed independently. This paradigm is now shifting. The problem big dataposes for academics and data scientist in general, is that extracting value andinsight from the data requires doing two processes simultaneously: discovering the informational content and how to extract it. The ‘where’ and the ‘how’ must bedone at the same time.

This is why the comparison with an Eldorado and a gold rush is appropriate: we know there is valuable information to be extracted from big data, but we do not know where exactly to find it in the data set and or how to extract it.

Consider the Twitter memes we introduced above. The research question of interest in this example is the evolution of memes and their drivers. To address this question, we must define where this information can be found and how to extract it.

The answer to both questions is the intuition that the data set (tweets published online) contains a massive amount of information about memes: which ones are currently trending, how they form and spread, etc. However, it is unclear where the information can be found in this corpus (should oneautomatically detect a meme? should a data-expert list the memes?). There is also no methodology on how to extract this information once located (what measures should we use to describe memes evolutions? How do these measures relate? What measures are computationally feasible?). Finally, all these questions must beanswered simultaneously to coherently address the problem.

The most successful solution to date is to frame very focused research questions and aim to capture only the information you areinterested in, and nothing else. Defining the scope of the project too broadlyruns the risk of corrupting the extracted information or limits the ability to extract the information in a scalable way. In effect, to reduce the confusioncaused by addressing both the ‘how’ and the ‘why’ at the same time, researchquestions must be very precise and well defined.

Using big data to understand networks

The need for a focused approach can be made more concrete whenconsidering network analysis. A network is made up of individual entities and the relationships between them. These can range from human social networks, neural network, electric networks, or protein interaction networks, for example.

We would expect that methods for analysing one type of network would apply to others. This is true in principle, however the research problem of interest for each network is different. For instance, when considering a social network, we are often interested in how communities form and interact. When considering a protein interaction networks, we are interested in the role a particular protein plays in the overall network.

In terms of methodology, addressing simultaneously the global scale community problem (in social networks) and the local scale role problem (in protein networks) remains an open question. But, by addressing each problemseparately, efficient methods can be devised. Therefore, the research questionmust focus on one of the two aspects: either local or global.

The incompatibility of the two approaches is due to the very different mathematical tools used to address both problems. For community detection, the network is regarded as a large matrix, and tools usually rely on the eigenvectors of that matrix.

On the other hand, to determine the role of each node, thenetwork is considered as a graph and the occurrence of small shapes around agiven node are enumerated. The former being algebraic in nature, and the later combinatorial in nature, it is not straightforward to articulate the two approaches in a single method.

One set of tools is needed to identify what community agiven agent belongs to, and another to describe the role of a given agent. However, there isn’t a tool that does both. As a consequence, it is necessary to decide which of the two aspects is to be studied in advance.

A threefold challenge

Like other sectors, academia sees the potential of big data. To harness this potential, academics face a threefold challenge: generate teams with varied expertise, work on very focused yet meaningful research problems, while taking ethical questions into account. Many such efforts are appearing in academia, and we are, hopefully, to expect great breakthrough in the near future.

Dr Pierre-André Maugis is a senior research fellow and head of technological transfer, Big Data Institute, UCL. He is also senior research associate at UCL’s Statistical Science Department. The views expressed in this article are the author’s own.


Big data analytics in health insurance

Subscribe to the Solvency II Wire mailing list.

Photo credits:
1. By Aркадий Захаров (Own work), 2. By NASA/MSFC/David Higginbotham/Emmett Given, 3. By Apkадий Захаров (Own work)