The science of analysing (big) data

The term big data is ubiquitous. But what does it mean and how much value does it really add?

Some argue that big data is a fad, a shiny rebranding of the science of analysing data that smacks of more than a hint of snake oil. But the zeitgeist is that big data is the new thing that can deliver insights and value for all.

There is a neat logic to the concept of big data: the more information we collect, the more knowledge we can acquire and put to good use. But it is as yet unclear what value can be unlocked in the masses of data our world is producing.

The 10th Colloquium of the Scottish Financial Risk Academy (SFRA), Big data and its impact on financial services, set out to explore the value of big data for the world of finance. OnRiskDialogues is a collaboration between the SFRA and Solvency II Wire that brings together a collection of articles from speakers at the colloquia.

Is it BIG data?

The biggest unit listed on wikipedia today is a yottabyte – 1 with 24 zeros behind it (or 1024 bytes) or, to put it in context of home computing, one trillion terabytes. The largest data sets used in big data currently are only of the order of petabytes (1 with 15 zeros behind it).

But as speakers at the Colloquium noted, ‘big’ is not necessarily the defining characteristic of big data.

The many v’s of big data

Defining big data can be as perplexing an exercise as extracting value from the data itself. A series of v’s are most commonly used to define big data. Although there is plenty of debate about which ones really count, speakers highlighted the following seven.

Volume – the size of the data set.

Variety – use of multiple data sources, including structured data (csv, JSON, xml) and unstructured data (text, audio, video).

Velocity – the speed at which data can be collected and processed.

Veracity – uncertainty about the quality and reliability of the data due to the size and range of data sources.  

Variability – the changing meaning of the data based on context. For example, in sentiment analysis of tweets the word ‘sick’ means different things when used by a doctor or a teenager.

Visualisation – the use of interactive visualisation to vary the data sample and analyse the results.

Value – the relative usefulness of the resulting analysis.

Data lakes

Jeff Richmond, Cloud Enterprise Architecture, Oracle UK, noted that one of the growing trends in the industry is for organisations to co-locate all their data into a central data set called a Data Lake, Data Pool or Data Reservoir. The Data Reservoir allows anyone with access to explore and analyse data sets and their inter-relationships with relative ease. 

In his article, Data lifecycle in a big data world, he compares the data lifecycle of traditional data warehousing with big data and argues that the added flexibility in big data analytics gives rise to serious problems with data governance.

As he puts it, “In data warehousing there is a single universal data set against which results can be tested (e.g. in economic data or results from medical trials) – that stability of the data source is eroded significantly in big data.”

These questions become even more pertinent in the context of financial risk modelling, as highlighted in Issue 1 of OnRisk Dialogues: The Challenges of Risk Models.

Big data in academia

The rise of big data is a product of rapid advances in storage
and computer technology. Although the idea of mass data storage emerges around the early 2000s, recent growth in communication, social media and the Internet of Things (machine to machine communication) have led to an explosion of the data universe. This is fuelling a new sort of gold rush.

It seems like everybody is on the bandwagon with their digital picks and shovels looking to extract value from the ever-expanding data mass. The analogy between panning for gold and big data is not too wide of the mark: trawling through vast amounts of meaningless ‘dirt’ in search of coveted ‘gold nuggets’ in the form of insights and values.

The rush has not been missed by academia. Dr Pierre-André Maugis, senior research fellow at the UCL Big-Data Institute, demonstrates in his article, Big data in Academia, how big data research is driving new multi-stakeholder collaborations in academia, bringing together a wide range of disciplines from statistics, to social sciences and the humanities.

Academics face unique challenges, as their work (including source data) is often made public, while at the same time academic institutions are subject to more stringent data privacy rules.

The article also gives a flavour of the challenges of making use of big data and how it is driving a paradigm shift in approaches to research methodology.

Big-data in health insurance

Some of the benefits and challenges of using big data in insurance are explained by Pierre du Toit, head of big data analytics, Vitality (UK). In Using big data to enhance and protect lives, he discusses integrating clinical and lifestyle information, including fitness tracking devices, and the benefits it can bring to both the insurer and its customers. These include helping customers improve their health, lowering claims for the insurer and developing a closer relationship through more frequent contact.

Security and privacy

It would be wrong to have any discussion of big data without talking about the impact of privacy and security.

Throughout the colloquium exchanges between speakers and the audience highlighted the inherent tension in using vast anonymised data sets derived from the personal data of individuals.

Some argued that the extent of personal data collected and stored by firms and governments and its potential impact on individuals are not yet fully realised by the general public, and there is a danger of a backlash. The impact of individuals either refusing to share personal data or demanding payment for their data could be devastating both to scientific learning and the economy that has been built around the use of personal data.

One

We know that everybody is talking about big data and that many are exploring its potential. As a relatively young discipline it faces many challenges, not least the need to prove that it can deliver on its promise.

The contributions to this issue of OnRisk Dialogues bring detailed insights into this often mystifying and at times mind-boggling area of knowledge, but they also highlight the fact that with new technologies come new dangers that must be properly addressed. From concerns about data quality and governance, to security and privacy issues, we are reminded that in big data, one is still the most important number.

Gideon Benari is the editor of Solvency II Wire. The views expressed in this article are the author’s own.

NEXT ARTICLE:

Data lifecycle in a big data world 

Subscribe to the Solvency II Wire mailing list.

Photo credits:
1. MariKlim (Own work)
2. NASA and European Space Agency
3. FeZn from ja