Tuesday, 16 July 2013

Technology : The Promise and Perils of Big Data


It is common knowledge that we are witnessing an explosion of data, particularly digital, in every facet of our economy and society. Synonymous with this phenomenon, the term ‘Big Data’ has increasingly gained relevance. Today, Big Data has reached a point of ubiquity where almost every business and technology journal, forum or research is talking about it.

What is Big Data? 

Simply put, “Big Data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.

There are concepts that have become fundamental to defining big data. Commonly known as the “3V’s of Big Data1”, these have come to become the universally accepted parameters of big data:
1.     Volume – Volumes at which data poses to become a problem, and when it overwhelms existing standard solutions within the enterprise.
2.     Velocity – High velocity of streaming data, event data and transient data.
3.     Variety – Structured and Semi-structured data. Dynamic schema and non-relational data.
There is no universally accepted dataset size, velocity or variety benchmark that identifies one as big data. Different industries have different benchmarks, and with time, the needle is always shifting.
A University of California at Berkeley – IDC Research (sponsored by EMC) estimated that in 2010 enterprises globally stored 7 exabytes2 of data while consumers stored 6 exabytes of data in their PCs and Notebooks. By some reckoning, IBM claims that 90% of world’s data has been accumulated over the last 2 years alone. Consider this – by 2009, every company with 1000+ employees in the US stored 200 TB of data (twice the size of Wal-Mart's data warehouse in 1999).



What is the source of all this massive data? 

This volume and detail of information is being fuelled by data captured from scientific measurements and experiments (astronomy, physics, genetics, etc.), peer-to-peer communication (text messaging, chat lines, digital phone calls), broadcasting (news, blogs), social networking (Facebook, Twitter), authorship (digital books, magazines, web pages, images, videos), administrative activities (enterprise or government documents, legal and financial records), business (e-commerce, stock markets, business intelligence, marketing, advertising), and more. Data coming in from multiple sources could possibly be incomplete, ambiguous and even, inaccurate. And this is just the tip of the iceberg! As per Cisco estimates3, there were 8.7 billion connected devices in 2012, there’ll be 15 billion by 2015 and 40 billion in 2020. And as billions of physical devices get connected with networked sensors, the ‘exhaust data’ (i.e., data that are created as a by-product of other activities) from mobile phones, smart energy meters, automobiles, industrial machines, aero planes etc. will throw forth terabytes of data in the age of Internet of Things4.

What are the Promises of Big Data?

1.     Savings: Traditional analytics has been a costly affair. Licensing fees and costly upgrades have stymied the use and spread of analytics. Big data changes all that. The sheer brute force of powerful algorithms has resulted in time savings – in organizing, analyzing and presenting data. This is over and above reduced software licensing cost/operation cost either from ETL tools, data archiving etc. Intel has already made significant strides in this direction.

2.     Data-as-a-Service: Big data has led to the creation of a new business model/service within enterprises where data is collected from multiple sources in order to make them available for consumption. The primary focus at this function is to defining entities and collecting raw data. One may think of it something like a data library. This concept was not possible before. Higher-volumes of transactional digital data is helping enterprises with providing a more accurate and detailed performance information from exposing & analyzing variability to improving performance through better management decision-making.

3.     A platform for data mix and match: Big data helps find insight across multiple transactional data. New types of data like logs, sensors, social media feeds etc. are now being analyzed over and above the traditional tabular data. It is this form of analysis that wasn’t possible before.

4.     Customer Segmentation: One great example of big data is differentiating between customers in a more meaningful way. More precisely tailored products or services. Example, Tesco5, the European retail giant, taps its loyalty program to collect customer purchase intelligence, which it then analyzes to inform a variety of decisions, including the micro-segmentation of its customers and improving promotions that ensure 30% fewer gaps on shelves. Segmentation could also help government agencies and politicians through higher-quality, customized civic engagement (as was implemented during Barack Obama’s Presidential campaign)

5.     Replace human decision-making with automated algorithms: Sophisticated analytics can substantially improve decision making, minimize risks, and unearth valuable insights that would otherwise remain hidden. Example: Tesco5 uses models to understand effect of discounts on sales. By intelligently discounting prices (i.e. not too early), Tesco raked in 30million pounds ‘pure profit’.

6.     New Products, Services or Business Models: Big data can be used to improve the development of the next generation of products and services. For instance, manufacturers could use data obtained from sensors (embedded in products) to create innovative after-sales service offerings such as proactive maintenance, those in healthcare could accelerate the development of new drugs by using advanced analytics, and automobile firms can create new, proactive after-sales maintenance service for automobiles through the use of networked sensors.

7.     Delivering high performance applications: Big data provides alternate capabilities using NoSQL database which primarily focuses on application specific transactions rather than be driven by the underlying data model.

What are the Perils in Big Data?


1.     Overlooking data veracity: Over years, enterprises, institutions and governments have built massive datasets. However, much of this development has taken place in an environment replete with ‘silo-ed’ systems, poor processes and inconsistent methods of data input. The result – mounds of incorrect, imprecise, duplicate and in many cases, uncertain data. According to a recent Experian QAS® study6, 36% of U.S. marketing organizations interact with customers and prospects through 5 or more channels, 94% of businesses suspect their customer and prospect data might have inaccuracies, and on average, as much as 17% of information in cross-channel marketing databases is believed to be wrong. Gartner7 predicts that poor data quality effects overall labor productivity by as much as a 20%. No big data initiative can be fruitful without data veracity. This is the single-most biggest risk and impediment to big data initiatives.

2.     Using untreated data: Traditional DW/BI architecture has spent considerable resources on ETL/MDM but considering the variety and velocity of big data, automated data preparation and cleansing tools are still very premature8. Yet, any attempt to use big data platforms as a container to load data without any types of classification, categorization, entity definition etc. will only result in failure. A metadata is also an absolute prerequisite to derive desired value from big data.

3.     Implementing big data in ‘silo-es’: The silos will not only look at ‘local’ problems, it may also not use the potential of data mix ‘n’ match and thereby, loose the benefits of big data in terms of analytics, data as a service etc.

4.     Not knowing your problem well beforehand: Expecting every problem can be solved using big data is stupidity. Not knowing the problem and yet expecting big data to solve it, is setting up for failure. Agreed, data science is supposed to answer unanswered questions, but still there are some boundaries. To know the unknown without any boundaries will take significant time as boundaries need to be defined first. This is why initial implementation of big data is proving to be time consuming and costly, and enterprises should be ready for that.

The data to information to actionable insight value chain is tougher than one can imagine. The problems of heterogeneity, scale, timeliness and complexity remain. For instance, most online data is unstructured, and little vale can be derived unless data items are appropriately ‘linked’. Data analysis, retrieval and modeling are the other foundational challenges. Given the scale of data and underlying algorithms, analysis has hit a bottleneck. Finally, the presentation of the results and its interpretation by non-technical experts has proven to be a major impediment to successful big data implementations.

Besides this, when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have little or no predictive power - even worse, the strongest patterns might be:
o    Entirely caused by chance (just like someone who wins at the lottery wins purely by chance),
o    Not replicable,
o    Having little predictive power, but obscuring weaker patterns that are ignored yet have a strong predictive power.

As a matter of fact, the difficulty in realizing value has perhaps become one of the biggest challenges to big data implementations.

5.     Insufficient focus on addressing the talent gap: Much has been said and published about the looming talent gap in the area of big data. McKinsey has projected a 50-60% shortfall in the US by 2018. Gartner has also echoed similar sentiments stating only one-thirds of 4.4 million big data jobs will be filled in by 2015. Unlike analytics, big data requires Data Scientists who bring together a very diverse set of skills: deep business insights, data visualization, statistics, machine learning and computer programming. Policy makers have to play a significant role in mitigating this talent shortage through education and immigration policy.

6.     Analytical overkill, cognitive bias & human limits: Business decisions – to invest/not to invest, to retain/to let go off, so on and so forth – are based on human judgment. And judgments are contextual, social and value-led. This is where computer-driven data analysis-led approach might fail. Not only does data fail to capture context, enormous data could lead to false hypotheses and cognitive biases – the falsity of which just grows with more data9. It pays to keep in mind that data analysis can yield ‘false positive’ signals – because of the choice of statistical algorithms, and the fact that the interpretation (of the results) is done by those who lack expertise.

7.     Data security and privacy (and the risk of misinterpretation): Digital ‘breadcrumbs’ we leave behind as we go about our everyday lives create a trail of behavior that is not only followed, captured, stored and mined “en-masse”, they are also discreetly thrown into machines, and made to run on seemingly workable algorithms – all with the purpose of identifying correlated phenomenon. The issue of privacy (and the associated risk of misinterpretation) is scary and throws open many uncomfortable questions – Who decides if an algorithm is fool-proof? Who determines if a seemingly ‘predicted’ behavior mirrors an actual one? What defines “fair” use of data? Who possesses the legal rights to ‘mine’ your data? Who speaks for us? Who is responsible when an inaccurate piece of data leads to unintended – or worse still, negative consequences?

Like any emerging technology area, big data too faces its share of systemic challenges. Despite this, sectors like computer, electronics, information technology, finance, insurance and government sectors are already making headway into deriving competitive advantage out of big data. Yet, there is a long way ahead. Policies related to privacy, security, intellectual property, and even liability will need to be addressed by governments and policy-makers. Organizations will need to not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data.

References:

Research Papers:

“Big Data: The next frontier for innovation, competition, and productivity.”
McKinsey Global Institute Research Report. Dated: June 2011

1 Doug Laney, Gartner Analyst. “3D data management: Controlling data volume, variety and velocity.” MetaGroup Research Publication. Feb 2001

2 One exabyte of data is the equivalent of more than 4,000 times the information stored in the US Library of Congress.

3 Dave Evans. “The Internet of Things: How the Next Evolution of the Internet Is Changing Everything.” A Cisco Internet Business Solutions Group (IBSG) Research Paper. April 2011.

George A. Miller, “The magical number seven, plus or minus two: Some limits on our capacity for processing information,” Psychological Review, Volume 63(2), March 1956: 81–97.2

Ted Friedman, Michael Smith. “Measuring the Business Value of Data Quality”. Gartner Research Report. October 2011

Cisco's 2012 Visual Networking Index (VNI) Forecast.  

McKinsey & Company’s Business Technology Office report in 2011

Internet Articles:

“Big Data and Government Transparency” Applied Data Labs (A Data Technology Research and Advisory Labs). Click to read

4 “How Many Things Are Currently Connected To The "Internet of Things" (IoT)?”. Contributor: Rob Soderbery, Cisco Executive. Click to read

5Tesco uses data for more than just loyalty cards”. Paul Miller, Contributor, Cloud of Data. Dated: October, 2012. Click to read

6 “Poor quality data can hurt cross-channel marketing efforts” Erin Haselkorn, Experien QAS® Marketing Services. Click to read

7Data Veracity”. Michael Walker, Data Science Central. Dated: Nov 28, 2012. Click to read

8 “Realizing Big Data Benefits: The Intersection of Science and Customer Segmentation”. Contributor: Neil Blehn, Wired Insights. Dated: June 7, 2013. Click to read

9 “What Data Can’t Do?” Dadiv Brooks, The New York Times. Dated: February 2013. Click to read
“The Hidden Biases in Big Data” Kate Crawford, Harvard B


No comments:

Post a Comment