It is common knowledge that
we are witnessing an explosion of data, particularly digital, in every facet of
our economy and society. Synonymous with this phenomenon, the term ‘Big Data’ has
increasingly gained relevance. Today, Big Data has reached a point of ubiquity
where almost every business and technology journal, forum or research is
talking about it.
What is Big Data?
Simply put, “Big Data”
refers to datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze.
There are concepts
that have become fundamental to defining big data. Commonly known as the “3V’s
of Big Data1”, these have come to become the universally accepted
parameters of big data:
1. Volume – Volumes at which data poses to become
a problem, and when it overwhelms existing standard solutions within the
enterprise.
2. Velocity – High velocity of streaming data,
event data and transient data.
3. Variety – Structured and Semi-structured data.
Dynamic schema and non-relational data.
There is no
universally accepted dataset size, velocity or variety benchmark that
identifies one as big data. Different industries have different benchmarks, and
with time, the needle is always shifting.
A
University of California at Berkeley – IDC Research (sponsored by EMC) estimated
that in 2010 enterprises globally stored 7 exabytes2 of data while consumers
stored 6 exabytes of data in their PCs and Notebooks. By some reckoning, IBM
claims that 90% of world’s data has been accumulated over the last 2 years
alone. Consider this – by 2009, every company with 1000+ employees in the US stored
200 TB of data (twice the size of Wal-Mart's data warehouse in 1999).
What is the source of all
this massive data?
This
volume and detail of information is being fuelled by data captured from
scientific measurements and experiments (astronomy, physics, genetics, etc.),
peer-to-peer communication (text messaging, chat lines, digital phone calls),
broadcasting (news, blogs), social networking (Facebook, Twitter), authorship
(digital books, magazines, web pages, images, videos), administrative
activities (enterprise or government documents, legal and financial records),
business (e-commerce, stock markets, business intelligence, marketing,
advertising), and more. Data coming in from multiple sources could possibly be
incomplete, ambiguous and even, inaccurate. And this is just the tip of the
iceberg! As per Cisco estimates3, there were 8.7 billion connected
devices in 2012, there’ll be 15 billion by 2015 and 40 billion in 2020. And as
billions of physical devices get connected with networked sensors, the ‘exhaust data’ (i.e., data that are created as a
by-product of other activities) from mobile phones, smart energy meters,
automobiles, industrial machines, aero planes etc. will throw forth terabytes
of data in the age of Internet of Things4.
What are the Promises of Big
Data?
1. Savings: Traditional analytics has been a costly
affair. Licensing fees and costly upgrades have stymied the use and spread of analytics.
Big data changes all that. The sheer brute force of powerful algorithms has
resulted in time savings – in organizing, analyzing and presenting data. This
is over and above reduced software licensing cost/operation cost either from
ETL tools, data archiving etc. Intel has already made significant strides in
this direction.
2.
Data-as-a-Service: Big data has led to the creation of a new business
model/service within enterprises where data is collected from multiple sources
in order to make them available for consumption. The primary focus at this
function is to defining entities and collecting raw data. One may think of it
something like a data library. This concept was not possible before. Higher-volumes
of transactional digital data is helping enterprises with providing a more accurate
and detailed performance information from exposing & analyzing variability
to improving performance through better management decision-making.
3. A platform for data mix
and match: Big data
helps find insight across multiple transactional data. New types of data like
logs, sensors, social media feeds etc. are now being analyzed over and above
the traditional tabular data. It is this form of analysis that wasn’t possible
before.
4. Customer Segmentation: One great example of big data is
differentiating between customers in a more meaningful way. More precisely
tailored products or services. Example, Tesco5, the European retail
giant, taps its loyalty program to collect customer purchase intelligence,
which it then analyzes to inform a variety of decisions, including the
micro-segmentation of its customers and improving promotions that ensure 30%
fewer gaps on shelves. Segmentation could also help government agencies and
politicians through higher-quality, customized civic engagement (as was
implemented during Barack Obama’s Presidential campaign)
5. Replace human
decision-making with automated algorithms: Sophisticated analytics can substantially improve decision
making, minimize risks, and unearth valuable insights that would otherwise
remain hidden. Example: Tesco5 uses models to understand effect of
discounts on sales. By intelligently discounting prices (i.e. not too early),
Tesco raked in 30million pounds ‘pure profit’.
6. New Products, Services or
Business Models: Big data can be used to improve the development of the
next generation of products and services. For instance, manufacturers could use
data obtained from sensors (embedded in products) to create innovative
after-sales service offerings such as proactive maintenance, those in
healthcare could accelerate the development of new drugs by using advanced
analytics, and automobile firms can create new, proactive after-sales
maintenance service for automobiles through the use of networked sensors.
7. Delivering high
performance applications: Big
data provides alternate capabilities using NoSQL database which primarily
focuses on application specific transactions rather than be driven by the underlying
data model.
What are the Perils in Big
Data?
1. Overlooking data veracity: Over years, enterprises, institutions
and governments have built massive datasets. However, much of this development
has taken place in an environment replete with ‘silo-ed’ systems, poor
processes and inconsistent methods of data input. The result – mounds of
incorrect, imprecise, duplicate and in many cases, uncertain data. According to
a recent Experian QAS® study6, 36% of U.S. marketing organizations
interact with customers and prospects through 5 or more channels, 94% of
businesses suspect their customer and prospect data might have inaccuracies,
and on average, as much as 17% of information in cross-channel marketing
databases is believed to be wrong. Gartner7 predicts that poor data
quality effects overall labor productivity by as much as a 20%. No big data
initiative can be fruitful without data veracity. This is the single-most biggest
risk and impediment to big data initiatives.
2. Using untreated data: Traditional DW/BI architecture has
spent considerable resources on ETL/MDM but considering the variety and
velocity of big data, automated data preparation and cleansing tools are still
very premature8. Yet, any attempt to use big data platforms as a
container to load data without any types of classification, categorization,
entity definition etc. will only result in failure. A metadata is also an
absolute prerequisite to derive desired value from big data.
3. Implementing big data in
‘silo-es’: The silos
will not only look at ‘local’ problems, it may also not use the potential of
data mix ‘n’ match and thereby, loose the benefits of big data in terms of
analytics, data as a service etc.
4. Not knowing your problem
well beforehand: Expecting
every problem can be solved using big data is stupidity. Not knowing the
problem and yet expecting big data to solve it, is setting up for failure. Agreed,
data science is supposed to answer unanswered questions, but still there are
some boundaries. To know the unknown without any boundaries will take
significant time as boundaries need to be defined first. This is why initial
implementation of big data is proving to be time consuming and costly, and enterprises
should be ready for that.
The
data to information to actionable insight value chain is tougher than one can
imagine. The problems of heterogeneity, scale, timeliness and complexity
remain. For instance, most online data is unstructured, and little vale can be derived
unless data items are appropriately ‘linked’. Data analysis, retrieval and
modeling are the other foundational challenges. Given the scale of data and
underlying algorithms, analysis has hit a bottleneck. Finally, the presentation
of the results and its interpretation by non-technical experts has proven to be
a major impediment to successful big data implementations.
Besides
this, when you search for patterns in very, very large data sets with billions
or trillions of data points and thousands of metrics, you are bound to identify
coincidences that have little or no predictive power - even worse, the
strongest patterns might be:
o Entirely caused by chance (just like
someone who wins at the lottery wins purely by chance),
o Not replicable,
o Having little predictive power, but
obscuring weaker patterns that are ignored yet have a strong predictive power.
As a
matter of fact, the difficulty in realizing value has perhaps become one of the
biggest challenges to big data implementations.
5. Insufficient focus on
addressing the talent gap: Much
has been said and published about the looming talent gap in the area of big
data. McKinsey has projected a 50-60% shortfall in the US by 2018. Gartner has also
echoed similar sentiments stating only one-thirds of 4.4 million big data jobs
will be filled in by 2015. Unlike analytics, big data requires Data Scientists
who bring together a very diverse set of skills: deep business insights, data
visualization, statistics, machine learning and computer programming. Policy
makers have to play a significant role in mitigating this talent shortage through
education and immigration policy.
6. Analytical overkill, cognitive
bias & human limits: Business
decisions – to invest/not to invest, to retain/to let go off, so on and so
forth – are based on human judgment. And judgments are contextual, social and value-led.
This is where computer-driven data analysis-led approach might fail. Not only does data fail to capture context, enormous
data could lead to false hypotheses and cognitive biases – the falsity of which
just grows with more data9. It pays to keep in mind that data
analysis can yield ‘false positive’ signals – because of the choice of
statistical algorithms, and the fact that the interpretation (of the results)
is done by those who lack expertise.
7. Data security and privacy
(and the risk of misinterpretation): Digital ‘breadcrumbs’ we leave behind as we go about our
everyday lives create a trail of behavior that is not only followed, captured,
stored and mined “en-masse”, they are also discreetly
thrown into machines, and made to run on seemingly workable algorithms –
all with the purpose of identifying correlated phenomenon. The issue of privacy
(and the associated risk of misinterpretation) is scary and throws open many uncomfortable
questions – Who decides if an algorithm is fool-proof? Who determines if a
seemingly ‘predicted’ behavior mirrors an actual one? What defines “fair” use
of data? Who possesses the legal rights to ‘mine’ your data? Who speaks for us?
Who is responsible when an inaccurate piece of data leads to unintended – or
worse still, negative consequences?
Like any
emerging technology area, big data too faces its share of systemic challenges. Despite
this, sectors like computer, electronics, information technology, finance,
insurance and government sectors are already making headway into deriving competitive
advantage out of big data. Yet, there is a long way ahead. Policies related to
privacy, security, intellectual property, and even liability will need to be
addressed by governments and policy-makers. Organizations will need to not only
to put the right talent and technology in place but also structure workflows
and incentives to optimize the use of big data.
References:
Research Papers:
“Big Data: The next frontier
for innovation, competition, and productivity.”
McKinsey Global Institute Research Report. Dated: June 2011
McKinsey Global Institute Research Report. Dated: June 2011
1 Doug Laney, Gartner Analyst. “3D data management:
Controlling data volume, variety and velocity.” MetaGroup Research Publication.
Feb 2001
2 One exabyte of data is the
equivalent of more than 4,000 times the information stored in the US Library of
Congress.
3 Dave Evans. “The Internet of Things: How the Next
Evolution of the Internet Is Changing Everything.” A Cisco Internet
Business Solutions Group (IBSG) Research Paper. April 2011.
George
A. Miller, “The magical number seven,
plus or minus two: Some limits on our capacity for processing information,”
Psychological Review, Volume 63(2), March 1956: 81–97.2
Ted Friedman, Michael Smith. “Measuring the Business Value of Data Quality”. Gartner Research
Report. October 2011
Cisco's 2012 Visual Networking Index (VNI) Forecast.
McKinsey
& Company’s Business Technology Office report in 2011
Internet Articles:
“Big Data and Government Transparency” Applied Data Labs (A Data Technology Research and
Advisory Labs). Click to read
4 “How Many Things Are Currently Connected To The
"Internet of Things" (IoT)?”. Contributor:
Rob Soderbery, Cisco Executive. Click to read
5 “Tesco uses data for more than just loyalty
cards”. Paul Miller, Contributor, Cloud of Data. Dated: October, 2012. Click to read
6 “Poor quality data can hurt cross-channel marketing
efforts” Erin Haselkorn, Experien QAS® Marketing Services. Click to read
8 “Realizing Big Data Benefits: The Intersection of
Science and Customer Segmentation”. Contributor:
Neil Blehn, Wired Insights. Dated: June 7, 2013. Click to read
“The Hidden Biases in Big Data” Kate Crawford, Harvard B


No comments:
Post a Comment