So, if you read the book (Cloud Basics – The Paradigm Shift) I had said it was really mean’t to be a Blog Post or a Series of Posts that ended up becoming a book.
As, I write my first post on Big Data I can clearly see the need for a Book or maybe even a series of Books here that can help clear up the mist on Big Data.
More over there is a need to clarify what Big Data means in relation to the Cloud.
We start today on a 3 part post to help simplify the concept of Big Data. This first part tries to answer the question: What is Big Data?
Well, June 2011 McKinsey came up with a 156 page report called “Big data: The next frontier for innovation, competition, and productivity“.
If you have the time you can download and read that report and spend sometime cutting through the web or else just continue to read this post for the crunching has already been done!
What is Big Data?
a) Big data is data that exceeds the processing capacity of conventional database systems.(1)
b) Large Organizations need to maintain large amounts of Structured and Unstructured Data.(2)
c) Real Time Data Storage, Correlation & Analysis (3)
d) Virtual Scale Out for Data Storage & Analytics (4)
All of the above may be somewhat true or not so true depending on who is asking the question on Big Data.
Big Data is data that is ‘Too Big’
Yes, the quantum of data is constantly increasing however Too big is a relative term. It can mean anything from a Terabyte (1 TB = 1000 GB) to a Petabyte (1 PB = 1000 TB) to a Exabyte (1EB = 1000 PB). Per a McKinsey study in the year 2009 an average Enterprise had over 100 TBs of data and some Enterprises had Petabytes of Data.
We cannot ignore the exponential growth of Data and rise in revenues of Storage firms from 2009 to 2012 (today). Hence, clearly the average enterprise has over 100 TBs of data and that’s big enough compared to an average user’s 500 GB Hard Drive. One would need 200 Hard Drives to store all that data!
At the high end of the spectrum, say a large securities or banking firm on Wall Street cloud easily have 4000 TB of Data. One would need 8000 Hard Drives of 500 GB each to store that data. Now that sure is Big Data!
Big Data is about Unstructured Data
Yes, databases evolved over the past few decades mainly geared to handling Structured (Relational) data. However, we must also recognize that Streaming Data (e.g. videos) and Time Series data (e.g. stock tickers) and other forms of non-relational data have existed, at least in the past decade.
This streaming or time series data can primarily be thought of as unstructured data. However, this data was ‘effectively stored’ in some database or storage and that it did carry some structure. A few examples of efficient storage and structure are:
1. Streaming Caches used for Caching Real Time Market Data so an Algorithmic Trade can be made.
2. A CDN(Content Delivery Network – set of streaming servers) that store video files and can stream efficiently a single video file on demand to millions of users.
3. Time Series Databases that store Tick Data (Stock Last Traded Price data that you see at the bottom of your TV News Channel) which is used for Back Testing of Trading Strategies by Hedge Funds.
Big Data is about Real-Time
Yes, traditional databases and data warehouse models meant data had to be stored for subsequent processing and analysis. The Extract, Transform, Load process which is how most Data Warehouses are primed with data is almost always a Batch(as opposed to Real-Time) process. Insights were often made on what could be thought off as historical data.
Getting more real time insight really meant going away from the traditional database model. Business demands in the past decade for Real Time insights lead to ‘in stream’ processing database products. These ‘in stream’ databases have been in use on Wall Street for over a decade now.
Therefore, a Real Time Database or Real Time Data Analysis is not something new. Furthermore, these systems exist on Wall Street, so we can assume that the volumes were also ‘Big’.
However, with the explosive growth of the Internet and Social Media the quantum of Data that needs to be processed in Real Time has increased significantly.
Tomorrow’s Enterprise will demand for insight not just from the all data within your enterprise but also from the growing Social ecosystem created around Enterprises by Social Destinations like Twitter, Facebook, LinkedIn etc.
Hence, it may be fair to say that the Real Time systems of today do not have the capacity to handle the Big Data Real Time demands of the Social Enterprise of tomorrow.
Big Data is about Virtual Scale-Out
Yes, the traditional in-house model meant using individual databases that could be scaled by using bigger and bigger hardware or implementing a Database Cluster.
However, with the Cloud you no longer need to incur the required Capex for buying hardware or incur significantly high license costs if you choose an Open Standards based Cloud.
This known mantra of the Cloud and is logically extended to Big Data. Given that Database/Storage form key layers within the Cloud.
It is important to note here how the Cloud is changing the concept of Database and Storage as being separate things.
Over time we may see just one layer in the Computing stack called the Data Layer as the Database and Storage layers collapse into one.
This cloud Data layer may very well handle all Structured and Unstructured Data, have real time processing capabilities and might at some point in the future be called the Big Data layer.
The definition of Big Data can be best understood if we grasp the key issues behind Data i.e. how computer programs Process and Store data and how this problem has been solved technologically over the past few decades.
(We will keep elements like data Retrieval, data I/O, data Moving aside for now and handle them in a more appropriate technical post which outlines mainly the evolution of (Big) Data technology)
We will discuss Issues with Data & The Evolution of Technology around Data in Part 2 of this Post next week.
Until then look forward to your comments and you can think ‘Big’ Data!
Coming Soon in this series: Big Data Simplified Part 3 of 3.
Other Posts in this Series:
Big Data Simplified – Part 2