Google, Facebook, Netflix, LinkedIn, Twitter and all other social media platforms clearly qualify as big data technology centers. But when did they know to start worrying about the data they have? The answer is simple - it all depends on the characteristics of big data, and when the data processing starts encroaching the 5 Vs.
Let’s see the 5 Vs of Big Data:
- Volume, the amount of data
- Velocity, how often new data is created and needs to be stored
- Variety, how heterogeneous data types are
- Veracity, the “truthiness” or “messiness” of the data
- Value, the significance of data
You’re not really in the big data world unless the volume of data is exabytes, petabytes, or more. Big data technology giants like Amazon, Shopify, and other e-commerce platforms get real-time, structured, and unstructured data, lying between terabytes and zettabytes every second from millions of customers especially smartphone users from across the globe. They do near real-time data processing and after running machine learning algorithms to do data analysis on big data, they make decisions to provide the best customer experience.
When do we find Volume as a problem:
A quick web search reveals that a decent 10TB hard drive runs at least $300. To manage a petabyte of data that’s 100 x $300 USD = $30,000 USD. Maybe you’ll get a discount, but even at 50% off, you’re well over $10,000 USD in storage costs alone. Imagine if you just want to keep a redundant version of the data for disaster recovery. You’d need even more disk space. Hence the volume of data becomes a problem when it grows beyond the normal limits and becomes an inefficient and costly way to store on local storage devices.
Amazon Redshift, which is a managed cloud data warehouse service by AWS is one of the popular options for storage. It stores data distributed across multiple nodes, which are resilient to disaster and faster for computations compared to on-premise relational databases like Postgres and MySql. It is also easy to replicate data from relational databases to Redshift without any downtime.
Imagine a machine learning service that is constantly learning from a stream of data, or a social media platform with billions of users posting and uploading photos 24x7x365. Every second, millions of transactions occur, and this means petabytes and zettabytes of data is being transferred from millions of devices to a data center every second. This rate of high volume data inflow per second defines the velocity of data.
When do we find Velocity as a problem:
High-velocity data sounds great because – velocity x time = volume and volume leads to insights, and insights lead to money. However, this path to growing revenue is not without its costs.
There are many questions that arise like, how do you process every packet of data that comes through your firewall, for maliciousness? How do you process such high-frequency structured and unstructured data on the fly? Moreover, when you have a high velocity of data, that almost always means that there are going to be large swings in the amount of data processed every second, tweets on Twitter are much more active during the Super Bowl than on an average Tuesday, how do you handle that?
Fortunately, “streaming data” solutions have cropped up to the rescue. The Apache organization has popular solutions like Spark and Kafka, where Spark is great for both batch processing and streaming processing, Kafka runs on a publish/subscribe mechanism. Amazon Kinesis is also a solution, which has a set of related APIs designed to process streaming data. Google Cloud Functions (Google Firebase also has a version of this) is another popular serverless function API. All these are a great black-box solution for managing complex processing of payloads on the fly but they all require time and effort to build data pipelines.
Now, if you don’t want to deal with the time and expense of creating your own data pipeline, that’s where something like FlyData could come in handy. FlyData seamlessly and securely replicates your Postgres, MySQL, or RDS data into Redshift in near real-time.
The real world is messy due to different types of data so it makes sense that anyone dealing with exciting challenges must also deal with messy data. Data heterogeneity is often a source of stress in building up a data warehouse. Not only videos, photos, and highly hierarchically interconnected posts and tweets on social platforms but also basic user information can come in wildly different data types. These heterogeneous data sets possess a big challenge for big data analytics.
When do we find Variety as a problem:
When consuming a high volume of data the data can have different data types (JSON, YAML, xSV (x = C(omma), P(ipe), T(ab), etc.), XML) before one can massage it to a uniform data type to store in a data warehouse. The data processing becomes even more painful when the data columns or keys are not guaranteed to exist forever, such as renaming, introducing, and/or deprecating support for keys in an API. So not only one is trying to squeeze a variety of data types into uniform data type but also the data types can vary from time to time.
One way to deal with a variety of data types is to record every transformation milestone applied to it along the route of your data processing pipeline. Firstly, store the raw data as-is in a data lake( a data lake is a hyper-flexible repository of data collected and kept in its rawest form, like Amazon S3 file storage ). Then transform the raw data with different types of data types into some aggregated and refined state, which then can be stored in another location inside the data lake, and then later can be loaded into a relational database or a data warehouse for data management.
The data in the real world is so dynamic that it is hard to know what is right and what is wrong. Veracity refers to the level of trustiness or messiness of data, and if higher the trustiness of the data, then lower the messiness and vice versa. Veracity and Value both together define the data quality, which can provide great insights to data scientists.
When do we find Veracity as a problem:
Consider the case of tweets on Twitter, which use things like hashtags, uncommon slangs, abbreviations, typos, and colloquial speech, all this data have a lot of messiness or noise and as the volume of data increases the noise also increases with it, which can be sometimes exponential too. The noise reduces the overall data quality affecting the data processing and later on data management of the processed data.
If the data is not sufficiently trustworthy, it then becomes important to extract only high-value data as it doesn’t always make sense to collect all the data you can because it is expensive and requires more effort to do so. Filtering out noises as early as possible in the data processing pipeline from the data while data extraction. This leaves only required and trustworthy data which can then be transformed and loaded for data analytics.
Until and unless the big data we have cannot be transformed into something valuable, it is useless. It is very important to understand the cost of resources and effort invested in big data collection and how much value it provides at the end of the data processing. Value is very important because it is what runs the business by impacting business decisions and providing a competitive advantage.
Consider the case of Netflix where user viewing and browsing pattern data is gathered from different data sources and then is extracted and transformed inside the data processing pipeline to generate only high-value information like user interests to provide useful recommendations. This, in turn, helps Netflix to avoid user churn and to attract even more users to their platform. The information generated could have been of low value if it had not satisfied the user. Hence, the value of big data impacts many business decisions and provides a competitive advantage over others.
In today's age, there are constant streams of high-volume real-time data flowing from devices like smartphones, IoT devices, laptops, all these streams form Big Data, and the 5 V's are important characteristics (framework for big data if you will) that help you identify what all to consider when data influx is scaling. Big data plays an instrumental role in many fields like artificial intelligence, business intelligence, data sciences, and machine learning where data processing (extraction-transformation-loading) leads to new insights, innovation, and better decision making. Big data breakdown also gives competitive advantages to those who do data analysis before decision-making over those who use traditional data to run their business. Solutions like Amazon Redshift will certainly provide an edge over relational databases for data warehousing while Spark and Kafka are promising solutions for the continuous streaming of data to the data warehouses.