Basics: Deep into the weeds of BigData

11 min readOct 28, 2021

Last week, I took a stroll through the dark woods of Big Data Basics, attempting to define it at the surface. To avoid barking up the wrong tree I’ve decided to hinge towards IBM Data Scientists for a deeper dive into the Big Data concepts. Hopefully, this syrup will enrich your BASICs appetite whilst delivering a better understanding of those “under-the-hood” processes. Now, let’s stop chewing the fat, putting a stop to any mere violence towards tangled clogs of plants and dig straight-in.

Re-defining size and complexity margins (the Big Data V’s)

As we hinted in our last “Big Data” piece, defining this concept is a bit of a conundrum. We agreed that size alone is not the main essence of this term. Instead, our convention focused on a mix of complexity and volume.

Back at IBM (a company that powers Watson, one of the most advanced question-answering computer systems that converges multiple data sources streamed from encyclopedias, dictionaries, thesauri, newswire articles and literary works), engineers identified the need to define complexity itself. This is how in addition to data Volume, engineers and enthusiasts alike came to define complexity margins such as Velocity, Variety, Variability, and Value.

Long story short, the three “V”s any data scientist is aware of (volume, velocity and variety) have finally received a wordy increment, that delivers more insights and meaning to the Big Data processing nonsense.

IBM wasn’t the first one to do it. In fact, if you enter in your fav search engine the phrase <”V”s of Big Data> you’ll find various results, some even claiming there are 10 dimensions of complexity; adding even more confusion to one’s leaky cauldron.

The reason for this is Big Data’s ambiguity that leads to a common, but unsurprising confusion. Complexity means something different to each of us. Depending on who you are; and more specifically what you do, the understanding will be different. Take a look at Big Data through the eyes of an engineer vs the perspective of a business executive or any other industry professional — it is like everyone talks the same talk, but the walk; is an entirely different journey.

Before these “V”s kicked Big Data PUB’s door, their sole purpose was to shed light on; and more specifically provide a general mechanism for identifying various big data challenges. Since IBM data scientists were the first to coin them down (in a lucrative perspective) we’ll stop at defining Big Data Volume, Velocity, Variety, Variability, and Value, considering the last one rather a business outlook than a strictly technical segregation margin.

V for Volume

Throwing you back to our last entry, size does matter, especially when your gigabyte measuring tape looks like a doughnut. As in the case of that carb-filled pastry, its volume rises in time and you better have a large enough fryer to taste that analytical goodness.

For instance, if you’re running a chain of supermarkets, all those price books, product catalogues, all that data streamed from each of your IoT enabled refrigerators, all of those staff, client, partner and supplier dynamic records will make a pretty nifty dent in your private, hybrid or public cloud. It only makes sense to get ready to ingest; sometimes more than you can chew.

In the greater picture, this particular “V” will inevitably get bigger (unlike my hopes and dreams). Back in 2016, IBM estimated we’d reach the 40 Zettabytes margin and guess what: according to Statista, we’ve already reached 59 Zettabytes. Volumes rise at an unprecedented rate, hence the requirement to drop the “get schwifty” act and rethink storage — otherwise, we might as well start building data centers on the moon.

There’s not much to be said here, as we covered it quite well in my previous post and “copy/paste” is something I’m not really into.

V for Velocity

This one is a bit easier to understand, particularly because you’ve heard of this term in a variety of contexts: be it a speeding ticket, the time your Uber-eats driver took to deliver your food or your internet connection speed. Those who lived long enough to use dial-up, know how important “velocity” is, especially when attaching a file to an Aol or eXcite email.

In the context of Big Data, velocity is defined by two margins: the speed of your data enrichment and the need for real-time analytics. Data enrichment velocity is simply the speed of any incoming info whilst the “real-time analytics” business has more to do with the way you model, store and process that incoming chunk of data, in reasonable terms. The more data you ingest, the more of a challenge providing real-time analytics is (though our guys can help you eat your cake and have it too ).

Let’s go back to that “supermarket” model, just to get an idea of how important this “V” can be. Imagine you’re selling the best ice-cream in the world, but to guarantee an amazing taste, your recipe deems as a sacrifice the requirement of storing this goodness below -7 degrees Celsius. If it goes above, with only 2 degrees, for as much as 1 hour, the entire batch is trashed. Yeah, this one is more of a Willie Wonka scenario, but bear with me.

You have 100 locations and over 2000 freezers around the UK and since you’re a smart business-human, all of those are equipped with IoT sensors that stream 24/7 details about your freezer’s general state and temperatures to your data warehouse.

If any of those freezers fail, you need to take immediate action; hence, the velocity requirement. Calling in for a mechanic, ensuring your staff re-distributes those ice-cream boxes to other freezers takes time and you cannot afford to spend it on processing that IoT data, without ensuring real-time bells and whistles.

Going back to a more realistic perspective, imagine the cruciality of how data is modelled, stored, processed and reported in healthcare. There is a reason doctors and nurses use STAT instead of the classical “as soon as possible”. When patient monitoring is involved, Big Data velocity might be a matter of life and death, especially under 2020 circumstances (we’ve already heard all about that).

V for Variety

As in case of that liquor you’ve invested heavily in, during this pandemic, Data can and must be variable. After all, everything you do is similar to a poly-coloured gradient, especially if you’ve reached the Vallhala of data centricity. You might have activities that are easily transposed to a relational model, generating data structures driven by such metrics as date, amounts, times, and persona details. However, you might already know, this is not enough.

If you spent more than a dime at that Business Development Sales and Marketing bar, you know that raw data is not of much help, especially when you’re looking to close a deal. Gone are the days when you could sell anything following the “shoe salesman” or telemarketing model. Today, NPS is everything and you better know your targets better than your spouse. To achieve a better conversion in sales, as well as in other areas, structured data (aka those neat pieces of details that fit in your relational data model) must be augmented by unstructured data sets.

Just imagine you could get a neat image of your prospect’s mood before initiating that pitch call. Aggregating any streams of data triggered by social media actions, posts, landing page visits or whitepaper downloads related to one of your leads is no longer a luxury — it is a requirement, especially in a customer-centric market.

Forget sales. Instead, try walking in a physician’s shoes, facing the requirement of delivering an accurate diagnosis for a patient, without any CAT scans, MRI imaging, lab results and clinical history. Not even Dr House MD can do that. It is merely impossible to solve a complex issue, without pumping the augmentation iron of data variety.

To get a grasp of this “V”, consider any data that is relatively easy to capture and store, but merely impossible to structure under a re-usable or relational format (data scientists shouting in my ear: “that doesn’t have a meta-model, dummy!”).

Another way of properly understanding unstructured data is by contrast. For instance:

money is always defined by numbers with at least 2 decimals and a set of currencies;
your client’s e-mail address will always be defined by a “text”@”text”.”text” format;
phone numbers will always follow a sequence of codes: country code — network or area code -subscribers number; and
dates will always follow a set of ordered formats, be it DD/MM/YYYY or YYYY/MM/DD.

Now, imagine organising a tweet, an MRI scan or a voice recording (even transcribed) in a neat model. Can you define a re-usable, replicable structure? If you do — hit us up and let’s create the most promising startup there is — we’ll cash in better than Apple. That’s a guarantee!

Undoubtedly, one of the most complex Big Data challenges is making sense of all this variety nonsense. This is why our voice recognition efforts are now more focused on determining a caller’s tone, rather than the message. This is why we employ AI to learn what sarcasm is. This is why most client satisfaction, NPS and quality control jobs, still need a lousy human to make sense of non-verbal subtext or “subjective” queues; a call, tweet or message might hide.

V for Veracity (&or Variability)

Bluntly put: this “V” has some serious trust issues. Its “clinical profile” is enabled by data variability, different perceptions and perspectives (again, we’re turning back to square one).

Consider the “LOL” acronym. For most of us, it stands for “Laughing Out Loud”. No wonder you’re gobsmacked when reading that “Condolences. LOL” message sent by one of your grandparents to some of their mourning friends. For them, “LOL” stands for “Lots of Love”, but how would you understand that without a proper context? Though a bit morbid, this example illustrates perfectly the challenges outlined by Veracity (&/or Variability) of Big Data.

Now, moving forward to a more professional illustration. Let’s explore the healthcare industry slang. Quite often, medical professionals use acronyms to describe a specific condition. If you visit an Emergency Room and a cardiologist is filling-in some clinic hours, he might refer to your chest pain as “CP”. Without proper context, a neurologist might interpret that record as “cerebral palsy”. Again, data perception here is strongly related to who you are, what you do (and now) where you do it. After all, if you’re admitted in the cardiology section, everyone will understand “CP” as chest pain.

Adding-up to those issues, there’s a more direct cause to data veracity challenges: trust. How certain are you, that your precious data is trustworthy? After all, even if you’re the data-centricity guru, there’s no insurance against inherent discrepancies in all the data collected. Loosely translated: in the veracity context, data interpretation makes sense only when accounting for delivery processes, auxiliary examinations and when certain health checks are set.

Both veracity conundrums (weak comprehension and low quality/trust) can be solved making use of advanced tools (such as analytics, AI algorithms, observable pattern deployment, efficient data cataloguing, etc) and here is where engineers and data scientists ally in the effort of delivering context to all that meaningless, useless or simply inaccurate data.

Last V though not the least stands for Value

Unlike the other ones, this “V” is quite easy to comprehend. After all, everything you’re doing regarding data processing is to achieve a goal now, or in the near future. Why baking doughnuts today, if you’re just living ’em on the counter to go bad? It’s simply not worth the effort.

Let’s put it this way: if you’re collecting data that aids your decision making, serves as a looking-glass for your potentials and helps you achieve predictable results as opposed to “having a hunch” — then you’re on the right path. A good indicator you’re collecting the right insights is having in place metrics like: “Customer Lifetime Value”, “Average Revenue per Client”, “Churn Projection”, etc.

Otherwise, if your data is filled by corrupt values or if it lacks key elements that power your organisation’s processes (like customer reference models and accurate time pinpoints) — you might need to call in some data science ” Ghost Busters ” — after all, BadData is somewhat a ghost of lucrative insights.

Leaving those default illustrations aside, we must understand that Value is versatile. We all collect data for a specific purpose and it is important to avoid getting drunk on plain water: at least, add some sparkling juice in that glass. Narrow-down aspects that matter to your organisation and collect data that builds-up on your daily KPIs or OKRs. If your enrichment does not prove useful now or in the closest future, you might not need that extravagant data collection budget to begin with.

Endnotes

Whether you’re looking to add or remove “V”s to your Big Data processing research, one thing seems certain: the worth of data processing is entirely up to you. Outcomes are tightly related to who you are, what you do, who are your stakeholders and what are your ultimate goals. Loosely translated: there are no “one shoe fits all” approaches.

Although, “Read the directions, even if you don’t follow them” especially when dealing with Big Data processing. After all, the way we interact, work and simply do is getting more data-driven than ever before. This time tomorrow — you might need to get a grip on why, how and what data you’re collecting or will need to collect. IBM’s 5 V’s of defining Big Data might not be perfect, but they serve as a pretty good kick-off when diving in the data-centricity business. Think of it as a framework, designed to power your Big Data strategy, that helps you spot precious golden nuggets on your data influx journey.