You know that Big Data involves lots of data. But have you ever stopped to think about just how much data, exactly, goes into Big Data? In other words, how big is Big Data, actually?
Defining Big Data
Before delving into the question of how big data has to be in order to be considered Big Data, let’s discuss the difficulty of defining what it actually means.
There is no official definition of Big Data, of course. What one person considers Big Data may just be a traditional data set in another person’s eyes.
That doesn’t mean that people don’t offer up various definitions for Big Data, however. For example, some would define it as any type of data that is distributed across multiple systems.
In some respects, that’s a good definition. Distributed systems tend to produce much more data than localized ones because distributed systems involve more machines, more services, and more applications, all of which generate more logs containing more data.
On the other hand, you can have a distributed system that doesn’t involve much data. For instance, if I mount my laptop’s 500-gigabyte hard disk over the network so that I can share it with other computers in my house, I would technically be creating a distributed data environment. But most people wouldn’t consider this an example of Big Data.
Another way to try to define Big Data is to compare it to “little data.” In this definition, it is any type of data that is processed using advanced analytics tools, while little data is interpreted in less sophisticated ways. The size of the actual data sets isn’t important in this definition.
This is also a valid way of thinking about what Big Data means. The big problem with this approach, however, is that there’s no clear line separating advanced analytics tools from basic software scripts. If you define Big Data only as data that is analyzed using Hadoop, Spark or another complex analytics platform, you run the risk of excluding from your definition data sets that are processed using R instead, for instance.
So, there’s no universal definition, but there are multiple ways to think about it. That’s an important point to recognize because it highlights the fact that we can’t define it in quantifiable terms alone.
Examples of Big Data
What we can do, however, is gain a sense of just how much data the average organization has to store and analyze today. Toward that end, here are some metrics that help put hard numbers on the scale of Big Data today:
- Analysts predict that by 2020, there will be 5,200 gigabytes of data on every person in the world.
- On average, people send about 500 million tweets per day.
- The average U.S. customer uses 1.8 gigabytes of data per month on his or her cell phone plan.
- Walmart processes one million customer transactions per hour.
- Amazon sells 600 items per second.
- On average, each person who uses email receives 88 emails per day and send 34. That adds up to more than 200 billion emails each day.
- MasterCard processes 74 billion transactions per year.
- Commercial airlines make about 5,800 flights per day.
All of the above are examples of sources of Big Data, no matter how you define it. Whether you analyze these types of data using a platform like Hadoop, and regardless of whether the systems that generate and store the data are distributed, it’s a safe bet that data sets like those described above would count as Big Data in most people’s books.
The Big Data Challenge
It’s also clear that the data sets represented above are huge. Even if your organization doesn’t work with the specific types of data described above, they provide a sense of just how much data various industries are generating today.
To work with that data effectively, you need a streamlined approach. You need not just powerful analytics tools, but also a way to move data from its source to an analytics platform quickly. With so much data to process, you can’t waste time converting it between different formats or offloading it manually from an environment like a mainframe (where lots of those banking, airline and other transactions take place) into a platform like Hadoop.
That’s where solutions like Syncsort’s come in. Syncsort’s data integration solutions automate the process of accessing and integrating data from legacy environments to next generation platforms, to prepare it for analysis using modern tools.
But no matter how you define it, Big Data is in a state of evolution. Discover how the new data supply chain impacts how data is moved, manipulated, and cleansed – download the new eBook The New Rules for Your Data Landscape today!