Ask the Expert: What is big data?

Records show that 720,000 hours of videos are uploaded every day to YouTube, Google answers over 8.5 billion questions per day, and Instagram Stories has 500 million active users daily. All these applications generate and access huge amounts of data (big data) every minute in today’s social media and e-commerce era. 

Assistant Professor of Computer Science Ajit Chavan.
Assistant Professor of Computer Science Ajit Chavan. Photo by Envisage Studios.

Simply put, datasets that are so big they cannot be stored and processed on a single machine using traditional software are considered big data. 

To effectively store, process, and access such data, we need to use a collection of computers (also known as clusters) housed in large data centers (there is one near Altoona, Iowa, run by Meta) equipped with specialized software (big data technologies).  

Big data is generally characterized by four Vs: volume, velocity, variety, and veracity. Volume refers to the amount of data generated. For example, Facebook generated 4 Petabytes (1 PB = 1 million GB) of data daily in 2020. Along with volume, velocity is also an important aspect of big data. The Google Search engine gets 99,000 search requests every second. They must process this data in a fraction of a second to return search results to the users. 

The variety refers to the different types and formats in which the data is generated and processed. As an example, users post stories, pictures, and videos on Instagram, react to posts, and comment on them. Storing such different types of data (text, images, audio, video, etc.) and retrieving them quickly to provide a unified user experience requires massive clusters of computers with fine-tuned software systems. The collected data is also used to gain insights into user preferences, such as the most popular products on Amazon or suggesting the most relevant products based on your previous product search and purchases. In these cases the fourth V (veracity) is essential since it refers to data quality, accuracy, and credibility. The collected data could have missing pieces or be inaccurate, impacting the quality of outcomes and insights. Finding such inconsistencies and addressing them can be challenging when collected data has hundreds of millions of records.   

In summary, big data refers to a rapidly growing, voluminous dataset consisting of many data formats and/or structures, making it impossible for traditional software systems running on a single machine to store, access, and process them effectively.