Big and Fast Data

What is big data?

An overloaded, fuzzy term

“Data too large to be efficiently processed on a single computer”

“Massive amounts of diverse, unstructured data produced by high-performance applications”

How big is “Big”?

Typical numbers associated with Big Data

2.5 Exabytes ($10^6$ TB) produced daily
By 2025, the amount of data generated each day is expected to reach 463 exabytes globally.
IoT: 21.5 Billion devices with internet access
By 2025, there would be 75 billion IOT devices in the world.
Google, Facebook, Microsoft, and Amazon store at least 1,200 petabytes of information.
Google processes ~100K searches per second, >3.5 billion searches per day, and >1.2 trillion searches per year
- Each query involves > 1000 machines
- Each search touches 200+ services
Amazon processes >1000 orders per second (2016) up from 35 in 2012

Alibaba in 2018 Singles Day: $30 billion revenue, $1 billion revenue in 1 min and 25 secs.
Alibaba in 2019 Singles Day: $38 billion revenue, $1 billion in 68 seconds.
Alibaba in 2020 Singles Day: $74 billion revenue.
Alibaba in 2021 Singles Day: $84.5 billion revenue.

How big is “Big”? – Instagram

Instagram

>1B daily users, clicking around the app
1,074 pictures are uploaded per second, 4 million per hour
95M photos, 500M stories daily
Most followed user: 600M followers

How big is “Big”? – FaceBook

FaceBook

Numbers ($\Leftarrow$) are from 2014! Today on FB:

1.91 Billion active users per day
350 million photos per day (148k/min)
Every min: 510k comments, 293k status updates

How big is “Big”? – TikTok

TikTok

1 Billion monthly active users
Millions of videos uploaded daily
More than 1 million videos viewed every day

The many Vs of Big data

Main Vs, by Doug Laney

Volume: large amounts of data
Variety: data comes in many different forms from diverse sources
Velocity: the content is changing quickly

More Vs

Value: data alone is not enough; how can value be derived from it?
Validity: ensure that the interpreted data is sound, clean
Veracity: can we trust the data? Does it come from reliable sources?
Volatility: how long does data need to be kept for? When is it considered irrelevant?
Visibility: data from diverse sources need to be stitched together
Virality: how quickly data is spread and shared

Volume

We call Big Data big because it is really big:

90% of all the data ever was created in the last 2 years
Every human created at least 1.7 MB of data per second in 2020.
Every day, 306.4 billion emails are sent, and 500 million Tweets are made.
The global big data and business analytics market was valued at [$138.9 billion in 2020 and is expected to grow to $229 billion in 2025] (https://www.precisely.com/blog/big-data/quality-data-big-data-worth).

Data growth rate

Variety

Structured data: SQL tables, format is known
Semi-structured data: JSON, XML
Unstructured data: Text, audio, video

We often need to combine various data sources of different types to come up with a result

Velocity

Data is not just big; it is generated and needs to be processed fast. Think of:

Data centers writing to log files
IoT sensors reporting temperatures around the globe
Twitter: >500 million tweets a day (or 6k/sec)
Stock Markets: high-frequency trading (latency costs money)
Online advertising

Data needs to be processed with soft or hard real-time guarantees

Big Data processing

The ETL cycle
- Extract: Convert raw or semi-structured data into structured data
- Transform: Convert units, join data sources, cleanup etc
- Load: Load the data into another system for further processing
Big data engineering is concerned with building pipelines
Big data analytics is concerned with discovering patterns

How to process all this data?

Batch processing: All data exists in some data store, a program processes the whole dataset at once
Stream processing: Processing of data as they arrive to the system

2 basic approaches to distribute data processing operations on lots of machines

Divide the data in chunks, apply the same algorithm on all chunks (data-parallelism)
Divide the problem in chunks, run it on a cluster of machines (task-parallelism)

Desired properties of a big data processing system

Robustness and fault-tolerance
Low latency reads and updates
Scalability
Generalization
Extensibility
Ad hoc queries
Minimal maintenance
Debuggability

Large-scale computing

Not a new discipline:

Cray-1 appeared in the late ’70s
Physicists used super computers for simulations in the ’80s
Shared-memory designs still in large scale use (e.g. TOP500 supercomputers)

What is new?

Large scale processing on distributed, commodity computers, enabled by advanced software using elastic resource allocation.

Software (not HW!) is what drives the Big Data industry.

A brief history of Big Data tech

2003: Google publishes the Google Filesystem paper, a large-scale distributed file system
2004: Google publishes the Map/Reduce paper, a distributed data processing abstraction
2006: Yahoo creates and open sources Hadoop, inspired by the Google papers
2006: Amazon launches its Elastic Compute Cloud, offering cheap, elastic resources
2007: Amazon publishes the DynamoDB paper, sketches the blueprints of a cloud-native database
2009 – onwards: The NoSQL movement. Schema-less, distributed databases defy the SQL way of storing data
2010: Matei Zaharia et al. publish the Spark paper, brings FP to in-memory computations
2012: Both Spark Streaming and Apache Flink appear, able to handle really high volume stream processing
2012: Alex Krizhevsky et al. publish their deep learning image classification paper re-igniting interest in neural networks and solidifying the value of big data

The Big Data Tech Landscape

$The big data landscape$

The big data landscape

Progress is mostly industry-driven

D: Most advancement in Big Data technologies came from the industry. The universities only started contributing late. Why?

. . .

Data is the new oil

Typical problems solved with Big Data

Modeling: What factors influence particular outcomes/behaviors?
Information retrieval: Finding needles in haystacks, aka search engines
Collaborative filtering: Recommending items based on items other users with similar tastes have chosen
Outlier detection: Discovering outstanding transactions

Image credits

Data is the new oil picture (c) the Economist

Bibliography

[1]

J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in Proceedings of the 6th conference on symposium on opearting systems design & implementation - volume 6, 2004, pp. 10–10.

[2]

S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in Proceedings of the nineteenth ACM symposium on operating systems principles, 2003, pp. 29–43.

[3]

G. DeCandia et al., “Dynamo: Amazon’s highly available key-value store,” ACM SIGOPS operating systems review, vol. 41, no. 6, pp. 205–220, 2007.

[4]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no. 10–10, p. 95, 2010.

[5]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[6]

A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009.

Big and Fast Data

Georgios Gousios and Burcu Kulahcioglu Ozkan

27 August 2024

Big and Fast Data

What is big data?

How big is “Big”?

How big is “Big”? – Instagram

How big is “Big”? – FaceBook

How big is “Big”? – TikTok

The many Vs of Big data

Volume

Variety

Velocity

Big Data processing

How to process all this data?

Desired properties of a big data processing system

Large-scale computing

A brief history of Big Data tech

The Big Data Tech Landscape

Progress is mostly industry-driven

Typical problems solved with Big Data

Image credits

Bibliography

Copyright