Big and Fast Data

What is big data?

An overloaded, fuzzy term

“Data too large to be efficiently processed on a single computer”

“Massive amounts of diverse, unstructured data produced by high-performance applications”

How big is “Big”?

Typical numbers associated with Big Data

How big is “Big”? – Instagram

Instagram
Instagram

  • >1B daily users, clicking around the app
  • 1,074 pictures are uploaded per second, 4 million per hour
  • 95M photos, 500M stories daily
  • Most followed user: 600M followers

How big is “Big”? – FaceBook

FaceBook
FaceBook

Numbers (\(\Leftarrow\)) are from 2014! Today on FB:

  • 1.91 Billion active users per day
  • 350 million photos per day (148k/min)
  • Every min: 510k comments, 293k status updates

How big is “Big”? – TikTok

TikTok
TikTok

  • 1 Billion monthly active users
  • Millions of videos uploaded daily
  • More than 1 million videos viewed every day

The many Vs of Big data

Main Vs, by Doug Laney

  • Volume: large amounts of data
  • Variety: data comes in many different forms from diverse sources
  • Velocity: the content is changing quickly

More Vs

  • Value: data alone is not enough; how can value be derived from it?
  • Validity: ensure that the interpreted data is sound, clean
  • Veracity: can we trust the data? Does it come from reliable sources?
  • Volatility: how long does data need to be kept for? When is it considered irrelevant?
  • Visibility: data from diverse sources need to be stitched together
  • Virality: how quickly data is spread and shared

Volume

We call Big Data big because it is really big:

Data growth rate
Data growth rate

Variety

  • Structured data: SQL tables, format is known
  • Semi-structured data: JSON, XML
  • Unstructured data: Text, audio, video

We often need to combine various data sources of different types to come up with a result

Velocity

Data is not just big; it is generated and needs to be processed fast. Think of:

  • Data centers writing to log files
  • IoT sensors reporting temperatures around the globe
  • Twitter: >500 million tweets a day (or 6k/sec)
  • Stock Markets: high-frequency trading (latency costs money)
  • Online advertising

Data needs to be processed with soft or hard real-time guarantees

Big Data processing

  • The ETL cycle

    • Extract: Convert raw or semi-structured data into structured data
    • Transform: Convert units, join data sources, cleanup etc
    • Load: Load the data into another system for further processing
  • Big data engineering is concerned with building pipelines

  • Big data analytics is concerned with discovering patterns

How to process all this data?

  • Batch processing: All data exists in some data store, a program processes the whole dataset at once
  • Stream processing: Processing of data as they arrive to the system

2 basic approaches to distribute data processing operations on lots of machines

  • Divide the data in chunks, apply the same algorithm on all chunks (data-parallelism)
  • Divide the problem in chunks, run it on a cluster of machines (task-parallelism)

Desired properties of a big data processing system

  • Robustness and fault-tolerance
  • Low latency reads and updates
  • Scalability
  • Generalization
  • Extensibility
  • Ad hoc queries
  • Minimal maintenance
  • Debuggability

Large-scale computing

Not a new discipline:

  • Cray-1 appeared in the late ’70s
  • Physicists used super computers for simulations in the ’80s
  • Shared-memory designs still in large scale use (e.g. TOP500 supercomputers)

What is new?

Large scale processing on distributed, commodity computers, enabled by advanced software using elastic resource allocation.

Software (not HW!) is what drives the Big Data industry.

A brief history of Big Data tech

  • 2003: Google publishes the Google Filesystem paper, a large-scale distributed file system
  • 2004: Google publishes the Map/Reduce paper, a distributed data processing abstraction
  • 2006: Yahoo creates and open sources Hadoop, inspired by the Google papers
  • 2006: Amazon launches its Elastic Compute Cloud, offering cheap, elastic resources
  • 2007: Amazon publishes the DynamoDB paper, sketches the blueprints of a cloud-native database
  • 2009 – onwards: The NoSQL movement. Schema-less, distributed databases defy the SQL way of storing data
  • 2010: Matei Zaharia et al. publish the Spark paper, brings FP to in-memory computations
  • 2012: Both Spark Streaming and Apache Flink appear, able to handle really high volume stream processing
  • 2012: Alex Krizhevsky et al. publish their deep learning image classification paper re-igniting interest in neural networks and solidifying the value of big data

The Big Data Tech Landscape

The big data landscape
The big data landscape

Progress is mostly industry-driven

D: Most advancement in Big Data technologies came from the industry. The universities only started contributing late. Why?

. . .

Data is the new oil
Data is the new oil

Typical problems solved with Big Data

  • Modeling: What factors influence particular outcomes/behaviors?
  • Information retrieval: Finding needles in haystacks, aka search engines
  • Collaborative filtering: Recommending items based on items other users with similar tastes have chosen
  • Outlier detection: Discovering outstanding transactions

Image credits

  • Data is the new oil picture (c) the Economist

Bibliography

[1]
J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters,” in Proceedings of the 6th conference on symposium on opearting systems design & implementation - volume 6, 2004, pp. 10–10.
[2]
S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system,” in Proceedings of the nineteenth ACM symposium on operating systems principles, 2003, pp. 29–43.
[3]
G. DeCandia et al., “Dynamo: Amazon’s highly available key-value store,” ACM SIGOPS operating systems review, vol. 41, no. 6, pp. 205–220, 2007.
[4]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no. 10–10, p. 95, 2010.
[5]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
[6]
A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009.