πŸ—œοΈ Why Is This Photo Only 500 KB? β€” JPEG, Huffman & the Art of Forgetting

Ages: 11–15  Β·  Duration: 105 minutes  Β·  Topics: Image representation, Information theory, Lossless & lossy compression, Human visual perception, Discrete Cosine Transform


Part 0 β€” Warm-Up: What IS a Picture? (10 min)

The Big Idea

"Your phone takes a photo β€” 12 megapixels, stunning detail. The raw data is 36 million bytes. But the file you share is 2 million bytes. Somehow, 34 million bytes vanished β€” and nobody can tell. Where did they go? Today, we find out β€” and the answer involves physics, psychology, and beautiful mathematics."

Let's start with something basic. What do you see?

You see light. Specifically, electromagnetic radiation β€” waves with wavelengths between about 380 nm (violet) and 700 nm (red). Your eye has three types of cone cells:

Cone typePeak sensitivityColour perceived
S (Short)~420 nmBlue-violet
M (Medium)~530 nmGreen
L (Long)~560 nmRed-yellow

Every colour you've ever seen is your brain's interpretation of how strongly these three cone types fired. A pure yellow light (580 nm) stimulates your L and M cones β€” but so does a mixture of red and green light. Your brain can't tell the difference!

πŸ’‘ "This is why screens only need three colours. They're not showing you real yellow β€” they're tricking your cones with a red–green cocktail."

🎨 Interactive: RGB Colour Mixer

Mix Red, Green, and Blue to make any colour. Each channel is 0–255 (one byte).

255
200
0
RGB(255, 200, 0) = #FFC800 Bytes: 3 (one per channel) Bits: 24 This pixel's storage cost: 3 bytes out of ~36,000,000 in a 12 MP photo.

Quick Practice

QuestionAnswer
How many cone types do humans have?3 (S, M, L)
Can a screen make real yellow light?No β€” it mixes red + green to fake it
Why do TVs use Red, Green, Blue?Those roughly match our 3 cone types
How many distinct colours can 24-bit RGB represent?$256^3 = 16{,}777{,}216$

Key Rule

A digital image is a rectangular grid of tiny coloured squares called pixels. Each pixel stores three numbers β€” one for Red, one for Green, one for Blue β€” each between 0 and 255. That's all. Every photo you've ever taken is just millions of number triplets.

Part 1 β€” The Raw Truth: BMP and the Cost of Honesty (β‰ˆ 15 min)

Setting the Scene

The simplest image format is BMP (Bitmap). It stores every pixel's RGB values with a small header.

Question: A photo from your phone is 4000 Γ— 3000 pixels. At 3 bytes per pixel, how big is it as a raw BMP?
$$4000 \times 3000 \times 3 = 36{,}000{,}000 \text{ bytes} = 36 \text{ MB}$$

πŸ“ Interactive: Image Size Calculator

12,000,000 pixels Γ— 3 bytes = 36,000,000 bytes = 36.00 MB At JPEG quality 85 (~12:1): β‰ˆ 3.00 MB At JPEG quality 50 (~24:1): β‰ˆ 1.50 MB

Anatomy of a BMP

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BMP Header (14 bytes)                   β”‚
β”‚  File size, offset to pixel data         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  DIB Header (40 bytes)                   β”‚
β”‚  Width, height, bits per pixel, etc.     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Pixel Data (width Γ— height Γ— 3 bytes)   β”‚
β”‚  B₁G₁R₁ Bβ‚‚Gβ‚‚Rβ‚‚ B₃G₃R₃ Bβ‚„Gβ‚„Rβ‚„ ...    β”‚
β”‚  (bottom row first, left to right)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Fun fact: BMP stores pixels bottom-up and in BGR order (blue first). A quirk from early Windows graphics.

A Tiny Example: A 4Γ—4 Flag

πŸ‡ΈπŸ‡ͺ A 4Γ—4 Swedish Flag

16 pixels Γ— 3 bytes = 48 bytes of pixel data + 54 bytes headers = 102 bytes total Blue (0, 70, 173) appears 12 times = 36 bytes Yellow (254, 204, 0) appears 4 times = 12 bytes That's a lot of repetition!

The Key Insight

BMP is perfectly faithful but catastrophically wasteful. The blue pixel (0, 70, 173) appears 12 times. We wrote those same 3 bytes twelve times. Redundancy is an invitation for compression.

Part 2 β€” The Format Zoo: PNG, GIF, TIFF (β‰ˆ 20 min)

The Problem

A 12-megapixel photo is 36 MB as a BMP. We want it smaller β€” much smaller. But there are two very different philosophies of compression, and each format picks a side.

The Two Philosophies

LosslessLossy
PromiseGet back every single bitGet back something that looks the same
AnalogyZipping a Word documentSummarising a novel
Typical ratio2:1 to 5:110:1 to 50:1
Good forText, logos, medical scansPhotos, video, music
FormatsPNG, GIF, TIFF, ZIPJPEG, MP3, H.264

πŸ’‘ "Lossless compression is like packing a suitcase perfectly β€” everything fits, and when you unpack, nothing's missing. Lossy compression is like leaving out the socks nobody will see."

🎨 GIF β€” The 256-Colour Veteran (1987)

Key idea: Reduce the palette to at most 256 colours, then compress with LZW.

Step 1: Colour quantization. A photo might use millions of colours. GIF picks the best 256 and maps every pixel to the nearest one β€” each pixel is now one byte instead of three.

Step 2: LZW Compression (Lempel-Ziv-Welch). A dictionary-based method β€” the encoder builds a dictionary of recurring sequences on the fly:

Input:    A B A B A B A B ...
Step 1:   A β†’ code 0         (dictionary: {A:0, B:1})
Step 2:   B β†’ code 1
Step 3:   AB not in dict β†’ add AB:2, output A(0)
Step 4:   BA not in dict β†’ add BA:3, output B(1)
Step 5:   AB found! ABA not in dict β†’ add ABA:4, output AB(2)
...

The decoder can rebuild the exact same dictionary from the compressed data! No dictionary needs to be transmitted.

The killer feature: Animation! GIF supports multiple frames, making it the internet's favourite looping format for decades.

πŸ–ΌοΈ PNG β€” The Lossless Champion (1996)

Key idea: Keep all colours, predict each pixel, compress the residuals.

PNG applies a prediction filter to each row. Instead of storing raw values, it stores the difference between each pixel and a prediction:

FilterPredictionDescription
None0Raw value
SubPixel to the leftHorizontal prediction
UpPixel aboveVertical prediction
AverageMean of left and aboveDiagonal compromise
PaethNearest of left, above, upper-leftAdaptive (Alan Paeth)

πŸ’‘ "The filter doesn't compress anything itself. It transforms the data so the actual compressor (DEFLATE) can work more efficiently. Like pre-chewing food for a baby."

The Format Comparison

FormatYearCompressionLossy?ColoursTransparencyBest for
BMP1986None/RLENo16M+NoNothing (obsolete)
GIF1987LZWNo*2561-bitAnimations, logos
TIFF1986VariousOptional16M+YesProfessional/scientific
PNG1996DEFLATENo16M+Full alphaScreenshots, web graphics
JPEG1992DCT+HuffmanYes16MNoPhotographs

*GIF's palette reduction is technically lossy, but the compression step itself is lossless.

Dead-End: Why Lossless Isn't Enough

The fundamental limit of lossless compression:

Shannon's Source Coding Theorem (1948) proves you cannot compress data below its entropy β€” the intrinsic information content. For photographs, the entropy per pixel is high.

No lossless method can compress a typical photo below about 2:1 to 4:1. This is a mathematical law, not a technological limitation.

To reach 10:1 or 50:1, we must lose something. JPEG's genius: it throws away precisely what your brain won't miss.

β˜• Break (5 min)


Part 3 β€” The Science of Not Noticing: Human Visual Perception (β‰ˆ 10 min)

Weber's Law & Contrast Sensitivity

JPEG is built on psychophysics β€” the science of how physical stimuli create perceptual experiences.

Weber's Law (1834): The just-noticeable difference (JND) in a stimulus is proportional to the stimulus magnitude. $$\frac{\Delta I}{I} \approx \text{constant} \approx 0.02 \text{ (for brightness)}$$

πŸ‘οΈ Interactive: Weber's Law Demo

Can you spot the slightly brighter square? It's harder against bright backgrounds!

40
8
Background: 40, Target: 48, Weber fraction Ξ”I/I = 20.0% β€” EASY to see

Spatial Frequency: Your Eye Has a Resolution Limit

Your eye is not equally sensitive to all detail. Spatial frequency measures how rapidly brightness changes:

Low frequencyMedium frequencyHigh frequency
Smooth gradients, sky, skinEdges, textures, hairNoise, fine detail
πŸ‘οΈ Very sensitiveπŸ‘οΈ Most sensitiveπŸ‘οΈ Least sensitive

At high spatial frequencies (fine detail), your sensitivity drops dramatically. JPEG exploits this ruthlessly.

Colour vs. Brightness: The Chroma Trick

Your eye is far more sensitive to brightness (luminance) than to colour (chrominance). JPEG converts from RGB to YCbCr:

ChannelMeaningSensitivity
YLuminance (brightness)β˜…β˜…β˜…β˜…β˜… High
CbBlue-difference chrominanceβ˜…β˜…β˜†β˜†β˜† Low
CrRed-difference chrominanceβ˜…β˜…β˜†β˜†β˜† Low
$$Y = 0.299R + 0.587G + 0.114B$$

Notice green contributes the most (0.587) β€” because your M cones (green-sensitive) are the most numerous!

πŸ”„ Interactive: RGB β†’ YCbCr Converter

Drag the sliders to see how JPEG separates brightness from colour β€” all 7 channels visualised.

255
128
0

After conversion, JPEG downsamples the Cb and Cr channels β€” keeping only one colour sample for every four pixels (4:2:0). This cuts the data by a third, and you can barely tell.

πŸ’‘ "Imagine telling a painter: 'Use detailed brushwork for light and shadow, but slap the colour on roughly.' That's chroma subsampling."


Solution 1 β€” Run-Length Encoding: The Simplest Trick πŸ₯‰

Setup

The simplest compression: if a value repeats, store the value and the count.

Computation

Compress our flag's first row:

Original:  BLUE  BLUE  YELLOW  BLUE
RLE:       2Γ—BLUE  1Γ—YELLOW  1Γ—BLUE
RepresentationBytes
Original (3 bytes Γ— 4 pixels)12
RLE: (2, 0,70,173), (1, 254,204,0), (1, 0,70,173)12

No savings! But the all-yellow row: 4Γ—(254,204,0) = 4 bytes instead of 12 β€” 3:1 ratio.

πŸ“¦ Interactive: Run-Length Encoder

Type a string and see how RLE compresses it:

Result

$$\boxed{\text{RLE: great for simple graphics, useless for photos}}$$

Solution 2 β€” Huffman Coding: Nature's ZIP πŸ₯ˆ

Key Observation

In Morse code, the most common letter (E) gets the shortest code (a single dot). Huffman coding does this optimally: short codes for common symbols, long codes for rare ones.

Shannon's Entropy β€” The Theoretical Floor

Claude Shannon (1948) defined the minimum average bits per symbol:

$$H = -\sum_{i} p_i \log_2 p_i$$

πŸ“Š Interactive: Shannon Entropy Calculator

Enter a string to see its entropy and how close Huffman gets:

Building a Huffman Tree

The algorithm (greedy, bottom-up):

  1. Create a leaf node for each symbol with its frequency.
  2. Repeat: merge the two lowest-frequency nodes into a parent.
  3. When one node remains, that's the root. Left = 0, right = 1.

🌳 Interactive: Huffman Tree Builder

Type a string to build its Huffman tree and see the codes:

The Prefix Property

βœ‹ "If black is '0' and grey is '10', how does the decoder know whether '0' is complete or the start of '10'?"

Huffman codes are prefix-free: no code is a prefix of any other. The decoder reads bits left-to-right and decodes unambiguously β€” no separators needed!

Result

$$\boxed{\text{Huffman coding: average bits/symbol} \le H + 1\text{, optimal among prefix codes}}$$
MetricValue
TypeLossless
OptimalityBest possible prefix-free code
Used byJPEG, PNG (via DEFLATE), ZIP, MP3
LimitationMust assign whole-number bit lengths

Solution 3 β€” The Discrete Cosine Transform: Hearing the Image πŸ₯‡

Setup

The DCT looks at an image the way a musician looks at a chord β€” breaking it into component frequencies.

The Fourier Idea

Joseph Fourier (1807) discovered:

Any periodic function can be written as a sum of sines and cosines.

This works for sound (audio frequencies) and images (spatial frequencies).

The 8Γ—8 Block: JPEG's Canvas

JPEG divides the image into 8Γ—8 blocks (64 pixels each) and transforms each independently. Why 8Γ—8? Small enough to be fast, large enough for meaningful frequency content, and $8 = 2^3$ is friendly for fast algorithms.

The DCT Formula

$$F(u,v) = \frac{1}{4} C(u)\,C(v) \sum_{x=0}^{7} \sum_{y=0}^{7} f(x,y) \cos\!\left[\frac{(2x+1)u\pi}{16}\right] \cos\!\left[\frac{(2y+1)v\pi}{16}\right]$$

where $C(0) = 1/\sqrt{2}$ and $C(k) = 1$ for $k > 0$.

πŸ—£οΈ "Don't panic! What matters is what it DOES."

The 64 Basis Patterns

Each coefficient $(u,v)$ corresponds to a cosine "wave pattern." Position $(0,0)$ is the DC coefficient (average brightness). Higher positions = finer detail.

🌊 Interactive: DCT Basis Patterns

Each position (u,v) in the 8Γ—8 frequency grid has a unique cosine wave pattern. Drag to explore individual patterns.

0
0

πŸŽ›οΈ Frequency Mixer β€” Blend 8 DCT Coefficients

Every JPEG 8Γ—8 block is a weighted sum of basis patterns. Drag the amplitudes to build your own block β€” then try the presets!

1024
0
0
0
0
0
0
0
Presets:

The Key Insight

The DCT doesn't compress anything. It reorganises information so that important parts (low frequencies) separate from unimportant parts (high frequencies). This sets the stage for: throw the unimportant parts away.

πŸ’‘ "For natural photographs, most high-frequency DCT coefficients are nearly zero. The energy is concentrated in a few low-frequency terms."


Solution 4 β€” The Full JPEG Pipeline πŸŽ“

The Pipeline

$$\text{RGB} \xrightarrow{1} \text{YCbCr} \xrightarrow{2} \text{Subsample} \xrightarrow{3} \text{DCT (8Γ—8)} \xrightarrow{4} \text{Quantize} \xrightarrow{5} \text{Huffman}$$

Stage 4: Quantization β€” The Art of Forgetting

Each DCT coefficient is divided by a value from a quantization matrix and rounded:

$$F_q(u,v) = \text{round}\!\left(\frac{F(u,v)}{Q(u,v)}\right)$$

The standard JPEG luminance quantization matrix:

Q = β”‚ 16  11  10  16  24  40  51  61 β”‚
    β”‚ 12  12  14  19  26  58  60  55 β”‚
    β”‚ 14  13  16  24  40  57  69  56 β”‚
    β”‚ 14  17  22  29  51  87  80  62 β”‚
    β”‚ 18  22  37  56  68 109 103  77 β”‚
    β”‚ 24  35  55  64  81 104 113  92 β”‚
    β”‚ 49  64  78  87 103 121 120 101 β”‚
    β”‚ 72  92  95  98 112 100 103  99 β”‚

βœ‹ "Top-left values are small (preserve low frequencies). Bottom-right values are large (destroy high frequencies). This IS the psychovisual model."

🎚️ Interactive: JPEG Quality Simulator

See how quantization destroys high-frequency detail. The 8Γ—8 block on the left is the original; the right is after DCT β†’ Quantize β†’ inverse DCT.

50%
Original
After JPEG

Stage 5: Entropy Coding

After quantization, the block is full of zeros. JPEG reads coefficients in zigzag order to group zeros together:

 β†’ 1  2  6  7  15 16 28 29
   3  5  8  14 17 27 30 43
   4  9  13 18 26 31 42 44
   10 12 19 25 32 41 45 54
   11 20 24 33 40 46 53 55
   21 23 34 39 47 52 56 61
   22 35 38 48 51 57 60 62
   36 37 49 50 58 59 63 64

Then it applies RLE on zeros + Huffman coding on the remaining values. The DC coefficient stores only the difference from the previous block.

The Full Pipeline on a Smooth Gradient Block

StageDataSize (bits)
Original64 pixel values Γ— 8 bits512
After DCT64 coefficients (reorganised)512
After quantize44, βˆ’26, 0, βˆ’1, then 60 zeros~64 values mostly zeros
After zigzag + RLE(0,44), (0,βˆ’26), (1,βˆ’1), EOB4 symbols
After HuffmanCompact bit string~30
$$\frac{512}{30} \approx 17{:}1 \text{ compression ratio for this block!}$$

When JPEG Fails

JPEG quality β‰ˆ 5%: Almost every DCT coefficient becomes zero except the DC term (average brightness per block). The image degrades into a mosaic of 8Γ—8 solid-colour tiles.

Rule: Never save text, screenshots, or logos as JPEG. Use PNG for sharp edges, JPEG for photos.

Result

$$\boxed{\text{JPEG}: \text{RGB} \to \text{YCbCr} \to \text{subsample} \to \text{DCT} \to \text{quantize} \to \text{Huffman} \implies 10\text{–}50{:}1}$$

Summary β€” The Complete Picture

The Compression Spectrum

$$\underbrace{\text{BMP}}_{1:1} \longrightarrow \underbrace{\text{RLE}}_{2\text{–}10:1} \longrightarrow \underbrace{\text{Huffman}}_{2\text{–}4:1} \longrightarrow \underbrace{\text{PNG}}_{2\text{–}5:1} \longrightarrow \underbrace{\text{JPEG}}_{10\text{–}50:1}$$

Key Numbers

FactValue
Bits per pixel (RGB)24
12 MP photo as BMP36 MB
12 MP photo as PNG~12 MB
12 MP photo as JPEG (Q85)~3 MB
12 MP photo as JPEG (Q50)~1.5 MB
Human cone types3 (S, M, L)
DCT block size8 Γ— 8 = 64 pixels
Shannon entropy formula$H = -\sum p_i \log_2 p_i$

Why Multiple Approaches?

MethodKey IdeaTypeBest ratioPrerequisites
RLE πŸ₯‰Count repeated valuesLossless~10:1 (graphics)Counting
Huffman πŸ₯ˆShort codes for common symbolsLossless~4:1Probability, trees
DCT + Quantize πŸ₯‡Separate frequencies, discard invisibleLossy~20:1Trig, cosines
Full JPEG πŸŽ“All of the above, orchestratedLossy~50:1All above

The Deeper Message

DisciplineContribution
PhysicsLight as EM waves; trichromaticity
PsychophysicsWeber's Law; contrast sensitivity; chroma insensitivity
Information theoryShannon entropy; limits of lossless compression
MathematicsFourier/DCT transforms; Huffman's theorem
Engineering8Γ—8 blocks; zigzag scan; quantization tables
"JPEG compression is a mathematical theory of what humans don't notice. It's one of the most elegant examples in all of science: understanding perception well enough to exploit it."

Extensions & Challenge Problems 🧩

πŸ₯‰ Bronze

  1. Byte count. A 1920Γ—1080 image has how many pixels? How many bytes as raw BMP? Express in MB.
  2. Huffman drill. Build a Huffman tree for ABRACADABRA. What are the codes? Average bits/char?
  3. RLE practice. Run-length encode AAABBBCCDDDDDDEE. How many bytes RLE vs. original?

πŸ₯ˆ Silver

  1. Shannon entropy. The letters in MISSISSIPPI have frequencies: M=1, I=4, S=4, P=2. Compute $H$. How close does Huffman get?
  2. YCbCr conversion. Convert RGB = (255, 128, 0) to YCbCr. What does the large Y tell you?
  3. Quantization. Given $F(3,5)=45$ and $Q(3,5)=87$: quantized value? Dequantized? What was lost?

πŸ₯‡ Gold

  1. Huffman optimality. Prove that in an optimal prefix-free code, the two least-probable symbols have equal length and differ only in the last bit.
  2. DCT energy. For [100, 102, 101, 103, 100, 101, 102, 100], compute $F(0)$ and the fraction of total energy in the DC term.
  3. Why 8Γ—8? What would happen with 4Γ—4 or 16Γ—16 blocks? Discuss frequency resolution, artifacts, and cost.

πŸŽ“ Diamond

  1. Gibbs phenomenon. Research why truncating a Fourier series causes overshoot near discontinuities. Can you estimate the percentage?
  2. Arithmetic coding. Explain why it beats Huffman and why JPEG 2000 switched to it.
  3. JPEG vs. wavelets. Why does JPEG 2000's DWT eliminate blocking artifacts?

Answer Key (for instructors)

Bronze
  1. $1920 \times 1080 = 2{,}073{,}600$ pixels. At 3 bytes: $6{,}220{,}800$ bytes $= 5.93$ MB.
  2. Frequencies: A=5, B=2, R=2, C=1, D=1. One valid tree: A=0, B=10, R=110, C=1110, D=1111. Average: $(5 \cdot 1 + 2 \cdot 2 + 2 \cdot 3 + 1 \cdot 4 + 1 \cdot 4)/11 = 23/11 \approx 2.09$ bits/char.
  3. Original: 17 chars = 17 bytes. RLE: (3,A)(3,B)(2,C)(6,D)(2,E) = 10 bytes. Savings: 41%.
Silver
  1. $H = -(1/11)\log_2(1/11) - 2(4/11)\log_2(4/11) - (2/11)\log_2(2/11) \approx 1.823$ bits/char. Huffman (I=0, S=10, P=110, M=111) gives $21/11 = 1.909$ bits/char β€” about 5% above entropy.
  2. $Y = 0.299(255) + 0.587(128) + 0.114(0) = 151.3$ (bright). $C_b \approx 42.5$ (low blue). $C_r \approx 201.9$ (high red). Makes sense for orange!
  3. $\text{round}(45/87) = 1$. Dequantized: $1 \times 87 = 87$. Error: $|87-45| = 42$. But at position (3,5), that's a high frequency humans can't see.
Gold
  1. Swap argument: if symbol $a$ is rarer than $b$ but has shorter code, swap their codes β€” average length decreases or stays same, contradicting optimality if it strictly decreases. Applying recursively shows the two rarest must share length and be siblings.
  2. Sum = 809. $F(0) = 809/\sqrt{8} \approx 286.0$. $F(0)^2 = 81{,}796$. Total energy = $\sum f(n)^2 = 81{,}819$. Fraction: $99.97\%$ in DC β€” extremely smooth signal.
  3. 4Γ—4: less frequency resolution, more blocks/overhead, smaller artifacts. 16Γ—16: better resolution but edges within block need many coefficients; larger visible artifacts. 8Γ—8 is the sweet spot.

This lecture was generated for a Math Circle session.