Ages: 11β15 Β· Duration: 105 minutes Β· Topics: Image representation, Information theory, Lossless & lossy compression, Human visual perception, Discrete Cosine Transform
"Your phone takes a photo β 12 megapixels, stunning detail. The raw data is 36 million bytes. But the file you share is 2 million bytes. Somehow, 34 million bytes vanished β and nobody can tell. Where did they go? Today, we find out β and the answer involves physics, psychology, and beautiful mathematics."
Let's start with something basic. What do you see?
You see light. Specifically, electromagnetic radiation β waves with wavelengths between about 380 nm (violet) and 700 nm (red). Your eye has three types of cone cells:
| Cone type | Peak sensitivity | Colour perceived |
|---|---|---|
| S (Short) | ~420 nm | Blue-violet |
| M (Medium) | ~530 nm | Green |
| L (Long) | ~560 nm | Red-yellow |
Every colour you've ever seen is your brain's interpretation of how strongly these three cone types fired. A pure yellow light (580 nm) stimulates your L and M cones β but so does a mixture of red and green light. Your brain can't tell the difference!
π‘ "This is why screens only need three colours. They're not showing you real yellow β they're tricking your cones with a redβgreen cocktail."
| Question | Answer |
|---|---|
| How many cone types do humans have? | 3 (S, M, L) |
| Can a screen make real yellow light? | No β it mixes red + green to fake it |
| Why do TVs use Red, Green, Blue? | Those roughly match our 3 cone types |
| How many distinct colours can 24-bit RGB represent? | $256^3 = 16{,}777{,}216$ |
A digital image is a rectangular grid of tiny coloured squares called pixels. Each pixel stores three numbers β one for Red, one for Green, one for Blue β each between 0 and 255. That's all. Every photo you've ever taken is just millions of number triplets.
The simplest image format is BMP (Bitmap). It stores every pixel's RGB values with a small header.
Question: A photo from your phone is 4000 Γ 3000 pixels. At 3 bytes per pixel, how big is it as a raw BMP?$$4000 \times 3000 \times 3 = 36{,}000{,}000 \text{ bytes} = 36 \text{ MB}$$
ββββββββββββββββββββββββββββββββββββββββββββ β BMP Header (14 bytes) β β File size, offset to pixel data β ββββββββββββββββββββββββββββββββββββββββββββ€ β DIB Header (40 bytes) β β Width, height, bits per pixel, etc. β ββββββββββββββββββββββββββββββββββββββββββββ€ β Pixel Data (width Γ height Γ 3 bytes) β β BβGβRβ BβGβRβ BβGβRβ BβGβRβ ... β β (bottom row first, left to right) β ββββββββββββββββββββββββββββββββββββββββββββ
Fun fact: BMP stores pixels bottom-up and in BGR order (blue first). A quirk from early Windows graphics.
BMP is perfectly faithful but catastrophically wasteful. The blue pixel (0, 70, 173) appears 12 times. We wrote those same 3 bytes twelve times. Redundancy is an invitation for compression.
A 12-megapixel photo is 36 MB as a BMP. We want it smaller β much smaller. But there are two very different philosophies of compression, and each format picks a side.
| Lossless | Lossy | |
|---|---|---|
| Promise | Get back every single bit | Get back something that looks the same |
| Analogy | Zipping a Word document | Summarising a novel |
| Typical ratio | 2:1 to 5:1 | 10:1 to 50:1 |
| Good for | Text, logos, medical scans | Photos, video, music |
| Formats | PNG, GIF, TIFF, ZIP | JPEG, MP3, H.264 |
π‘ "Lossless compression is like packing a suitcase perfectly β everything fits, and when you unpack, nothing's missing. Lossy compression is like leaving out the socks nobody will see."
Key idea: Reduce the palette to at most 256 colours, then compress with LZW.
Step 1: Colour quantization. A photo might use millions of colours. GIF picks the best 256 and maps every pixel to the nearest one β each pixel is now one byte instead of three.
Step 2: LZW Compression (Lempel-Ziv-Welch). A dictionary-based method β the encoder builds a dictionary of recurring sequences on the fly:
Input: A B A B A B A B ...
Step 1: A β code 0 (dictionary: {A:0, B:1})
Step 2: B β code 1
Step 3: AB not in dict β add AB:2, output A(0)
Step 4: BA not in dict β add BA:3, output B(1)
Step 5: AB found! ABA not in dict β add ABA:4, output AB(2)
...
The decoder can rebuild the exact same dictionary from the compressed data! No dictionary needs to be transmitted.
The killer feature: Animation! GIF supports multiple frames, making it the internet's favourite looping format for decades.
Key idea: Keep all colours, predict each pixel, compress the residuals.
PNG applies a prediction filter to each row. Instead of storing raw values, it stores the difference between each pixel and a prediction:
| Filter | Prediction | Description |
|---|---|---|
| None | 0 | Raw value |
| Sub | Pixel to the left | Horizontal prediction |
| Up | Pixel above | Vertical prediction |
| Average | Mean of left and above | Diagonal compromise |
| Paeth | Nearest of left, above, upper-left | Adaptive (Alan Paeth) |
π‘ "The filter doesn't compress anything itself. It transforms the data so the actual compressor (DEFLATE) can work more efficiently. Like pre-chewing food for a baby."
| Format | Year | Compression | Lossy? | Colours | Transparency | Best for |
|---|---|---|---|---|---|---|
| BMP | 1986 | None/RLE | No | 16M+ | No | Nothing (obsolete) |
| GIF | 1987 | LZW | No* | 256 | 1-bit | Animations, logos |
| TIFF | 1986 | Various | Optional | 16M+ | Yes | Professional/scientific |
| PNG | 1996 | DEFLATE | No | 16M+ | Full alpha | Screenshots, web graphics |
| JPEG | 1992 | DCT+Huffman | Yes | 16M | No | Photographs |
*GIF's palette reduction is technically lossy, but the compression step itself is lossless.
Shannon's Source Coding Theorem (1948) proves you cannot compress data below its entropy β the intrinsic information content. For photographs, the entropy per pixel is high.
No lossless method can compress a typical photo below about 2:1 to 4:1. This is a mathematical law, not a technological limitation.
To reach 10:1 or 50:1, we must lose something. JPEG's genius: it throws away precisely what your brain won't miss.
JPEG is built on psychophysics β the science of how physical stimuli create perceptual experiences.
Weber's Law (1834): The just-noticeable difference (JND) in a stimulus is proportional to the stimulus magnitude. $$\frac{\Delta I}{I} \approx \text{constant} \approx 0.02 \text{ (for brightness)}$$
Your eye is not equally sensitive to all detail. Spatial frequency measures how rapidly brightness changes:
| Low frequency | Medium frequency | High frequency |
|---|---|---|
| Smooth gradients, sky, skin | Edges, textures, hair | Noise, fine detail |
| ποΈ Very sensitive | ποΈ Most sensitive | ποΈ Least sensitive |
At high spatial frequencies (fine detail), your sensitivity drops dramatically. JPEG exploits this ruthlessly.
Your eye is far more sensitive to brightness (luminance) than to colour (chrominance). JPEG converts from RGB to YCbCr:
| Channel | Meaning | Sensitivity |
|---|---|---|
| Y | Luminance (brightness) | β β β β β High |
| Cb | Blue-difference chrominance | β β βββ Low |
| Cr | Red-difference chrominance | β β βββ Low |
Notice green contributes the most (0.587) β because your M cones (green-sensitive) are the most numerous!
After conversion, JPEG downsamples the Cb and Cr channels β keeping only one colour sample for every four pixels (4:2:0). This cuts the data by a third, and you can barely tell.
π‘ "Imagine telling a painter: 'Use detailed brushwork for light and shadow, but slap the colour on roughly.' That's chroma subsampling."
The simplest compression: if a value repeats, store the value and the count.
Compress our flag's first row:
Original: BLUE BLUE YELLOW BLUE RLE: 2ΓBLUE 1ΓYELLOW 1ΓBLUE
| Representation | Bytes |
|---|---|
| Original (3 bytes Γ 4 pixels) | 12 |
| RLE: (2, 0,70,173), (1, 254,204,0), (1, 0,70,173) | 12 |
No savings! But the all-yellow row: 4Γ(254,204,0) = 4 bytes instead of 12 β 3:1 ratio.
In Morse code, the most common letter (E) gets the shortest code (a single dot). Huffman coding does this optimally: short codes for common symbols, long codes for rare ones.
Claude Shannon (1948) defined the minimum average bits per symbol:
$$H = -\sum_{i} p_i \log_2 p_i$$The algorithm (greedy, bottom-up):
β "If black is '0' and grey is '10', how does the decoder know whether '0' is complete or the start of '10'?"
Huffman codes are prefix-free: no code is a prefix of any other. The decoder reads bits left-to-right and decodes unambiguously β no separators needed!
| Metric | Value |
|---|---|
| Type | Lossless |
| Optimality | Best possible prefix-free code |
| Used by | JPEG, PNG (via DEFLATE), ZIP, MP3 |
| Limitation | Must assign whole-number bit lengths |
The DCT looks at an image the way a musician looks at a chord β breaking it into component frequencies.
Joseph Fourier (1807) discovered:
Any periodic function can be written as a sum of sines and cosines.
This works for sound (audio frequencies) and images (spatial frequencies).
JPEG divides the image into 8Γ8 blocks (64 pixels each) and transforms each independently. Why 8Γ8? Small enough to be fast, large enough for meaningful frequency content, and $8 = 2^3$ is friendly for fast algorithms.
where $C(0) = 1/\sqrt{2}$ and $C(k) = 1$ for $k > 0$.
π£οΈ "Don't panic! What matters is what it DOES."
Each coefficient $(u,v)$ corresponds to a cosine "wave pattern." Position $(0,0)$ is the DC coefficient (average brightness). Higher positions = finer detail.
The DCT doesn't compress anything. It reorganises information so that important parts (low frequencies) separate from unimportant parts (high frequencies). This sets the stage for: throw the unimportant parts away.
π‘ "For natural photographs, most high-frequency DCT coefficients are nearly zero. The energy is concentrated in a few low-frequency terms."
Each DCT coefficient is divided by a value from a quantization matrix and rounded:
$$F_q(u,v) = \text{round}\!\left(\frac{F(u,v)}{Q(u,v)}\right)$$The standard JPEG luminance quantization matrix:
Q = β 16 11 10 16 24 40 51 61 β
β 12 12 14 19 26 58 60 55 β
β 14 13 16 24 40 57 69 56 β
β 14 17 22 29 51 87 80 62 β
β 18 22 37 56 68 109 103 77 β
β 24 35 55 64 81 104 113 92 β
β 49 64 78 87 103 121 120 101 β
β 72 92 95 98 112 100 103 99 β
β "Top-left values are small (preserve low frequencies). Bottom-right values are large (destroy high frequencies). This IS the psychovisual model."
After quantization, the block is full of zeros. JPEG reads coefficients in zigzag order to group zeros together:
β 1 2 6 7 15 16 28 29 3 5 8 14 17 27 30 43 4 9 13 18 26 31 42 44 10 12 19 25 32 41 45 54 11 20 24 33 40 46 53 55 21 23 34 39 47 52 56 61 22 35 38 48 51 57 60 62 36 37 49 50 58 59 63 64
Then it applies RLE on zeros + Huffman coding on the remaining values. The DC coefficient stores only the difference from the previous block.
| Stage | Data | Size (bits) |
|---|---|---|
| Original | 64 pixel values Γ 8 bits | 512 |
| After DCT | 64 coefficients (reorganised) | 512 |
| After quantize | 44, β26, 0, β1, then 60 zeros | ~64 values mostly zeros |
| After zigzag + RLE | (0,44), (0,β26), (1,β1), EOB | 4 symbols |
| After Huffman | Compact bit string | ~30 |
JPEG quality β 5%: Almost every DCT coefficient becomes zero except the DC term (average brightness per block). The image degrades into a mosaic of 8Γ8 solid-colour tiles.
Rule: Never save text, screenshots, or logos as JPEG. Use PNG for sharp edges, JPEG for photos.
| Fact | Value |
|---|---|
| Bits per pixel (RGB) | 24 |
| 12 MP photo as BMP | 36 MB |
| 12 MP photo as PNG | ~12 MB |
| 12 MP photo as JPEG (Q85) | ~3 MB |
| 12 MP photo as JPEG (Q50) | ~1.5 MB |
| Human cone types | 3 (S, M, L) |
| DCT block size | 8 Γ 8 = 64 pixels |
| Shannon entropy formula | $H = -\sum p_i \log_2 p_i$ |
| Method | Key Idea | Type | Best ratio | Prerequisites |
|---|---|---|---|---|
| RLE π₯ | Count repeated values | Lossless | ~10:1 (graphics) | Counting |
| Huffman π₯ | Short codes for common symbols | Lossless | ~4:1 | Probability, trees |
| DCT + Quantize π₯ | Separate frequencies, discard invisible | Lossy | ~20:1 | Trig, cosines |
| Full JPEG π | All of the above, orchestrated | Lossy | ~50:1 | All above |
| Discipline | Contribution |
|---|---|
| Physics | Light as EM waves; trichromaticity |
| Psychophysics | Weber's Law; contrast sensitivity; chroma insensitivity |
| Information theory | Shannon entropy; limits of lossless compression |
| Mathematics | Fourier/DCT transforms; Huffman's theorem |
| Engineering | 8Γ8 blocks; zigzag scan; quantization tables |
"JPEG compression is a mathematical theory of what humans don't notice. It's one of the most elegant examples in all of science: understanding perception well enough to exploit it."
ABRACADABRA. What are the codes? Average bits/char?AAABBBCCDDDDDDEE. How many bytes RLE vs. original?MISSISSIPPI have frequencies: M=1, I=4, S=4, P=2. Compute $H$. How close does Huffman get?This lecture was generated for a Math Circle session.