Proteins perform nearly every function in living cells:
Catalyze reactions
Transport molecules
Enable movement
Sense signals
Fight infection
O(10,000) different proteins in a human cell
Sequence → Structure → Function
The amino acid sequence determines the 3D structure, which determines the function
ATP Synthase — The World's Smallest Rotary Motor
Converts ADP + Pi → ATP using a proton gradient
Rotates at ~9,000 RPM
Produces ~100 kg of ATP per person per day
Image: David S. Goodsell, RCSB PDB
Viral Fusion Machinery
The SARS-CoV-2 spike protein binds ACE2 receptors and fuses viral and cell membranes
Understanding protein structure → vaccine design
(mRNA vaccines encode this protein)
Image: David S. Goodsell, RCSB PDB
Green Fluorescent Protein (GFP)
Protein that literally glows — absorbs UV, emits green light
Chromophore forms spontaneously from 3 amino acids
Nobel Prize in Chemistry 2008
Revolutionized biology: fuse GFP to any protein to track it in living cells
Amino Acids: The Building Blocks of Proteins
20 amino acids, each with a different R group — same backbone, different chemistry
Primary Structure — The Amino Acid Sequence
20 standard amino acids, each with a unique side chain
Typical protein: 300–500 amino acids
Human genome encodes ~20,000 proteins
Isoforms and post-translational modifications add many more
Amino Acids & the Protein Backbone
All 20 amino acids share the same N–Cα–C backbone
The R group (side chain) determines each amino acid's properties
Amino acids link via peptide bonds
Secondary Structure
Formed by hydrogen bonds between backbone atoms
α-helices: coiled spring shape (~3.6 residues per turn)
β-sheets: extended strands side by side (parallel or antiparallel)
Tertiary & Quaternary Structure
Tertiary
Quaternary
The Protein Data Bank
Founded 1971 with 7 structures — free, open access — key to every ML advance in structural biology
Without the PDB, there is no AlphaFold
How Are Protein Structures Determined?
X-ray Crystallography
~85% of PDB structures
Grow protein crystals, shoot X-rays, measure diffraction
Resolution: 1–3 Å
~25 Nobel Prizes
NMR Spectroscopy
~7% of PDB
Measures distances between atoms in solution
Limited to smaller proteins (<40 kDa)
7–8 Nobel Prizes
Cryo-Electron Microscopy
Revolution in the last decade
Flash-freeze proteins, image with electron beam
No crystals needed → larger complexes
1 Nobel Prize
Computational
AlphaFold & MD
2 Nobel Prizes
The PDB File Format
ATOM 1 N ALA A 1 27.340 24.430 2.614 1.00 9.67 N
ATOM 2 CA ALA A 1 26.266 25.413 2.842 1.00 10.38 C
ATOM 3 C ALA A 1 26.913 26.639 3.531 1.00 9.62 C
ATOM 4 O ALA A 1 27.886 26.463 4.263 1.00 9.62 O
ATOM 5 CB ALA A 1 25.112 24.880 3.649 1.00 13.77 C
Record
Serial
Name
Residue
Chain
ResSeq
X
Y
Z
Occ.
B-factor
Element
ATOM
1
N
ALA
A
1
27.340
24.430
2.614
1.00
9.67
N
Every atom gets x, y, z coordinates in Ångströms (10−10 m)
Predicting Protein Structure
Can we predict 3D structure from sequence alone?
CASP — Critical Assessment of Protein Structure Prediction
At each layer, each atom aggregates information from its neighbors
Identical to ECFP with large random weights — but now we can optimize
Interpretable Learned Features
Feature predictive of solubility
Pro-solubility features activate on hydrophilic OH groups
Feature predictive of toxicity
Pro-toxicity features activate on aromatic ring systems (known carcinogens)
Unlike ECFP, learned features can be activated by similar but distinct fragments
The Recipe for a Breakthrough
1
A large, high-quality dataset
The Protein Data Bank (50 years of data)
2
A standardized metric
GDT-TS score
3
A truly prospective competition
CASP (no data leakage possible)
This combination is rare in biology — and it enabled one of the most important ML breakthroughs in history
AlphaFold 3 — Full Architecture
AlphaFold 2 — Architecture Overview
Why AlphaFold 2 Worked
Multiple Sequence Alignments (MSAs) — evolutionary information from related proteins
Evoformer — attention mechanism that reasons about residue pairs
Structure Module — directly predicts 3D coordinates using equivariant transformations
End-to-end differentiable — trained on PDB structures with FAPE loss
Recycling — iteratively refines its own predictions
Maximum Achievable R² Under Label Noise
Brown, Muchmore & Hajduk, Drug Discovery Today 2009
Two independent measurements of true value \(T\): \(\;X_1 = T + \varepsilon_1,\; X_2 = T + \varepsilon_2,\; \varepsilon \sim \mathcal{N}(0, \sigma^2_{\text{exp}})\)