Train & Test
The network starts with random weights. Click Train and watch it learn to recognize handwritten digits from scratch. Draw a digit to test its progress.
Draw here
Network Input
Training Progress
First Layer Weights
Output Probabilities
How It Works
This is a multilayer perceptron (MLP) that learns to recognize handwritten digits from the MNIST dataset. Both training and inference run entirely in your browser - no server, no framework, just pure JavaScript implementing forward propagation, backpropagation, and stochastic gradient descent.
The network starts with random weights and learns by repeatedly seeing training examples, computing how wrong its predictions are (the loss), and adjusting each weight to reduce that error. This is backpropagation - gradients flow backward through the network, telling each neuron how to change.
Your drawing is downsampled to 28x28 pixels, flattened to 784 values, and fed through three layers. The first-layer weight visualization shows what patterns each neuron has learned to detect - watch them evolve from random noise into edge and stroke detectors.
The Architecture
Input Layer 784 neurons
Each pixel of the 28×28 image becomes one input value, normalized to match the training data distribution.
Hidden Layer 1 128 neurons · ReLU
Each neuron computes a weighted sum of all 784 inputs plus a bias, then applies ReLU (zeroing out negatives). This layer learns to detect simple features like edges and curves.
Hidden Layer 2 64 neurons · ReLU
Combines features from the first hidden layer into more abstract representations: loops, intersections, and stroke patterns that distinguish digit classes.
Output Layer 10 neurons · Softmax
One neuron per digit class. Softmax converts the raw scores into a probability distribution that sums to 1. The highest probability is the network's prediction.
109,386
Trainable parameters
~97%
Test accuracy on MNIST
<1ms
Inference time in browser
Under the Hood
Everything is implemented in pure JavaScript - forward pass, backward pass, and SGD weight updates. No TensorFlow, no ONNX, no WebAssembly.
// Forward pass (each layer):
z = W · x + b // matrix-vector multiply
a = max(0, z) // ReLU activation
// Backward pass (gradients):
dz = p − y // softmax + cross-entropy
dW = dz ⊗ aT // outer product → weight gradients
// SGD update:
W −= lr · dW // gradient descent step
Training uses 2,000 MNIST images (200 per digit class) with mini-batch SGD. The "Load Pre-trained" option loads weights trained on the full 60,000 image dataset using PyTorch for comparison.
Quantization
Neural networks typically store weights as 32-bit floating-point numbers. Quantization reduces this precision, representing each weight with fewer bits. This saves memory and compute, which matters when deploying models on phones and microcontrollers.
Toggle Quantize above and drag the bit slider to see the effect in real time. At 8 bits, accuracy barely changes. At 4 bits the model uses 8× less memory but predictions start to degrade. At 2 bits, each weight can only be one of 4 values - watch the weight visualization collapse into distinct bands and predictions fall apart.
// Symmetric quantization:
scale = max(|w|) / (2bits−1 − 1)
q = round(w / scale) // integer representation
ŵ = q × scale // dequantized (lossy)
While simple layers quantize well, the softmax function resists compression due to its exponential nature. Most "integer-only" models cheat by switching back to floating-point for softmax. This playground explores several methods to avoid that bottleneck, including I-BERT (polynomial approximation), Softermax (base-2 simplification), and the Shift invariant trick.
For the Vision Transformer
and GPT,
attention scores are quantized before softmax. The Shift toggle exploits softmax's
shift invariance: softmax(z + c) = softmax(z).
Subtracting the mean centers values around zero, yielding a much tighter fit for symmetric
quantization. Toggle it off to see the difference - especially visible in the GPT's
generated text at low bit-widths.
ETH Zürich · Integrated Systems Laboratory · Semester Project
Deep Dive: Softmax Quantization
A semester project at ETH Zürich under Prof. Dr. Luca Benini. We optimized the softmax operation for MobileBERT to enable integer-only inference, maintaining accuracy down to 4-bit precision.
1. The Exponential Bottleneck
Exponentials are expensive in fixed-point hardware. We approximate them using second-order polynomials.
2. The Staircase Effect
Reducing bits turns the continuous signal into a staircase. Drag the slider to see the L1 error increase as precision drops.
3. Visualizing Attention Degradation
Low-bit quantization distorts transformer attention patterns. The "Shift trick" centers logits around zero, allowing for tighter clipping bounds and higher accuracy at 4-bit and 5-bit precision.
Full Precision (FP32)
Quantized
Quantization Methods
I-BERT
Approximates exponentials using integer-only polynomials and power-of-2 shifts.
Softermax
Uses base-2 and online normalization to eliminate multiple passes over data.
ITAmax
Achieves optimal scaling factors by focusing on narrow input ranges.