Neural Network Playground | Hannes Stählin

Train & Test

Weights start random. Hit Train, then draw a digit to test it.

LR: 0.030

Speed:

Model:

Quantize:

H1:

H2:

Epoch: 0/30 Loss: - Accuracy: -

Draw here

Network Input

Training Progress

Loss Test Accuracy

First Layer Weights

Positive Negative

Output Probabilities

Click Train to start learning 784 → 128 → 64 → 10

How It Works

A multilayer perceptron (MLP) trained on the MNIST dataset. Training and inference run entirely in the browser in plain JavaScript: forward pass, backpropagation, and SGD, no framework.

Training adjusts each weight to reduce the prediction error, with gradients flowing backward through the network.

Your drawing is downsampled to 28×28, flattened to 784 values, and run through three layers. The first-layer visualization shows what each neuron learns to detect.

The Architecture

Input Layer 784 neurons

Each pixel of the 28×28 image is one input value, normalized to the training distribution.

Hidden Layer 1 128 neurons · ReLU

A weighted sum of all 784 inputs plus bias, passed through ReLU. Detects simple features like edges and curves.

Hidden Layer 2 64 neurons · ReLU

Combines those features into more abstract patterns: loops, intersections, and strokes that distinguish digits.

Output Layer 10 neurons · Softmax

One neuron per digit. Softmax turns the scores into probabilities; the highest one is the prediction.

109,386

Trainable parameters

~97%

Test accuracy on MNIST

<1ms

Inference time in browser

Under the Hood

All in plain JavaScript: forward pass, backward pass, SGD updates. No TensorFlow, ONNX, or WebAssembly.

// Forward pass (each layer):

z = W · x + b // matrix-vector multiply

a = max(0, z) // ReLU activation

// Backward pass (gradients):

dz = p − y // softmax + cross-entropy

dW = dz ⊗ a^T // outer product → weight gradients

// SGD update:

W −= lr · dW // gradient descent step

Training uses 2,000 MNIST images (200 per class) with mini-batch SGD. "Load Pre-trained" swaps in weights trained on the full 60,000-image set in PyTorch.

Quantization

Weights are usually 32-bit floats. Quantization stores each with fewer bits, saving memory and compute, which matters on phones and microcontrollers.

Toggle Quantize and drag the bit slider. At 8 bits accuracy holds; at 4 bits the model is 8× smaller but starts to degrade; at 2 bits each weight is one of 4 values and predictions fall apart.

// Symmetric quantization:

scale = max(|w|) / (2^bits−1 − 1)

q = round(w / scale) // integer representation

ŵ = q × scale // dequantized (lossy)

Softmax resists quantization because of its exponentials, so most "integer-only" models fall back to floating-point for it. This demo tries a few ways around that: I-BERT (polynomial approximation), Softermax (base-2), and the shift trick.

For the Vision Transformer and GPT, attention scores are quantized before softmax. The Shift toggle uses softmax's shift invariance, softmax(z + c) = softmax(z): subtracting the mean centers values around zero for a tighter fit. Toggle it off to see the difference, clearest in the GPT's text at low bit-widths.

ETH Zürich · Integrated Systems Laboratory · Semester Project

Deep Dive: Softmax Quantization

A semester project at ETH Zürich under Prof. Dr. Luca Benini, optimizing softmax in MobileBERT for integer-only inference while holding accuracy down to 4-bit.

1. The Exponential Bottleneck

Exponentials are expensive in fixed-point hardware. We approximate them using second-order polynomials.

2. The Staircase Effect

Reducing bits turns the continuous signal into a staircase. Drag the slider to see the L1 error increase as precision drops.

8 bit

256 levels L1 error: 0.0000

3. Visualizing Attention Degradation

Low-bit quantization distorts transformer attention patterns. The "Shift trick" centers logits around zero, allowing for tighter clipping bounds and higher accuracy at 4-bit and 5-bit precision.

Precision: 8 bit

Shift Trick:

Full Precision (FP32)

Quantized

Quantization Methods

I-BERT

Approximates exponentials using integer-only polynomials and power-of-2 shifts.

Softermax

Uses base-2 and online normalization to eliminate multiple passes over data.

ITAmax

Achieves optimal scaling factors by focusing on narrow input ranges.

Results (SST-2 Accuracy)

Original I-BERT

Optimized (Shift)