Interactive Demo

Neural Network Playground

Train neural networks from scratch in your browser. Draw digits, generate text, and explore quantization.

JavaScript Canvas MNIST Neural Network Quantization GPT ETH Research: Softmax Quantization

Train & Test

The network starts with random weights. Click Train and watch it learn to recognize handwritten digits from scratch. Draw a digit to test its progress.

0.030
Epoch: 0/30 Loss: - Accuracy: -

Draw here

Network Input

Training Progress

Loss Test Accuracy

First Layer Weights

Positive Negative

Output Probabilities

Click Train to start learning 784 → 128 → 64 → 10

How It Works

This is a multilayer perceptron (MLP) that learns to recognize handwritten digits from the MNIST dataset. Both training and inference run entirely in your browser - no server, no framework, just pure JavaScript implementing forward propagation, backpropagation, and stochastic gradient descent.

The network starts with random weights and learns by repeatedly seeing training examples, computing how wrong its predictions are (the loss), and adjusting each weight to reduce that error. This is backpropagation - gradients flow backward through the network, telling each neuron how to change.

Your drawing is downsampled to 28x28 pixels, flattened to 784 values, and fed through three layers. The first-layer weight visualization shows what patterns each neuron has learned to detect - watch them evolve from random noise into edge and stroke detectors.

The Architecture

Input Layer 784 neurons

Each pixel of the 28×28 image becomes one input value, normalized to match the training data distribution.

Hidden Layer 1 128 neurons · ReLU

Each neuron computes a weighted sum of all 784 inputs plus a bias, then applies ReLU (zeroing out negatives). This layer learns to detect simple features like edges and curves.

Hidden Layer 2 64 neurons · ReLU

Combines features from the first hidden layer into more abstract representations: loops, intersections, and stroke patterns that distinguish digit classes.

Output Layer 10 neurons · Softmax

One neuron per digit class. Softmax converts the raw scores into a probability distribution that sums to 1. The highest probability is the network's prediction.

109,386

Trainable parameters

~97%

Test accuracy on MNIST

<1ms

Inference time in browser

Under the Hood

Everything is implemented in pure JavaScript - forward pass, backward pass, and SGD weight updates. No TensorFlow, no ONNX, no WebAssembly.

// Forward pass (each layer):

z = W · x + b // matrix-vector multiply

a = max(0, z) // ReLU activation

// Backward pass (gradients):

dz = py // softmax + cross-entropy

dW = dzaT // outer product → weight gradients

// SGD update:

W −= lr · dW // gradient descent step

Training uses 2,000 MNIST images (200 per digit class) with mini-batch SGD. The "Load Pre-trained" option loads weights trained on the full 60,000 image dataset using PyTorch for comparison.

Quantization

Neural networks typically store weights as 32-bit floating-point numbers. Quantization reduces this precision, representing each weight with fewer bits. This saves memory and compute, which matters when deploying models on phones and microcontrollers.

Toggle Quantize above and drag the bit slider to see the effect in real time. At 8 bits, accuracy barely changes. At 4 bits the model uses 8× less memory but predictions start to degrade. At 2 bits, each weight can only be one of 4 values - watch the weight visualization collapse into distinct bands and predictions fall apart.

// Symmetric quantization:

scale = max(|w|) / (2bits−1 − 1)

q = round(w / scale) // integer representation

= q × scale // dequantized (lossy)

While simple layers quantize well, the softmax function resists compression due to its exponential nature. Most "integer-only" models cheat by switching back to floating-point for softmax. This playground explores several methods to avoid that bottleneck, including I-BERT (polynomial approximation), Softermax (base-2 simplification), and the Shift invariant trick.

For the Vision Transformer and GPT, attention scores are quantized before softmax. The Shift toggle exploits softmax's shift invariance: softmax(z + c) = softmax(z). Subtracting the mean centers values around zero, yielding a much tighter fit for symmetric quantization. Toggle it off to see the difference - especially visible in the GPT's generated text at low bit-widths.

ETH Zürich · Integrated Systems Laboratory · Semester Project

Deep Dive: Softmax Quantization

A semester project at ETH Zürich under Prof. Dr. Luca Benini. We optimized the softmax operation for MobileBERT to enable integer-only inference, maintaining accuracy down to 4-bit precision.

1. The Exponential Bottleneck

Exponentials are expensive in fixed-point hardware. We approximate them using second-order polynomials.

2. The Staircase Effect

Reducing bits turns the continuous signal into a staircase. Drag the slider to see the L1 error increase as precision drops.

8 bit
256 levels L1 error: 0.0000

3. Visualizing Attention Degradation

Low-bit quantization distorts transformer attention patterns. The "Shift trick" centers logits around zero, allowing for tighter clipping bounds and higher accuracy at 4-bit and 5-bit precision.

8 bit

Full Precision (FP32)

Quantized

Quantization Methods

I-BERT

Approximates exponentials using integer-only polynomials and power-of-2 shifts.

Softermax

Uses base-2 and online normalization to eliminate multiple passes over data.

ITAmax

Achieves optimal scaling factors by focusing on narrow input ranges.

Results (SST-2 Accuracy)

Original I-BERT
Optimized (Shift)