Softmax Quantization | Hannes Stählin

The Problem

Transformer models like BERT and GPT are incredibly powerful but also extremely resource-hungry. To run them on mobile phones or small edge devices, we need to shrink the numbers they compute with, a technique called quantization.

Instead of using 32‑bit floating-point numbers, we can use 8‑bit or even 4‑bit integers, drastically cutting memory and speeding up inference. However, one critical operation resists this compression: the softmax function.

Softmax converts raw scores into probabilities and sits at the heart of every attention layer. Its exponential and normalization operations make it extremely difficult to compute with only integers. Most implementations cheat by switching back to floating-point for softmax, creating a major performance bottleneck.

What Does Softmax Do?

Softmax takes a vector of raw scores and turns them into a probability distribution: values between 0 and 1 that sum to 1. The larger a score is relative to the others, the more probability it receives. This is how a transformer decides which words to "pay attention" to.

softmax(z_i) = e^z_i / Σ e^z_j · exponential amplifies differences, normalization creates probabilities

Quantization: Fewer Bits, Less Precision

Quantization maps continuous values to a small set of discrete levels. Drag the slider to see how reducing the bit‑width affects the representation. With fewer bits, the "staircase" gets coarser and fine details are lost. This is the core tradeoff.

Bit Width: 8 bit

256 levels L1 error: 0.00

Softmax in Action: Attention Weights

In a transformer, softmax produces an attention matrix, a grid showing how much each word "attends to" every other word. Below is a comparison of attention weights at full precision versus quantized. Notice how low‑bit quantization can distort the pattern.

Use the slider to reduce precision and watch the attention map degrade. The shift operation I introduced helps preserve the pattern at low precision.

Precision: 8 bit

Shift: Shift the input distribution for tighter clipping bounds

Full Precision (FP32)

Quantized

Simulated 8×8 attention matrix · brighter = higher attention weight

Key Insight: Shift Before Softmax

A central contribution of this work was introducing a shift operation before the quantized softmax. Softmax is mathematically shift-invariant: adding a constant to all inputs doesn't change the output probabilities.

softmax(z + c) = softmax(z)

Shifting the input by any constant c doesn't change the result.

This property is powerful: by shifting the data to be centered around zero, we can use tighter clipping bounds for symmetric quantization. Tighter bounds mean a better scaling factor, which means more precision where it matters.

This simple trick reduced the L1 error between quantized and original softmax output by 34.6%, a major accuracy boost, especially at 4 and 5‑bit precision.

Results: Performance at Low Precision

After fixing scaling-factor issues in the I-BERT implementation and adding the shift, performance at 4 and 5‑bit quantization improved dramatically. The bars show accuracy on the SST-2 sentiment analysis benchmark (MobileBERT model).

Original I-BERT

Optimized (Shift + Fixed Scaling)

Dashed line: original unquantized MobileBERT (90.4% accuracy on SST-2)

Quantization Methods Investigated

I-BERT

Approximates the softmax exponential using integer-only polynomial expressions. By exploiting the fact that powers of 2 are trivial in hardware, the exponential is reduced to a narrow interval and approximated with a second-order polynomial.

Softermax

Replaces the natural exponential base e with base‑2, making hardware implementation more efficient. Uses an online normalization scheme to eliminate the extra pass over inputs needed for numerical stability.

ITAmax

Similar to Softermax but additionally exploits the observation that only a narrow range of input values produces non-zero softmax outputs, allowing a more optimal scaling factor for quantization.

Key Takeaways

+37%

Average accuracy improvement at 4‑bit over the original I-BERT implementation

4‑bit

Lowest usable precision achieved, down from 6‑bit with the original implementation

−34.6%

Reduction in L1 error from the shift operation alone

Context

This was a semester project at the Integrated Systems Laboratory (IIS), ETH Zürich, under Prof. Dr. Luca Benini. The model used for evaluation was MobileBERT, a compact version of BERT designed for mobile deployment.

Performance was evaluated on the GLUE benchmark, a standard suite of NLP tasks including sentiment analysis (SST-2), linguistic acceptability (CoLA), textual entailment (RTE), and paraphrase detection (MRPC).