WebGL Shader Optimization

Ship an unoptimized fragment shader and a 100k-point scatter plot that should render in 2ms instead blows the entire 16.67ms frame budget on ALU pressure and driver stalls.

Concept overview

WebGL shader optimization is the practice of minimizing per-pixel ALU work, eliminating CPU-GPU synchronization stalls, and keeping vertex data resident in VRAM so draw calls never wait on JavaScript. It sits at the GPU end of the high-performance animation and GPU acceleration pipeline: once data lands on the GPU, shader cost and draw-call batching determine whether you hit frame budget. The pipeline has two programmable stages — the vertex shader runs once per point (spatial mapping, sizing) and the fragment shader runs once per pixel (color, alpha, edges), so fragment cost dominates at high pixel coverage.

Most stalls are not raw compute — they are synchronization. Every gl.drawArrays queues commands; if the driver must wait for a JavaScript upload to finish before executing, the budget collapses. The goal is data residency before the render loop, branchless math inside shaders, and minimal state changes between draws.

It is worth being precise about where a 16.67ms budget actually goes in a WebGL chart. CPU-side layout and JavaScript typically consume 4–6ms before a single draw call is even submitted, which leaves roughly 10ms for the GPU to transform vertices, rasterize, and shade fragments. Within that GPU window, the fragment stage dominates whenever points overlap or cover many pixels, because it runs once per covered pixel rather than once per data point. This is why the optimization order is fragment-first: shave per-pixel ALU work and overdraw before touching the vertex stage, then keep the CPU side lean so it does not eat the GPU’s share of the frame. A chart can be GPU-bound, CPU-bound, or bandwidth-bound at different zoom levels, so profile each regime rather than assuming one bottleneck.

The synchronization point deserves emphasis because it is the least intuitive. performance.now() around a draw call measures only the time to submit commands to the driver, not the time the GPU spends executing them — the GPU runs asynchronously. Anything that forces the CPU to wait for the GPU to finish (gl.finish(), a synchronous gl.readPixels(), or reading back a buffer you just wrote) collapses that asynchrony into a stall, and the stall shows up as a mysterious gap in the trace rather than as time attributed to the offending call. Measuring true GPU execution time requires EXT_disjoint_timer_query, which is the only reliable way to attribute frame time to GPU work.

WebGL pipeline cost centers Data flows from a pre-allocated VBO through the vertex shader and rasterizer into the fragment shader, where per-pixel ALU cost dominates. VBO (VRAM) bufferSubData Vertex shader per point Rasterizer point sprites Fragment per pixel · hot Optimize the hottest stage first: fragment ALU Keep data resident; branchless math; batch by program/texture
Per-pixel fragment work dominates; keep vertex data resident in VRAM and use branchless math to stay in budget.

Optimization technique decision table

Technique Targets Mechanism Gain
Branchless math (mix, step, smoothstep) Fragment ALU Avoids warp divergence Uniform SIMD execution
Precision qualifiers (lowp/mediump) Mobile throughput Smaller registers Avoids software emulation of highp
bufferSubData on pre-allocated VBO CPU-GPU sync In-place update No per-frame reallocation
Uniform caching Driver overhead Skip unchanged gl.uniform* Fewer validation calls
Instanced rendering Draw-call count One call, many instances O(1) draws for repeated marks
GPU picking Hit-test cost Render IDs, read one pixel O(1) vs O(n) CPU scan

For collapsing thousands of identical markers into a single draw call, see reducing draw calls with instanced rendering.

Each technique targets a distinct cost center, and they stack. Branchless math and precision qualifiers attack fragment ALU throughput; buffer streaming and uniform caching attack CPU-GPU synchronization; instanced rendering attacks draw-call count; GPU picking attacks the cost of interaction. A common mistake is to micro-optimize a shader that is already fast while leaving a per-frame bufferData reallocation in place — the reallocation stall dwarfs any ALU savings. Profile to find the dominant cost center first, fix it, then re-profile, because optimizing a non-bottleneck stage produces no measurable improvement no matter how clever the change.

Reference spec

// Streaming VBO: pre-allocate once, update in place — no per-frame reallocation.
class StreamingDataBuffer {
  private readonly buffer: WebGLBuffer;
  private readonly view: Float32Array;

  constructor(private readonly gl: WebGL2RenderingContext, maxPoints: number) {
    this.view = new Float32Array(maxPoints * 2); // x, y per point.
    this.buffer = gl.createBuffer()!;
    gl.bindBuffer(gl.ARRAY_BUFFER, this.buffer);
    // PERF: DYNAMIC_DRAW hints frequent updates so the driver places this in fast-write memory.
    gl.bufferData(gl.ARRAY_BUFFER, this.view, gl.DYNAMIC_DRAW);
    gl.bindBuffer(gl.ARRAY_BUFFER, null);
  }

  updateBatch(data: Float32Array, startFloat: number): void {
    if (startFloat + data.length > this.view.length) throw new RangeError('VBO overflow');
    this.view.set(data, startFloat);
    this.gl.bindBuffer(this.gl.ARRAY_BUFFER, this.buffer);
    // PERF: bufferSubData uploads only the changed range, avoiding driver reallocation stalls.
    this.gl.bufferSubData(this.gl.ARRAY_BUFFER, startFloat * 4, data);
    this.gl.bindBuffer(this.gl.ARRAY_BUFFER, null);
  }
}
// PERF: branchless fragment shader — no if/else means no warp divergence.
precision mediump float; // PERF: mediump avoids software highp emulation on mobile GPUs.

varying vec2 v_uv;
uniform vec4 u_baseColor;
uniform vec4 u_highlightColor;
uniform float u_threshold;

void main() {
  // step() returns 1.0 when u_threshold <= v_uv.x, else 0.0 — no branch.
  float mask = step(u_threshold, v_uv.x);
  vec4 color = mix(u_baseColor, u_highlightColor, mask); // Interpolate without branching.
  color.a = clamp(color.a, 0.0, 1.0); // Bound alpha to limit overdraw.
  gl_FragColor = color;
}

Step-by-step implementation

// PERF: uniform cache — resolve locations once, skip redundant uploads.
class UniformCache {
  private readonly locations = new Map<string, WebGLUniformLocation>();
  private readonly cache = new Map<string, number>();

  constructor(private readonly gl: WebGL2RenderingContext, program: WebGLProgram) {
    gl.useProgram(program);
    const count = gl.getProgramParameter(program, gl.ACTIVE_UNIFORMS) as number;
    for (let i = 0; i < count; i++) {
      const info = gl.getActiveUniform(program, i)!;
      this.locations.set(info.name, gl.getUniformLocation(program, info.name)!);
    }
  }

  setFloat(name: string, value: number): void {
    if (this.cache.get(name) === value) return; // PERF: skip unchanged uniform — no driver call.
    this.cache.set(name, value);
    this.gl.uniform1f(this.locations.get(name)!, value);
  }
}

Performance & memory notes

A fragment shader executing per pixel runs millions of times per frame, so a single avoidable branch or texture fetch multiplies into measurable ALU pressure. Branchless math maps to SIMD instructions that keep warps converged — divergent branches force the GPU to execute both paths and mask, doubling cost in the worst case. On memory, bufferData reallocates and fragments VRAM; bufferSubData against a pre-sized buffer is O(changed bytes) with no reallocation. Never instantiate textures, buffers, or programs inside the render loop — that leaks VRAM until a WEBGL_context_lost event. Always gl.deleteBuffer, gl.deleteTexture, and gl.deleteProgram on teardown.

For per-point styling, resist the urge to set a uniform per point — that serializes into thousands of driver calls. Instead, pack styling metadata (color index, size, category) into a vertex attribute alongside position, or into a sampler2D texture atlas the shader reads by index. Both keep the fragment shader stateless and the draw loop free of per-point API calls. Group related uniforms into uniform blocks (WebGL2 UBOs) so a single bufferSubData updates many values at once, and update them only when they actually change. Texture state is the other batching axis: switching the bound texture mid-frame forces a pipeline flush, so group draws by texture binding and prefer a single atlas over many small textures.

When preprocessing is heavy — spatial binning, aggregation, or matrix math over a large dataset — keep it off the main thread. The compute can run in a worker that prepares the typed arrays, which the main thread then streams into the resident VBO with bufferSubData. This keeps the JavaScript share of the frame budget small so the GPU gets its full slice, and it is exactly the division of labor the offscreen-rendering and frame-stabilization companion guides describe from their own angles.

Profiling shader performance

Shader work is opaque to ordinary JavaScript profilers, so use GPU-aware tooling. Spector.js captures the full WebGL command stream for a frame, letting you inspect every state change, draw call, shader compilation log, and buffer binding — invaluable for spotting redundant state switches and accidental per-frame buffer recreation. The Chrome DevTools Memory panel reveals texture thrashing and orphaned buffers. For frame timing, EXT_disjoint_timer_query measures actual GPU execution rather than submission time, and you can cross-reference it with the DevTools GPU track to separate driver overhead from shader cost.

Bake this into CI rather than checking it by hand. Record frame times and GPU memory usage during an automated interaction and alert when the 95th-percentile frame exceeds 16ms or VRAM crosses a ceiling — performance regressions in shaders are easy to introduce and hard to notice until a user on weaker hardware complains. Watch gl.getError() for GL_OUT_OF_MEMORY and GL_INVALID_FRAMEBUFFER_OPERATION during development; both point at resource lifetime bugs that will eventually surface as context loss in production.

Accessibility checklist

Troubleshooting

Symptom Root cause Fix
Frame drops on mobile only highp emulated in software Declare mediump/lowp precision explicitly
Micro-stutters during streaming bufferData reallocating each frame Pre-allocate and use bufferSubData
GPU stalls on fast updates Redundant gl.uniform* calls Add a uniform cache; skip unchanged values
WEBGL_context_lost after minutes Objects created in the render loop Create resources once; delete on teardown
Shader compiles but renders black Undefined uniform/attrib or precision mismatch Check info logs; validate locations are non-null

Frequently Asked Questions

Why are branches so expensive in fragment shaders?

GPUs execute fragments in lockstep groups called warps. When a branch sends some fragments down one path and the rest down another, the hardware cannot run both simultaneously, so it executes both paths for the whole warp and masks the unused results. That serializes divergent work and can double shader cost. Branchless functions like step and mix keep every fragment on the same path, preserving full SIMD throughput.

When should I use lowp vs mediump vs highp?

Use lowp for values in roughly the −2 to 2 range such as colors and alpha, mediump for normalized device coordinates and most interpolated varyings, and highp only where precision genuinely matters — depth buffers, large world coordinates, or accumulation. Desktop GPUs treat everything as highp, but mobile GPUs may emulate highp in software, so over-declaring it causes severe mobile frame drops.

How do I avoid CPU-GPU synchronization stalls?

Keep vertex data resident: pre-allocate the VBO and update it with bufferSubData rather than reallocating with bufferData. Never call gl.finish() or gl.readPixels() synchronously during the render loop — both force the CPU to wait for the GPU to drain. For picking, throttle readPixels to once per frame or use WebGL2 pixel buffer objects for asynchronous readback.

Should I optimize the vertex or fragment shader first?

Profile, but the fragment shader is usually the hotter stage because it runs per pixel rather than per point, and dense overlapping marks cause heavy overdraw. Reduce fragment cost first with branchless math, lower precision, and alpha clamping, then attack the vertex stage with early culling of off-screen points. Use writing custom GLSL shaders for scatter plots for the full point-sprite pipeline.