Data Joins & Key Functions

Get the key function wrong and every other layer fails: transitions snap to the wrong elements, event listeners detach silently, and the heap fills with orphaned nodes that never garbage-collect.

Concept overview

A data join is the reconciliation layer between a raw array and rendered primitives. selection.data(values, key) compares the incoming array against the elements already bound to the selection and partitions the result into enter (no matching node), update (matched node), and exit (node with no matching datum). This mechanism is the spine of the wider D3.js data binding and layout architecture; scales, axes, and transitions all run downstream of how the join resolves identity.

By default D3 matches by array index. Index binding is cheap but fails the moment data is sorted, filtered, or streamed out of order — element 0 always maps to datum 0 regardless of what that datum now represents. A key function overrides this with a stable identity contract:

(datum: Datum, index: number, nodes: Array<Element | null>) => string | number

When a key matches an existing node, D3 reuses that node and keeps its bound listeners, in-flight transitions, and focus state intact. That continuity is what makes the enter-update-exit pattern predictable instead of a source of flicker.

Mechanically, the key function runs in two passes during a single .data() call. The first pass iterates the elements already in the selection and invokes the key callback with each node’s currently bound datum (read from the node’s __data__ property), building an internal map from key string to existing node. The second pass iterates the incoming array and invokes the key callback with each new datum; if that key is found in the map, the new datum is paired with the existing node (update) and the node is removed from the map; if not, the datum becomes an enter placeholder. Whatever remains in the map after the second pass — nodes whose key no entry datum claimed — becomes the exit selection. This is a set difference computed by string equality, which has two immediate consequences. First, key collisions are silent: if two incoming records hash to the same key, only the first is paired and the second is treated inconsistently, so uniqueness is not optional. Second, type matters, because D3 coerces keys to strings — the number 1 and the string "1" produce the same key, but a Date object and its getTime() value do not, so normalize key types at the source.

The cost of this machinery is a single Map allocation and two linear passes, which is why an explicit key adds O(n) work over index binding’s pure positional pairing. For ten thousand nodes that overhead is on the order of a millisecond or two as long as the key callback itself does no string concatenation or property walking — which is the practical reason to precompute a stable id field during ingestion and have the callback simply return d.id rather than build a composite key on every bind.

Index binding versus key binding on a reordered array After sorting, index binding remaps every node to a different datum while a stable key keeps each node attached to its original datum. index binding key binding (d => d.id) A B C C A B node reused for a different datum → state drift A B C C A B node follows its datum → state preserved
With a stable key, each DOM node tracks its logical datum across a reorder; index binding instead reassigns nodes by position and corrupts bound state.

Decision table: index binding vs explicit keys

Axis Index binding (default) Explicit key (d => d.id)
Cost O(n), no map O(n) build of an internal key map
Reorder / sort Breaks: nodes reused by position Stable: nodes follow data
Streaming inserts Spurious enter/exit churn Minimal, targeted enter/exit
Transition continuity Lost on any reshuffle Preserved per logical entity
When to use Static, append-only arrays Any sorted, filtered, or streamed data

The choice is not really about node count; it is about whether array position is a reliable proxy for entity identity. For a fixed list rendered once and never reordered, position is identity, and index binding is both correct and marginally cheaper. The instant the array can be sorted, filtered, paginated, or have items inserted in the middle, position decouples from identity and index binding starts reusing the wrong node for the wrong datum. Because that decoupling usually arrives later — when a sort feature or a live feed is added months after the chart shipped — the safe default for any data that is not provably immutable is an explicit key. The marginal cost is a millisecond; the cost of the bug is a chart that animates the wrong bars to the wrong heights and detaches event listeners from under the user.

There is a third option the table omits because it is an anti-pattern: a “key” derived from a mutable field, such as the datum’s current value or its current sort rank. This looks stable in a quick test and fails the moment the underlying field changes, because the same logical entity now hashes to a new key and D3 treats it as a simultaneous exit and enter — the node is destroyed and recreated, losing its transition, listeners, and focus. A key must be an identity, not a property. If your records have no natural primary key, mint a stable UUID at ingestion and carry it through; never synthesize the key inside the callback from fields that change.

Reference spec

// d3-selection v7 — relevant signatures
selection.data<NewDatum>(
  data: NewDatum[],
  key?: (this: Element, datum: NewDatum, index: number, groups: Array<Element>) => string,
): Selection<GElement, NewDatum, PElement, PDatum>;

selection.enter(): Selection<EnterElement, Datum, PElement, PDatum>;
selection.exit(): Selection<GElement, Datum, PElement, PDatum>;
selection.join(
  enter: string | ((enter: Selection<...>) => Selection<...>),
  update?: (update: Selection<...>) => Selection<...> | undefined,
  exit?: (exit: Selection<...>) => void,
): Selection<...>;

The key callback must return a primitive. Returning undefined, null, or a non-stable value (a fresh object, Date.now(), Math.random()) collapses identity and forces full re-renders.

A few contract details are easy to miss and expensive to discover. The key callback receives three arguments — (datum, index, group) — and is invoked once per existing node and once per incoming datum, so it must be a pure function of the datum alone; reading index to build the key reintroduces the exact positional fragility you were trying to escape. The return value is coerced with String(), so 0, false, and "" are all valid keys but 0 and "0" collide, as do false and "false". The selection.enter() result is a selection of placeholder nodes, not real elements, which is why you must .append() to it before setting attributes; the selection.exit() result is real, still-attached nodes that you are responsible for removing. Finally, selection.join(enterFn, updateFn, exitFn) is pure sugar over enter/merge/exit — it does not supply a key. The key is always the second argument to the preceding .data() call, and forgetting it while using .join() is a common way to get index binding without realizing it.

// The key is on .data(), not on .join() — this is index binding despite the join:
selection.data(rows).join('rect');                 // ❌ positional
// Correct: key on .data(), then join branches:
selection.data(rows, (d: Row) => d.id).join('rect'); // ✅ identity

Step-by-step implementation

The pipeline below assumes ids are assigned upstream and the dataset arrives as an immutable snapshot per update. Treating each incoming array as immutable is what makes the join deterministic: D3 diffs the new snapshot against what is bound, and if you mutate the previously-bound array in place you corrupt the baseline it compares against.

import { select } from 'd3-selection';
import { scaleBand, scaleLinear } from 'd3-scale';

interface Bar { id: string; category: string; value: number; }

const xScale = scaleBand<string>().range([0, 800]).padding(0.1);
const yScale = scaleLinear().range([300, 0]);

function renderBars(svg: SVGGElement, dataset: Bar[]): void {
  xScale.domain(dataset.map((d) => d.category));
  yScale.domain([0, Math.max(...dataset.map((d) => d.value))]);

  select(svg)
    .selectAll<SVGRectElement, Bar>('rect')
    .data(dataset, (d) => d.id) // PERF: precomputed id keeps the callback O(1)
    .join(
      (enter) => enter.append('rect').attr('width', xScale.bandwidth()),
      (update) => update,
      (exit) => exit.remove(),
    )
    .attr('x', (d) => xScale(d.category)!)
    .attr('y', (d) => yScale(d.value))
    .attr('width', xScale.bandwidth())
    .attr('height', (d) => 300 - yScale(d.value)); // A11Y: pair with role="img" + aria-label on <svg>
}

Performance & memory notes

  • Join cost: key resolution builds an internal map, so an explicit key adds O(n) work versus index binding’s pure positional pass. For 10k nodes this stays under ~2ms if the key callback does no string work.
  • Frame budget: defer scale recalculation until after the join completes, batch DOM writes inside one requestAnimationFrame, and prefer transform over x/y for animated moves to keep changes on the compositor.
  • GC pressure: every datum that exits without .remove() leaves a detached subtree pinned by D3’s __data__ reference. Over a long session this accumulates into progressive GC pauses.
  • Allocation in the callback: a key callback that does `${d.region}-${d.metric}` allocates a string per element per pass — two allocations per node per bind. At 10k nodes and 60 binds/second that is 1.2M short-lived strings/second, a visible sawtooth in the memory timeline. Precompute the composite id once at ingestion so the callback returns an existing reference.
  • Compositor-friendly updates: when the join’s update phase repositions nodes, animate transform rather than cx/cy so the move stays on the GPU compositor and skips layout, per transition and animation sequences.
  • Canvas/WebGL: there is no real DOM to bind, so simulate the join over a Map<string, RenderState>, dirty-check incoming keys, and push only changed state to buffers — one draw call per frame.

The virtual-join pattern is worth spelling out, because it is how you keep the conceptual benefits of keyed identity after you have outgrown the DOM. You maintain a Map<string, RenderState> keyed by the same stable id you would have used in SVG. On each frame, diff the incoming key set against the map’s key set: keys present in the incoming data but absent from the map are enter and get a new slot in a pre-allocated typed array; keys in both are update and have their slot rewritten in place; keys in the map but absent from the data are exit and have their slot freed (or compacted). The crucial performance property is that you never reallocate the buffer per frame — you reuse fixed-capacity Float32Array storage and call gl.bufferSubData only on the ranges that changed, then issue one drawArrays. This gives you O(changed) GPU upload cost rather than O(n), the same dirty-checking win that a keyed SVG join gets from reusing nodes.

interface RenderState { x: number; y: number; slot: number; }

class VirtualJoin {
  private readonly state = new Map<string, RenderState>();
  private readonly buffer: Float32Array;       // pre-allocated, never grows in the loop
  constructor(capacity: number) {
    this.buffer = new Float32Array(capacity * 2);
  }
  reconcile(rows: ReadonlyArray<{ id: string; x: number; y: number }>): number {
    const seen = new Set<string>();
    for (const row of rows) {
      let st = this.state.get(row.id);
      if (!st) {                                // ENTER: claim the next free slot
        st = { x: row.x, y: row.y, slot: this.state.size };
        this.state.set(row.id, st);
      } else {                                  // UPDATE: rewrite in place
        st.x = row.x; st.y = row.y;
      }
      this.buffer[st.slot * 2] = row.x;
      this.buffer[st.slot * 2 + 1] = row.y;
      seen.add(row.id);
    }
    for (const id of this.state.keys()) {       // EXIT: drop unseen keys
      if (!seen.has(id)) this.state.delete(id);
    }
    // PERF: caller uploads only buffer.subarray(0, this.state.size * 2) once per frame.
    return this.state.size;
  }
}

Accessibility checklist

Troubleshooting

  • Symptom: rebinding new data renders nothing. Root cause: the key returns the same value for every record (often because the callback mutates or because you replaced the whole array with structurally-identical objects), so D3 classifies everything as update and your enter-only attribute writes never run. Fix: confirm keys are unique and reference-stable, and verify that newly changed attributes are applied on the update branch, not only on enter — see fixing a D3 data join not updating on rebind.
  • Symptom: duplicate elements appear after a sort. Root cause: index binding plus a reorder, so shifted items look new and .enter() fires for elements that already exist while the originals are stranded. Fix: add (d) => d.id and use .join() so reorders resolve to update rather than enter.
  • Symptom: flicker on streaming updates. Root cause: non-deterministic keys (Date.now(), random salts, or a key built from a value that changes each tick), making every record a simultaneous exit and enter. Fix: namespace keys from immutable fields only; mint a UUID at ingestion if no natural id exists.
  • Symptom: heap climbs over time. Root cause: missing exit().remove(), or a listener/Map pinning a removed node’s subtree through D3’s __data__ reference. Fix: always remove exits, detach listeners before removal, and key node caches with WeakMap; verify detached-node count returns to zero in a comparison heap snapshot.
  • Symptom: the same entity appears twice after merging two feeds. Root cause: key collision across sources — both feeds use a per-source counter as the id. Fix: namespace the key with the source (`${source}:${row.id}`) so identities stay disjoint, and assert key uniqueness with a Set size check before binding.
  • Symptom: numeric and string ids double-bind the same record. Root cause: 1 and "1" are distinct logical ids but coerce to the same key string, or vice versa. Fix: normalize key types at ingestion so the join never sees both forms.

Frequently Asked Questions

When is index binding actually acceptable?

Only for static or strictly append-only data that is never sorted, filtered, or reordered. The moment array position can change relative to the logical entity it represents, index binding will reuse the wrong DOM node and corrupt bound state, so prefer an explicit key whenever there is any doubt.

What makes a good key?

A primitive that is unique per logical entity and stable across updates — typically a database primary key or UUID. Avoid keys derived from mutable fields, timestamps, or random values, since any change in the key forces D3 to treat the same entity as both an exit and an enter.

Does selection.join() replace the need for a key?

No. join() only sugars the enter/update/exit branching; it still binds by index unless you pass a key to the preceding .data() call. Always supply the key to .data() and then chain .join().

How do I key data on Canvas where there is no DOM?

Maintain a Map keyed by your stable identifier and diff incoming keys against it each frame to derive enter, update, and exit sets manually. Push only the changed entries to your typed-array buffers and issue a single draw call per frame.