Cumulative Pixel-Mask Batching
The single most clever idea in BBC line drawing. Invented (so far as the line-test analysis records) by “NJ” in nj-linedraw4.asm. Drops shallow-line cost from ~34c/pixel to ~5c/pixel on the no-y-step common path, and to ~8.75c/pixel amortised on horizontal lines.
See bresenham-line for the broader implementation comparison.
The problem it solves
In MODE 4 (8 pixels/byte), a near-horizontal Bresenham line will plot several pixels into the same screen byte before y-steps onto the next row. The naive approach reads the byte, EORs in one pixel, writes the byte — for every pixel. That’s LDA (scr),Y / EOR #imm / STA (scr),Y = 13c per pixel, plus error update and counting. The screen byte gets read 8 times and written 8 times to draw 8 pixels.
RTW improves on this with screen-byte caching: hold the byte in A across pixels in the same byte, write it out only on y-step or column change. Saves ~5c/pixel for horizontal runs.
NJ goes further.
The idea
Each pixel column within a byte has its own loop entry point (a00, a10, …, a70). At the entry, X is loaded with a cumulative mask covering this column and all columns to its right:
a00: LDX #$FF ; columns 0-7 traversed since last y-step
a10: LDX #$7F ; columns 1-7
a20: LDX #$3F ; columns 2-7
a30: LDX #$1F ; columns 3-7
a40: LDX #$0F ; columns 4-7
a50: LDX #$07 ; columns 5-7
a60: LDX #$03 ; columns 6-7
a70: LDX #$01 ; column 7 only
Key insight: when no y-step occurs, control falls through to the next column’s SBC directly, skipping its LDX #imm. X retains the wider mask from the original entry point — accumulating coverage column by column. The actual screen write happens only on y-step, when one EOR (scr),Y / STA (scr),Y commits all accumulated columns in a single RMW.
The hot path
For each pixel column, no-y-step case:
; (X already holds cumulative mask from earlier entry)
b50: SBC #dy ; 2c — dy is self-modified immediate
BCS b60 ; 3c — fall through to next column on no-y-step5 cycles per pixel on the pass-through path. No screen access. No counter decrement. No carry restoration.
The y-step batch write
When the BCS falls through (y-step needed), the accumulated mask gets committed:
ADC #dx ; 2c restore error
STA err ; 3c
TXA ; 2c recover cumulative mask from X
AND #&XX ; 2c trim to columns actually traversed
EOR (scr),Y ; 5c batch-write all pixels at once
STA (scr),Y ; 6c
DEC cnt ; 5c counter fires only on y-step
BEQ done ; 2c
DEY ; 2c
BPL nextrow ; 3c
; = 32 cycles for the batchFor 8 pixels in a byte: 7 × 5c + 32c + 3c = 70 cycles / 8 pixels = 8.75c/pixel. The screen is read once, written once. Approaching the theoretical floor.
For a 45° diagonal (y-step every pixel): no batching gain — degenerates to ~37c/pixel.
Why it works
Three properties combine:
- Falling through column entry points — code is unrolled per pixel column, but consecutive columns are physically adjacent in memory. The
BCSto the next column lands on itsSBC, past itsLDX #imm. - X carries state across loop iterations — Bresenham’s other implementations clobber X for indexing or counting. NJ reserves X for the cumulative mask.
- The mask values are monotonic —
$FF, $7F, $3F, ...— successive masks are strict subsets of earlier ones, so trimming at write time (TXA / AND #trim) yields exactly the pixels traversed since entry.
Costs
- Code size: 32 unrolled column bodies (8 columns × 4 octants) plus the steep loops. NJ totals 2365 bytes.
- Setup: ~100-120c. 16 SBC immediate operands need patching with dx/dy on every line.
- SMC: Cannot run from ROM. Self-modifying both setup operands and (implicitly) operates through
JMP (x1)indirect-through-table dispatch. - Complexity: the
ls(loop-segment) anderrsvariables handle the final partial column. Easy to get off-by-one wrong. The source is dense and minimally commented.
When to use it
- Long, near-horizontal lines (e.g. wireframe horizons, vector-mode terrain). Massive win.
- Game logic where line drawing is a clear hot path and 2 KB of code is acceptable.
When not to use it
- Short lines (<5 pixels). Setup cost dominates — see bresenham-line for break-even table.
- 45°-ish lines. Y-step every pixel → no batching, no win over Tricky.
- ROM code. Can’t SMC.
- Memory-tight projects. RTW at 297 B is 1/8 the size with comparable horizontal-line performance via byte-caching.
Adapting to other modes
- MODE 5/2 (colour): harder. The batched RMW needs both AND-mask (clear) and ORA-value (set) per pixel. Could be done with two accumulator registers, or by pre-composing the colour into the cumulative mask. Adds ~3c/pixel and complicates the trim mask.
- MODE 2: only 2 pixels per byte → batching gain is at most 1 extra pixel per write. Probably not worth the architecture change vs Raster’s pixel-pair approach.
- MODE 0/3: 8 pixels/byte like MODE 4 — directly applicable.
Adapting to other implementations
Tricky’s dispatch-table architecture is close enough to NJ’s that the cumulative-mask trick could be retrofitted. The doc (§8.4) explicitly flags this as the highest-impact improvement to Tricky: could reduce shallow per-pixel cost from ~34c to ~10c.
Cross-references
- bresenham-line — the comparison page
- deferred-counting — NJ’s other key innovation
- carry-chain-invariant — Raster’s analogous-but-different shallow trick
- 6502-isa — SBC imm vs SBC zp cycle cost
- line-drawing-implementations — full citation
This wiki is curated by Claude following the LLM-Wiki methodology — a human curates source documents, the LLM compiles structured cross-linked markdown. Content may contain errors, omissions, or stale claims. For authoritative information refer to the original source documents in the bbc-documents GitHub archive.