← /research

Vello: GPU Compute 2D Renderer

Processing · Reading Notes Created Jan 4, 2025

Source

Vello GitHub — Linebender (Article)
View source →
Project: web-graphics-research
graphicsrustgpurendering

Vello is a GPU compute-centric 2D renderer in Rust. It’s positioned to replace Skia/Cairo for next-gen graphics applications.

What Makes It Different

Traditional renderers (Skia, Cairo) do significant work on CPU:

  • Path sorting
  • Clipping calculations
  • Tile command generation

Vello moves all of this to GPU compute shaders using parallel prefix-sum algorithms.

Result: Minimal temporary buffers, massive parallelism.

The Core Innovation: Prefix Sum

The key insight is that many sequential operations can be parallelized using prefix sum (scan).

The Problem

Consider path rendering. Each path produces a variable number of line segments. To write output, you need to know where each path’s output starts — which depends on how many segments all previous paths produced.

Sequential approach:

offset[0] = 0
offset[1] = offset[0] + segments[0]
offset[2] = offset[1] + segments[1]
...

This is inherently serial. O(n) steps.

Prefix sum approach:

// Pass 1: Compute segment counts (parallel)
segments = [3, 2, 5, 1, 4, ...]

// Pass 2: Prefix sum (parallel!)
offsets = prefix_sum(segments) = [0, 3, 5, 10, 11, 15, ...]

// Pass 3: Write output at offsets (parallel)

Prefix sum can be computed in O(log n) parallel steps using tree reduction.

Decoupled Look-Back Algorithm

Vello uses the decoupled look-back algorithm, which achieves near-memcpy speeds (~46G elements/s on AMD 5700 XT).

The key insight: Instead of synchronizing globally, each workgroup:

  1. Computes its local sum
  2. Writes a “partial” flag to shared memory
  3. Looks back at previous workgroups
  4. When all predecessors are done, writes “complete” flag with final sum
// Simplified look-back
@group(0) @binding(0) var<storage, read_write> flags: array<atomic<u32>>;
@group(0) @binding(1) var<storage, read_write> partials: array<u32>;

fn look_back(workgroup_id: u32) -> u32 {
    var sum = 0u;
    var i = workgroup_id - 1u;

    loop {
        let flag = atomicLoad(&flags[i]);
        if flag == FLAG_COMPLETE {
            return sum + partials[i];  // Done!
        } else if flag == FLAG_PARTIAL {
            sum += partials[i];
            i -= 1u;
        }
        // Spin if FLAG_NONE
    }
}

This achieves 1.5x speedup over tree reduction because workgroups don’t wait for global barriers.

Performance

177 fps on the paris-30k test scene (M1 Max, 1600px square).

For context, this is a complex scene with 30,000 path elements rendering at near-interactive rates.

Rendering Pipeline

The pipeline has four compute shader stages, each using prefix-sum:

Scene Graph (CPU)
    ↓ encoding (scene → GPU buffers)
┌─────────────────────────────────────────┐
│ flatten.wgsl                            │
│   • Bézier curves → line segments       │
│   • Adaptive subdivision based on zoom  │
│   • Uses prefix sum for segment counts  │
└─────────────────────┬───────────────────┘

┌─────────────────────┴───────────────────┐
│ binning.wgsl                            │
│   • Spatial sorting into tiles          │
│   • Each tile: list of overlapping paths│
│   • Prefix sum for bin offsets          │
└─────────────────────┬───────────────────┘

┌─────────────────────┴───────────────────┐
│ coarse.wgsl                             │
│   • Per-Tile Command List (PTCL)        │
│   • Compacts draw commands per tile     │
│   • Prefix sum for command offsets      │
└─────────────────────┬───────────────────┘

┌─────────────────────┴───────────────────┐
│ fine.wgsl                               │
│   • Actual pixel rasterization          │
│   • Configurable AA: Area/MSAA8/MSAA16  │
│   • Reads PTCL, writes to framebuffer   │
└─────────────────────────────────────────┘

Why Four Stages?

Each stage transforms data in a way that produces variable-length output. Prefix sum allows computing output offsets without CPU round-trips.

Key insight: The CPU never needs to know intermediate sizes. The GPU figures it out entirely on its own.

Tile-Based Rendering

Vello divides the canvas into 16×16 pixel tiles. Benefits:

  • Only process tiles that changed (partial refresh)
  • Bounded memory per tile
  • Efficient cache usage
  • Natural parallelization (one workgroup per tile)

Supported Features

  • Shapes (paths, fills, strokes)
  • Images
  • Gradients (linear, radial)
  • Text (via skrifa/cosmic-text)
  • Blend modes
  • Clipping/masking

PostScript-inspired API, SVG-compatible.

Code Example

use vello::{Scene, kurbo::Circle, peniko::{Brush, Color}};

let mut scene = Scene::new();

let circle = Circle::new((100.0, 100.0), 50.0);
let brush = Brush::Solid(Color::rgb8(255, 0, 0));

scene.fill(
    vello::peniko::Fill::NonZero,
    kurbo::Affine::IDENTITY,
    &brush,
    None,
    &circle,
);

Backends

BackendUse Case
velloPrimary GPU-accelerated
vello_cpuSoftware fallback
vello_hybridMixed GPU/CPU

Integrations

  • Xilem: Linebender’s Rust GUI toolkit
  • Bevy: via bevy_vello crate
  • SVG: via vello_svg
  • Lottie: via velato

Comparison

RendererApproachBest For
SkiaCPU + GPU hybridMature, broad support
CairoMostly CPULegacy, stable
VelloGPU computePerformance, modern apps
PathfinderGPU rasterizationResearch, specific use cases

Vello is experimental but represents the performance frontier.

Current Limitations (Alpha)

  • Blur/filter effects in progress
  • Some conflation artifacts
  • Requires compute shader support (no fallback to old GPUs)

When to Use

✓ New Rust graphics project targeting modern hardware ✓ Performance-critical 2D rendering ✓ Targeting both native and WebGPU

✗ Need mature ecosystem today ✗ Must support old GPUs without compute shaders ✗ Already invested in Skia

Sources:

Related: building figma today, rust wasm graphics, webgpu future roadmap