Vello: GPU Compute 2D Renderer | Research

Vello is a GPU compute-centric 2D renderer in Rust. It’s positioned to replace Skia/Cairo for next-gen graphics applications.

What Makes It Different

Traditional renderers (Skia, Cairo) do significant work on CPU:

Path sorting
Clipping calculations
Tile command generation

Vello moves all of this to GPU compute shaders using parallel prefix-sum algorithms.

Result: Minimal temporary buffers, massive parallelism.

The Core Innovation: Prefix Sum

The key insight is that many sequential operations can be parallelized using prefix sum (scan).

The Problem

Consider path rendering. Each path produces a variable number of line segments. To write output, you need to know where each path’s output starts — which depends on how many segments all previous paths produced.

Sequential approach:

offset[0] = 0
offset[1] = offset[0] + segments[0]
offset[2] = offset[1] + segments[1]
...

This is inherently serial. O(n) steps.

Prefix sum approach:

// Pass 1: Compute segment counts (parallel)
segments = [3, 2, 5, 1, 4, ...]

// Pass 2: Prefix sum (parallel!)
offsets = prefix_sum(segments) = [0, 3, 5, 10, 11, 15, ...]

// Pass 3: Write output at offsets (parallel)

Prefix sum can be computed in O(log n) parallel steps using tree reduction.

Decoupled Look-Back Algorithm

Vello uses the decoupled look-back algorithm, which achieves near-memcpy speeds (~46G elements/s on AMD 5700 XT).

The key insight: Instead of synchronizing globally, each workgroup:

Computes its local sum
Writes a “partial” flag to shared memory
Looks back at previous workgroups
When all predecessors are done, writes “complete” flag with final sum

// Simplified look-back
@group(0) @binding(0) var<storage, read_write> flags: array<atomic<u32>>;
@group(0) @binding(1) var<storage, read_write> partials: array<u32>;

fn look_back(workgroup_id: u32) -> u32 {
    var sum = 0u;
    var i = workgroup_id - 1u;

    loop {
        let flag = atomicLoad(&flags[i]);
        if flag == FLAG_COMPLETE {
            return sum + partials[i];  // Done!
        } else if flag == FLAG_PARTIAL {
            sum += partials[i];
            i -= 1u;
        }
        // Spin if FLAG_NONE
    }
}

This achieves 1.5x speedup over tree reduction because workgroups don’t wait for global barriers.

Performance

177 fps on the paris-30k test scene (M1 Max, 1600px square).

For context, this is a complex scene with 30,000 path elements rendering at near-interactive rates.

Rendering Pipeline

The pipeline has four compute shader stages, each using prefix-sum:

Scene Graph (CPU)
    ↓ encoding (scene → GPU buffers)
┌─────────────────────────────────────────┐
│ flatten.wgsl                            │
│   • Bézier curves → line segments       │
│   • Adaptive subdivision based on zoom  │
│   • Uses prefix sum for segment counts  │
└─────────────────────┬───────────────────┘
                      ↓
┌─────────────────────┴───────────────────┐
│ binning.wgsl                            │
│   • Spatial sorting into tiles          │
│   • Each tile: list of overlapping paths│
│   • Prefix sum for bin offsets          │
└─────────────────────┬───────────────────┘
                      ↓
┌─────────────────────┴───────────────────┐
│ coarse.wgsl                             │
│   • Per-Tile Command List (PTCL)        │
│   • Compacts draw commands per tile     │
│   • Prefix sum for command offsets      │
└─────────────────────┬───────────────────┘
                      ↓
┌─────────────────────┴───────────────────┐
│ fine.wgsl                               │
│   • Actual pixel rasterization          │
│   • Configurable AA: Area/MSAA8/MSAA16  │
│   • Reads PTCL, writes to framebuffer   │
└─────────────────────────────────────────┘

Why Four Stages?

Each stage transforms data in a way that produces variable-length output. Prefix sum allows computing output offsets without CPU round-trips.

Key insight: The CPU never needs to know intermediate sizes. The GPU figures it out entirely on its own.

Tile-Based Rendering

Vello divides the canvas into 16×16 pixel tiles. Benefits:

Only process tiles that changed (partial refresh)
Bounded memory per tile
Efficient cache usage
Natural parallelization (one workgroup per tile)

Supported Features

Shapes (paths, fills, strokes)
Images
Gradients (linear, radial)
Text (via skrifa/cosmic-text)
Blend modes
Clipping/masking

PostScript-inspired API, SVG-compatible.

Code Example

use vello::{Scene, kurbo::Circle, peniko::{Brush, Color}};

let mut scene = Scene::new();

let circle = Circle::new((100.0, 100.0), 50.0);
let brush = Brush::Solid(Color::rgb8(255, 0, 0));

scene.fill(
    vello::peniko::Fill::NonZero,
    kurbo::Affine::IDENTITY,
    &brush,
    None,
    &circle,
);

Backends

Backend	Use Case
`vello`	Primary GPU-accelerated
`vello_cpu`	Software fallback
`vello_hybrid`	Mixed GPU/CPU

Integrations

Xilem: Linebender’s Rust GUI toolkit
Bevy: via bevy_vello crate
SVG: via vello_svg
Lottie: via velato

Comparison

Renderer	Approach	Best For
Skia	CPU + GPU hybrid	Mature, broad support
Cairo	Mostly CPU	Legacy, stable
Vello	GPU compute	Performance, modern apps
Pathfinder	GPU rasterization	Research, specific use cases

Vello is experimental but represents the performance frontier.

Current Limitations (Alpha)

Blur/filter effects in progress
Some conflation artifacts
Requires compute shader support (no fallback to old GPUs)

When to Use

✓ New Rust graphics project targeting modern hardware ✓ Performance-critical 2D rendering ✓ Targeting both native and WebGPU

✗ Need mature ecosystem today ✗ Must support old GPUs without compute shaders ✗ Already invested in Skia

Sources:

Vello GitHub
Vello Docs
Prefix Sum on Vulkan — Raph Levien’s deep dive
Prefix Sum on Portable Compute — Cross-platform implementation
GPUPrefixSums — Algorithm collection
Using Vello for Games