Vello: GPU Compute 2D Renderer
Vello is a GPU compute-centric 2D renderer in Rust. It’s positioned to replace Skia/Cairo for next-gen graphics applications.
What Makes It Different
Traditional renderers (Skia, Cairo) do significant work on CPU:
- Path sorting
- Clipping calculations
- Tile command generation
Vello moves all of this to GPU compute shaders using parallel prefix-sum algorithms.
Result: Minimal temporary buffers, massive parallelism.
The Core Innovation: Prefix Sum
The key insight is that many sequential operations can be parallelized using prefix sum (scan).
The Problem
Consider path rendering. Each path produces a variable number of line segments. To write output, you need to know where each path’s output starts — which depends on how many segments all previous paths produced.
Sequential approach:
offset[0] = 0
offset[1] = offset[0] + segments[0]
offset[2] = offset[1] + segments[1]
...
This is inherently serial. O(n) steps.
Prefix sum approach:
// Pass 1: Compute segment counts (parallel)
segments = [3, 2, 5, 1, 4, ...]
// Pass 2: Prefix sum (parallel!)
offsets = prefix_sum(segments) = [0, 3, 5, 10, 11, 15, ...]
// Pass 3: Write output at offsets (parallel)
Prefix sum can be computed in O(log n) parallel steps using tree reduction.
Decoupled Look-Back Algorithm
Vello uses the decoupled look-back algorithm, which achieves near-memcpy speeds (~46G elements/s on AMD 5700 XT).
The key insight: Instead of synchronizing globally, each workgroup:
- Computes its local sum
- Writes a “partial” flag to shared memory
- Looks back at previous workgroups
- When all predecessors are done, writes “complete” flag with final sum
// Simplified look-back
@group(0) @binding(0) var<storage, read_write> flags: array<atomic<u32>>;
@group(0) @binding(1) var<storage, read_write> partials: array<u32>;
fn look_back(workgroup_id: u32) -> u32 {
var sum = 0u;
var i = workgroup_id - 1u;
loop {
let flag = atomicLoad(&flags[i]);
if flag == FLAG_COMPLETE {
return sum + partials[i]; // Done!
} else if flag == FLAG_PARTIAL {
sum += partials[i];
i -= 1u;
}
// Spin if FLAG_NONE
}
}
This achieves 1.5x speedup over tree reduction because workgroups don’t wait for global barriers.
Performance
177 fps on the paris-30k test scene (M1 Max, 1600px square).
For context, this is a complex scene with 30,000 path elements rendering at near-interactive rates.
Rendering Pipeline
The pipeline has four compute shader stages, each using prefix-sum:
Scene Graph (CPU)
↓ encoding (scene → GPU buffers)
┌─────────────────────────────────────────┐
│ flatten.wgsl │
│ • Bézier curves → line segments │
│ • Adaptive subdivision based on zoom │
│ • Uses prefix sum for segment counts │
└─────────────────────┬───────────────────┘
↓
┌─────────────────────┴───────────────────┐
│ binning.wgsl │
│ • Spatial sorting into tiles │
│ • Each tile: list of overlapping paths│
│ • Prefix sum for bin offsets │
└─────────────────────┬───────────────────┘
↓
┌─────────────────────┴───────────────────┐
│ coarse.wgsl │
│ • Per-Tile Command List (PTCL) │
│ • Compacts draw commands per tile │
│ • Prefix sum for command offsets │
└─────────────────────┬───────────────────┘
↓
┌─────────────────────┴───────────────────┐
│ fine.wgsl │
│ • Actual pixel rasterization │
│ • Configurable AA: Area/MSAA8/MSAA16 │
│ • Reads PTCL, writes to framebuffer │
└─────────────────────────────────────────┘
Why Four Stages?
Each stage transforms data in a way that produces variable-length output. Prefix sum allows computing output offsets without CPU round-trips.
Key insight: The CPU never needs to know intermediate sizes. The GPU figures it out entirely on its own.
Tile-Based Rendering
Vello divides the canvas into 16×16 pixel tiles. Benefits:
- Only process tiles that changed (partial refresh)
- Bounded memory per tile
- Efficient cache usage
- Natural parallelization (one workgroup per tile)
Supported Features
- Shapes (paths, fills, strokes)
- Images
- Gradients (linear, radial)
- Text (via skrifa/cosmic-text)
- Blend modes
- Clipping/masking
PostScript-inspired API, SVG-compatible.
Code Example
use vello::{Scene, kurbo::Circle, peniko::{Brush, Color}};
let mut scene = Scene::new();
let circle = Circle::new((100.0, 100.0), 50.0);
let brush = Brush::Solid(Color::rgb8(255, 0, 0));
scene.fill(
vello::peniko::Fill::NonZero,
kurbo::Affine::IDENTITY,
&brush,
None,
&circle,
);
Backends
| Backend | Use Case |
|---|---|
vello | Primary GPU-accelerated |
vello_cpu | Software fallback |
vello_hybrid | Mixed GPU/CPU |
Integrations
- Xilem: Linebender’s Rust GUI toolkit
- Bevy: via
bevy_vellocrate - SVG: via
vello_svg - Lottie: via
velato
Comparison
| Renderer | Approach | Best For |
|---|---|---|
| Skia | CPU + GPU hybrid | Mature, broad support |
| Cairo | Mostly CPU | Legacy, stable |
| Vello | GPU compute | Performance, modern apps |
| Pathfinder | GPU rasterization | Research, specific use cases |
Vello is experimental but represents the performance frontier.
Current Limitations (Alpha)
- Blur/filter effects in progress
- Some conflation artifacts
- Requires compute shader support (no fallback to old GPUs)
When to Use
✓ New Rust graphics project targeting modern hardware ✓ Performance-critical 2D rendering ✓ Targeting both native and WebGPU
✗ Need mature ecosystem today ✗ Must support old GPUs without compute shaders ✗ Already invested in Skia
Sources:
- Vello GitHub
- Vello Docs
- Prefix Sum on Vulkan — Raph Levien’s deep dive
- Prefix Sum on Portable Compute — Cross-platform implementation
- GPUPrefixSums — Algorithm collection
- Using Vello for Games
Related: building figma today, rust wasm graphics, webgpu future roadmap