← /research

Building Procreate for the Web

Processing · Literature Review Created Jan 24, 2025
Project: web-graphics-research
graphicsarchitecturepaintingwebgpusynthesis

Building a Procreate-class painting application for the web is now technically feasible. This document synthesizes research into an optimal Chrome-first architecture.

The Core Challenge

Procreate is a raster-first painting application, fundamentally different from vector tools like Figma. Key requirements:

RequirementProcreate ApproachWeb Challenge
High-resolution canvas16K × 4K on iPad ProGPU memory limits, tile streaming
Low-latency strokes<10ms touch-to-pixelMain thread blocking, GC pauses
Pressure sensitivityApple Pencil integrationPointer Events API
100+ blend modesMetal shadersWebGPU compute
Unlimited undoEfficient diff storageMemory management
Natural media feelStamp-based brushesGPU stroke rendering

┌─────────────────────────────────────────────────────────────┐
│                      UI Layer                                │
│               TypeScript + Svelte/React                      │
│     (toolbar, layer panel, brush settings — NOT canvas)      │
└─────────────────────────┬───────────────────────────────────┘
                          │ wasm-bindgen / postMessage
┌─────────────────────────▼───────────────────────────────────┐
│                    Core Engine                               │
│                    Rust → WASM                               │
│  • Document model (layers, masks, groups)                    │
│  • Brush engine (stamp interpolation, dynamics)              │
│  • Undo/redo (tile-based diff storage)                       │
│  • Import/export (PSD, PNG, custom format)                   │
└─────────────────────────┬───────────────────────────────────┘

┌─────────────────────────▼───────────────────────────────────┐
│                  Rendering Engine                            │
│                   wgpu + custom                              │
│  • Tile-based compositing (dirty rect optimization)          │
│  • GPU stroke rasterization                                  │
│  • Blend mode compute shaders                                │
│  • Filter pipeline (blur, color adjust)                      │
└─────────────────────────┬───────────────────────────────────┘

┌─────────────────────────▼───────────────────────────────────┐
│                    Graphics API                              │
│                      WebGPU                                  │
│            (Chrome 113+, optimized path)                     │
└─────────────────────────────────────────────────────────────┘

Brush Engine Architecture

The brush engine is the heart of any painting app. Two fundamental approaches exist:

1. Stamp-Based Rendering (Traditional)

Procreate uses a stamp brush model where strokes are formed by repeatedly “stamping” a brush shape along a path.

From Procreate’s Brush Studio:

“A stroke forms by ‘stamping’ the brush shape over and over again along a path.”

Key parameters:

  • Shape: The stamp texture (tip)
  • Grain: Texture applied inside the stamp
  • Spacing: Gap between stamps (0% = fluid stroke, 100% = dots)
  • Jitter: Randomization of position, rotation, size

Traditional GPU implementation:

for each point on stroke path:
    1. Calculate stamp position (with spacing)
    2. Apply jitter (lateral, linear, rotation)
    3. Map pressure → size, opacity
    4. Blend stamp texture onto canvas layer

Problem: At high DPI with small spacing, this creates thousands of overlapping alpha-blended quads per stroke—expensive and creates overdraw.

2. Continuous Integration (Modern)

Apoorva Joshi’s research replaces discrete stamps with mathematical integration:

“Rather than repeatedly stamping at discrete positions, treat the brush as continuously slid across the stroke axis.”

For any pixel (X, Y), the intensity is:

α(X,Y) = ∫[X₁ to X₂] f(x, X, Y) dx

Where X₁, X₂ are the leftmost/rightmost stamp centers affecting that pixel.

Advantage: Single quad per stroke, no overdraw, computed entirely in fragment shader.

3. GPU-Accelerated Vector Strokes (Ciallo)

Ciallo (SIGGRAPH 2024) introduces a hybrid vector-raster approach:

“GPU-based rendering techniques for digital painting that bridge the gap between raster and vector stroke representations.”

Three brush types:

  1. Vanilla strokes: Variable-width polylines, geometry shader tessellation
  2. Stamp brushes: GPU-computed stamp positions via prefix-sum in compute shader
  3. Airbrush: Resolution-independent opacity falloff

Key insight from the paper:

“A compute shader can calculate the prefix sum of edge length in parallel. By passing the values into the fragment shader, stamp positions on an edge can be calculated.”

This allows stamp density to vary based on stroke curvature while maintaining vector editability.


Stroke Geometry

Tessellation Approach

For variable-width strokes, the GPU needs triangulated geometry:

Stroke polyline:  P₀ ─── P₁ ─── P₂ ─── P₃
                   │     │     │     │
                   w₀    w₁    w₂    w₃  (widths from pressure)

Tessellated mesh:
    ╱‾‾‾‾‾╲
   ╱       ╲      Expanded to quads perpendicular to stroke direction
  ╱_________╲     Miter or bevel joins at corners

Two GPU approaches:

  1. Geometry shader: Creates quads from line segments (desktop only)
  2. Instanced rendering: Pre-tessellated quad instances (WebGPU compatible)

From Ciallo:

“Both geometry shader and instanced rendering can be used—geometry shader for desktop programs on Windows and instanced rendering for the Web.”

Stroke Smoothing

Raw input points are noisy. Catmull-Rom splines provide smooth interpolation:

“The beauty of Catmull-Rom splines is that the curve passes through all control points. Simply choose points in space and the path will pass through them smoothly.”

Use centripetal variant (α = 0.5) to prevent loops and cusps:

fn catmull_rom(p0: Vec2, p1: Vec2, p2: Vec2, p3: Vec2, t: f32, alpha: f32) -> Vec2 {
    // Compute knot intervals based on distance^alpha
    let t0 = 0.0;
    let t1 = t0 + (p1 - p0).length().powf(alpha);
    let t2 = t1 + (p2 - p1).length().powf(alpha);
    let t3 = t2 + (p3 - p2).length().powf(alpha);
    // ... Barry-Goldman algorithm
}

Input Handling

Pointer Events API

The Pointer Events API provides unified stylus support:

canvas.addEventListener('pointermove', (e: PointerEvent) => {
  const point = {
    x: e.clientX,
    y: e.clientY,
    pressure: e.pressure,      // 0.0 - 1.0
    tiltX: e.tiltX,            // -90° to 90°
    tiltY: e.tiltY,            // -90° to 90°
    twist: e.twist,            // 0° to 359° (rotation)
    pointerType: e.pointerType // "pen" | "touch" | "mouse"
  };

  // Coalesce events for smooth strokes
  const coalesced = e.getCoalescedEvents();
  for (const ce of coalesced) {
    strokeEngine.addPoint(ce);
  }
});

Key properties:

  • pressure: Normalized 0-1, maps to brush size/opacity
  • tiltX/tiltY: Stylus angle, affects brush shape
  • twist: Barrel rotation for calligraphy brushes
  • tangentialPressure: Barrel pressure (-1 to 1)

Important: Use getCoalescedEvents() to capture high-frequency input that browsers batch for performance.

Prediction

For ultra-low latency, use getPredictedEvents() to draw ahead of the stylus:

const predicted = e.getPredictedEvents();
// Draw predicted points with lower opacity,
// replace when actual events arrive

Layer Compositing

Blend Mode Implementation

WebGPU supports hardware blend modes, but not all Photoshop blend modes are built-in. Complex modes require compute shaders.

From WebGPU Fundamentals:

“We can set the blending mode, the primitive topology, and the depth/stencil state.”

Standard modes (hardware accelerated):

// Fragment output with premultiplied alpha
@fragment
fn fs_main(@location(0) color: vec4<f32>) -> @location(0) vec4<f32> {
    return vec4(color.rgb * color.a, color.a);
}

Pipeline blend state:

blend: {
  color: {
    srcFactor: 'one',          // Premultiplied source
    dstFactor: 'one-minus-src-alpha',
    operation: 'add'
  },
  alpha: {
    srcFactor: 'one',
    dstFactor: 'one-minus-src-alpha',
    operation: 'add'
  }
}

Complex modes (compute shader required):

  • Multiply, Screen, Overlay, Soft Light
  • Color Dodge, Color Burn
  • Difference, Exclusion
  • Hue, Saturation, Color, Luminosity

Premultiplied Alpha

Critical for correct compositing. From Limnu’s analysis:

“The browser defaults to compositing a WebGL canvas using premultiplied alpha because colors come out of the renderer in premultiplied form.”

Always work in premultiplied alpha internally:

premultiplied.rgb = straight.rgb * straight.a
premultiplied.a = straight.a

Tile-Based Rendering

Why Tiles?

For 16K × 16K canvases:

  • Raw size: 16384 × 16384 × 4 bytes = 1 GB per layer
  • 10 layers = 10 GB (impossible in browser)

Solution: Tile-based streaming with sparse allocation.

From Polycount discussion:

“Instead of storing full canvas textures, break the canvas into a grid and only store tiles that are modified. GPUs work better with fewer texture bindings, so use large atlases of tiles.”

Tile Architecture

Canvas Grid (16K × 16K, 256px tiles = 64×64 = 4096 tiles)
┌───┬───┬───┬───┬───┐
│   │ ▓ │ ▓ │   │   │   ▓ = dirty (needs re-render)
├───┼───┼───┼───┼───┤       = untouched (not allocated)
│   │ ▓ │ ▓ │ ▓ │   │   █ = clean (cached in GPU)
├───┼───┼───┼───┼───┤
│   │   │ █ │ █ │   │
└───┴───┴───┴───┴───┘

Only allocate tiles that contain paint. Untouched regions consume zero memory.

Tile Atlas

Pack allocated tiles into GPU texture atlases for efficient rendering:

struct TileAtlas {
    texture: wgpu::Texture,        // 4096×4096 atlas
    allocator: AtlasAllocator,     // Tracks free slots
    tile_map: HashMap<TileCoord, AtlasSlot>,
}

impl TileAtlas {
    fn get_or_allocate(&mut self, coord: TileCoord) -> AtlasSlot {
        self.tile_map.get(&coord).cloned()
            .unwrap_or_else(|| {
                let slot = self.allocator.alloc();
                self.tile_map.insert(coord, slot);
                slot
            })
    }
}

Performance target from Polycount:

“Running at 1000fps with a 4K × 4K canvas using half floats, the buffer is only 128MB of VRAM, allowing for 15 layers with just 2GB.”


Undo/Redo System

The Problem

Naive approach: Store full canvas snapshot per operation.

  • 4K × 4K × 4 bytes = 64 MB per snapshot
  • 100 undo levels = 6.4 GB

Tile-Based Diffs

Only store tiles that changed:

struct UndoOperation {
    affected_tiles: Vec<TileCoord>,
    old_data: HashMap<TileCoord, TileData>,  // Only changed tiles
    timestamp: Instant,
}

fn record_stroke(&mut self, stroke: &Stroke) {
    let affected = self.get_affected_tiles(stroke.bounds());
    let old_data = affected.iter()
        .map(|coord| (*coord, self.read_tile(*coord)))
        .collect();

    self.undo_stack.push(UndoOperation {
        affected_tiles: affected,
        old_data,
        timestamp: Instant::now(),
    });
}

From Pixelitor’s approach:

“Pixelitor only stores bitmaps for the regions affected by each operation.”

Memory Pressure Handling

When memory is constrained:

  1. Compress older undo entries (LZ4/zstd)
  2. Spill to IndexedDB
  3. Discard oldest entries
const UNDO_MEMORY_LIMIT = 512 * 1024 * 1024; // 512 MB

function trimUndoStack() {
    let totalSize = undoStack.reduce((sum, op) => sum + op.byteSize, 0);
    while (totalSize > UNDO_MEMORY_LIMIT && undoStack.length > 10) {
        const oldest = undoStack.shift();
        totalSize -= oldest.byteSize;
        oldest.dispose();
    }
}

Performance Architecture

OffscreenCanvas + Worker

Move rendering off the main thread:

// main.ts
const offscreen = canvas.transferControlToOffscreen();
const worker = new Worker('render-worker.js');
worker.postMessage({ type: 'init', canvas: offscreen }, [offscreen]);

// render-worker.js
self.onmessage = async (e) => {
    if (e.data.type === 'init') {
        const adapter = await navigator.gpu.requestAdapter();
        const device = await adapter.requestDevice();
        const context = e.data.canvas.getContext('webgpu');
        // Render loop runs entirely in worker
    }
};

From Chrome’s OffscreenCanvas documentation:

“Making canvas rendering contexts available to workers increases parallelism and makes better use of multi-core systems.”

Benefits:

  • UI remains responsive during heavy rendering
  • Input handling on main thread, rendering on worker
  • requestAnimationFrame() works in workers

SharedArrayBuffer for Zero-Copy

Share stroke data between threads without copying:

// Shared stroke buffer
const strokeBuffer = new SharedArrayBuffer(1024 * 1024);
const strokeView = new Float32Array(strokeBuffer);

// Main thread writes input
strokeView[writeIndex++] = point.x;
strokeView[writeIndex++] = point.y;
strokeView[writeIndex++] = point.pressure;
Atomics.store(strokeCount, 0, writeIndex / 3);

// Worker reads and renders
const count = Atomics.load(strokeCount, 0);
for (let i = lastProcessed; i < count; i++) {
    const x = strokeView[i * 3];
    const y = strokeView[i * 3 + 1];
    const pressure = strokeView[i * 3 + 2];
    renderPoint(x, y, pressure);
}

Note: Requires COOP/COEP headers for SharedArrayBuffer.


Compute Shaders for Effects

Gaussian Blur

From WebGPU Fundamentals:

“2D image processing is an excellent use case for WebGPU.”

Separable blur for efficiency (two 1D passes instead of one 2D):

@group(0) @binding(0) var inputTex: texture_2d<f32>;
@group(0) @binding(1) var outputTex: texture_storage_2d<rgba8unorm, write>;
@group(0) @binding(2) var<storage> weights: array<f32>;

@compute @workgroup_size(64, 1)
fn blur_horizontal(@builtin(global_invocation_id) id: vec3<u32>) {
    let size = textureDimensions(inputTex);
    if (id.x >= size.x || id.y >= size.y) { return; }

    var sum = vec4<f32>(0.0);
    let radius = arrayLength(&weights) / 2u;

    for (var i = 0u; i < arrayLength(&weights); i++) {
        let offset = i32(i) - i32(radius);
        let coord = vec2<i32>(i32(id.x) + offset, i32(id.y));
        let clamped = clamp(coord, vec2(0), vec2<i32>(size) - 1);
        sum += textureLoad(inputTex, clamped, 0) * weights[i];
    }

    textureStore(outputTex, vec2<i32>(id.xy), sum);
}

Workgroup Optimization

From Codrops tutorial:

“A general advice for WebGPU is to choose a workgroup size of 64.”

Use tile-based processing with shared memory:

var<workgroup> tile: array<vec4<f32>, 272>; // 16×17 with halo

@compute @workgroup_size(16, 16)
fn process_tile(@builtin(local_invocation_id) local_id: vec3<u32>,
                @builtin(workgroup_id) group_id: vec3<u32>) {
    // Load tile + halo into shared memory
    // Process with fast local memory access
    // Write results
}

Color Management

Display P3 Support

Modern displays support wide gamut. From WICG proposal:

const ctx = canvas.getContext('2d', { colorSpace: 'display-p3' });
ctx.fillStyle = 'color(display-p3 1 0.5 0)'; // Vivid orange outside sRGB

For WebGPU:

context.configure({
    device,
    format: navigator.gpu.getPreferredCanvasFormat(),
    colorSpace: 'display-p3', // If supported
    alphaMode: 'premultiplied'
});

ICC Profile Handling

For import/export, use jsColorEngine:

import { ColorEngine, Profile } from 'js-color-engine';

const engine = new ColorEngine();
const srgb = await Profile.fromURL('/profiles/sRGB.icc');
const p3 = await Profile.fromURL('/profiles/DisplayP3.icc');

const transform = engine.createTransform(srgb, p3);
const converted = transform.apply(imageData);

File Format Support

PSD Import/Export

@webtoon/psd is the modern choice:

  • Zero dependencies
  • WebAssembly acceleration
  • ~100 KB minified (vs 443 KB for PSD.js)
import Psd from '@aspect/psd';

const psd = Psd.parse(arrayBuffer);
for (const layer of psd.layers) {
    console.log(layer.name, layer.opacity, layer.blendMode);
    const imageData = await layer.composite();
}

Limitations:

  • No CMYK support (converts to RGB)
  • Some adjustment layers not supported
  • Smart objects require special handling

Custom Format

For optimal performance, design a custom format:

Header:
  - Magic: "GPAINT"
  - Version: u32
  - Canvas size: u32 × u32
  - Layer count: u32

Layer Table:
  - Name: string
  - Blend mode: u8
  - Opacity: f32
  - Bounds: i32 × 4
  - Tile count: u32
  - Tile offsets: [u64]

Tile Data:
  - Compressed RGBA (LZ4 or zstd)

Bill of Materials

ComponentRecommendationAlternative
Core languageRustC++ via Emscripten
WASM bindingswasm-bindgenwasm-pack
GPU APIwgpuRaw WebGPU
Brush mathCustom (Ciallo-inspired)
Spline smoothingkurbo (Catmull-Rom)Custom
PSD support@webtoon/psdag-psd
Color enginejsColorEngineCustom
Compressionlz4_flex, zstd
UI frameworkSvelteReact
BuildVite + wasm-pack

Performance Targets

MetricTargetNotes
Stroke latency<16msTouch to visible pixel
60fps compositing50 layersWith blend modes
Canvas size16K × 16KSparse tile allocation
Undo depth100+Tile-based diffs
Initial load<3sLazy load brushes
Memory usage<2GBWith 4K canvas, 20 layers

Implementation Phases

Phase 1: Core Canvas

  1. WebGPU context setup
  2. Tile-based layer system
  3. Basic brush (pressure → size)
  4. Pan/zoom with gestures

Phase 2: Brush Engine

  1. Stamp-based rendering
  2. Stroke smoothing (Catmull-Rom)
  3. Brush dynamics (pressure, tilt)
  4. Basic brush library

Phase 3: Compositing

  1. Blend mode compute shaders
  2. Layer masks
  3. Clipping groups
  4. Adjustment layers

Phase 4: Performance

  1. OffscreenCanvas + Worker
  2. Tile-based undo
  3. Memory management
  4. IndexedDB persistence

Phase 5: Polish

  1. PSD import/export
  2. Color management
  3. Advanced brushes
  4. Selection tools

Key Research Sources

Brush Rendering

Graphics Architecture

Web APIs

WebGPU

Related: building figma today, vello gpu vector graphics, webgpu vs webgl, rust wasm graphics, webgpu future roadmap