Building Procreate for the Web | Research

Building a Procreate-class painting application for the web is now technically feasible. This document synthesizes research into an optimal Chrome-first architecture.

The Core Challenge

Procreate is a raster-first painting application, fundamentally different from vector tools like Figma. Key requirements:

Requirement	Procreate Approach	Web Challenge
High-resolution canvas	16K × 4K on iPad Pro	GPU memory limits, tile streaming
Low-latency strokes	<10ms touch-to-pixel	Main thread blocking, GC pauses
Pressure sensitivity	Apple Pencil integration	Pointer Events API
100+ blend modes	Metal shaders	WebGPU compute
Unlimited undo	Efficient diff storage	Memory management
Natural media feel	Stamp-based brushes	GPU stroke rendering

Recommended Stack

┌─────────────────────────────────────────────────────────────┐
│                      UI Layer                                │
│               TypeScript + Svelte/React                      │
│     (toolbar, layer panel, brush settings — NOT canvas)      │
└─────────────────────────┬───────────────────────────────────┘
                          │ wasm-bindgen / postMessage
┌─────────────────────────▼───────────────────────────────────┐
│                    Core Engine                               │
│                    Rust → WASM                               │
│  • Document model (layers, masks, groups)                    │
│  • Brush engine (stamp interpolation, dynamics)              │
│  • Undo/redo (tile-based diff storage)                       │
│  • Import/export (PSD, PNG, custom format)                   │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────┐
│                  Rendering Engine                            │
│                   wgpu + custom                              │
│  • Tile-based compositing (dirty rect optimization)          │
│  • GPU stroke rasterization                                  │
│  • Blend mode compute shaders                                │
│  • Filter pipeline (blur, color adjust)                      │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────┐
│                    Graphics API                              │
│                      WebGPU                                  │
│            (Chrome 113+, optimized path)                     │
└─────────────────────────────────────────────────────────────┘

Brush Engine Architecture

The brush engine is the heart of any painting app. Two fundamental approaches exist:

1. Stamp-Based Rendering (Traditional)

Procreate uses a stamp brush model where strokes are formed by repeatedly “stamping” a brush shape along a path.

From Procreate’s Brush Studio:

“A stroke forms by ‘stamping’ the brush shape over and over again along a path.”

Key parameters:

Shape: The stamp texture (tip)
Grain: Texture applied inside the stamp
Spacing: Gap between stamps (0% = fluid stroke, 100% = dots)
Jitter: Randomization of position, rotation, size

Traditional GPU implementation:

for each point on stroke path:
    1. Calculate stamp position (with spacing)
    2. Apply jitter (lateral, linear, rotation)
    3. Map pressure → size, opacity
    4. Blend stamp texture onto canvas layer

Problem: At high DPI with small spacing, this creates thousands of overlapping alpha-blended quads per stroke—expensive and creates overdraw.

2. Continuous Integration (Modern)

Apoorva Joshi’s research replaces discrete stamps with mathematical integration:

“Rather than repeatedly stamping at discrete positions, treat the brush as continuously slid across the stroke axis.”

For any pixel (X, Y), the intensity is:

α(X,Y) = ∫[X₁ to X₂] f(x, X, Y) dx

Where X₁, X₂ are the leftmost/rightmost stamp centers affecting that pixel.

Advantage: Single quad per stroke, no overdraw, computed entirely in fragment shader.

3. GPU-Accelerated Vector Strokes (Ciallo)

Ciallo (SIGGRAPH 2024) introduces a hybrid vector-raster approach:

“GPU-based rendering techniques for digital painting that bridge the gap between raster and vector stroke representations.”

Three brush types:

Vanilla strokes: Variable-width polylines, geometry shader tessellation
Stamp brushes: GPU-computed stamp positions via prefix-sum in compute shader
Airbrush: Resolution-independent opacity falloff

Key insight from the paper:

“A compute shader can calculate the prefix sum of edge length in parallel. By passing the values into the fragment shader, stamp positions on an edge can be calculated.”

This allows stamp density to vary based on stroke curvature while maintaining vector editability.

Stroke Geometry

Tessellation Approach

For variable-width strokes, the GPU needs triangulated geometry:

Stroke polyline:  P₀ ─── P₁ ─── P₂ ─── P₃
                   │     │     │     │
                   w₀    w₁    w₂    w₃  (widths from pressure)

Tessellated mesh:
    ╱‾‾‾‾‾╲
   ╱       ╲      Expanded to quads perpendicular to stroke direction
  ╱_________╲     Miter or bevel joins at corners

Two GPU approaches:

Geometry shader: Creates quads from line segments (desktop only)
Instanced rendering: Pre-tessellated quad instances (WebGPU compatible)

From Ciallo:

“Both geometry shader and instanced rendering can be used—geometry shader for desktop programs on Windows and instanced rendering for the Web.”

Stroke Smoothing

Raw input points are noisy. Catmull-Rom splines provide smooth interpolation:

“The beauty of Catmull-Rom splines is that the curve passes through all control points. Simply choose points in space and the path will pass through them smoothly.”

Use centripetal variant (α = 0.5) to prevent loops and cusps:

fn catmull_rom(p0: Vec2, p1: Vec2, p2: Vec2, p3: Vec2, t: f32, alpha: f32) -> Vec2 {
    // Compute knot intervals based on distance^alpha
    let t0 = 0.0;
    let t1 = t0 + (p1 - p0).length().powf(alpha);
    let t2 = t1 + (p2 - p1).length().powf(alpha);
    let t3 = t2 + (p3 - p2).length().powf(alpha);
    // ... Barry-Goldman algorithm
}

Input Handling

Pointer Events API

The Pointer Events API provides unified stylus support:

canvas.addEventListener('pointermove', (e: PointerEvent) => {
  const point = {
    x: e.clientX,
    y: e.clientY,
    pressure: e.pressure,      // 0.0 - 1.0
    tiltX: e.tiltX,            // -90° to 90°
    tiltY: e.tiltY,            // -90° to 90°
    twist: e.twist,            // 0° to 359° (rotation)
    pointerType: e.pointerType // "pen" | "touch" | "mouse"
  };

  // Coalesce events for smooth strokes
  const coalesced = e.getCoalescedEvents();
  for (const ce of coalesced) {
    strokeEngine.addPoint(ce);
  }
});

Key properties:

pressure: Normalized 0-1, maps to brush size/opacity
tiltX/tiltY: Stylus angle, affects brush shape
twist: Barrel rotation for calligraphy brushes
tangentialPressure: Barrel pressure (-1 to 1)

Important: Use getCoalescedEvents() to capture high-frequency input that browsers batch for performance.

Prediction

For ultra-low latency, use getPredictedEvents() to draw ahead of the stylus:

const predicted = e.getPredictedEvents();
// Draw predicted points with lower opacity,
// replace when actual events arrive

Layer Compositing

Blend Mode Implementation

WebGPU supports hardware blend modes, but not all Photoshop blend modes are built-in. Complex modes require compute shaders.

From WebGPU Fundamentals:

“We can set the blending mode, the primitive topology, and the depth/stencil state.”

Standard modes (hardware accelerated):

// Fragment output with premultiplied alpha
@fragment
fn fs_main(@location(0) color: vec4<f32>) -> @location(0) vec4<f32> {
    return vec4(color.rgb * color.a, color.a);
}

Pipeline blend state:

blend: {
  color: {
    srcFactor: 'one',          // Premultiplied source
    dstFactor: 'one-minus-src-alpha',
    operation: 'add'
  },
  alpha: {
    srcFactor: 'one',
    dstFactor: 'one-minus-src-alpha',
    operation: 'add'
  }
}

Complex modes (compute shader required):

Multiply, Screen, Overlay, Soft Light
Color Dodge, Color Burn
Difference, Exclusion
Hue, Saturation, Color, Luminosity

Premultiplied Alpha

Critical for correct compositing. From Limnu’s analysis:

“The browser defaults to compositing a WebGL canvas using premultiplied alpha because colors come out of the renderer in premultiplied form.”

Always work in premultiplied alpha internally:

premultiplied.rgb = straight.rgb * straight.a
premultiplied.a = straight.a

Tile-Based Rendering

Why Tiles?

For 16K × 16K canvases:

Raw size: 16384 × 16384 × 4 bytes = 1 GB per layer
10 layers = 10 GB (impossible in browser)

Solution: Tile-based streaming with sparse allocation.

From Polycount discussion:

“Instead of storing full canvas textures, break the canvas into a grid and only store tiles that are modified. GPUs work better with fewer texture bindings, so use large atlases of tiles.”

Tile Architecture

Canvas Grid (16K × 16K, 256px tiles = 64×64 = 4096 tiles)
┌───┬───┬───┬───┬───┐
│   │ ▓ │ ▓ │   │   │   ▓ = dirty (needs re-render)
├───┼───┼───┼───┼───┤       = untouched (not allocated)
│   │ ▓ │ ▓ │ ▓ │   │   █ = clean (cached in GPU)
├───┼───┼───┼───┼───┤
│   │   │ █ │ █ │   │
└───┴───┴───┴───┴───┘

Only allocate tiles that contain paint. Untouched regions consume zero memory.

Tile Atlas

Pack allocated tiles into GPU texture atlases for efficient rendering:

struct TileAtlas {
    texture: wgpu::Texture,        // 4096×4096 atlas
    allocator: AtlasAllocator,     // Tracks free slots
    tile_map: HashMap<TileCoord, AtlasSlot>,
}

impl TileAtlas {
    fn get_or_allocate(&mut self, coord: TileCoord) -> AtlasSlot {
        self.tile_map.get(&coord).cloned()
            .unwrap_or_else(|| {
                let slot = self.allocator.alloc();
                self.tile_map.insert(coord, slot);
                slot
            })
    }
}

Performance target from Polycount:

“Running at 1000fps with a 4K × 4K canvas using half floats, the buffer is only 128MB of VRAM, allowing for 15 layers with just 2GB.”

Undo/Redo System

The Problem

Naive approach: Store full canvas snapshot per operation.

4K × 4K × 4 bytes = 64 MB per snapshot
100 undo levels = 6.4 GB

Tile-Based Diffs

Only store tiles that changed:

struct UndoOperation {
    affected_tiles: Vec<TileCoord>,
    old_data: HashMap<TileCoord, TileData>,  // Only changed tiles
    timestamp: Instant,
}

fn record_stroke(&mut self, stroke: &Stroke) {
    let affected = self.get_affected_tiles(stroke.bounds());
    let old_data = affected.iter()
        .map(|coord| (*coord, self.read_tile(*coord)))
        .collect();

    self.undo_stack.push(UndoOperation {
        affected_tiles: affected,
        old_data,
        timestamp: Instant::now(),
    });
}

From Pixelitor’s approach:

“Pixelitor only stores bitmaps for the regions affected by each operation.”

Memory Pressure Handling

When memory is constrained:

Compress older undo entries (LZ4/zstd)
Spill to IndexedDB
Discard oldest entries

const UNDO_MEMORY_LIMIT = 512 * 1024 * 1024; // 512 MB

function trimUndoStack() {
    let totalSize = undoStack.reduce((sum, op) => sum + op.byteSize, 0);
    while (totalSize > UNDO_MEMORY_LIMIT && undoStack.length > 10) {
        const oldest = undoStack.shift();
        totalSize -= oldest.byteSize;
        oldest.dispose();
    }
}

Performance Architecture

OffscreenCanvas + Worker

Move rendering off the main thread:

// main.ts
const offscreen = canvas.transferControlToOffscreen();
const worker = new Worker('render-worker.js');
worker.postMessage({ type: 'init', canvas: offscreen }, [offscreen]);

// render-worker.js
self.onmessage = async (e) => {
    if (e.data.type === 'init') {
        const adapter = await navigator.gpu.requestAdapter();
        const device = await adapter.requestDevice();
        const context = e.data.canvas.getContext('webgpu');
        // Render loop runs entirely in worker
    }
};

From Chrome’s OffscreenCanvas documentation:

“Making canvas rendering contexts available to workers increases parallelism and makes better use of multi-core systems.”

Benefits:

UI remains responsive during heavy rendering
Input handling on main thread, rendering on worker
requestAnimationFrame() works in workers

SharedArrayBuffer for Zero-Copy

Share stroke data between threads without copying:

// Shared stroke buffer
const strokeBuffer = new SharedArrayBuffer(1024 * 1024);
const strokeView = new Float32Array(strokeBuffer);

// Main thread writes input
strokeView[writeIndex++] = point.x;
strokeView[writeIndex++] = point.y;
strokeView[writeIndex++] = point.pressure;
Atomics.store(strokeCount, 0, writeIndex / 3);

// Worker reads and renders
const count = Atomics.load(strokeCount, 0);
for (let i = lastProcessed; i < count; i++) {
    const x = strokeView[i * 3];
    const y = strokeView[i * 3 + 1];
    const pressure = strokeView[i * 3 + 2];
    renderPoint(x, y, pressure);
}

Note: Requires COOP/COEP headers for SharedArrayBuffer.

Compute Shaders for Effects

Gaussian Blur

From WebGPU Fundamentals:

“2D image processing is an excellent use case for WebGPU.”

Separable blur for efficiency (two 1D passes instead of one 2D):

@group(0) @binding(0) var inputTex: texture_2d<f32>;
@group(0) @binding(1) var outputTex: texture_storage_2d<rgba8unorm, write>;
@group(0) @binding(2) var<storage> weights: array<f32>;

@compute @workgroup_size(64, 1)
fn blur_horizontal(@builtin(global_invocation_id) id: vec3<u32>) {
    let size = textureDimensions(inputTex);
    if (id.x >= size.x || id.y >= size.y) { return; }

    var sum = vec4<f32>(0.0);
    let radius = arrayLength(&weights) / 2u;

    for (var i = 0u; i < arrayLength(&weights); i++) {
        let offset = i32(i) - i32(radius);
        let coord = vec2<i32>(i32(id.x) + offset, i32(id.y));
        let clamped = clamp(coord, vec2(0), vec2<i32>(size) - 1);
        sum += textureLoad(inputTex, clamped, 0) * weights[i];
    }

    textureStore(outputTex, vec2<i32>(id.xy), sum);
}

Workgroup Optimization

From Codrops tutorial:

“A general advice for WebGPU is to choose a workgroup size of 64.”

Use tile-based processing with shared memory:

var<workgroup> tile: array<vec4<f32>, 272>; // 16×17 with halo

@compute @workgroup_size(16, 16)
fn process_tile(@builtin(local_invocation_id) local_id: vec3<u32>,
                @builtin(workgroup_id) group_id: vec3<u32>) {
    // Load tile + halo into shared memory
    // Process with fast local memory access
    // Write results
}

Color Management

Display P3 Support

Modern displays support wide gamut. From WICG proposal:

const ctx = canvas.getContext('2d', { colorSpace: 'display-p3' });
ctx.fillStyle = 'color(display-p3 1 0.5 0)'; // Vivid orange outside sRGB

For WebGPU:

context.configure({
    device,
    format: navigator.gpu.getPreferredCanvasFormat(),
    colorSpace: 'display-p3', // If supported
    alphaMode: 'premultiplied'
});

ICC Profile Handling

For import/export, use jsColorEngine:

import { ColorEngine, Profile } from 'js-color-engine';

const engine = new ColorEngine();
const srgb = await Profile.fromURL('/profiles/sRGB.icc');
const p3 = await Profile.fromURL('/profiles/DisplayP3.icc');

const transform = engine.createTransform(srgb, p3);
const converted = transform.apply(imageData);

File Format Support

PSD Import/Export

@webtoon/psd is the modern choice:

Zero dependencies
WebAssembly acceleration
~100 KB minified (vs 443 KB for PSD.js)

import Psd from '@aspect/psd';

const psd = Psd.parse(arrayBuffer);
for (const layer of psd.layers) {
    console.log(layer.name, layer.opacity, layer.blendMode);
    const imageData = await layer.composite();
}

Limitations:

No CMYK support (converts to RGB)
Some adjustment layers not supported
Smart objects require special handling

Custom Format

For optimal performance, design a custom format:

Header:
  - Magic: "GPAINT"
  - Version: u32
  - Canvas size: u32 × u32
  - Layer count: u32

Layer Table:
  - Name: string
  - Blend mode: u8
  - Opacity: f32
  - Bounds: i32 × 4
  - Tile count: u32
  - Tile offsets: [u64]

Tile Data:
  - Compressed RGBA (LZ4 or zstd)

Bill of Materials

Component	Recommendation	Alternative
Core language	Rust	C++ via Emscripten
WASM bindings	wasm-bindgen	wasm-pack
GPU API	wgpu	Raw WebGPU
Brush math	Custom (Ciallo-inspired)	—
Spline smoothing	kurbo (Catmull-Rom)	Custom
PSD support	@webtoon/psd	ag-psd
Color engine	jsColorEngine	Custom
Compression	lz4_flex, zstd	—
UI framework	Svelte	React
Build	Vite + wasm-pack	—

Performance Targets

Metric	Target	Notes
Stroke latency	<16ms	Touch to visible pixel
60fps compositing	50 layers	With blend modes
Canvas size	16K × 16K	Sparse tile allocation
Undo depth	100+	Tile-based diffs
Initial load	<3s	Lazy load brushes
Memory usage	<2GB	With 4K canvas, 20 layers

Implementation Phases

Phase 1: Core Canvas

WebGPU context setup
Tile-based layer system
Basic brush (pressure → size)
Pan/zoom with gestures

Phase 2: Brush Engine

Stamp-based rendering
Stroke smoothing (Catmull-Rom)
Brush dynamics (pressure, tilt)
Basic brush library

Phase 3: Compositing

Blend mode compute shaders
Layer masks
Clipping groups
Adjustment layers

Phase 4: Performance

OffscreenCanvas + Worker
Tile-based undo
Memory management
IndexedDB persistence

Phase 5: Polish

PSD import/export
Color management
Advanced brushes
Selection tools

Key Research Sources

Brush Rendering

Ciallo: GPU-Accelerated Rendering of Vector Brush Strokes (SIGGRAPH 2024)
Efficient Rendering of Linear Brush Strokes (JCGT)
Brush Rendering Tutorial

Graphics Architecture

Web APIs

WebGPU