Building Procreate for the Web
Building a Procreate-class painting application for the web is now technically feasible. This document synthesizes research into an optimal Chrome-first architecture.
The Core Challenge
Procreate is a raster-first painting application, fundamentally different from vector tools like Figma. Key requirements:
| Requirement | Procreate Approach | Web Challenge |
|---|---|---|
| High-resolution canvas | 16K × 4K on iPad Pro | GPU memory limits, tile streaming |
| Low-latency strokes | <10ms touch-to-pixel | Main thread blocking, GC pauses |
| Pressure sensitivity | Apple Pencil integration | Pointer Events API |
| 100+ blend modes | Metal shaders | WebGPU compute |
| Unlimited undo | Efficient diff storage | Memory management |
| Natural media feel | Stamp-based brushes | GPU stroke rendering |
Recommended Stack
┌─────────────────────────────────────────────────────────────┐
│ UI Layer │
│ TypeScript + Svelte/React │
│ (toolbar, layer panel, brush settings — NOT canvas) │
└─────────────────────────┬───────────────────────────────────┘
│ wasm-bindgen / postMessage
┌─────────────────────────▼───────────────────────────────────┐
│ Core Engine │
│ Rust → WASM │
│ • Document model (layers, masks, groups) │
│ • Brush engine (stamp interpolation, dynamics) │
│ • Undo/redo (tile-based diff storage) │
│ • Import/export (PSD, PNG, custom format) │
└─────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────▼───────────────────────────────────┐
│ Rendering Engine │
│ wgpu + custom │
│ • Tile-based compositing (dirty rect optimization) │
│ • GPU stroke rasterization │
│ • Blend mode compute shaders │
│ • Filter pipeline (blur, color adjust) │
└─────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────▼───────────────────────────────────┐
│ Graphics API │
│ WebGPU │
│ (Chrome 113+, optimized path) │
└─────────────────────────────────────────────────────────────┘
Brush Engine Architecture
The brush engine is the heart of any painting app. Two fundamental approaches exist:
1. Stamp-Based Rendering (Traditional)
Procreate uses a stamp brush model where strokes are formed by repeatedly “stamping” a brush shape along a path.
From Procreate’s Brush Studio:
“A stroke forms by ‘stamping’ the brush shape over and over again along a path.”
Key parameters:
- Shape: The stamp texture (tip)
- Grain: Texture applied inside the stamp
- Spacing: Gap between stamps (0% = fluid stroke, 100% = dots)
- Jitter: Randomization of position, rotation, size
Traditional GPU implementation:
for each point on stroke path:
1. Calculate stamp position (with spacing)
2. Apply jitter (lateral, linear, rotation)
3. Map pressure → size, opacity
4. Blend stamp texture onto canvas layer
Problem: At high DPI with small spacing, this creates thousands of overlapping alpha-blended quads per stroke—expensive and creates overdraw.
2. Continuous Integration (Modern)
Apoorva Joshi’s research replaces discrete stamps with mathematical integration:
“Rather than repeatedly stamping at discrete positions, treat the brush as continuously slid across the stroke axis.”
For any pixel (X, Y), the intensity is:
α(X,Y) = ∫[X₁ to X₂] f(x, X, Y) dx
Where X₁, X₂ are the leftmost/rightmost stamp centers affecting that pixel.
Advantage: Single quad per stroke, no overdraw, computed entirely in fragment shader.
3. GPU-Accelerated Vector Strokes (Ciallo)
Ciallo (SIGGRAPH 2024) introduces a hybrid vector-raster approach:
“GPU-based rendering techniques for digital painting that bridge the gap between raster and vector stroke representations.”
Three brush types:
- Vanilla strokes: Variable-width polylines, geometry shader tessellation
- Stamp brushes: GPU-computed stamp positions via prefix-sum in compute shader
- Airbrush: Resolution-independent opacity falloff
Key insight from the paper:
“A compute shader can calculate the prefix sum of edge length in parallel. By passing the values into the fragment shader, stamp positions on an edge can be calculated.”
This allows stamp density to vary based on stroke curvature while maintaining vector editability.
Stroke Geometry
Tessellation Approach
For variable-width strokes, the GPU needs triangulated geometry:
Stroke polyline: P₀ ─── P₁ ─── P₂ ─── P₃
│ │ │ │
w₀ w₁ w₂ w₃ (widths from pressure)
Tessellated mesh:
╱‾‾‾‾‾╲
╱ ╲ Expanded to quads perpendicular to stroke direction
╱_________╲ Miter or bevel joins at corners
Two GPU approaches:
- Geometry shader: Creates quads from line segments (desktop only)
- Instanced rendering: Pre-tessellated quad instances (WebGPU compatible)
From Ciallo:
“Both geometry shader and instanced rendering can be used—geometry shader for desktop programs on Windows and instanced rendering for the Web.”
Stroke Smoothing
Raw input points are noisy. Catmull-Rom splines provide smooth interpolation:
“The beauty of Catmull-Rom splines is that the curve passes through all control points. Simply choose points in space and the path will pass through them smoothly.”
Use centripetal variant (α = 0.5) to prevent loops and cusps:
fn catmull_rom(p0: Vec2, p1: Vec2, p2: Vec2, p3: Vec2, t: f32, alpha: f32) -> Vec2 {
// Compute knot intervals based on distance^alpha
let t0 = 0.0;
let t1 = t0 + (p1 - p0).length().powf(alpha);
let t2 = t1 + (p2 - p1).length().powf(alpha);
let t3 = t2 + (p3 - p2).length().powf(alpha);
// ... Barry-Goldman algorithm
}
Input Handling
Pointer Events API
The Pointer Events API provides unified stylus support:
canvas.addEventListener('pointermove', (e: PointerEvent) => {
const point = {
x: e.clientX,
y: e.clientY,
pressure: e.pressure, // 0.0 - 1.0
tiltX: e.tiltX, // -90° to 90°
tiltY: e.tiltY, // -90° to 90°
twist: e.twist, // 0° to 359° (rotation)
pointerType: e.pointerType // "pen" | "touch" | "mouse"
};
// Coalesce events for smooth strokes
const coalesced = e.getCoalescedEvents();
for (const ce of coalesced) {
strokeEngine.addPoint(ce);
}
});
Key properties:
pressure: Normalized 0-1, maps to brush size/opacitytiltX/tiltY: Stylus angle, affects brush shapetwist: Barrel rotation for calligraphy brushestangentialPressure: Barrel pressure (-1 to 1)
Important: Use getCoalescedEvents() to capture high-frequency input that browsers batch for performance.
Prediction
For ultra-low latency, use getPredictedEvents() to draw ahead of the stylus:
const predicted = e.getPredictedEvents();
// Draw predicted points with lower opacity,
// replace when actual events arrive
Layer Compositing
Blend Mode Implementation
WebGPU supports hardware blend modes, but not all Photoshop blend modes are built-in. Complex modes require compute shaders.
From WebGPU Fundamentals:
“We can set the blending mode, the primitive topology, and the depth/stencil state.”
Standard modes (hardware accelerated):
// Fragment output with premultiplied alpha
@fragment
fn fs_main(@location(0) color: vec4<f32>) -> @location(0) vec4<f32> {
return vec4(color.rgb * color.a, color.a);
}
Pipeline blend state:
blend: {
color: {
srcFactor: 'one', // Premultiplied source
dstFactor: 'one-minus-src-alpha',
operation: 'add'
},
alpha: {
srcFactor: 'one',
dstFactor: 'one-minus-src-alpha',
operation: 'add'
}
}
Complex modes (compute shader required):
- Multiply, Screen, Overlay, Soft Light
- Color Dodge, Color Burn
- Difference, Exclusion
- Hue, Saturation, Color, Luminosity
Premultiplied Alpha
Critical for correct compositing. From Limnu’s analysis:
“The browser defaults to compositing a WebGL canvas using premultiplied alpha because colors come out of the renderer in premultiplied form.”
Always work in premultiplied alpha internally:
premultiplied.rgb = straight.rgb * straight.a
premultiplied.a = straight.a
Tile-Based Rendering
Why Tiles?
For 16K × 16K canvases:
- Raw size: 16384 × 16384 × 4 bytes = 1 GB per layer
- 10 layers = 10 GB (impossible in browser)
Solution: Tile-based streaming with sparse allocation.
From Polycount discussion:
“Instead of storing full canvas textures, break the canvas into a grid and only store tiles that are modified. GPUs work better with fewer texture bindings, so use large atlases of tiles.”
Tile Architecture
Canvas Grid (16K × 16K, 256px tiles = 64×64 = 4096 tiles)
┌───┬───┬───┬───┬───┐
│ │ ▓ │ ▓ │ │ │ ▓ = dirty (needs re-render)
├───┼───┼───┼───┼───┤ = untouched (not allocated)
│ │ ▓ │ ▓ │ ▓ │ │ █ = clean (cached in GPU)
├───┼───┼───┼───┼───┤
│ │ │ █ │ █ │ │
└───┴───┴───┴───┴───┘
Only allocate tiles that contain paint. Untouched regions consume zero memory.
Tile Atlas
Pack allocated tiles into GPU texture atlases for efficient rendering:
struct TileAtlas {
texture: wgpu::Texture, // 4096×4096 atlas
allocator: AtlasAllocator, // Tracks free slots
tile_map: HashMap<TileCoord, AtlasSlot>,
}
impl TileAtlas {
fn get_or_allocate(&mut self, coord: TileCoord) -> AtlasSlot {
self.tile_map.get(&coord).cloned()
.unwrap_or_else(|| {
let slot = self.allocator.alloc();
self.tile_map.insert(coord, slot);
slot
})
}
}
Performance target from Polycount:
“Running at 1000fps with a 4K × 4K canvas using half floats, the buffer is only 128MB of VRAM, allowing for 15 layers with just 2GB.”
Undo/Redo System
The Problem
Naive approach: Store full canvas snapshot per operation.
- 4K × 4K × 4 bytes = 64 MB per snapshot
- 100 undo levels = 6.4 GB
Tile-Based Diffs
Only store tiles that changed:
struct UndoOperation {
affected_tiles: Vec<TileCoord>,
old_data: HashMap<TileCoord, TileData>, // Only changed tiles
timestamp: Instant,
}
fn record_stroke(&mut self, stroke: &Stroke) {
let affected = self.get_affected_tiles(stroke.bounds());
let old_data = affected.iter()
.map(|coord| (*coord, self.read_tile(*coord)))
.collect();
self.undo_stack.push(UndoOperation {
affected_tiles: affected,
old_data,
timestamp: Instant::now(),
});
}
From Pixelitor’s approach:
“Pixelitor only stores bitmaps for the regions affected by each operation.”
Memory Pressure Handling
When memory is constrained:
- Compress older undo entries (LZ4/zstd)
- Spill to IndexedDB
- Discard oldest entries
const UNDO_MEMORY_LIMIT = 512 * 1024 * 1024; // 512 MB
function trimUndoStack() {
let totalSize = undoStack.reduce((sum, op) => sum + op.byteSize, 0);
while (totalSize > UNDO_MEMORY_LIMIT && undoStack.length > 10) {
const oldest = undoStack.shift();
totalSize -= oldest.byteSize;
oldest.dispose();
}
}
Performance Architecture
OffscreenCanvas + Worker
Move rendering off the main thread:
// main.ts
const offscreen = canvas.transferControlToOffscreen();
const worker = new Worker('render-worker.js');
worker.postMessage({ type: 'init', canvas: offscreen }, [offscreen]);
// render-worker.js
self.onmessage = async (e) => {
if (e.data.type === 'init') {
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();
const context = e.data.canvas.getContext('webgpu');
// Render loop runs entirely in worker
}
};
From Chrome’s OffscreenCanvas documentation:
“Making canvas rendering contexts available to workers increases parallelism and makes better use of multi-core systems.”
Benefits:
- UI remains responsive during heavy rendering
- Input handling on main thread, rendering on worker
- requestAnimationFrame() works in workers
SharedArrayBuffer for Zero-Copy
Share stroke data between threads without copying:
// Shared stroke buffer
const strokeBuffer = new SharedArrayBuffer(1024 * 1024);
const strokeView = new Float32Array(strokeBuffer);
// Main thread writes input
strokeView[writeIndex++] = point.x;
strokeView[writeIndex++] = point.y;
strokeView[writeIndex++] = point.pressure;
Atomics.store(strokeCount, 0, writeIndex / 3);
// Worker reads and renders
const count = Atomics.load(strokeCount, 0);
for (let i = lastProcessed; i < count; i++) {
const x = strokeView[i * 3];
const y = strokeView[i * 3 + 1];
const pressure = strokeView[i * 3 + 2];
renderPoint(x, y, pressure);
}
Note: Requires COOP/COEP headers for SharedArrayBuffer.
Compute Shaders for Effects
Gaussian Blur
From WebGPU Fundamentals:
“2D image processing is an excellent use case for WebGPU.”
Separable blur for efficiency (two 1D passes instead of one 2D):
@group(0) @binding(0) var inputTex: texture_2d<f32>;
@group(0) @binding(1) var outputTex: texture_storage_2d<rgba8unorm, write>;
@group(0) @binding(2) var<storage> weights: array<f32>;
@compute @workgroup_size(64, 1)
fn blur_horizontal(@builtin(global_invocation_id) id: vec3<u32>) {
let size = textureDimensions(inputTex);
if (id.x >= size.x || id.y >= size.y) { return; }
var sum = vec4<f32>(0.0);
let radius = arrayLength(&weights) / 2u;
for (var i = 0u; i < arrayLength(&weights); i++) {
let offset = i32(i) - i32(radius);
let coord = vec2<i32>(i32(id.x) + offset, i32(id.y));
let clamped = clamp(coord, vec2(0), vec2<i32>(size) - 1);
sum += textureLoad(inputTex, clamped, 0) * weights[i];
}
textureStore(outputTex, vec2<i32>(id.xy), sum);
}
Workgroup Optimization
From Codrops tutorial:
“A general advice for WebGPU is to choose a workgroup size of 64.”
Use tile-based processing with shared memory:
var<workgroup> tile: array<vec4<f32>, 272>; // 16×17 with halo
@compute @workgroup_size(16, 16)
fn process_tile(@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(workgroup_id) group_id: vec3<u32>) {
// Load tile + halo into shared memory
// Process with fast local memory access
// Write results
}
Color Management
Display P3 Support
Modern displays support wide gamut. From WICG proposal:
const ctx = canvas.getContext('2d', { colorSpace: 'display-p3' });
ctx.fillStyle = 'color(display-p3 1 0.5 0)'; // Vivid orange outside sRGB
For WebGPU:
context.configure({
device,
format: navigator.gpu.getPreferredCanvasFormat(),
colorSpace: 'display-p3', // If supported
alphaMode: 'premultiplied'
});
ICC Profile Handling
For import/export, use jsColorEngine:
import { ColorEngine, Profile } from 'js-color-engine';
const engine = new ColorEngine();
const srgb = await Profile.fromURL('/profiles/sRGB.icc');
const p3 = await Profile.fromURL('/profiles/DisplayP3.icc');
const transform = engine.createTransform(srgb, p3);
const converted = transform.apply(imageData);
File Format Support
PSD Import/Export
@webtoon/psd is the modern choice:
- Zero dependencies
- WebAssembly acceleration
- ~100 KB minified (vs 443 KB for PSD.js)
import Psd from '@aspect/psd';
const psd = Psd.parse(arrayBuffer);
for (const layer of psd.layers) {
console.log(layer.name, layer.opacity, layer.blendMode);
const imageData = await layer.composite();
}
Limitations:
- No CMYK support (converts to RGB)
- Some adjustment layers not supported
- Smart objects require special handling
Custom Format
For optimal performance, design a custom format:
Header:
- Magic: "GPAINT"
- Version: u32
- Canvas size: u32 × u32
- Layer count: u32
Layer Table:
- Name: string
- Blend mode: u8
- Opacity: f32
- Bounds: i32 × 4
- Tile count: u32
- Tile offsets: [u64]
Tile Data:
- Compressed RGBA (LZ4 or zstd)
Bill of Materials
| Component | Recommendation | Alternative |
|---|---|---|
| Core language | Rust | C++ via Emscripten |
| WASM bindings | wasm-bindgen | wasm-pack |
| GPU API | wgpu | Raw WebGPU |
| Brush math | Custom (Ciallo-inspired) | — |
| Spline smoothing | kurbo (Catmull-Rom) | Custom |
| PSD support | @webtoon/psd | ag-psd |
| Color engine | jsColorEngine | Custom |
| Compression | lz4_flex, zstd | — |
| UI framework | Svelte | React |
| Build | Vite + wasm-pack | — |
Performance Targets
| Metric | Target | Notes |
|---|---|---|
| Stroke latency | <16ms | Touch to visible pixel |
| 60fps compositing | 50 layers | With blend modes |
| Canvas size | 16K × 16K | Sparse tile allocation |
| Undo depth | 100+ | Tile-based diffs |
| Initial load | <3s | Lazy load brushes |
| Memory usage | <2GB | With 4K canvas, 20 layers |
Implementation Phases
Phase 1: Core Canvas
- WebGPU context setup
- Tile-based layer system
- Basic brush (pressure → size)
- Pan/zoom with gestures
Phase 2: Brush Engine
- Stamp-based rendering
- Stroke smoothing (Catmull-Rom)
- Brush dynamics (pressure, tilt)
- Basic brush library
Phase 3: Compositing
- Blend mode compute shaders
- Layer masks
- Clipping groups
- Adjustment layers
Phase 4: Performance
- OffscreenCanvas + Worker
- Tile-based undo
- Memory management
- IndexedDB persistence
Phase 5: Polish
- PSD import/export
- Color management
- Advanced brushes
- Selection tools
Key Research Sources
Brush Rendering
- Ciallo: GPU-Accelerated Rendering of Vector Brush Strokes (SIGGRAPH 2024)
- Efficient Rendering of Linear Brush Strokes (JCGT)
- Brush Rendering Tutorial
Graphics Architecture
Web APIs
WebGPU
Related: building figma today, vello gpu vector graphics, webgpu vs webgl, rust wasm graphics, webgpu future roadmap