WebLLM — A Browser-Based Vision-Language Model

description

The challenge

Almost every AI assistant that can “see your screen” works by shipping whatever you’re looking at to a remote server. For anything sensitive on screen that’s a real privacy problem, and it means the tool is useless offline. I wanted to answer a concrete question: can a modern vision-language model run entirely client-side — inside a browser extension, on the user’s own GPU — so that screenshots and prompts never leave the machine at all?

My contribution

I built a Chrome extension that runs a vision-language model (Gemma) fully client-side using WebGPU and the ONNX Runtime, with no backend server. It supports on-screen visual Q&A — you ask a question about what’s currently on your screen — and plain text chat. I architected it as a Manifest V3 extension with three moving parts working together: an offscreen-document runtime hosting the model, adaptive image compression and resolution tiering to fit screenshots through the model efficiently, and token-by-token streaming inference so answers appear as they generate. The whole thing is built for local privacy — no telemetry, and no persisted prompts or screenshots.

Key decisions

Hosting the model in an offscreen document was the decision that made Manifest V3 workable at all. MV3 service workers are too short-lived and constrained to hold a model in memory, so I moved the model into an offscreen document as the persistent runtime and let the service worker act as the coordinator. Without that split, the extension simply couldn’t exist under current Chrome rules.

Adaptive image compression and resolution tiering was the second necessity. Raw screenshots are large; feeding them at full resolution either blows past the model’s input limits or grinds to a crawl. Stepping resolution down adaptively let me trade detail for speed where the detail wasn’t needed.

Third, I chose streaming inference deliberately. A local model is slower than a datacenter one — that’s the price of privacy — and token-by-token output makes that latency feel responsive instead of frozen. It’s a UX decision as much as a technical one.

The outcome

The extension runs a vision-language model end to end in the browser, with zero server calls — a working demonstration that private, local, on-screen visual Q&A is possible on ordinary consumer hardware.

Building it taught me the real constraints of on-device inference — memory budgets, quantization, the WebGPU and ONNX toolchain, and the Manifest V3 sandbox. That’s a different and harder skill set than calling a hosted API, and it’s the part I’m proudest of.