Training a vision model to count objects in real time — running entirely in the browser

TL;DR: I trained an object-detection model to count objects in real time, then ran it directly in the browser (webcam → count → boxes) at ~60 FPS, with images never leaving the device. Inference is fully client-side via WebGPU. This post is about the product decisions along the way — try the live demo at the end.

Context — JTBD

When: I need a tool to count objects in real time through a camera (e.g. counting items on a tray) — accurate, fast, and private (images never leave the user’s machine).

I want: point the camera → see the count update instantly, run it on the plain web, no heavy app or GPU rig.

So that: turn “count by hand, eye-straining, error-prone” into “the machine counts, a human confirms” — proper human-in-the-loop.

The real problem: a stock model doesn’t know your object

Everyone’s first instinct: “just call an image-recognition API”. But a general model (the COCO 80-class kind) only knows people, cars, bottles, laptops… — it has no idea what your custom object is. Point it at your thing and it misses or mislabels it as the nearest class.

→ Lesson #1: pretrained ≠ a solution for custom objects. To count your thing correctly, you almost always have to fine-tune on labelled images of it.

And the real “hard part” isn’t “can it detect at all” (the happy-path demo is easy) — it’s overlapping / touching objects (occlusion), the #1 source of counting errors. A pretty number on a demo ≠ accuracy in the wild.

Options — and why I dropped them

Cloud recognition API → doesn’t know your custom object + sends images off-device (privacy loss).
Train from scratch → needs lots of data and compute, overkill for one narrow need.
Fine-tune a small detector (YOLO nano) from existing weights ✅ → light, fast, good enough. This is the path I took.

Product decisions along the way

1. Detection or segmentation? Segmentation (pixel masks) separates touching objects better, but it’s heavy → hard to run in real time. For live counting through a camera, I chose detection (bounding boxes) for speed and enough FPS. When precision matters, “freeze the frame → process once” (hybrid) beats forcing segmentation on every frame.

2. Measure FPS before investing. Before training anything, I built a minimal spike running a nano model live in the browser and measured real FPS (~60 with WebGPU vs ~4–5 falling back to CPU). This is the decision gate: if real-time isn’t smooth, the whole approach is wrong — better to know early, not after training.

3. One GPU is enough. Fine-tuning a nano model on ~8,000 images took tens of minutes on a single GPU. No cluster. A speed tip: cache images and let the framework auto-pick the largest batch — with a small model the bottleneck is usually data loading, not the GPU.

4. Run on-device, not on a server. Inference runs client-side with onnxruntime-web + WebGPU → images stay on the device (privacy), no server hop, still ~60 FPS. For data-sensitive use cases this is a big difference from “upload to an API”.

5. Don’t automate what you can’t control. Because occlusion is still the weak spot, “near-perfect accuracy” does not come from the model alone — it comes from controlling the input (spread objects in a single layer, contrasting background) + a human confirming. The model handles the easy part; capture rules + human-in-the-loop handle the hard part.

5 takeaways (PM lens)

A pretrained model doesn’t know your object → fine-tuning is required, but fast (one GPU, tens of minutes).
Real-time ⇒ pick a light model (detection) and make FPS an early gate, don’t commit to segmentation then discover it lags.
On-device (WebGPU) = privacy + no server hop, and a nano model hits ~60 FPS — you don’t always need a backend.
A demo ≠ real accuracy: occlusion + domain gap are where it breaks → test beyond the happy path + human verify.
Lean first: one model + one working demo, then expand. Don’t build a “platform” before the core runs.

Try it

I put the model on the web so you can play with it directly — point your webcam at an object and watch it count in real time (everything runs in the browser, images are never sent anywhere):

👉 Try the real-time counter

Open in Chrome/Edge for WebGPU; the first load fetches the model (a few seconds), then it’s smooth.