← Forky AI Blog

How AI calorie counting actually works (and where it breaks)

A look under the hood of photo-based macro estimation — why whole-plate vision drifts ±25%, why per-component decomposition lifts that to ±10–15%, and the failure modes that no AI calorie counter has solved yet.

By Elie de Rougemont, Founder of Forky AI · 9 min read · published

If you ask GPT-4o 'how many calories in this photo of a plate?', it will guess. The guess will be within ±25% on a good day and ±60% on a bad one. AI calorie counting only works when you stop asking the question that way.

The whole-plate question is wrong

The naive prompt for an AI calorie counter is: "How many calories are in this plate?" Every consumer-facing app that shipped on this prompt — and most of them did — discovered the same failure mode within a week of TestFlight: the model is plausibly fluent but quantitatively unreliable. It sees a bowl of pasta with cream sauce and writes "around 600 calories". The bowl actually contained 950. The cream-and-cheese topping was visible in the photo. The model averaged.

Whole-plate estimation drifts because the model is being asked to do two things at once: identify every component AND compute the total. It compresses the identification step. The components it acknowledges in its written answer match the photo; the components it actually priced into the calorie number do not.

The fix is decomposition

Forky AI's pipeline asks the model three separate questions in one structured prompt:

We then compute per-component macros as grams × per_100g ÷ 100 and sum across components. The model never sees the final number; we compute it.

Why the missing-topping bug is the single biggest accuracy hit

Empirically, the most common reason a calorie-counter underestimates is not that it gets the portion wrong — it's that it doesn't acknowledge a topping that exists in the photo. A cream sauce on pasta is 100g × ~400 kcal/100g = 400 calories on its own, often more than the pasta itself. Missing it means the estimate halves.

Pass 2 is the entire point of the 3-pass prompt. It exists to make the model look at the same image a second time with a different question. The naive prompt asks "what's on the plate"; the toppings prompt asks "what is layered on top of what you just listed". The model catches things in pass 2 that it missed in pass 1, every single time.

How accurate is this, really

On standard plated meals where every component is visible and identifiable, the per-component approach lands in the ±10–15% range. That's about as good as a trained human food photographer with a kitchen scale would do estimating from the same photo. Below that, you need an actual food scale.

Above that — meaning worse than ±15% — you find:

What makes the per-component approach better than single-photo apps

The single-photo apps in the category (Cal AI, Lose It's Snap It, MyFitnessPal's Meal Scan) all ship some variant of the whole-plate prompt. Their vendor-reported accuracy hovers around ±20%. The difference is structural — they ask the model for one number; we ask it for a structured list and compute the number ourselves.

The user-visible consequence is small but real: in Forky AI, every macro estimate is editable per-component. Tap "cheese, 30g", change to 50g, watch the total recompute. The component breakdown is what gives the user a handle on the AI's guess. Without it, the user is staring at a single number with no idea what to argue with.

"The number isn't the answer. The breakdown is the answer. The number is just a sum."

The honest accuracy claim

Forky AI's current shipping pipeline runs GPT-4o with the 3-pass component prompt and USDA-style per-100g lookups. On photos of standard plated meals — single plate, every component visible — accuracy lands at ±10–15% on our internal sanity checks. On mixed dishes (soups, stews, curries), accuracy widens to ±25%, and we surface a confidence dot in the UI to flag it.

This is in the same neighbourhood as a human nutritionist eyeballing a photo. It is not in the same neighbourhood as a kitchen scale. For most users, that's the right trade — the kitchen scale is too high friction for daily logging, and "±15% three meals a day" is plenty for hitting weekly macro targets.

What's next

We're working on two extensions: (1) using the user's editing history as a personal prior — if you systematically bump every "cheese, 30g" to 60g, the model should learn — and (2) cross-checking the component list against a depth map from the iPhone's LiDAR sensor on Pro models to get a real portion-size signal instead of relying on visual prior.

Neither is shipping yet. We'll write about both when they are.