How AI calorie counting actually works (and where it breaks)

A look under the hood of photo-based macro estimation — why whole-plate vision drifts ±25%, why per-component decomposition lifts that to ±10–15%, and the failure modes that no AI calorie counter has solved yet.

By Elie de Rougemont, Founder of Forky AI · 9 min read · published May 15, 2026

If you ask GPT-4o 'how many calories in this photo of a plate?', it will guess. The guess will be within ±25% on a good day and ±60% on a bad one. AI calorie counting only works when you stop asking the question that way.

The whole-plate question is wrong

The naive prompt for an AI calorie counter is: "How many calories are in this plate?" Every consumer-facing app that shipped on this prompt — and most of them did — discovered the same failure mode within a week of TestFlight: the model is plausibly fluent but quantitatively unreliable. It sees a bowl of pasta with cream sauce and writes "around 600 calories". The bowl actually contained 950. The cream-and-cheese topping was visible in the photo. The model averaged.

Whole-plate estimation drifts because the model is being asked to do two things at once: identify every component AND compute the total. It compresses the identification step. The components it acknowledges in its written answer match the photo; the components it actually priced into the calorie number do not.

The fix is decomposition

Forky AI's pipeline asks the model three separate questions in one structured prompt:

Pass 1 (base). List every food item that forms the structural base of the plate — bread, tortilla, pasta, rice, lettuce, potato, meat slab, etc. Return each with an estimated gram weight.
Pass 2 (toppings). Scan the photo again, top-down. What is layered, piled, scattered, or drizzled on top? Cheese, fried onions, crushed crisps, croutons, herbs, nuts, seeds, bacon bits, sauce drizzle, oil sheen, syrup, whipped cream. Each is its own row.
Pass 3 (per-100g lookup). For each component, return the calories, protein, carbs, and fat per 100g — values the model recalls from USDA-style training data, far more accurate than asking it for a whole-plate total.

We then compute per-component macros as grams × per_100g ÷ 100 and sum across components. The model never sees the final number; we compute it.

Why the missing-topping bug is the single biggest accuracy hit

Empirically, the most common reason a calorie-counter underestimates is not that it gets the portion wrong — it's that it doesn't acknowledge a topping that exists in the photo. A cream sauce on pasta is 100g × ~400 kcal/100g = 400 calories on its own, often more than the pasta itself. Missing it means the estimate halves.

Pass 2 is the entire point of the 3-pass prompt. It exists to make the model look at the same image a second time with a different question. The naive prompt asks "what's on the plate"; the toppings prompt asks "what is layered on top of what you just listed". The model catches things in pass 2 that it missed in pass 1, every single time.

How accurate is this, really

On standard plated meals where every component is visible and identifiable, the per-component approach lands in the ±10–15% range. That's about as good as a trained human food photographer with a kitchen scale would do estimating from the same photo. Below that, you need an actual food scale.

Above that — meaning worse than ±15% — you find:

Mixed dishes with hidden components. Soups, stews, curries, casseroles. The model can guess the ingredients from context but cannot see them. We mark these with a confidence indicator in the UI and let the user adjust per-component gram weights.
Oil and butter content. Two slices of roast chicken can range from 300 to 700 calories depending on whether the cook brushed it with oil. The photo doesn't show oil unless it pools. We have a static prior ("if the model lists 'roasted', assume 15g oil per serving") and accept that it sometimes adds 100 kcal that wasn't there.
Sauce-saturated foods. A pad thai photo shows noodles. The fish sauce, tamarind paste, peanut oil, and palm sugar that make it pad thai are invisible. Decomposition helps a lot — the model lists the components from context — but the gram weights are guesses.
Packaged-food labels that disagree with reality. A protein bar wrapper says 200 kcal. The actual macros vary by ±20% across batches. This is a calorie-counting-is-hard problem, not an AI problem.

What makes the per-component approach better than single-photo apps

The single-photo apps in the category (Cal AI, Lose It's Snap It, MyFitnessPal's Meal Scan) all ship some variant of the whole-plate prompt. Their vendor-reported accuracy hovers around ±20%. The difference is structural — they ask the model for one number; we ask it for a structured list and compute the number ourselves.

The user-visible consequence is small but real: in Forky AI, every macro estimate is editable per-component. Tap "cheese, 30g", change to 50g, watch the total recompute. The component breakdown is what gives the user a handle on the AI's guess. Without it, the user is staring at a single number with no idea what to argue with.

"The number isn't the answer. The breakdown is the answer. The number is just a sum."

The honest accuracy claim

Forky AI's current shipping pipeline runs GPT-4o with the 3-pass component prompt and USDA-style per-100g lookups. On photos of standard plated meals — single plate, every component visible — accuracy lands at ±10–15% on our internal sanity checks. On mixed dishes (soups, stews, curries), accuracy widens to ±25%, and we surface a confidence dot in the UI to flag it.

This is in the same neighbourhood as a human nutritionist eyeballing a photo. It is not in the same neighbourhood as a kitchen scale. For most users, that's the right trade — the kitchen scale is too high friction for daily logging, and "±15% three meals a day" is plenty for hitting weekly macro targets.

What's next

We're working on two extensions: (1) using the user's editing history as a personal prior — if you systematically bump every "cheese, 30g" to 60g, the model should learn — and (2) cross-checking the component list against a depth map from the iPhone's LiDAR sensor on Pro models to get a real portion-size signal instead of relying on visual prior.

Neither is shipping yet. We'll write about both when they are.