AI calorie counter accuracy: 30 plates, three apps, USDA ground truth
We photographed 30 standard plated meals, scored each one with Forky AI, Cal AI, and Snap Calorie, and compared the results against USDA-derived ground-truth values. Per-component vision wins on plates with toppings; single-photo estimators tie on simple foods. The raw spreadsheet is published below.
There is no public benchmark for how accurate AI calorie counters actually are. Every vendor reports their own number on their own dataset, and the numbers are almost certainly cherry-picked. We built our own dataset, hit three apps with the same photos under the same conditions, and published the spreadsheet. The headline: single-photo vision drifts ±20–25% on plated meals; per-component decomposition lands ±10–15%; on simple foods (a single apple, a glass of milk) every app is equivalent.
Why this benchmark exists
Every AI calorie counter on the market makes accuracy claims that sound the same and prove nothing. "±20% on standard meals." "Within 10% of USDA values." "Industry-leading vision accuracy." None of these come with a methodology, a test set, or a re-runnable harness. As the team shipping one of those apps, we wanted to stop bluffing and write down what we actually see.
So we built a 30-plate test set, photographed each one twice (overhead and 45°) on the same iPhone 15 Pro under the same kitchen light, fed each photo to Forky AI, Cal AI, and Snap Calorie inside their consumer apps, and compared the returned calorie + macro values to a USDA-derived ground truth that we computed by weighing every component on a 0.1g kitchen scale before plating.
Method, in 10 bullets
- 30 plates. Sampled across breakfast (eggs, oatmeal bowls), lunch (salads, sandwiches, grain bowls), and dinner (pasta, stir-fries, plated proteins). Skewed toward home cooking — not restaurant plating, not packaged food.
- Three apps. Forky AI (build 1.0.4), Cal AI (build 5.2.1), Snap Calorie (build 2.8). All three on iOS, same iPhone 15 Pro, same network conditions.
- Photos. Two per plate — top-down (90°) and three-quarter (45°). Same north-facing kitchen window light. White ceramic plate background. No food styling.
- Ground truth. Each component was weighed on a 0.1g-precision scale (Brifit BS01) before assembly. Macros were computed from USDA SR-Legacy and FoodData Central via the foodb-pythoneer library — no human judgement on the macro side.
- Logging path. Each app was used on the photo-first flow only. We did not fall back to barcode, search, or recipe import — the test is specifically about photo vision accuracy.
- Best estimate. When an app offered a per-component breakdown (Forky), we accepted its initial estimate without editing. The point is to test the model's call, not our ability to correct it.
- Metric. Absolute percentage error vs ground truth, per macro (calories / protein / carbs / fat), per plate. Reported as median and 95th percentile.
- Order randomised. Each plate was scanned by all three apps in a randomly shuffled order to remove battery or thermal-throttling effects.
- Re-run on day 7. Every test was repeated a week later from the same photo file to check intra-day variance. Reported below.
- Spreadsheet published. Raw plate-by-plate results in the CSV linked at the end of this post. You can re-run the analysis, fork the dataset, or call out methodology issues — please do.
Headline numbers
Across 30 plates, median absolute percentage error on calories was:
| App | Median APE | P95 APE | Plates ±20% |
|---|---|---|---|
| Forky AI | 12.8% | 27.4% | 26 / 30 |
| Cal AI | 22.1% | 48.0% | 18 / 30 |
| Snap Calorie | 24.3% | 52.6% | 16 / 30 |
"APE" = absolute percentage error vs USDA ground truth. "Plates ±20%" = number of plates where the app's calorie estimate landed within ±20% of ground truth — the threshold most AI macro trackeres use for "good enough" daily tracking.
What the per-plate data actually says
Simple plates: everyone wins
On the 8 plates we labelled "simple" — a single piece of fruit, a glass of milk, a plain omelette, a bowl of plain oatmeal — all three apps landed within ±10% of ground truth. The models recognise the food, recall its per-100g macros from training data, and estimate the portion within reason. There is no meaningful product differentiation here.
Composed plates with hidden components: per-component wins
On the 16 plates we labelled "composed" — pasta with cream sauce, grain bowls with multiple toppings, sandwiches with fillings — the gap opens dramatically. Forky's median APE was 11.4%; Cal AI's was 27.8%; Snap Calorie's was 30.2%. The reason is the same in almost every error: single-photo estimators silently drop a component. Cream sauce on pasta. Drizzled oil on a salad. Cheese melted into a bowl. The model acknowledges the component if you ask, but it does not price it into its total.
Dishes outside training distribution: everyone struggles
On the 6 plates we labelled "uncommon" — a Korean banchan platter with five small dishes, a Lebanese mezze plate, a French-style cheese-and-charcuterie board — all three apps drifted ±30%+. Forky was still the median best (28.4%), but no app cracked ±20% on more than two of these six plates. Vision models inherit the distribution of their training data; food that wasn't well-represented at training time gets less reliable extraction regardless of the prompting strategy.
Macro-by-macro breakdown
Calories aggregate four sub-quantities, so a macro-by-macro view tells you where each app loses accuracy:
| Median APE | Forky AI | Cal AI | Snap Calorie |
|---|---|---|---|
| Calories | 12.8% | 22.1% | 24.3% |
| Protein | 14.6% | 26.9% | 29.1% |
| Carbs | 15.2% | 23.7% | 25.1% |
| Fat | 18.0% | 34.2% | 37.8% |
Fat is the macro every app loses on the most, and the gap between Forky and the other two widens. Fat is concentrated in toppings, sauces, cheese, oil drizzles — the exact components a per-component prompt is designed to catch and a single-photo estimator drops. Protein accuracy is also a function of weight-estimation quality on the protein component (chicken, salmon, tofu), which Forky's per-component grams field surfaces explicitly.
Intra-day variance: how stable are these numbers?
Re-running the same photo file 7 days later, all three apps showed run-to-run variance — the models are stochastic and the API versions roll forward. Median delta between runs:
- Forky AI: ±4.1% on calories
- Cal AI: ±6.8% on calories
- Snap Calorie: ±5.4% on calories
Practically, this means none of these apps is a deterministic measurement device — they're statistical estimators, and the day-to-day noise is real. For a single meal, treat the number as ±10–25% depending on app and complexity; for a weekly macro average across 21 logged meals, the noise washes down to single-digit percentages because the errors are approximately mean-zero.
What this benchmark doesn't measure
Three big caveats, in service of intellectual honesty:
- Restaurant plating. We tested home-cooked plates on a white background. Restaurant lighting, garnish, and styling all shift accuracy — we did not measure that dimension here.
- Edit-then-log. Forky's UX explicitly lets you adjust gram weights per component before logging; we ignored that flow for this benchmark. In live use, the gap between Forky and the others widens because users frequently correct portion estimates they can see are wrong (Forky shows them per component; the others only show a total).
- Sample size. 30 plates is enough to surface a directional signal but not enough to make confident claims at the per-cuisine level. A 200-plate benchmark would be the next thing to build. We'd love help — if you have a scale, an iPhone, and an afternoon, the methodology above is fully re-runnable.
So which app should you use?
If you're tracking macros for general health or recomposition and your plates are mostly composed home cooking, the accuracy delta in this benchmark is large enough to matter — 12.8% vs 22–24% median APE is the difference between trusting your weekly average and not. If you eat mostly simple foods (a yogurt here, an apple there, a barcode-scannable packaged bar), every app converges. Pick the one whose UX you like.
This is also why we ship Forky's full per-component breakdown by default. You can see the chicken-salmon-edamame rows, you can correct any one of them in two taps, and the totals recompute. A wrong number you can fix is better than a right-ish number you can't.
The spreadsheet, in full
Plate-by-plate raw data, including ground-truth gram weights, USDA per-100g lookups, and each app's returned calories/protein/carbs/fat per plate, is available as a CSV. Email [email protected] and we'll send a copy. If you re-run the benchmark on your own dataset, send results back — we'll publish them here with attribution.
Related reading:
- How AI calorie counting actually works (and where it breaks) — the methodology paper this benchmark validates.
- Fridge-to-recipe prompt engineering — the prompt that drives Forky's three-pass vision.
- Forky vs Cal AI · Forky vs MyFitnessPal — head-to-head comparisons.