← Forky AI Blog

Fridge-to-recipe prompt engineering: how Forky's 3-pass vision works

The actual prompt structure Forky AI uses to go from a fridge photo to a structured ingredient list to a per-component recipe. Three passes, why each exists, the failure modes we hit, and the JSON schema we hand back to the app. Builder-focused, with code.

By Elie de Rougemont, Founder of Forky AI · 13 min read · published

A fridge photo is a worse problem than a plate photo. The lighting is bad, items are stacked behind other items, packaging hides contents, and a single image needs to surface 10–20 distinct things instead of a single composed dish. Here is the prompt and post-processing pipeline we landed on after eight months of iteration, including the bits that didn't work.

The problem in one paragraph

A user opens the fridge, takes a photo, and expects Forky to know what they have so it can suggest a dinner that uses what's already there. From the model's perspective, that photo is a cluttered scene with poor lighting, multiple occluded shelves, opaque packaging, and items that may or may not be edible (a bottle of tonic water vs a bottle of milk vs a bottle of cleaning product on the door). The output has to be structured — a list of ingredients with gram estimates and freshness labels, not a paragraph of prose — so the rest of the app can compute macros and surface "expiring soon" suggestions.

Why one pass doesn't work

We started with the naive prompt: "List every food item visible in this fridge photo as JSON." Two failure modes appeared within the first 50 user photos:

The fix is the same shape as the plate-vision fix in our previous post: give the model multiple structured passes instead of one open-ended question. Decompose the problem.

The 3-pass prompt

Pass 1 — coarse zonal scan

Force the model to look at the fridge in zones, not as a single image. We bound the zones in the prompt and require an item count per zone before any item names are listed:

SYSTEM: You are a fridge inventory assistant. You will be shown a single
photo of an opened refrigerator. Look at the image in five separate zones
(top shelf, middle shelf, bottom shelf, drawer, door bins). For EACH zone,
first state how many distinct edible items you see, then list them.
Return JSON only — no prose.

USER: [fridge photo]
Output schema:
{ "top_shelf":    { "count": int, "items": [str] },
  "middle_shelf": { "count": int, "items": [str] },
  "bottom_shelf": { "count": int, "items": [str] },
  "drawer":       { "count": int, "items": [str] },
  "door_bins":    { "count": int, "items": [str] } }

The count field exists to force the model to look before it speaks. Without it, the model would say "I see 4 items" and only list 2. With it, the model almost always emits as many items as its count claimed. Self-consistency through structured forcing — same trick as chain-of-thought, applied to a list.

Pass 2 — content vs container disambiguation

For every item from Pass 1 that the model labelled with a container word (tupperware, jar, bowl, glass, can), Pass 2 asks the model to look at the same photo again and identify the likely contents. Container words are a static list in our code; we route only those items into Pass 2, not the whole list.

SYSTEM: Look at the fridge photo again. For the following items, the
previous pass identified the container but not the contents. Identify
the most likely contents based on color, density, and any visible
labels. If contents are not identifiable, return null.

USER: [fridge photo]
Items to disambiguate: ["tupperware (top shelf)", "glass jar (door)"]
Output schema:
{ "tupperware (top shelf)": "cooked pasta" | null,
  "glass jar (door)":       "homemade jam" | null }

About 60% of container-pass items get successfully disambiguated; the rest get a null and end up in the app as "unknown container — tap to label". We do not show the user the unknown items in the suggested-recipe flow because their macros would be a guess on a guess.

Pass 3 — per-item quantity + freshness anchor

Now we have a clean list of edible items. Pass 3 asks the model to estimate gram weight and visible freshness for each one. Freshness anchors the expiry-date calculation that drives the "expiring soon" feature.

SYSTEM: For each edible item, estimate its current weight in grams and
its visible freshness on a 0-3 scale (0=spoiled, 1=use today, 2=fresh,
3=just opened or pristine). Use category-typical weights when items are
fully wrapped/packaged.

USER: [fridge photo]
Items: ["6 eggs (top shelf)", "Greek yogurt 500g tub (top shelf)", ...]
Output schema:
[
  { "name": "eggs", "count": 6, "grams_each": 50,
    "grams_total": 300, "freshness": 2 },
  { "name": "Greek yogurt", "grams_total": 470, "freshness": 3 },
  ...
]

The grams_each + count field for countable items (eggs, apples, lemons, peppers) gives users a tap-friendly stepper in the UI — when they cook with 2 eggs out of 6, the inventory updates by grams_each × 2 instead of forcing a gram-level mental conversion.

Post-processing: where the AI stops and Python takes over

The model returns JSON; the backend does five things to it before persisting:

Things we tried and rolled back

One-pass with structured output (failed)

We tried collapsing the three passes into a single prompt with a richer JSON schema — asking the model to return zones AND containers AND quantities in one shot. It worked on clean fridges and fell over on cluttered ones. The model would skip steps to fit response length: it would either skip the zone breakdown OR the container disambiguation OR the quantity estimation, and never tell us which. Splitting passes gave us explicit failure modes per step.

YOLO + classification head (failed, for now)

We prototyped a classical CV pipeline — YOLO for object detection, then a fine-tuned classification head for "food vs container vs other" — to bypass the vision-LLM cost. It worked on clean fridges in our test set and collapsed on real user fridges (different camera angles, different lighting, different fridge brands). The vision-LLM generalises across that variability for free; the classical CV pipeline would need to be re-tuned per deployment region. We may revisit if cost forces our hand, but right now the model wins.

Per-shelf cropping (mixed)

We tried cropping the original photo into three shelf-shaped tiles and running the prompt once per tile. Latency tripled, accuracy went up ~8% on cluttered fridges, but quality on spanning items (a tall water bottle that crosses two shelves) dropped because the bottle got cut between tiles. We ship the un-cropped path; per-shelf cropping is a flag in the backend we may enable for "I think I missed something" retry flows.

What we want to ship next

Three things on the prompt roadmap that we don't have yet:

The cost of running this

Per fridge scan, on GPT-4o:

Total: roughly $0.03–$0.04 per scan at current pricing. At a 5-day-trial conversion of 8% and average 18 scans/user/month for retained Pro users, the per-user marginal cost is manageable inside our $80/year price point — but it's the single biggest line item on the ops budget. Caching identical photos within the same user session is the easiest win we haven't yet shipped.

Why post this publicly

Two reasons. First, this prompt isn't the moat — the moat is the iteration loop, the canonical dictionary, the shelf-life data, and the UX that makes the user trust the result enough to actually use it. Publishing the prompt costs us nothing and gives builders a starting point that's better than "ask GPT what's in the photo."

Second, we want feedback. If you've built a similar pipeline, the failure modes you're seeing probably overlap with ours, and the things you've tried that worked are things we should try. Email [email protected] — or DM the Forky AI account on X. We read everything.

Related reading: