Importing a recipe from any URL, photo, or PDF — the pipeline

Most recipe-import features hand-wave 'AI' and ship a Spoonacular wrapper. Forky AI's pipeline does four things differently: scrapes structured data first, falls back to vision OCR, generates a hero photo, and vision-verifies that photo before saving.

By Elie de Rougemont, Founder of Forky AI · 8 min read · published May 15, 2026

The Forky AI import button accepts four sources: a URL, a photo, a PDF, or pasted text. They land at the same endpoint and run through the same downstream pipeline. The first step — getting from 'pile of source text' to 'structured recipe with macros' — is where most apps in the category cut corners. Here's how Forky AI doesn't.

Four sources, one normalised shape

Recipes arrive in any of these formats:

URL. The user pastes a link from any recipe blog, food magazine, or newspaper site. We fetch the page server-side with a 5-second timeout, a 2 MB cap, and an SSRF guard that blocks private IP ranges.
Photo. The user takes a picture of a cookbook page or a screenshot of a recipe. Vision OCR extracts the text.
PDF. The user uploads a cookbook page exported to PDF (or a meal-plan PDF from a coach). We parse the embedded text layer when present, fall back to OCR when it's a scanned image.
Pasted text. The user copy-pastes a recipe from Notes or WhatsApp.

All four converge on the same intermediate representation: a string of recipe text plus an optional photo source. The rest of the pipeline doesn't care which source it came from.

Structured data first, vision second

For URL imports, we don't immediately ask the LLM to parse the page. We first look for application/ld+json Recipe schema, which most major recipe sites ship. When it's present, we trust it. The schema gives us a clean ingredients list, instructions, prep time, yield, and often nutrition facts — all without an LLM call and without the risk of the LLM hallucinating an ingredient that wasn't on the page.

Only when the page has no Recipe schema (which is more common on blog posts written in 2014 than you'd think) do we hand the page text to the LLM with a structured-output prompt and ask it to extract:

{
  "name": "...",
  "ingredients": [
    { "name": "...", "quantity_text": "...", "grams": N }
  ],
  "steps": ["..."],
  "time_min": N,
  "servings": N,
  "tags": ["..."]
}

The structured-output mode is what makes this reliable. We constrain the model to return exactly this JSON shape, and we validate every field. If the model returns "quantity_text": "a handful" with no grams, we run a second pass against a per-ingredient gram-weight dictionary (peanuts ≈ 30g/handful, spinach ≈ 60g/handful, etc) so the macros downstream are computed against real numbers, not text.

The per-ingredient macro lookup

Once we have ingredients-with-grams, the macros are mechanical. For each ingredient, we look up the per-100g calories/protein/carbs/fat from a curated database (~530 entries, USDA-style values). For ingredients not in the database, we ask the LLM for the per-100g values only — not the totals. Same trick as the meal-photo pipeline: let the model do the lookup, let our code do the arithmetic.

This means the recipe arrives in your library with per-portion macros that match what would happen if you cooked it and photographed the plate. No mystery numbers, no drift.

The AI hero photo problem

Every imported recipe needs a hero photo. The source URL sometimes has one (extracted from Open Graph or the Recipe schema's image field), but often the photo is broken, watermarked, or doesn't actually show the recipe.

Our first version of this pinned an Unsplash stock photo whose URL had been hand-mapped to a keyword in the dish name. "Frittata" mapped to photo-1612240498936. That worked great until we discovered, six months later, that the URL we'd labelled "frittata" was in fact a stock photo of three glazed donuts. The "eggs" URL was a raspberry layer cake. The list had never been visually audited.

We now generate every hero photo per-recipe via Gemini 3.1's image model with a prompt that includes:

The dish name as the primary signal.
The top 8 ingredients (so the photo looks like the actual recipe, not a generic version).
A per-keyword visual identity hint: "frittata = thick baked egg cake sliced into wedges, golden top with visible vegetables embedded inside" vs "pizza = round flat bread with melted cheese and toppings". This keeps the model from producing donuts for a frittata.
Style constraints: "restaurant-quality plating, soft natural daylight, shallow depth of field, no text, no menu, no watermark, no people, no packaged products".

The vision-verification step

The image-gen model still hallucinates. Not often, but often enough that we couldn't ship the photo straight to the recipe doc. We added a verification step before persisting:

async def _ai_photo_matches_dish(b64_data, mime, name):
    chat = LlmChat(model="gpt-4o-mini", system=VERIFIER_PROMPT)
    msg = UserMessage(
        text=f'Intended dish: "{name}". Does this image plausibly show that dish? YES or NO.',
        file_contents=[ImageContent(image_base64=b64_data)],
    )
    resp = await chat.send_message(msg)
    return resp.strip().upper().startswith("YES")

GPT-4o-mini vision sees the generated image and the intended dish name. It replies YES or NO. We only cache and persist the photo when the verifier says YES. If it says NO, we fall back to a placeholder card and the next read triggers another generation. Cost per recipe: roughly $0.04 for Gemini plus $0.001 for the verifier. Cost per recipe served to a different user (cache hit): $0. The global cache is keyed on the normalised dish name.

Self-healing on read

Recipes that landed in the library before the verification step shipped — some with bad AI photos, some with the donut-instead-of-frittata stock pic — would otherwise be stuck with the wrong photo forever. So the GET /recipes/<id> handler self-heals: if a row's photo_url is empty, it triggers a background regen and marks the row ai_in_flight to avoid stacking concurrent gens on rapid pull-to-refresh. The next read returns the verified photo.

"The verifier turned a probabilistic source into a deterministic one. We can cache the output globally because we know it passed an explicit 'does this image show the dish' gate before being written."

What we learned

Three things that generalise beyond this pipeline:

Static reference tables decay silently. The keyword photo list "worked" for six months because no one visually audited the URLs. When you ship a feature that depends on a static table, ship a validator script alongside it.
LLM output is probabilistic; cached output is deterministic. A bad LLM generation cached globally turns one user's bug into every user's bug. Verification is the gate that lets you cache safely.
The self-heal-on-read pattern lets you ship migrations cheaply. Instead of a batch backfill that hits the DB once for every legacy row, the GET handler does the work lazily the next time someone reads the row. The migration runs over weeks instead of one big window, and only touches rows that are actually being used.