Skip to content
Back to Projects

Journalistic Accuracy Scoring

A pipeline that tracks and scores journalistic reliability, running a LangGraph LLM stack with tiered evaluation — mocked LLMs for fast CI, real cloud models on a nightly cadence.

PythonFastAPIAnthropic (Claude)OpenAILangGraphNext.jsPlaywright

Problem

Reporter reliability is asserted, not measured — there is no systematic, repeatable way to track and score accuracy over time.

Solution

Score reliability with a two-tier evaluation pipeline — mocked LLMs for CI speed and real cloud models nightly for behavioral fidelity — and track each source's score over time.

Results

  • Cloud-LLM reliability scoring
  • Two-tier eval (mocked CI + real nightly)
  • Tracks reliability over time

An LLM evaluation pipeline that measures reporter reliability instead of asserting it. A two-tier test strategy keeps CI fast with mocked models while validating real behavior nightly against cloud LLMs (Anthropic and OpenAI), with LangGraph and Postgres checkpoints orchestrating the scoring runs.

Want something like this?

A free 15-minute intro call. No pitch deck, no pressure — just whether I can help.

Next opening: June 2026

Selected work