Journalistic Accuracy Scoring
A pipeline that tracks and scores journalistic reliability, running a LangGraph LLM stack with tiered evaluation — mocked LLMs for fast CI, real cloud models on a nightly cadence.
Problem
Reporter reliability is asserted, not measured — there is no systematic, repeatable way to track and score accuracy over time.
Solution
Score reliability with a two-tier evaluation pipeline — mocked LLMs for CI speed and real cloud models nightly for behavioral fidelity — and track each source's score over time.
Results
- Cloud-LLM reliability scoring
- Two-tier eval (mocked CI + real nightly)
- Tracks reliability over time
An LLM evaluation pipeline that measures reporter reliability instead of asserting it. A two-tier test strategy keeps CI fast with mocked models while validating real behavior nightly against cloud LLMs (Anthropic and OpenAI), with LangGraph and Postgres checkpoints orchestrating the scoring runs.
Want something like this?
A free 15-minute intro call. No pitch deck, no pressure — just whether I can help.
Next opening: June 2026
Selected work
- AI Operations PlatformProduction AI platform — WhatsApp AI assistant with RAG, commerce infrastructure, content pipelines, LLM evaluation, custom MCP servers.B2B
- Trading AI AnalystMulti-agent trading platform with LangGraph — market analysis, signal generation, risk modeling, portfolio management, and real-time WebSocket streaming.B2C
- OrchestKitOpen-source agent framework for Claude Code — 111 skills, 37 agents, 211 hooks.Open Source
- SkillForge3-tier multi-agent platform (20+ agents) that transforms URLs into implementation-ready artifacts for AI coding assistants.Private
- Voice-to-Chart Clinical AIAI sidecar for veterinary clinics — transcribes Hebrew visit audio, extracts structured medical findings, generates copy-paste-ready chart text, and WhatsApps a summary to the pet owner.B2B
- Real-Estate Connector CRMA personal CRM for a cross-border real-estate connector — a unified WhatsApp and email inbox where AI drafts replies for human approval, with voice transcription, three-language translation, AI summaries, semantic search, and a deal pipeline.B2B
- Multilingual Document IntelligenceAn AI document platform — on-demand Portuguese→English/Hebrew translation, multilingual semantic search, and RAG-powered Q&A over property documents, with an adaptive UI that picks one of five panel layouts based on context.Internal
- Network-Powered Job DiscoveryA network-powered job-discovery engine for junior tech professionals — surfaces junior roles and shows community members' warm paths into the hiring organizations.B2C







