Invoice Processing Pipeline

The Problem

A mid-size financial services firm was drowning in invoice processing. Their accounts payable team spent the majority of their time on data entry—manually keying vendor names, amounts, dates, and line items from PDFs into their ERP system.

The numbers painted a clear picture:

~2,000 invoices/month requiring manual processing
3 full-time employees dedicated to data entry and validation
8-12% error rate on key fields
8 minutes average per invoice, including verification
Downstream problems: Reconciliation issues, delayed payments, vendor relationship strain

They’d looked at traditional OCR solutions, but accuracy on their diverse invoice formats (50+ vendors, each with different layouts) was too low to be useful.

What We Built

Core Architecture

We designed a document extraction pipeline with Claude at the center, but the key wasn’t just “use AI for OCR.” The architecture decisions were what made it production-ready:

┌─────────────────────────────────────────────────────┐
│                 Invoice Upload                       │
│              (API / Email / SFTP)                    │
└──────────────────────────┬──────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────┐
│              Document Preprocessing                  │
│  • PDF → Image conversion                           │
│  • Multi-page detection                             │
│  • Quality assessment                               │
└──────────────────────────┬──────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────┐
│              Claude Extraction                       │
│  • Structured output (JSON schema)                  │
│  • Per-field confidence scores                      │
│  • Reasoning traces for debugging                   │
└──────────────────────────┬──────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────┐
│              Confidence Routing                      │
│  • High (>0.90) → Auto-approve                      │
│  • Medium (0.85-0.90) → Spot-check queue            │
│  • Low (&lt;0.85) → Human review required              │
└──────────────────────────┬──────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────┐
│              ERP Integration                         │
│  • Batch submission                                 │
│  • Error handling & retry                           │
│  • Audit trail                                      │
└─────────────────────────────────────────────────────┘

Key Technical Decisions

1. Confidence Scoring Per Field

Not all fields are equally hard to extract. Vendor name is usually clear; line item descriptions can be ambiguous. We designed the system to output confidence scores per field, not just per document.

This allowed nuanced routing: a document might auto-process even with one low-confidence field if that field is non-critical.

2. Human-in-the-Loop by Default

We didn’t try to eliminate humans—we optimized their time. The review interface shows:

Original document alongside extracted data
Fields flagged for attention highlighted
One-click correction with keyboard shortcuts

Average review time for flagged items: 45 seconds. The system learns from corrections to improve future extractions on similar formats.

3. Temporal for Workflow Orchestration

Invoice processing isn’t a single API call—it’s a workflow with multiple steps that can fail independently. We used Temporal to handle:

Retry logic with exponential backoff
Human task assignment and tracking
Timeout handling for stalled workflows
Audit logging for compliance

4. Structured Outputs, Not Free Text

We used Claude’s structured output capabilities to enforce a strict JSON schema. This eliminated parsing errors and ensured the ERP system always received correctly formatted data.

{
  "vendor": {
    "name": "Acme Supplies Inc.",
    "confidence": 0.95
  },
  "invoice_number": {
    "value": "INV-2025-1847",
    "confidence": 0.98
  },
  "total_amount": {
    "value": 1547.50,
    "currency": "USD",
    "confidence": 0.97
  },
  "line_items": [...]
}

Results

After 12 weeks of development and phased rollout:

Metric	Before	After	Change
Automation rate	0%	85%	+85%
Time per invoice	8 min	45 sec (flagged only)	-91%
Error rate	8-12%	<2%	-80%
Monthly capacity	2,000	6,000+	+200%
FTEs on data entry	3	1 (reviewer role)	-67%

The two FTEs freed from data entry were reassigned to vendor relationship management and process improvement—work that actually benefits from human judgment.

What We Learned

Start with the human workflow, not the AI. We spent the first week shadowing the AP team. Understanding why certain invoices took longer (multi-page, handwritten notes, unusual formats) informed our confidence thresholds and routing logic.

Confidence calibration matters. Our initial thresholds were too conservative—too many documents went to human review. We tuned based on actual error rates in production, finding the sweet spot where auto-approved documents had <1% error rate.

The review interface is the product. Half our development time went into the human review UI. Making corrections fast and frustration-free was critical to adoption. The team actually prefers the new system because when they do intervene, their time is well-spent.

Technical Details

For teams building similar systems, here’s what worked:

Document preprocessing: Used PyMuPDF for PDF handling, Pillow for image processing. Quality assessment catches faxes and poor scans before they waste API calls.
Prompt engineering: Few-shot examples in the prompt significantly improved accuracy on unusual formats. We maintain a library of “hard” examples.
Cost management: At ~$0.03 per invoice for extraction, the system pays for itself many times over compared to human data entry costs.
Monitoring: Datadog for infrastructure, custom dashboards for extraction quality metrics. Alert on confidence score drift—it’s an early warning for format changes.

Measurable Outcomes

The Challenge

The Outcome