Invoice Processing Pipeline
Measurable Outcomes
The Challenge
Manual review of ~2,000 invoices/month consumed 3 FTEs dedicated to data entry and validation. Error rates were 8-12% on key fields, causing downstream reconciliation issues.
The Outcome
85% of invoices now process without human touch. Review time per invoice dropped from ~8 min to ~45 sec for flagged items. Team handles 3x the volume with the same headcount.
The Problem
A mid-size financial services firm was drowning in invoice processing. Their accounts payable team spent the majority of their time on data entry—manually keying vendor names, amounts, dates, and line items from PDFs into their ERP system.
The numbers painted a clear picture:
- ~2,000 invoices/month requiring manual processing
- 3 full-time employees dedicated to data entry and validation
- 8-12% error rate on key fields
- 8 minutes average per invoice, including verification
- Downstream problems: Reconciliation issues, delayed payments, vendor relationship strain
They’d looked at traditional OCR solutions, but accuracy on their diverse invoice formats (50+ vendors, each with different layouts) was too low to be useful.
What We Built
Core Architecture
We designed a document extraction pipeline with Claude at the center, but the key wasn’t just “use AI for OCR.” The architecture decisions were what made it production-ready:
┌─────────────────────────────────────────────────────┐
│ Invoice Upload │
│ (API / Email / SFTP) │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Document Preprocessing │
│ • PDF → Image conversion │
│ • Multi-page detection │
│ • Quality assessment │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Claude Extraction │
│ • Structured output (JSON schema) │
│ • Per-field confidence scores │
│ • Reasoning traces for debugging │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Confidence Routing │
│ • High (>0.90) → Auto-approve │
│ • Medium (0.85-0.90) → Spot-check queue │
│ • Low (<0.85) → Human review required │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ ERP Integration │
│ • Batch submission │
│ • Error handling & retry │
│ • Audit trail │
└─────────────────────────────────────────────────────┘
Key Technical Decisions
1. Confidence Scoring Per Field
Not all fields are equally hard to extract. Vendor name is usually clear; line item descriptions can be ambiguous. We designed the system to output confidence scores per field, not just per document.
This allowed nuanced routing: a document might auto-process even with one low-confidence field if that field is non-critical.
2. Human-in-the-Loop by Default
We didn’t try to eliminate humans—we optimized their time. The review interface shows:
- Original document alongside extracted data
- Fields flagged for attention highlighted
- One-click correction with keyboard shortcuts
Average review time for flagged items: 45 seconds. The system learns from corrections to improve future extractions on similar formats.
3. Temporal for Workflow Orchestration
Invoice processing isn’t a single API call—it’s a workflow with multiple steps that can fail independently. We used Temporal to handle:
- Retry logic with exponential backoff
- Human task assignment and tracking
- Timeout handling for stalled workflows
- Audit logging for compliance
4. Structured Outputs, Not Free Text
We used Claude’s structured output capabilities to enforce a strict JSON schema. This eliminated parsing errors and ensured the ERP system always received correctly formatted data.
{
"vendor": {
"name": "Acme Supplies Inc.",
"confidence": 0.95
},
"invoice_number": {
"value": "INV-2025-1847",
"confidence": 0.98
},
"total_amount": {
"value": 1547.50,
"currency": "USD",
"confidence": 0.97
},
"line_items": [...]
}
Results
After 12 weeks of development and phased rollout:
| Metric | Before | After | Change |
|---|---|---|---|
| Automation rate | 0% | 85% | +85% |
| Time per invoice | 8 min | 45 sec (flagged only) | -91% |
| Error rate | 8-12% | <2% | -80% |
| Monthly capacity | 2,000 | 6,000+ | +200% |
| FTEs on data entry | 3 | 1 (reviewer role) | -67% |
The two FTEs freed from data entry were reassigned to vendor relationship management and process improvement—work that actually benefits from human judgment.
What We Learned
Start with the human workflow, not the AI. We spent the first week shadowing the AP team. Understanding why certain invoices took longer (multi-page, handwritten notes, unusual formats) informed our confidence thresholds and routing logic.
Confidence calibration matters. Our initial thresholds were too conservative—too many documents went to human review. We tuned based on actual error rates in production, finding the sweet spot where auto-approved documents had <1% error rate.
The review interface is the product. Half our development time went into the human review UI. Making corrections fast and frustration-free was critical to adoption. The team actually prefers the new system because when they do intervene, their time is well-spent.
Technical Details
For teams building similar systems, here’s what worked:
- Document preprocessing: Used PyMuPDF for PDF handling, Pillow for image processing. Quality assessment catches faxes and poor scans before they waste API calls.
- Prompt engineering: Few-shot examples in the prompt significantly improved accuracy on unusual formats. We maintain a library of “hard” examples.
- Cost management: At ~$0.03 per invoice for extraction, the system pays for itself many times over compared to human data entry costs.
- Monitoring: Datadog for infrastructure, custom dashboards for extraction quality metrics. Alert on confidence score drift—it’s an early warning for format changes.