Build Featured

Invoice Processing Pipeline

Financial Services Firm · Financial Services
Duration 12 weeks
Team Size 2 people
Completed November 2025

Measurable Outcomes

85% Automation Rate from 0%
45 sec Review Time from 8 min
3x Volume Capacity same headcount

The Challenge

Manual review of ~2,000 invoices/month consumed 3 FTEs dedicated to data entry and validation. Error rates were 8-12% on key fields, causing downstream reconciliation issues.

The Outcome

85% of invoices now process without human touch. Review time per invoice dropped from ~8 min to ~45 sec for flagged items. Team handles 3x the volume with the same headcount.

The Problem

A mid-size financial services firm was drowning in invoice processing. Their accounts payable team spent the majority of their time on data entry—manually keying vendor names, amounts, dates, and line items from PDFs into their ERP system.

The numbers painted a clear picture:

  • ~2,000 invoices/month requiring manual processing
  • 3 full-time employees dedicated to data entry and validation
  • 8-12% error rate on key fields
  • 8 minutes average per invoice, including verification
  • Downstream problems: Reconciliation issues, delayed payments, vendor relationship strain

They’d looked at traditional OCR solutions, but accuracy on their diverse invoice formats (50+ vendors, each with different layouts) was too low to be useful.

What We Built

Core Architecture

We designed a document extraction pipeline with Claude at the center, but the key wasn’t just “use AI for OCR.” The architecture decisions were what made it production-ready:

┌─────────────────────────────────────────────────────┐
│                 Invoice Upload                       │
│              (API / Email / SFTP)                    │
└──────────────────────────┬──────────────────────────┘


┌─────────────────────────────────────────────────────┐
│              Document Preprocessing                  │
│  • PDF → Image conversion                           │
│  • Multi-page detection                             │
│  • Quality assessment                               │
└──────────────────────────┬──────────────────────────┘


┌─────────────────────────────────────────────────────┐
│              Claude Extraction                       │
│  • Structured output (JSON schema)                  │
│  • Per-field confidence scores                      │
│  • Reasoning traces for debugging                   │
└──────────────────────────┬──────────────────────────┘


┌─────────────────────────────────────────────────────┐
│              Confidence Routing                      │
│  • High (>0.90) → Auto-approve                      │
│  • Medium (0.85-0.90) → Spot-check queue            │
│  • Low (<0.85) → Human review required              │
└──────────────────────────┬──────────────────────────┘


┌─────────────────────────────────────────────────────┐
│              ERP Integration                         │
│  • Batch submission                                 │
│  • Error handling & retry                           │
│  • Audit trail                                      │
└─────────────────────────────────────────────────────┘

Key Technical Decisions

1. Confidence Scoring Per Field

Not all fields are equally hard to extract. Vendor name is usually clear; line item descriptions can be ambiguous. We designed the system to output confidence scores per field, not just per document.

This allowed nuanced routing: a document might auto-process even with one low-confidence field if that field is non-critical.

2. Human-in-the-Loop by Default

We didn’t try to eliminate humans—we optimized their time. The review interface shows:

  • Original document alongside extracted data
  • Fields flagged for attention highlighted
  • One-click correction with keyboard shortcuts

Average review time for flagged items: 45 seconds. The system learns from corrections to improve future extractions on similar formats.

3. Temporal for Workflow Orchestration

Invoice processing isn’t a single API call—it’s a workflow with multiple steps that can fail independently. We used Temporal to handle:

  • Retry logic with exponential backoff
  • Human task assignment and tracking
  • Timeout handling for stalled workflows
  • Audit logging for compliance

4. Structured Outputs, Not Free Text

We used Claude’s structured output capabilities to enforce a strict JSON schema. This eliminated parsing errors and ensured the ERP system always received correctly formatted data.

{
  "vendor": {
    "name": "Acme Supplies Inc.",
    "confidence": 0.95
  },
  "invoice_number": {
    "value": "INV-2025-1847",
    "confidence": 0.98
  },
  "total_amount": {
    "value": 1547.50,
    "currency": "USD",
    "confidence": 0.97
  },
  "line_items": [...]
}

Results

After 12 weeks of development and phased rollout:

MetricBeforeAfterChange
Automation rate0%85%+85%
Time per invoice8 min45 sec (flagged only)-91%
Error rate8-12%<2%-80%
Monthly capacity2,0006,000++200%
FTEs on data entry31 (reviewer role)-67%

The two FTEs freed from data entry were reassigned to vendor relationship management and process improvement—work that actually benefits from human judgment.

What We Learned

Start with the human workflow, not the AI. We spent the first week shadowing the AP team. Understanding why certain invoices took longer (multi-page, handwritten notes, unusual formats) informed our confidence thresholds and routing logic.

Confidence calibration matters. Our initial thresholds were too conservative—too many documents went to human review. We tuned based on actual error rates in production, finding the sweet spot where auto-approved documents had <1% error rate.

The review interface is the product. Half our development time went into the human review UI. Making corrections fast and frustration-free was critical to adoption. The team actually prefers the new system because when they do intervene, their time is well-spent.

Technical Details

For teams building similar systems, here’s what worked:

  • Document preprocessing: Used PyMuPDF for PDF handling, Pillow for image processing. Quality assessment catches faxes and poor scans before they waste API calls.
  • Prompt engineering: Few-shot examples in the prompt significantly improved accuracy on unusual formats. We maintain a library of “hard” examples.
  • Cost management: At ~$0.03 per invoice for extraction, the system pays for itself many times over compared to human data entry costs.
  • Monitoring: Datadog for infrastructure, custom dashboards for extraction quality metrics. Alert on confidence score drift—it’s an early warning for format changes.

Have a similar challenge?

Let's talk about what we can build together.

Book a call