OCR engine
Tesseract 5 with the standard English model. Open source, Apache 2.0, no API keys.
Drag-drop a real invoice or receipt — or pick one of the two synthetic samples. Tesseract reads the printed text; pattern-based extraction pulls invoice number, date, total, tax, and all detected money amounts into JSON.
1 · Pick a sample
2 · Or upload your own
Drag & drop or click below. JPEG / PNG / WebP, max 5 MB.
Result
OCR engine
Tesseract 5 with the standard English model. Open source, Apache 2.0, no API keys.
Pre-processing
Image is fed to Tesseract directly for synthetic samples; real photos benefit from deskewing and thresholding (added when your data needs it).
Field patterns
Regex patterns extract invoice number, date, total, tax, and all currency amounts.
Structured output
Returned as JSON ready for downstream automation — ERP push, payment system, audit trail.
Production swap
For complex layouts we move to LayoutLMv3 / Donut for layout-aware extraction with same JSON output. Tesseract remains the OCR fallback.
Real-world tuning
Per-supplier templates lift accuracy materially. Production systems combine layout extraction with structured templates per vendor.
We build production IDP systems with your document layouts, schema, accuracy SLAs, and ERP integration — typically processing thousands of documents a day at over 95% straight-through-processing rates.
Book a 30-minute consultation. We will walk through the use case, sketch the value case, and tell you honestly whether we can help.