OCR + AI Restoration — Digitizing the Original Books

The Pipeline

Seven stages from physical book to verified digital text. Each stage has quality gates to ensure nothing is lost or changed.

High-Quality Scanning

Source the best available scans of the original editions. Ideally 300-600 DPI, full-color for plates, grayscale for text pages. Each page must be clean, flat, and properly aligned.

Technical details

Preferred sources: Internet Archive scans, BBT originals, private collector high-res scans. The 1972 Macmillan Bhagavad-gita is already available at high resolution from Internet Archive.

Scanning specs: 300 DPI minimum for OCR accuracy. 600 DPI preferred for print restoration. TIFF or lossless PNG for archival, JPEG for processing. Color depth: 24-bit for color plates, 8-bit grayscale for text.

Pre-processing: deskew, crop margins, normalize contrast. Tools: ScanTailor, ImageMagick, or custom Python scripts with OpenCV.

AI Vision Text Extraction

Each page image is sent to a vision-language model (Gemini 2.5 Flash or Claude Sonnet) that doesn't just pattern-match pixels — it understands the text it's reading. This is critical for Sanskrit diacriticals: the AI knows the word should be "Kṛṣṇa" not "Krsna" because it understands the language, not just the shapes.

Technical details

Why vision LLMs, not traditional OCR: Traditional OCR (Google Vision, Textract) achieves ~97% accuracy on general text but only ~90-93% on diacritical marks specifically. The dots below (ṣ, ṇ, ḍ, ṭ), macrons (ā, ī, ū), and special characters (ś, ṁ, ñ) are where traditional OCR fails most. Vision-language models understand the context — they know Sanskrit vocabulary, so they produce the right diacriticals from the start.

Primary engine: Gemini 2.5 Flash — best cost/quality ratio for vision tasks. Batch processing mode. Prompt instructs: extract text exactly as printed, preserve all IAST diacritical marks, maintain paragraph structure.

Fallback/verification: Claude Sonnet 4.6 via batch API (50% discount) for pages where Gemini confidence is low or results need a second opinion.

Cost comparison for 24,000 pages:

Approach	Diacritical accuracy	Cost
Google Cloud Vision (traditional OCR)	~90-93%	$35
Gemini 2.5 Flash (vision LLM)	~97-99%	$80-150
Claude Sonnet batch (vision LLM)	~97-99%	$150-250
Gemini + dictionary correction (our approach)	99%+	~$150

The key insight: Traditional OCR is 3-4x cheaper but produces 5-10x more diacritical errors that need manual correction. Vision LLMs cost more upfront but save enormous human review time. For a sacred text project where every diacritical matters, the extra $100 is easily justified.

Output: Clean text per page with verse structure preserved. Markdown format with headings, verse numbers, synonyms, translations, and purports correctly identified.

Automated Diacritical Correction

Sanskrit transliteration uses special characters (ā, ī, ū, ṛ, ṣ, ṇ, ś, ṁ, etc.) that OCR engines often confuse with their plain equivalents (a, i, u, r, s, n). A post-processing script cross-references every word against a dictionary of 50,000+ known Sanskrit terms from our existing archive.

Technical details

The problem: OCR commonly confuses: ṣ with s or ś, ṇ with n or ñ, ṛ with r, ā with a, ṁ with m. These are IAST (International Alphabet of Sanskrit Transliteration) characters in the Unicode Latin Extended Additional block.

Dictionary source: Our existing 21,000+ markdown files contain every Sanskrit term Prabhupada used, already in correct IAST. This gives us a massive verified vocabulary.

Process: For each word containing Latin characters that could be IAST, fuzzy-match against the dictionary. If a close match exists with different diacriticals, flag for automatic correction or human review.

Tool: Custom Node.js script using Levenshtein distance with diacritical-aware weighting. A substitution of ṣ to s costs less than s to t.

Digital Comparison Source

Independently, we prepare a second digital copy of the same text from another source — the original VedaBase export, the existing VANIWIKI markdown, or another verified digital edition. This gives us two independent digital representations of the same book to compare.

Technical details

Sources: (1) Original VedaBase Folio export (pre-edit versions), (2) VANIWIKI content/ markdown files (21,000+ files already digitized), (3) iskconpress/books GitHub repo.

Alignment: Both texts need to be aligned at the verse level. Use verse reference numbers (BG 1.1, SB 1.1.1, etc.) as anchors. Synonyms, translations, and purports are aligned separately.

Normalization: Strip formatting (bold, italic, headings), normalize whitespace, normalize Unicode (NFC form). The goal is pure text comparison.

Diff Generation

A word-level diff is computed between the OCR output (from the scan) and the digital comparison text. Every difference is logged: added words, missing words, changed words, diacritical differences, punctuation changes. This produces a "difference document" — the most critical artifact in the pipeline.

Technical details

Algorithm: Word-level diff using Myers' algorithm (same as git diff). Each difference is categorized: (a) OCR error — the scan reading is wrong, (b) Editorial change — the digital source differs from the original print, (c) Ambiguous — needs human review.

Auto-classification: If the OCR confidence for a word is below 90% and the comparison text has a plausible alternative, classify as probable OCR error. If OCR confidence is high and the texts differ, flag as potential editorial change.

Output: HTML diff document with side-by-side comparison, color-coded changes, links to the scan image for each page, and per-difference classification. Also: JSON export for programmatic processing.

AI + Human Verification

Every flagged difference is resolved. For each one, we look at the original scan image and determine which reading is correct. AI (Claude) assists by analyzing the scan image alongside both text variants, but a human reviewer has the final say. The scan is always the ultimate authority.

Technical details

AI reviewer: Claude Haiku (batch API, ~$0.80/1,000 pages) receives: the cropped scan image of the disputed area + the OCR reading + the comparison text reading. It returns a verdict with confidence level and reasoning.

Human reviewer: A web-based review interface shows the scan side-by-side with both readings. The reviewer clicks to accept one reading or type a correction. Results are logged with reviewer identity for accountability.

Priority triage: Diacritical differences are most common (thousands). Pure OCR errors are second. Actual editorial differences between editions are rarest and most valuable — these are the changes made after 1978.

Estimated review volume: At 97% OCR accuracy, ~3% of 24,000 pages will have at least one flagged word = ~720 pages needing any review. Many differences will be auto-resolved by the dictionary step, leaving an estimated 200-400 pages requiring human eyes.

Publishing & Distribution

The verified text is published in multiple formats: structured markdown (for the archive and apps), PDF (for print and download), EPUB (for e-readers), and HTML (for the web). Every file carries a verification hash and provenance chain back to the original scan.

Technical details

Output formats: (1) Markdown with YAML frontmatter — feeds into prabhupada.io and VaniReader, (2) PDF via WeasyPrint — print-ready with original typography, (3) EPUB — for Kindle, Apple Books, etc., (4) Plain text — for search indexing and translation.

Provenance: Each output file includes metadata: source scan identifier, OCR engine version, diff status (matched/corrected/reviewed), reviewer identity if human-reviewed, SHA-256 hash of the verified text.

Distribution: Published to GitHub (iskconpress/books), prabhupada.io (Obsidian Publish), and available for download at iskcon.io. All formats freely available under open license.

What We're Scanning

Work	Edition	Volumes	Pages
Bhagavad-gita As It Is	1972 Macmillan	1	~1,050
Srimad-Bhagavatam	1970-77 originals	30	~12,000
Sri Caitanya-caritamrta	1973-75 originals	17	~7,200
Other books (20+ titles)	First editions	~20	~3,500
Total		67	~24,000

Choosing the Right Tool

Not all OCR is equal. For sacred texts where every diacritical mark matters, the choice of extraction engine determines whether you spend $150 on AI or 40+ hours on manual correction.

Engine	Type	General accuracy	Diacritical accuracy	Cost / 24K pages
Gemini 2.5 Flash Google, cloud API	Vision LLM	~98%	~97-99%	$80 - $150
Claude Sonnet 4.6 Anthropic, cloud API (batch 50% off)	Vision LLM	~98%	~97-99%	$150 - $250
GPT-4o OpenAI, cloud API	Vision LLM	~97%	~96-98%	$120 - $200
Traditional cloud OCR
Google Cloud Vision DOCUMENT_TEXT_DETECTION, cloud API	Cloud OCR	~97%	~90-93%	$35
AWS Textract Amazon, cloud API	Cloud OCR	~97%	~90-93%	$35
Local / offline software
Tesseract 5 Open source, runs locally, free	Local OCR	~93%	~78-85%	$0
Tesseract 5 + Sanskrit training Custom-trained model, local	Local OCR	~95%	~88-92%	$0 (+ training time)
EasyOCR Open source, Python, local GPU	Local OCR	~90%	~70-80%	$0
Surya OCR Open source, transformer-based, local GPU	Local AI	~96%	~90-94%	$0 (needs GPU)

Understanding the accuracy numbers — what do they really mean?

General accuracy measures how well the engine reads standard English text. Most engines do well here (93-98%). This is not where the real challenge lies.

Diacritical accuracy is what matters for this project. Srila Prabhupada's books use IAST transliteration — Latin characters with marks that completely change the meaning. Consider:

Correct: Kṛṣṇa (Krishna — the Supreme Lord)

OCR error: Krsna (missing dots — loses meaning)

Correct: dharmā (religious duty, with macron)

OCR error: dharma (plain Latin — ambiguous)

At 90% diacritical accuracy (traditional OCR), roughly 1 in 10 Sanskrit words has at least one wrong diacritical. Across 24,000 pages with Sanskrit on nearly every page, that's thousands of errors requiring manual correction.

At 98% accuracy (vision LLM), it's closer to 1 in 50 words — still hundreds of errors, but manageable with dictionary correction and spot-checking.

The cost tradeoff: Tesseract is free but produces ~5,000+ diacritical errors. If each takes 30 seconds to find and fix, that's 40+ hours of human labor. Gemini costs ~$150 but produces ~500 errors that our dictionary auto-corrects most of, leaving perhaps 50-100 for human review. The $150 buys back weeks of volunteer time.

Our strategy: Use the best available AI (Gemini 2.5 Flash) as the primary extractor, run dictionary correction for diacriticals, then use Claude as a second opinion on flagged passages. We can also run Tesseract as a free third source — where all three agree, we have very high confidence. Where they disagree, we look at the scan.

Cost Breakdown

High-quality scan sourcing

Internet Archive, private collectors, BBT originals

$500

AI vision extraction (Gemini 2.5 Flash)

24,000 pages with diacritical-aware extraction

$150

Secondary AI verification (Claude batch)

Ambiguous pages reviewed by Claude Sonnet

$30

Cloud storage & compute

GCS, processing scripts, hosting

$50

Human review tooling

Web-based diff review interface

$200

PDF/EPUB generation & publishing

Print-ready output, distribution setup

$200

Contingency & ongoing maintenance

Re-processing, corrections, updates

$500

Maha-prasadam feast

Celebration for completion of the project

$50

Total Project Budget

$1,680

The $5,000 fundraising goal includes a buffer for unexpected costs, scaling to additional source editions, and building out the review interface as a reusable tool for future projects.

Frequently Asked Questions

Don't we already have digital copies of these books?

Yes — but they come from VedaBase exports and manual transcription, not from the physical originals. These digital copies may contain transcription errors, and they cannot definitively prove what was in the original print. OCR from the actual pages creates an independent verification source tied directly to the physical book.

Can AI accurately read Sanskrit diacritical marks?

Traditional OCR (Google Vision, Tesseract) achieves only ~90-93% accuracy on diacritical marks specifically — the dots, macrons, and special characters that distinguish Sanskrit transliteration. That's why we use vision-language models (Gemini, Claude) instead: they understand that the word is "Krsna" not "Krsna" because they know Sanskrit vocabulary, not just pixel patterns. Combined with our 50,000+ word dictionary cross-reference, we target 99%+ final accuracy on diacriticals.

Why not use free/local OCR software like Tesseract?

We evaluated local options carefully. Here's the honest comparison:

Tesseract (open source, free, runs locally): Achieves only ~78% accuracy on Sanskrit text out of the box. Even with specialized training data, character error rate is 2-3%. It consistently fails on diacritical marks — missing dots below letters, confusing macrons with tildes. For 24,000 pages, that means thousands of errors requiring manual correction.

EasyOCR (open source, free): Even worse for Sanskrit — fails to form conjunct characters correctly, weak on Devanagari morphology. Not suitable.

The math: Tesseract is free but produces ~10x more errors on diacriticals. If each error takes 30 seconds to manually find and fix, and there are 5,000+ diacritical errors across 24,000 pages, that's 40+ hours of human correction. Spending ~$150 on AI vision extraction that gets it right the first time saves enormous volunteer time and produces more reliable results.

We may still use Tesseract as a free third-opinion source for the diff comparison — three independent readings of the same page give even higher confidence than two.

How is this different from just scanning the books?

Scanning produces images — pictures of pages. You can look at them, but you can't search, copy, translate, or reformat the text. OCR extracts the actual text from those images, making it searchable, editable, and usable in apps, websites, and new print editions. We do both: preserve the scan images as archival proof, and extract the text for practical use.

What happens with the diff document — the differences found?

This may be the most valuable output of the entire project. The diff document catalogues every place where existing digital editions differ from the original print. Some differences will be OCR errors (which we fix). But others may reveal unauthorized editorial changes made after Prabhupada's departure. The diff becomes a permanent, verifiable record of what changed and when.

Will this be open source?

Yes — completely. The pipeline code, the OCR results, the diff documents, the verified text, and all output formats will be published on GitHub. Anyone can verify our work, run the pipeline themselves, or build upon the results.

OCR + AI
Book Restoration

Project Funding

Why This Matters

Preservation

Authenticity

Accessibility

Print Restoration

The Pipeline

High-Quality Scanning

AI Vision Text Extraction

Automated Diacritical Correction

Digital Comparison Source

Diff Generation

AI + Human Verification

Publishing & Distribution

What We're Scanning

Choosing the Right Tool

Cost Breakdown

Frequently Asked Questions

Support This Project

OCR + AIBook Restoration

Project Funding

Why This Matters

Preservation

Authenticity

Accessibility

Print Restoration

The Pipeline

High-Quality Scanning

AI Vision Text Extraction

Automated Diacritical Correction

Digital Comparison Source

Diff Generation

AI + Human Verification

Publishing & Distribution

What We're Scanning

Choosing the Right Tool

Cost Breakdown

Frequently Asked Questions

Support This Project

OCR + AI
Book Restoration