Skip to main content
Back to Projects
Personal Project

AIHub

An intelligent document processing platform that combines OCR, PDF manipulation, and AI-powered text extraction to transform how teams handle documents.

Year2024
RoleLead Developer
Technologies
Nuxt 3Tesseract.jspdf-libSharpAzure MSALIndexedDBTailwind CSS

Overview

AIHub is a document intelligence platform that brings AI-powered processing capabilities to the browser. It enables teams to extract text from images and PDFs, manipulate documents, and process content at scale - all without uploading sensitive documents to external servers.

The Challenge

Document processing in enterprise environments faces several constraints:

  • Sensitive documents can't be uploaded to third-party services
  • Manual data extraction is slow and error-prone
  • Existing OCR solutions require complex server infrastructure
  • Teams need to process documents in various formats
  • Integration with existing authentication systems is required

Solution

A browser-based platform that:

  • Runs OCR entirely in the browser using Tesseract.js
  • Manipulates PDFs client-side with pdf-lib
  • Processes images with Sharp for optimal OCR input
  • Stores data locally with IndexedDB for privacy
  • Authenticates via Azure AD for enterprise deployment

Technical Architecture

Client-Side OCR

Tesseract.js enables accurate text recognition without server round-trips:

// Initialize Tesseract worker
const worker = await createWorker('eng')

// Process image with confidence scoring
const { data } = await worker.recognize(imageData)

// Extract text with position data
const blocks = data.blocks.map(block => ({
  text: block.text,
  confidence: block.confidence,
  bbox: block.bbox
}))

PDF Processing Pipeline

Multi-stage document handling:

  1. Ingestion - Parse PDF structure with pdfjs-dist
  2. Extraction - Render pages to canvas for OCR
  3. Enhancement - Pre-process images with Sharp
  4. Recognition - Run Tesseract on enhanced images
  5. Assembly - Combine results with pdf-lib

Image Pre-processing

Sharp optimizes images before OCR:

  • Convert to grayscale for better recognition
  • Apply adaptive thresholding
  • Deskew rotated scans
  • Remove noise and artifacts

Local Storage Strategy

IndexedDB provides persistent, private storage:

  • Processed documents cached locally
  • User preferences and settings
  • Offline capability for previously processed files

Key Features

Smart Text Extraction

AI-powered OCR that handles:

  • Scanned documents and photos
  • Multi-column layouts
  • Tables and structured data
  • Handwritten text (with reduced accuracy)

PDF Manipulation

Client-side operations:

  • Merge multiple PDFs
  • Split documents by page
  • Extract specific pages
  • Add watermarks and annotations

Batch Processing

Queue multiple documents for processing:

  • Progress tracking per document
  • Parallel processing with Web Workers
  • Resume interrupted batches

Enterprise Authentication

Azure MSAL integration:

  • Single sign-on with company credentials
  • Token-based session management
  • Automatic token refresh

Technical Stack

  • Nuxt 3 for the application framework
  • Tesseract.js 6 for browser-based OCR
  • pdf-lib for PDF creation and manipulation
  • pdfjs-dist for PDF rendering and parsing
  • Sharp for image processing
  • Canvas API for image manipulation
  • IndexedDB (idb) for local storage
  • Azure MSAL for authentication
  • Playwright for E2E testing

Performance Optimizations

Document processing is resource-intensive. Key optimizations:

  • Web Workers - OCR runs off the main thread
  • Progressive loading - Process visible pages first
  • Caching - Store intermediate results
  • Lazy initialization - Load Tesseract only when needed
  • Memory management - Release resources after processing

Privacy by Design

All processing happens in the browser:

  • Documents never leave the user's device
  • No server-side storage of content
  • Authentication tokens stored securely
  • Clear data option for sensitive sessions

Results

  • Processes 50+ page documents in under a minute
  • 95%+ accuracy on clean printed text
  • Zero server infrastructure for document processing
  • Deployed to enterprise teams with strict data policies

Lessons Learned

  1. Browser capabilities are impressive - Modern browsers can handle serious workloads
  2. OCR quality depends on input - Image pre-processing is crucial
  3. Memory limits are real - Large documents need careful chunking
  4. UX during processing - Users need feedback for long operations
  5. Offline-first wins - Local storage makes the app feel instant