PDF Processing in the Browser with pdf-lib and Tesseract.js
Build powerful PDF manipulation and OCR capabilities entirely in the browser, keeping sensitive documents private and eliminating server dependencies.
Robert Fridzema
Fullstack Developer

Processing PDFs traditionally required server-side tools. But modern browsers are surprisingly capable - you can parse, manipulate, and even OCR PDFs entirely client-side. Here's how to build a complete PDF processing pipeline in the browser.
Why Browser-Based Processing?
- Privacy - Documents never leave the user's device
- No server costs - Processing happens on user's hardware
- Offline capable - Works without internet after initial load
- Instant feedback - No upload/download delays
The tradeoff is processing speed and memory limits, but for most use cases, it works well.
The Toolkit
| Library | Purpose | Size |
|---|---|---|
| pdf-lib | Create and modify PDFs | 300KB |
| pdfjs-dist | Render and parse PDFs | 500KB |
| Tesseract.js | OCR (text recognition) | 2MB + language data |
Rendering PDFs with PDF.js
PDF.js (from Mozilla) renders PDFs to canvas:
import * as pdfjsLib from 'pdfjs-dist' // Set worker source (required) pdfjsLib.GlobalWorkerOptions.workerSrc = '/pdf.worker.min.js' async function renderPdfPage( file: File, pageNumber: number, scale: number = 1.5 ): Promise<HTMLCanvasElement> { // Load the PDF const arrayBuffer = await file.arrayBuffer() const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise // Get the page const page = await pdf.getPage(pageNumber) // Calculate dimensions const viewport = page.getViewport({ scale }) // Create canvas const canvas = document.createElement('canvas') canvas.width = viewport.width canvas.height = viewport.height // Render await page.render({ canvasContext: canvas.getContext('2d')!, viewport, }).promise return canvas }
Extracting Text
PDF.js can also extract text content:
async function extractText(file: File): Promise<string> { const arrayBuffer = await file.arrayBuffer() const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise const textParts: string[] = [] for (let i = 1; i <= pdf.numPages; i++) { const page = await pdf.getPage(i) const textContent = await page.getTextContent() const pageText = textContent.items .map((item: any) => item.str) .join(' ') textParts.push(pageText) } return textParts.join('\n\n') }
OCR with Tesseract.js
When PDFs are scanned images (no selectable text), you need OCR:
import Tesseract from 'tesseract.js' async function performOCR( canvas: HTMLCanvasElement, language: string = 'eng' ): Promise<string> { const result = await Tesseract.recognize(canvas, language, { logger: (m) => console.log(m), // Progress updates }) return result.data.text }
Multi-Language Support
Tesseract supports 100+ languages. Load them on demand:
async function ocrWithLanguage( canvas: HTMLCanvasElement, languages: string[] ): Promise<string> { const worker = await Tesseract.createWorker() // Load multiple languages for (const lang of languages) { await worker.loadLanguage(lang) } await worker.initialize(languages.join('+')) const { data } = await worker.recognize(canvas) await worker.terminate() return data.text } // Usage: Dutch and English const text = await ocrWithLanguage(canvas, ['nld', 'eng'])
Improving OCR Accuracy
Pre-process images for better results:
function preprocessForOCR(canvas: HTMLCanvasElement): HTMLCanvasElement { const ctx = canvas.getContext('2d')! const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height) const data = imageData.data // Convert to grayscale for (let i = 0; i < data.length; i += 4) { const avg = (data[i] + data[i + 1] + data[i + 2]) / 3 data[i] = avg // R data[i + 1] = avg // G data[i + 2] = avg // B } // Increase contrast const factor = 1.5 for (let i = 0; i < data.length; i += 4) { data[i] = clamp((data[i] - 128) * factor + 128) data[i + 1] = clamp((data[i + 1] - 128) * factor + 128) data[i + 2] = clamp((data[i + 2] - 128) * factor + 128) } // Apply threshold (binarize) const threshold = 128 for (let i = 0; i < data.length; i += 4) { const value = data[i] > threshold ? 255 : 0 data[i] = value data[i + 1] = value data[i + 2] = value } ctx.putImageData(imageData, 0, 0) return canvas } function clamp(value: number): number { return Math.max(0, Math.min(255, value)) }
Creating and Modifying PDFs with pdf-lib
pdf-lib creates and modifies PDFs without a server:
Create a New PDF
import { PDFDocument, rgb, StandardFonts } from 'pdf-lib' async function createPdf(): Promise<Uint8Array> { const doc = await PDFDocument.create() // Add a page const page = doc.addPage([595.28, 841.89]) // A4 size // Embed font const font = await doc.embedFont(StandardFonts.Helvetica) // Add text page.drawText('Hello, PDF!', { x: 50, y: 800, size: 24, font, color: rgb(0, 0, 0), }) // Add rectangle page.drawRectangle({ x: 50, y: 700, width: 200, height: 50, borderColor: rgb(0.5, 0.5, 0.5), borderWidth: 1, }) // Save return doc.save() }
Merge PDFs
async function mergePdfs(files: File[]): Promise<Uint8Array> { const mergedDoc = await PDFDocument.create() for (const file of files) { const arrayBuffer = await file.arrayBuffer() const pdf = await PDFDocument.load(arrayBuffer) const pages = await mergedDoc.copyPages(pdf, pdf.getPageIndices()) pages.forEach((page) => mergedDoc.addPage(page)) } return mergedDoc.save() }
Split PDF
async function splitPdf( file: File, ranges: Array<{ start: number; end: number }> ): Promise<Uint8Array[]> { const arrayBuffer = await file.arrayBuffer() const sourcePdf = await PDFDocument.load(arrayBuffer) const results: Uint8Array[] = [] for (const range of ranges) { const newDoc = await PDFDocument.create() const pageIndices = Array.from( { length: range.end - range.start + 1 }, (_, i) => range.start + i - 1 // Convert to 0-indexed ) const pages = await newDoc.copyPages(sourcePdf, pageIndices) pages.forEach((page) => newDoc.addPage(page)) results.push(await newDoc.save()) } return results }
Add Watermark
async function addWatermark( file: File, text: string ): Promise<Uint8Array> { const arrayBuffer = await file.arrayBuffer() const doc = await PDFDocument.load(arrayBuffer) const font = await doc.embedFont(StandardFonts.HelveticaBold) const pages = doc.getPages() for (const page of pages) { const { width, height } = page.getSize() page.drawText(text, { x: width / 2 - 100, y: height / 2, size: 50, font, color: rgb(0.9, 0.9, 0.9), opacity: 0.3, rotate: { angle: 45, type: 'degrees' }, }) } return doc.save() }
Complete Pipeline: PDF to Searchable PDF
Combine everything to convert scanned PDFs to searchable ones:
import * as pdfjsLib from 'pdfjs-dist' import { PDFDocument, rgb } from 'pdf-lib' import Tesseract from 'tesseract.js' interface OcrResult { text: string words: Array<{ text: string bbox: { x0: number; y0: number; x1: number; y1: number } }> } async function makeSearchablePdf(file: File): Promise<Uint8Array> { // Load original PDF const arrayBuffer = await file.arrayBuffer() const sourcePdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise // Create new PDF const newDoc = await PDFDocument.create() // Process each page for (let i = 1; i <= sourcePdf.numPages; i++) { const page = await sourcePdf.getPage(i) const viewport = page.getViewport({ scale: 2 }) // Higher scale for better OCR // Render to canvas const canvas = document.createElement('canvas') canvas.width = viewport.width canvas.height = viewport.height await page.render({ canvasContext: canvas.getContext('2d')!, viewport, }).promise // Perform OCR const ocrResult = await Tesseract.recognize(canvas, 'eng') // Add page to new PDF const newPage = newDoc.addPage([ viewport.width / 2, // Scale back to original size viewport.height / 2, ]) // Embed the image const imageBytes = canvas.toDataURL('image/jpeg', 0.9) const image = await newDoc.embedJpg( await fetch(imageBytes).then((r) => r.arrayBuffer()) ) newPage.drawImage(image, { x: 0, y: 0, width: newPage.getWidth(), height: newPage.getHeight(), }) // Add invisible text layer const font = await newDoc.embedFont('Helvetica') for (const word of ocrResult.data.words) { const { x0, y0, x1, y1 } = word.bbox const scaleFactor = 0.5 // Match the scale-down newPage.drawText(word.text, { x: x0 * scaleFactor, y: newPage.getHeight() - y1 * scaleFactor, size: (y1 - y0) * scaleFactor * 0.8, font, opacity: 0, // Invisible but searchable }) } } return newDoc.save() }
Performance Optimization
Web Workers
Move heavy processing off the main thread:
// pdf-worker.ts self.onmessage = async (e) => { const { type, payload } = e.data switch (type) { case 'ocr': const result = await performOCR(payload.imageData) self.postMessage({ type: 'ocr-complete', result }) break case 'merge': const merged = await mergePdfs(payload.files) self.postMessage({ type: 'merge-complete', result: merged }) break } } // Main thread const worker = new Worker('/pdf-worker.js') worker.postMessage({ type: 'ocr', payload: { imageData } }) worker.onmessage = (e) => { if (e.data.type === 'ocr-complete') { console.log('OCR result:', e.data.result) } }
Progressive Loading
Process pages one at a time for large PDFs:
async function* processPages(file: File) { const arrayBuffer = await file.arrayBuffer() const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise for (let i = 1; i <= pdf.numPages; i++) { const page = await pdf.getPage(i) // Process page... yield { page: i, total: pdf.numPages, result: pageResult } } } // Usage with progress for await (const progress of processPages(file)) { updateProgress(`Processing page ${progress.page} of ${progress.total}`) }
Memory Management
Large PDFs can exhaust memory. Clean up aggressively:
async function processLargePdf(file: File): Promise<void> { const arrayBuffer = await file.arrayBuffer() const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise for (let i = 1; i <= pdf.numPages; i++) { const page = await pdf.getPage(i) // Process page... const canvas = await renderPage(page) await processCanvas(canvas) // Clean up canvas.width = 0 canvas.height = 0 page.cleanup() // Force garbage collection opportunity await new Promise((r) => setTimeout(r, 0)) } pdf.destroy() }
Error Handling
PDF processing can fail in many ways:
async function safePdfLoad(file: File): Promise<PDFDocumentProxy | null> { try { const arrayBuffer = await file.arrayBuffer() return await pdfjsLib.getDocument({ data: arrayBuffer }).promise } catch (error) { if (error instanceof Error) { if (error.message.includes('Invalid PDF')) { console.error('File is not a valid PDF') } else if (error.message.includes('password')) { console.error('PDF is password protected') } else { console.error('Failed to load PDF:', error.message) } } return null } }
Browser Support
| Feature | Chrome | Firefox | Safari | Edge |
|---|---|---|---|---|
| PDF.js | Yes | Yes | Yes | Yes |
| pdf-lib | Yes | Yes | Yes | Yes |
| Tesseract.js | Yes | Yes | Yes | Yes |
| Web Workers | Yes | Yes | Yes | Yes |
| SharedArrayBuffer | Yes* | Yes* | Yes* | Yes* |
*Requires Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy headers for optimal Tesseract performance.
Key Takeaways
- pdf-lib for creation/modification - Pure JavaScript, no native deps
- PDF.js for rendering/parsing - Battle-tested, Mozilla-backed
- Tesseract.js for OCR - Accurate but memory-intensive
- Web Workers are essential - Keep the UI responsive
- Clean up resources - Memory limits are real in browsers
- Pre-process for OCR - Grayscale and contrast help accuracy
Browser-based PDF processing is mature enough for production use. The privacy and simplicity benefits often outweigh the performance tradeoffs.
Building document processing features? Get in touch - I've shipped several browser-based document tools and happy to share more.
Related Articles

Building Production Electron Apps with Vue 3 and Azure SSO
A comprehensive guide to building secure, enterprise-ready Electron applications with Vue 3, TypeScript, and Azure AD authentication.

Getting Started with Nuxt 3 - A Practical Guide
Learn how to build modern web applications with Nuxt 3, from project setup to deployment. Includes tips from migrating a real project.

Migrating Legacy Databases: A Real-World n8n + PostgreSQL Story
How we migrated decades of business data from a legacy system to a modern Laravel application using n8n workflows and PostgreSQL.