Skip to main content
Back to Blog
deep-diveJanuary 2, 202513 min min read

PDF Processing in the Browser with pdf-lib and Tesseract.js

Build powerful PDF manipulation and OCR capabilities entirely in the browser, keeping sensitive documents private and eliminating server dependencies.

Robert Fridzema

Robert Fridzema

Fullstack Developer

PDF Processing in the Browser with pdf-lib and Tesseract.js

Processing PDFs traditionally required server-side tools. But modern browsers are surprisingly capable - you can parse, manipulate, and even OCR PDFs entirely client-side. Here's how to build a complete PDF processing pipeline in the browser.

Why Browser-Based Processing?

  • Privacy - Documents never leave the user's device
  • No server costs - Processing happens on user's hardware
  • Offline capable - Works without internet after initial load
  • Instant feedback - No upload/download delays

The tradeoff is processing speed and memory limits, but for most use cases, it works well.

The Toolkit

LibraryPurposeSize
pdf-libCreate and modify PDFs300KB
pdfjs-distRender and parse PDFs500KB
Tesseract.jsOCR (text recognition)2MB + language data

Rendering PDFs with PDF.js

PDF.js (from Mozilla) renders PDFs to canvas:

import * as pdfjsLib from 'pdfjs-dist'

// Set worker source (required)
pdfjsLib.GlobalWorkerOptions.workerSrc = '/pdf.worker.min.js'

async function renderPdfPage(
  file: File,
  pageNumber: number,
  scale: number = 1.5
): Promise<HTMLCanvasElement> {
  // Load the PDF
  const arrayBuffer = await file.arrayBuffer()
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise

  // Get the page
  const page = await pdf.getPage(pageNumber)

  // Calculate dimensions
  const viewport = page.getViewport({ scale })

  // Create canvas
  const canvas = document.createElement('canvas')
  canvas.width = viewport.width
  canvas.height = viewport.height

  // Render
  await page.render({
    canvasContext: canvas.getContext('2d')!,
    viewport,
  }).promise

  return canvas
}

Extracting Text

PDF.js can also extract text content:

async function extractText(file: File): Promise<string> {
  const arrayBuffer = await file.arrayBuffer()
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise

  const textParts: string[] = []

  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i)
    const textContent = await page.getTextContent()

    const pageText = textContent.items
      .map((item: any) => item.str)
      .join(' ')

    textParts.push(pageText)
  }

  return textParts.join('\n\n')
}

OCR with Tesseract.js

When PDFs are scanned images (no selectable text), you need OCR:

import Tesseract from 'tesseract.js'

async function performOCR(
  canvas: HTMLCanvasElement,
  language: string = 'eng'
): Promise<string> {
  const result = await Tesseract.recognize(canvas, language, {
    logger: (m) => console.log(m), // Progress updates
  })

  return result.data.text
}

Multi-Language Support

Tesseract supports 100+ languages. Load them on demand:

async function ocrWithLanguage(
  canvas: HTMLCanvasElement,
  languages: string[]
): Promise<string> {
  const worker = await Tesseract.createWorker()

  // Load multiple languages
  for (const lang of languages) {
    await worker.loadLanguage(lang)
  }
  await worker.initialize(languages.join('+'))

  const { data } = await worker.recognize(canvas)

  await worker.terminate()

  return data.text
}

// Usage: Dutch and English
const text = await ocrWithLanguage(canvas, ['nld', 'eng'])

Improving OCR Accuracy

Pre-process images for better results:

function preprocessForOCR(canvas: HTMLCanvasElement): HTMLCanvasElement {
  const ctx = canvas.getContext('2d')!
  const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
  const data = imageData.data

  // Convert to grayscale
  for (let i = 0; i < data.length; i += 4) {
    const avg = (data[i] + data[i + 1] + data[i + 2]) / 3
    data[i] = avg     // R
    data[i + 1] = avg // G
    data[i + 2] = avg // B
  }

  // Increase contrast
  const factor = 1.5
  for (let i = 0; i < data.length; i += 4) {
    data[i] = clamp((data[i] - 128) * factor + 128)
    data[i + 1] = clamp((data[i + 1] - 128) * factor + 128)
    data[i + 2] = clamp((data[i + 2] - 128) * factor + 128)
  }

  // Apply threshold (binarize)
  const threshold = 128
  for (let i = 0; i < data.length; i += 4) {
    const value = data[i] > threshold ? 255 : 0
    data[i] = value
    data[i + 1] = value
    data[i + 2] = value
  }

  ctx.putImageData(imageData, 0, 0)
  return canvas
}

function clamp(value: number): number {
  return Math.max(0, Math.min(255, value))
}

Creating and Modifying PDFs with pdf-lib

pdf-lib creates and modifies PDFs without a server:

Create a New PDF

import { PDFDocument, rgb, StandardFonts } from 'pdf-lib'

async function createPdf(): Promise<Uint8Array> {
  const doc = await PDFDocument.create()

  // Add a page
  const page = doc.addPage([595.28, 841.89]) // A4 size

  // Embed font
  const font = await doc.embedFont(StandardFonts.Helvetica)

  // Add text
  page.drawText('Hello, PDF!', {
    x: 50,
    y: 800,
    size: 24,
    font,
    color: rgb(0, 0, 0),
  })

  // Add rectangle
  page.drawRectangle({
    x: 50,
    y: 700,
    width: 200,
    height: 50,
    borderColor: rgb(0.5, 0.5, 0.5),
    borderWidth: 1,
  })

  // Save
  return doc.save()
}

Merge PDFs

async function mergePdfs(files: File[]): Promise<Uint8Array> {
  const mergedDoc = await PDFDocument.create()

  for (const file of files) {
    const arrayBuffer = await file.arrayBuffer()
    const pdf = await PDFDocument.load(arrayBuffer)

    const pages = await mergedDoc.copyPages(pdf, pdf.getPageIndices())
    pages.forEach((page) => mergedDoc.addPage(page))
  }

  return mergedDoc.save()
}

Split PDF

async function splitPdf(
  file: File,
  ranges: Array<{ start: number; end: number }>
): Promise<Uint8Array[]> {
  const arrayBuffer = await file.arrayBuffer()
  const sourcePdf = await PDFDocument.load(arrayBuffer)

  const results: Uint8Array[] = []

  for (const range of ranges) {
    const newDoc = await PDFDocument.create()
    const pageIndices = Array.from(
      { length: range.end - range.start + 1 },
      (_, i) => range.start + i - 1 // Convert to 0-indexed
    )

    const pages = await newDoc.copyPages(sourcePdf, pageIndices)
    pages.forEach((page) => newDoc.addPage(page))

    results.push(await newDoc.save())
  }

  return results
}

Add Watermark

async function addWatermark(
  file: File,
  text: string
): Promise<Uint8Array> {
  const arrayBuffer = await file.arrayBuffer()
  const doc = await PDFDocument.load(arrayBuffer)

  const font = await doc.embedFont(StandardFonts.HelveticaBold)
  const pages = doc.getPages()

  for (const page of pages) {
    const { width, height } = page.getSize()

    page.drawText(text, {
      x: width / 2 - 100,
      y: height / 2,
      size: 50,
      font,
      color: rgb(0.9, 0.9, 0.9),
      opacity: 0.3,
      rotate: { angle: 45, type: 'degrees' },
    })
  }

  return doc.save()
}

Complete Pipeline: PDF to Searchable PDF

Combine everything to convert scanned PDFs to searchable ones:

import * as pdfjsLib from 'pdfjs-dist'
import { PDFDocument, rgb } from 'pdf-lib'
import Tesseract from 'tesseract.js'

interface OcrResult {
  text: string
  words: Array<{
    text: string
    bbox: { x0: number; y0: number; x1: number; y1: number }
  }>
}

async function makeSearchablePdf(file: File): Promise<Uint8Array> {
  // Load original PDF
  const arrayBuffer = await file.arrayBuffer()
  const sourcePdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise

  // Create new PDF
  const newDoc = await PDFDocument.create()

  // Process each page
  for (let i = 1; i <= sourcePdf.numPages; i++) {
    const page = await sourcePdf.getPage(i)
    const viewport = page.getViewport({ scale: 2 }) // Higher scale for better OCR

    // Render to canvas
    const canvas = document.createElement('canvas')
    canvas.width = viewport.width
    canvas.height = viewport.height

    await page.render({
      canvasContext: canvas.getContext('2d')!,
      viewport,
    }).promise

    // Perform OCR
    const ocrResult = await Tesseract.recognize(canvas, 'eng')

    // Add page to new PDF
    const newPage = newDoc.addPage([
      viewport.width / 2, // Scale back to original size
      viewport.height / 2,
    ])

    // Embed the image
    const imageBytes = canvas.toDataURL('image/jpeg', 0.9)
    const image = await newDoc.embedJpg(
      await fetch(imageBytes).then((r) => r.arrayBuffer())
    )

    newPage.drawImage(image, {
      x: 0,
      y: 0,
      width: newPage.getWidth(),
      height: newPage.getHeight(),
    })

    // Add invisible text layer
    const font = await newDoc.embedFont('Helvetica')

    for (const word of ocrResult.data.words) {
      const { x0, y0, x1, y1 } = word.bbox
      const scaleFactor = 0.5 // Match the scale-down

      newPage.drawText(word.text, {
        x: x0 * scaleFactor,
        y: newPage.getHeight() - y1 * scaleFactor,
        size: (y1 - y0) * scaleFactor * 0.8,
        font,
        opacity: 0, // Invisible but searchable
      })
    }
  }

  return newDoc.save()
}

Performance Optimization

Web Workers

Move heavy processing off the main thread:

// pdf-worker.ts
self.onmessage = async (e) => {
  const { type, payload } = e.data

  switch (type) {
    case 'ocr':
      const result = await performOCR(payload.imageData)
      self.postMessage({ type: 'ocr-complete', result })
      break

    case 'merge':
      const merged = await mergePdfs(payload.files)
      self.postMessage({ type: 'merge-complete', result: merged })
      break
  }
}

// Main thread
const worker = new Worker('/pdf-worker.js')

worker.postMessage({ type: 'ocr', payload: { imageData } })
worker.onmessage = (e) => {
  if (e.data.type === 'ocr-complete') {
    console.log('OCR result:', e.data.result)
  }
}

Progressive Loading

Process pages one at a time for large PDFs:

async function* processPages(file: File) {
  const arrayBuffer = await file.arrayBuffer()
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise

  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i)
    // Process page...
    yield { page: i, total: pdf.numPages, result: pageResult }
  }
}

// Usage with progress
for await (const progress of processPages(file)) {
  updateProgress(`Processing page ${progress.page} of ${progress.total}`)
}

Memory Management

Large PDFs can exhaust memory. Clean up aggressively:

async function processLargePdf(file: File): Promise<void> {
  const arrayBuffer = await file.arrayBuffer()
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise

  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i)

    // Process page...
    const canvas = await renderPage(page)
    await processCanvas(canvas)

    // Clean up
    canvas.width = 0
    canvas.height = 0
    page.cleanup()

    // Force garbage collection opportunity
    await new Promise((r) => setTimeout(r, 0))
  }

  pdf.destroy()
}

Error Handling

PDF processing can fail in many ways:

async function safePdfLoad(file: File): Promise<PDFDocumentProxy | null> {
  try {
    const arrayBuffer = await file.arrayBuffer()
    return await pdfjsLib.getDocument({ data: arrayBuffer }).promise
  } catch (error) {
    if (error instanceof Error) {
      if (error.message.includes('Invalid PDF')) {
        console.error('File is not a valid PDF')
      } else if (error.message.includes('password')) {
        console.error('PDF is password protected')
      } else {
        console.error('Failed to load PDF:', error.message)
      }
    }
    return null
  }
}

Browser Support

FeatureChromeFirefoxSafariEdge
PDF.jsYesYesYesYes
pdf-libYesYesYesYes
Tesseract.jsYesYesYesYes
Web WorkersYesYesYesYes
SharedArrayBufferYes*Yes*Yes*Yes*

*Requires Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy headers for optimal Tesseract performance.

Key Takeaways

  1. pdf-lib for creation/modification - Pure JavaScript, no native deps
  2. PDF.js for rendering/parsing - Battle-tested, Mozilla-backed
  3. Tesseract.js for OCR - Accurate but memory-intensive
  4. Web Workers are essential - Keep the UI responsive
  5. Clean up resources - Memory limits are real in browsers
  6. Pre-process for OCR - Grayscale and contrast help accuracy

Browser-based PDF processing is mature enough for production use. The privacy and simplicity benefits often outweigh the performance tradeoffs.


Building document processing features? Get in touch - I've shipped several browser-based document tools and happy to share more.

#JavaScript #PDF #OCR #Tesseract #Browser APIs
Share: