Skip to main content
Using pdf-parse on Vercel Is Wrong — Here's What Actually Works

Using pdf-parse on Vercel Is Wrong — Here's What Actually Works

Why pdf-parse fails on Vercel serverless and how unpdf solves it. A debugging story with zero native dependencies and 3-5 second processing times.

Chudi Nnorukam
Chudi Nnorukam
Dec 28, 2025 Updated Feb 16, 2026 7 min read
In this cluster

AI Product Development: Claude Code workflows, micro-SaaS execution, and evidence-based AI building.

Pillar guide

How I Use Claude Code to Ship Production-Quality Code Every Session Master Claude Code with quality gates, context management, and evidence-based workflows. The comprehensive guide to building with AI that doesn't break.

Related in this cluster

It was 2 AM. StatementSync was ready to deploy. I pushed to Vercel and watched the build fail.

Error: Cannot find module 'canvas'
    at Function.Module._resolveFilename

Canvas? I’m processing PDFs, not drawing graphics. Three hours later, I learned why pdf-parse breaks on serverless.

pdf-parse depends on the canvas module, which requires native bindings unavailable in Lambda and Edge environments. The fix is unpdf—a pure JavaScript PDF parser with zero native dependencies that works on Vercel serverless functions, AWS Lambda, and Cloudflare Workers. Same extraction quality, no build failures.

Why Does pdf-parse Fail on Vercel Serverless?

pdf-parse depends on pdfjs-dist, which has optional native dependencies including the canvas module. Canvas requires Python, node-gyp, and C++ build tools that Vercel’s serverless runtime cannot compile. The result is either a build-time failure or a silent runtime segfault when the function first processes a PDF in production.

The Problem

pdf-parse is the go-to library for PDF text extraction in Node.js:

import pdf from 'pdf-parse';

const dataBuffer = fs.readFileSync('statement.pdf');
const data = await pdf(dataBuffer);
console.log(data.text);

Works perfectly locally. Crashes spectacularly on Vercel.

Why It Fails

pdf-parse depends on pdfjs-dist, Mozilla’s PDF.js port for Node. pdfjs-dist has optional dependencies:

{
  "optionalDependencies": {
    "canvas": "^2.x",
    "node-fetch": "^2.x"
  }
}

Canvas is a native module that requires:

  • Python
  • node-gyp
  • C++ build tools

Vercel’s serverless runtime doesn’t have these. The build either:

  1. Fails outright with missing module errors
  2. Succeeds but crashes at runtime with segfaults

The Debugging Journey

Attempt 1: Exclude Canvas

“Just mark canvas as external,” Stack Overflow said.

// next.config.js
module.exports = {
  webpack: (config) => {
    config.externals = [...(config.externals || []), 'canvas'];
    return config;
  },
};

Result: Different error.

Error: Could not load the "canvas" module

pdfjs-dist tries to load canvas at runtime, not just build time.

Attempt 2: Legacy Build

“Use pdf-parse legacy mode,” another answer suggested.

const pdf = require('pdf-parse/lib/pdf-parse');

Result: Still fails. The dependency chain remains.

Attempt 3: pdfjs-dist Directly

“Skip pdf-parse, use pdfjs-dist with worker disabled.”

import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = '';

const pdf = await pdfjsLib.getDocument({ data: buffer }).promise;

Result: Works locally, memory errors on Vercel.

Vercel functions have 1GB memory limit. pdfjs-dist’s memory usage is unpredictable with large PDFs.

The Solution: unpdf

After three hours, I found unpdf:

import { getDocument, extractText } from 'unpdf';

const pdf = await getDocument({ data: buffer }).promise;
const text = await extractText(pdf);

Result: Works. First try.

Why Does unpdf Work Where pdf-parse Fails?

unpdf is a pure JavaScript PDF parser with zero native dependencies. It requires no compilation step, works on Vercel serverless, AWS Lambda, and Cloudflare Workers, and keeps memory usage predictable rather than unpredictable. The same getDocument and extractText API works identically across local development and production without any environment-specific configuration.

Why unpdf Works

unpdf is built specifically for serverless:

Featurepdf-parseunpdf
Native depsYes (canvas)No
Vercel compatibleNoYes
Edge runtimeNoYes
Bundle sizeLargeSmall
Memory usageUnpredictableControlled

The library uses a pure JavaScript PDF parser without native modules. No build-time compilation, no runtime loading issues.

Implementation

Here’s the complete pattern for serverless PDF processing:

import { getDocument, extractText } from 'unpdf';

interface Transaction {
  date: string;
  description: string;
  amount: number;
  type: 'debit' | 'credit';
}

async function processPdf(buffer: Buffer): Promise<Transaction[]> {
  // Load PDF
  const pdf = await getDocument({ data: buffer }).promise;

  // Extract text
  const text = await extractText(pdf);

  // Parse transactions (pattern-based for bank statements)
  const transactions = parseTransactions(text);

  // Cleanup
  pdf.destroy();

  return transactions;
}

function parseTransactions(text: string): Transaction[] {
  // Bank-specific parsing patterns
  const lines = text.split('
');
  const transactions: Transaction[] = [];

  for (const line of lines) {
    const match = line.match(/(d{2}/d{2})s+(.+?)s+(-?$[d,]+.d{2})/);
    if (match) {
      transactions.push({
        date: match[1],
        description: match[2].trim(),
        amount: parseFloat(match[3].replace(/[$,]/g, '')),
        type: match[3].startsWith('-') ? 'debit' : 'credit'
      });
    }
  }

  return transactions;
}

Performance

On Vercel’s free tier (1GB memory, 10s timeout):

PDF SizeProcessing TimeMemory Used
1 page1-2 seconds~100MB
5 pages3-4 seconds~200MB
10 pages5-6 seconds~350MB
20 pages8-9 seconds~500MB

Comfortable margins for typical bank statements (1-5 pages).

When Should You Use Pattern-Based Extraction Instead of LLM?

For structured documents like bank statements and invoices, pattern-based extraction achieves 99% accuracy at zero runtime cost versus $0.01–$0.05 per LLM call. At 1,000 statements per month that difference is $10–$50. For flat-rate SaaS products where marginal cost must stay near zero, pattern-based parsing is the only pricing-sustainable choice.

Pattern-Based vs LLM Extraction

For structured documents like bank statements, pattern-based extraction beats LLM:

ApproachAccuracyCostSpeed
Pattern-based99%$03-5s
LLM (GPT-5)99.5%$0.01-0.0510-30s
OCR + LLM95%$0.02-0.0815-45s

For StatementSync processing 1000 statements/month:

  • Pattern-based: $0
  • LLM: $10-50/month

The 0.5% accuracy difference doesn’t justify the cost for this use case. This cost analysis was a key input to the flat-rate vs per-file pricing decision for StatementSync.

When to Use What

Use unpdf when:

  • Deploying to Vercel, Netlify, or Cloudflare
  • Processing structured documents (statements, invoices)
  • Need low memory footprint
  • Running on edge runtimes

Use pdf-parse when:

  • Running on traditional servers (EC2, DigitalOcean)
  • Need advanced PDF features (annotations, forms)
  • Have native build tools available

Use LLM extraction when:

  • Documents are unstructured or variable
  • Accuracy is more important than cost
  • Processing low volumes

How Do You Set Up unpdf in a Next.js App Router Project?

Install unpdf with npm install unpdf, then create a route handler under app/api/. Set export const runtime = 'nodejs' — not 'edge' — because unpdf requires Node APIs unavailable on the Edge runtime. Use formData() to receive the file, convert it to a buffer with arrayBuffer(), call getDocument and extractText, then call pdf.destroy() to release memory.

Setting Up in Next.js

Installing unpdf and configuring Next.js for serverless PDF processing:

npm install unpdf

No additional configuration needed for standard Vercel deployments. If you’re using Next.js 14+ with the App Router, create your route handler:

// app/api/process/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { getDocument, extractText } from 'unpdf';

export const runtime = 'nodejs'; // not 'edge' — unpdf needs Node APIs

export async function POST(req: NextRequest) {
  const formData = await req.formData();
  const file = formData.get('pdf') as File;

  if (!file || file.type !== 'application/pdf') {
    return NextResponse.json({ error: 'Invalid file' }, { status: 400 });
  }

  const buffer = Buffer.from(await file.arrayBuffer());
  const pdf = await getDocument({ data: buffer }).promise;
  const text = await extractText(pdf);
  pdf.destroy();

  return NextResponse.json({ text });
}

One gotcha: set runtime = 'nodejs' not 'edge'. Edge runtime has stricter module constraints. unpdf works on Node runtime, not edge runtime.

Handling Edge Cases

Password-Protected PDFs

try {
  const pdf = await getDocument({
    data: buffer,
    password: userProvidedPassword // optional
  }).promise;
} catch (err) {
  if (err.name === 'PasswordException') {
    return { error: 'PDF is password protected' };
  }
  throw err;
}

PasswordException is thrown immediately, before any processing. Always catch it explicitly or you’ll get an unhandled rejection in production.

Corrupted or Invalid Files

async function safePdfExtract(buffer: Buffer): Promise<string | null> {
  try {
    const pdf = await getDocument({ data: buffer }).promise;
    const text = await extractText(pdf);
    pdf.destroy();
    return text;
  } catch (err) {
    // InvalidPDFException for malformed files
    // MissingPDFException for empty or non-PDF data
    console.error('PDF extraction failed:', err.name, err.message);
    return null;
  }
}

Return null instead of throwing to let the caller decide whether a failed extraction is a hard error or a skippable item.

Scanned PDFs (Image-Based)

unpdf extracts embedded text. Scanned documents—where each page is a JPEG embedded in a PDF—return empty strings. Before assuming extraction succeeded, check the output:

const text = await extractText(pdf);
if (text.trim().length < 50) {
  // Likely a scanned document
  return { error: 'Document appears to be scanned. Text extraction not supported.' };
}

For scanned documents, you’d need OCR (Tesseract.js or an external API like AWS Textract). That’s out of scope for most bank statements—major US banks generate text-based PDFs—but worth detecting gracefully.

Testing the Pipeline

Two tests that catch 90% of production issues:

// __tests__/pdf-processing.test.ts
import { processPdf } from '../lib/pdf';
import fs from 'fs';

describe('PDF processing', () => {
  it('extracts transactions from Chase statement', async () => {
    const buffer = fs.readFileSync('__tests__/fixtures/chase-sample.pdf');
    const result = await processPdf(buffer);
    expect(result.transactions.length).toBeGreaterThan(0);
    expect(result.transactions[0]).toMatchObject({
      date: expect.stringMatching(/^d{2}/d{2}$/),
      amount: expect.any(Number),
    });
  });

  it('returns null for scanned PDF', async () => {
    const buffer = fs.readFileSync('__tests__/fixtures/scanned.pdf');
    const result = await processPdf(buffer);
    expect(result).toBeNull();
  });
});

The fixture files are real PDFs (anonymized). Testing against actual bank statement formats catches the edge cases in date parsing and amount formatting before they hit production.

Verifying Your Setup in Production

Deployment to Vercel can surface issues that local testing misses. Before handling real user data, run three checks.

Memory ceiling: The performance table above shows 20-page PDFs using ~500MB. Vercel’s free tier allows 1,024MB per function. Test your worst-case PDF during staging, not production. If you’re regularly processing PDFs over 15 pages, bump to Vercel’s Pro tier where you can configure function memory up to 3,008MB.

Cold start behavior: Vercel’s serverless functions spin down after inactivity. The first PDF request after a cold start takes 2-4x longer than subsequent requests. If your users frequently trigger that first cold request, consider Vercel’s Fluid Compute option that keeps functions warm between invocations.

File size limits: Vercel’s default payload limit for serverless functions is 4.5MB. A 20-page bank statement PDF typically sits well under 1MB. But if your use case involves scanned PDFs or combined multi-month statements, verify your largest expected file size against this limit before launch. You can adjust it in next.config.js:

export default {
  api: {
    bodyParser: {
      sizeLimit: '10mb',
    },
  },
};

Running these three checks in staging eliminates the category of production failures that are unrelated to code—infrastructure surprises that happen on first real traffic.

The Lesson

The right library matters more than clever workarounds. I spent 3 hours trying to make pdf-parse work on serverless. unpdf worked in 10 minutes.

If you’re building PDF processing for serverless, start with unpdf. Save yourself the 2 AM debugging.


Chudi Nnorukam

Written by Chudi Nnorukam

I design and deploy agent-based AI automation systems that eliminate manual workflows, scale content, and power recursive learning. Specializing in micro-SaaS tools, content automation, and high-performance web applications.

Related: From Pain Point to MVP: StatementSync in One Week | Portfolio: StatementSync

FAQ

Why doesn't pdf-parse work on Vercel?

pdf-parse depends on pdfjs-dist which has optional native dependencies (canvas). Vercel's serverless runtime can't compile native modules, causing silent failures or build errors.

What is unpdf?

unpdf is a serverless-first PDF processing library. No native dependencies, works on edge runtimes, and provides text extraction and parsing capabilities for modern JavaScript environments.

How accurate is unpdf text extraction?

For structured documents (bank statements, invoices), accuracy is 99%+. For complex layouts or scanned documents, you may need additional OCR processing.

Can I use pdf-parse locally but unpdf in production?

Technically yes, but this creates inconsistency between environments. Better to use unpdf everywhere for predictable behavior.

Does unpdf work with Cloudflare Workers?

Yes, unpdf is specifically designed for edge and serverless runtimes including Cloudflare Workers, Vercel Edge, and Netlify Functions.

Sources & Further Reading

Sources

Further Reading

Discussion

Comments powered by GitHub Discussions coming soon.