Serverless PDF Processing: Why unpdf Beats pdf-parse on Vercel
The technical story of debugging PDF processing failures on Vercel and why unpdf is the serverless-compatible solution that actually works.
In this cluster
AI Product Development: Claude Code workflows, micro-SaaS execution, and evidence-based AI building.

It was 2 AM. StatementSync was ready to deploy. I pushed to Vercel and watched the build fail.
Error: Cannot find module 'canvas'
at Function.Module._resolveFilename Canvas? I’m processing PDFs, not drawing graphics. Three hours later, I learned why pdf-parse breaks on serverless.
The Problem
pdf-parse is the go-to library for PDF text extraction in Node.js:
import pdf from 'pdf-parse';
const dataBuffer = fs.readFileSync('statement.pdf');
const data = await pdf(dataBuffer);
console.log(data.text); Works perfectly locally. Crashes spectacularly on Vercel.
Why It Fails
pdf-parse depends on pdfjs-dist, Mozilla’s PDF.js port for Node. pdfjs-dist has optional dependencies:
{
"optionalDependencies": {
"canvas": "^2.x",
"node-fetch": "^2.x"
}
} Canvas is a native module that requires:
- Python
- node-gyp
- C++ build tools
Vercel’s serverless runtime doesn’t have these. The build either:
- Fails outright with missing module errors
- Succeeds but crashes at runtime with segfaults
The Debugging Journey
Attempt 1: Exclude Canvas
“Just mark canvas as external,” Stack Overflow said.
// next.config.js
module.exports = {
webpack: (config) => {
config.externals = [...(config.externals || []), 'canvas'];
return config;
},
}; Result: Different error.
Error: Could not load the "canvas" module pdfjs-dist tries to load canvas at runtime, not just build time.
Attempt 2: Legacy Build
“Use pdf-parse legacy mode,” another answer suggested.
const pdf = require('pdf-parse/lib/pdf-parse'); Result: Still fails. The dependency chain remains.
Attempt 3: pdfjs-dist Directly
“Skip pdf-parse, use pdfjs-dist with worker disabled.”
import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = '';
const pdf = await pdfjsLib.getDocument({ data: buffer }).promise; Result: Works locally, memory errors on Vercel.
Vercel functions have 1GB memory limit. pdfjs-dist’s memory usage is unpredictable with large PDFs.
The Solution: unpdf
After three hours, I found unpdf:
import { getDocument, extractText } from 'unpdf';
const pdf = await getDocument({ data: buffer }).promise;
const text = await extractText(pdf); Result: Works. First try.
Why unpdf Works
unpdf is built specifically for serverless:
| Feature | pdf-parse | unpdf |
|---|---|---|
| Native deps | Yes (canvas) | No |
| Vercel compatible | No | Yes |
| Edge runtime | No | Yes |
| Bundle size | Large | Small |
| Memory usage | Unpredictable | Controlled |
The library uses a pure JavaScript PDF parser without native modules. No build-time compilation, no runtime loading issues.
Implementation
Here’s the complete pattern for serverless PDF processing:
import { getDocument, extractText } from 'unpdf';
interface Transaction {
date: string;
description: string;
amount: number;
type: 'debit' | 'credit';
}
async function processPdf(buffer: Buffer): Promise<Transaction[]> {
// Load PDF
const pdf = await getDocument({ data: buffer }).promise;
// Extract text
const text = await extractText(pdf);
// Parse transactions (pattern-based for bank statements)
const transactions = parseTransactions(text);
// Cleanup
pdf.destroy();
return transactions;
}
function parseTransactions(text: string): Transaction[] {
// Bank-specific parsing patterns
const lines = text.split('
');
const transactions: Transaction[] = [];
for (const line of lines) {
const match = line.match(/(d{2}/d{2})s+(.+?)s+(-?$[d,]+.d{2})/);
if (match) {
transactions.push({
date: match[1],
description: match[2].trim(),
amount: parseFloat(match[3].replace(/[$,]/g, '')),
type: match[3].startsWith('-') ? 'debit' : 'credit'
});
}
}
return transactions;
} Performance
On Vercel’s free tier (1GB memory, 10s timeout):
| PDF Size | Processing Time | Memory Used |
|---|---|---|
| 1 page | 1-2 seconds | ~100MB |
| 5 pages | 3-4 seconds | ~200MB |
| 10 pages | 5-6 seconds | ~350MB |
| 20 pages | 8-9 seconds | ~500MB |
Comfortable margins for typical bank statements (1-5 pages).
Pattern-Based vs LLM Extraction
For structured documents like bank statements, pattern-based extraction beats LLM:
| Approach | Accuracy | Cost | Speed |
|---|---|---|---|
| Pattern-based | 99% | $0 | 3-5s |
| LLM (GPT-4) | 99.5% | $0.01-0.05 | 10-30s |
| OCR + LLM | 95% | $0.02-0.08 | 15-45s |
For StatementSync processing 1000 statements/month:
- Pattern-based: $0
- LLM: $10-50/month
The 0.5% accuracy difference doesn’t justify the cost for this use case.
When to Use What
Use unpdf when:
- Deploying to Vercel, Netlify, or Cloudflare
- Processing structured documents (statements, invoices)
- Need low memory footprint
- Running on edge runtimes
Use pdf-parse when:
- Running on traditional servers (EC2, DigitalOcean)
- Need advanced PDF features (annotations, forms)
- Have native build tools available
Use LLM extraction when:
- Documents are unstructured or variable
- Accuracy is more important than cost
- Processing low volumes
The Lesson
The right library matters more than clever workarounds. I spent 3 hours trying to make pdf-parse work on serverless. unpdf worked in 10 minutes.
If you’re building PDF processing for serverless, start with unpdf. Save yourself the 2 AM debugging.
Related: From Pain Point to MVP: StatementSync in One Week | Portfolio: StatementSync
FAQ
Why doesn't pdf-parse work on Vercel?
pdf-parse depends on pdfjs-dist which has optional native dependencies (canvas). Vercel's serverless runtime can't compile native modules, causing silent failures or build errors.
What is unpdf?
unpdf is a serverless-first PDF processing library. No native dependencies, works on edge runtimes, and provides text extraction and parsing capabilities for modern JavaScript environments.
How accurate is unpdf text extraction?
For structured documents (bank statements, invoices), accuracy is 99%+. For complex layouts or scanned documents, you may need additional OCR processing.
Can I use pdf-parse locally but unpdf in production?
Technically yes, but this creates inconsistency between environments. Better to use unpdf everywhere for predictable behavior.
Does unpdf work with Cloudflare Workers?
Yes, unpdf is specifically designed for edge and serverless runtimes including Cloudflare Workers, Vercel Edge, and Netlify Functions.
Sources & Further Reading
Sources
- Vercel Functions Official documentation for Vercel serverless functions.
- unpdf (GitHub) Official repository for the unpdf library.
- pdf-parse (GitHub) Official repository for the pdf-parse library.
Further Reading
- From Idea to Deployed MVP: MicroSaaSBot's Complete Workflow The full pipeline from 'I have an idea' to 'it's live on Vercel with Stripe billing.' Every phase explained with the real StatementSync timeline.
- How MicroSaaSBot Validates Ideas Before Writing Code The validation phase that prevents building products nobody wants. Market research, persona scoring, and the go/no-go decision that saves weeks of effort.
- The MicroSaaSBot Architecture: How Multi-Agent Systems Build Products Deep dive into MicroSaaSBot's multi-agent architecture: Researcher, Architect, Developer, and Deployer agents working in sequence to ship SaaS products.