Why Scanned PDFs Are Large

Understanding scan file sizes and effective compression strategies

Quick Answer
Scanned PDFs are large because they contain image data (not text). Each page is stored as a high-resolution photograph. A single color page scanned at 300 DPI can be 8-10 MB, while text-based PDFs are typically under 100 KB per page. DPI, color mode, and compression settings dramatically impact file size.

Image Data vs. Text Data: The Fundamental Difference

The primary reason scanned PDFs are massive compared to regular PDFs is that they store image data instead of text data. When you scan a document, the scanner captures a photograph of each page, pixel by pixel.

Text-based PDFs store characters as vectors and fonts, which is extremely space-efficient. The word "document" in a text PDF takes only a few bytes. The same word in a scanned PDF requires thousands of pixels to represent, each storing color information.

Format Type Data Storage Method Typical Size (per page)
Text PDF Characters, fonts, vectors 20-100 KB
Scanned PDF (300 DPI, B&W) Image pixels (bitmap) 1-2 MB
Scanned PDF (300 DPI, Color) Image pixels with RGB data 8-10 MB
Scanned PDF (600 DPI, Color) High-res image pixels 30-35 MB

DPI (Dots Per Inch) Impact on File Size

DPI determines how many pixels are captured per inch of the scanned document. Higher DPI means more detail but exponentially larger files. The relationship is quadratic: doubling the DPI quadruples the file size.

Understanding DPI Calculations

For a standard US Letter page (8.5" × 11"):

DPI Setting Pixel Dimensions Total Pixels Approximate Size (Color)
150 DPI 1,275 × 1,650 2.1 million 2-3 MB
300 DPI 2,550 × 3,300 8.4 million 8-10 MB
600 DPI 5,100 × 6,600 33.6 million 30-35 MB
1200 DPI 10,200 × 13,200 134.6 million 120-140 MB
DPI Recommendation:
Reading text documents: 150-200 DPI is sufficient
General office documents: 300 DPI is the sweet spot
Archival quality: 400-600 DPI for preservation
Photos or detailed graphics: 600+ DPI only when necessary

Color vs. Grayscale vs. Black & White

The color mode dramatically affects file size because it determines how much data is stored for each pixel.

Black & White (1-bit)

1 bit per pixel: Each pixel is either black or white. Smallest size, best for text-only documents. ~200-500 KB per page at 300 DPI.

Grayscale (8-bit)

8 bits per pixel: 256 shades of gray. Good for documents with photos or shading. ~1-2 MB per page at 300 DPI.

Color (24-bit RGB)

24 bits per pixel: 16.7 million colors (8 bits each for Red, Green, Blue). Largest size. ~8-10 MB per page at 300 DPI.

Key insight: Color images are approximately 3× larger than grayscale, and grayscale is 8× larger than black & white. For text documents, black & white mode can reduce file sizes by 95% with no loss of readability.

Compression: Uncompressed vs. JPEG vs. CCITT

Compression algorithms reduce file size by encoding image data more efficiently. Different compression types work better for different content.

Common PDF Image Compression Methods

Compression Type Best For Quality Size Reduction
Uncompressed Archival/maximum quality Perfect (lossless) None (baseline)
CCITT Group 4 Black & white text Perfect (lossless) 90-95% reduction
ZIP/Flate Grayscale documents Perfect (lossless) 30-60% reduction
JPEG (High Quality) Color photos Slight loss 70-80% reduction
JPEG (Medium Quality) Color documents Noticeable loss 85-90% reduction
JPEG (Low Quality) Web preview only Significant loss 95%+ reduction

Many scanning software defaults to uncompressed or lightly compressed scans, resulting in unnecessarily large files. Applying JPEG compression at 80-85% quality can reduce color scans by 80% with minimal visible difference.

Multiple Pages Accumulate Size

Unlike text PDFs where each page adds minimal size, scanned PDFs accumulate megabytes per page. A 100-page document scanned in color at 300 DPI without compression can easily exceed 800 MB.

Quick File Size Calculator:

Estimated size = Pages × Per-Page Size

Examples:
• 50 pages × 8 MB (color, 300 DPI) = 400 MB
• 50 pages × 2 MB (grayscale, 300 DPI) = 100 MB
• 50 pages × 500 KB (B&W, 300 DPI) = 25 MB
• 50 pages × 100 KB (B&W compressed) = 5 MB

This is why scanning settings matter tremendously for multi-page documents. The difference between appropriate and excessive settings can be 100× the file size.

How to Reduce Scanned PDF File Size

Effective Size Reduction Strategies:

1. Lower the DPI
• Change from 600 DPI to 300 DPI (75% reduction)
• For text-only: use 150-200 DPI (90% reduction)

2. Switch to Grayscale or B&W
• Color to grayscale: 70% reduction
• Color to black & white: 95% reduction
• Use B&W for text documents

3. Apply JPEG Compression
• Use 80-85% quality for color/grayscale
• 70-80% reduction with minimal quality loss
• Use CCITT compression for B&W (automatic)

4. Add OCR with Text Layer
• OCR creates searchable, smaller text layer
• Can reduce image resolution after OCR
• Makes content searchable and copyable

5. Use PDF Compression Tools
• Adobe Acrobat: "Reduce File Size" feature
• Online tools: Smallpdf, iLovePDF, Reformatly
• Ghostscript: command-line compression
• Preview (Mac): Export with "Reduce File Size"

Optimal Settings by Document Type

Document Type Recommended DPI Color Mode Compression
Text documents 150-200 DPI Black & White CCITT Group 4
Business documents 300 DPI Grayscale or B&W JPEG 85% or CCITT
Documents with color logos 300 DPI Color JPEG 80-85%
Photos or magazines 300-400 DPI Color JPEG 85-90%
Archival/legal documents 300-600 DPI Grayscale or Color Lossless (ZIP/Flate)

OCR and Hybrid PDFs

Adding OCR (Optical Character Recognition) to scanned PDFs creates a hybrid document with both the scanned image and an invisible text layer. This makes PDFs searchable and can actually reduce file size in some cases.

Benefits of OCR for File Size

After running OCR, you can reduce the image quality/resolution while maintaining readability through the text layer. The text layer adds minimal size (similar to native PDFs), but allows you to use lower quality images as a "background."

For example: a 300 DPI color scan might be 8 MB per page. After OCR, you could reduce the image to 150 DPI at 70% JPEG quality (now 1 MB per page) while the invisible text layer maintains perfect searchability and copy-paste functionality.

Frequently Asked Questions

Why is my 10-page scan 50 MB?

Your scanner is likely set to high DPI (600+) in color mode without compression. At 600 DPI color with no compression, each page is approximately 30-35 MB uncompressed. Reduce to 300 DPI grayscale with JPEG compression to get under 1 MB per page.

Does reducing PDF size reduce quality?

It depends on the method. Lowering DPI or using lossy JPEG compression will reduce quality, but intelligently done, the quality loss is imperceptible for most uses. Going from 600 to 300 DPI is virtually unnoticeable on screen, and JPEG at 80-85% quality looks identical to the eye. Lossless compression (CCITT for B&W, ZIP for grayscale) reduces size without any quality loss.

Can I compress a PDF without losing text readability?

Absolutely. For text documents, scan in black & white mode at 300 DPI with CCITT compression. This maintains perfect text clarity while keeping file size to 200-500 KB per page. Black & white compression is lossless, so there's zero quality degradation.

What's the best free tool to compress scanned PDFs?

For online tools, Reformatly, Smallpdf, and iLovePDF offer free PDF compression. On desktop, Adobe Acrobat Reader DC (free version) has compression features. For Mac users, Preview's Export function with "Reduce File Size" works well. For advanced users, Ghostscript (free, command-line) provides the most control.

Should I scan documents at maximum DPI for future-proofing?

Only for archival/preservation purposes. For everyday business documents, 300 DPI is plenty for future needs. Ultra-high DPI (1200+) creates enormous files (100+ MB per page) that are difficult to store and share. Even professional archives typically use 400-600 DPI as the maximum. Remember: you can always rescan if higher quality is needed later, but massive files are hard to manage now.