Paperdoc Library
A zero-dependency PHP library for generating, parsing, and converting documents — PDF, HTML, CSV, DOCX, XLSX, PPTX, Markdown and more. One API for create, read, and convert.
Features
- Generate documents from scratch (PDF, HTML, CSV, DOCX, XLSX, PPTX, Markdown)
- Parse existing files into a unified in-memory model
- Convert between any supported format in one call
- Batch processing — open and process multiple files at once
- Thumbnails — generate preview images from documents or any supported file (images, PDF, Office, HTML, etc.); LibreOffice required for Office/CSV, Imagick or Ghostscript for PDF
- Laravel integration — ServiceProvider, Facade, and Artisan commands
- AI-powered — OCR (Tesseract) and LLM extraction via Neuron AI
- Core library: PHP + extensions only; thumbnails require LibreOffice and/or Imagick/Ghostscript
Requirements
Paperdoc requires PHP 8.2+ and the following extensions (no external PHP packages):
| Dependency | Version |
|---|---|
| PHP | ^8.2 |
| ext-dom | * |
| ext-mbstring | * |
| ext-zip | * |
| ext-zlib | * |
Optional (Laravel): illuminate/support ^11.0 or ^12.0 for the Facade and ServiceProvider.
Thumbnails — required system dependencies
For correct, high-quality thumbnails (fonts and layout):
- LibreOffice is required for DOCX, XLSX, PPTX, and CSV (headless:
libreofficeorsofficein PATH). - Imagick or Ghostscript is required for PDF thumbnails with correct rendering (otherwise a fallback text/image preview is used).
Without these, thumbnails fall back to native PHP previews (limited fonts, no real layout).
Installation
Install the package via Composer:
composer require paperdoc-dev/paperdoc-lib
Laravel auto-discovery
The PaperdocServiceProvider and Paperdoc facade are registered automatically. No manual registration needed.
Quick Start
Standalone PHP
Create a document, add content, and save to a file:
use Paperdoc\Support\DocumentManager;
$manager = new DocumentManager();
// Create a PDF document
$doc = $manager->create('pdf', 'My Report');
$doc->addSection()
->addParagraph('Hello, Paperdoc!')
->setBold(true);
$manager->save($doc, 'output/report.pdf');
Laravel (Facade)
Use the Paperdoc facade to create, parse, convert, or render:
use Paperdoc\Facades\Paperdoc;
// Create and save
$doc = Paperdoc::create('docx', 'Invoice #1042');
$doc->addSection()->addParagraph('Amount due: $500');
Paperdoc::save($doc, storage_path('invoices/1042.docx'));
// Parse an existing file
$doc = Paperdoc::open('uploads/report.xlsx');
// Convert directly (file to file)
Paperdoc::convert('report.docx', 'report.pdf', 'pdf');
// Render document to string (e.g. HTML)
$html = Paperdoc::renderAs($doc, 'html');
// Batch open multiple files
$docs = Paperdoc::openBatch([
'file1.pdf',
'file2.docx',
'file3.xlsx',
]);
Conversion & rendering
Convert a file to another format in one call, or render a document to a string (e.g. HTML or Markdown) without writing to disk.
File-to-file conversion
// Standalone
DocumentManager::convert('input.docx', 'output.pdf', 'pdf');
// Laravel
Paperdoc::convert('reports/data.xlsx', storage_path('reports/data.pdf'), 'pdf');
Render to string
Useful for web preview, APIs, or further processing:
$doc = Paperdoc::open('document.pdf');
$html = Paperdoc::renderAs($doc, 'html');
$markdown = Paperdoc::renderAs($doc, 'md');
Supported Formats
Parse existing files or generate new ones for each format. Legacy Office formats (DOC, XLS, PPT) are parse-only.
| Format | Parse | Render / Generate |
|---|---|---|
| ✓ | ✓ | |
| HTML | ✓ | ✓ |
| DOCX | ✓ | ✓ |
| XLSX | ✓ | ✓ |
| PPTX | ✓ | ✓ |
| CSV | ✓ | ✓ |
| Markdown | ✓ | ✓ |
| DOC | ✓ | — |
| XLS | ✓ | — |
| PPT | ✓ | — |
Document Model
Every format uses the same in-memory structure, so you can parse a PDF and render to DOCX without format-specific code:
Document
└── Section[]
├── Paragraph (TextRun[], bold, italic, font…)
├── Table → TableRow[] → TableCell[]
├── Image
└── PageBreak
Styles live in Document/Style/ and can be applied at paragraph, run, or section level.
Generate thumbnails
Paperdoc can generate thumbnails from documents or from any supported file. Thumbnails are computed on the fly (no file is written unless you do it yourself). For in-memory documents, the thumbnail reflects the first image in the document, or falls back to the first page of the source file if the document was opened from disk.
LibreOffice is required for DOCX, XLSX, PPTX, and CSV to get thumbnails with correct fonts and layout. For PDF, Imagick or Ghostscript is required for proper rendering. See Requirements.
From a document (in-memory)
Use getThumbnail() for raw binary data (width, height, mimeType, data) or getThumbnailDataUri() for a data:image/…;base64,… string suitable for <img src="…">:
$doc = Paperdoc::open('report.pdf');
// Array: ['data' => '…', 'mimeType' => 'image/jpeg', 'width' => 300, 'height' => 200]
$thumb = $doc->getThumbnail(300, 200, 85);
if ($thumb) {
file_put_contents('preview.jpg', base64_decode($thumb['data']));
}
// Data URI — use directly in HTML
$dataUri = $doc->getThumbnailDataUri(300, 200);
// <img src="<?php echo $dataUri; ?>" alt="Preview">
Via DocumentManager or Facade
Same API without holding the document instance:
// Standalone
$thumb = DocumentManager::thumbnail($document, 300, 300, 85);
$dataUri = DocumentManager::thumbnailDataUri($document, 300, 300);
// Laravel
$dataUri = Paperdoc::thumbnailDataUri($doc, 200, 200);
From a file path (any format)
ThumbnailGenerator can create a thumbnail directly from a file. Images use GD; PDF uses Imagick or Ghostscript (required for correct rendering); Office and CSV use LibreOffice headless (required) → PDF → first page. Without LibreOffice/Imagick/Ghostscript, a fallback text or grid preview is used.
use Paperdoc\Support\ThumbnailGenerator;
// Array or data URI
$thumb = ThumbnailGenerator::fromFile('document.docx', 300, 300, 85);
$dataUri = ThumbnailGenerator::fromFileDataUri('report.pdf', 400, 300, 90, 0); // page 0 = first page
Defaults: ThumbnailGenerator::DEFAULT_WIDTH / DEFAULT_HEIGHT = 300, DEFAULT_QUALITY = 85. For PDF and Office files, the optional $page argument (0-based) selects which page to thumbnail.
Opening options (OCR, LLM)
When opening a file with open() or openBatch(), you can pass options to enable OCR and/or LLM augmentation:
// Force OCR on a scanned PDF
$doc = Paperdoc::open('scan.pdf', ['ocr' => true]);
// Skip OCR even if auto-detect would run it
$doc = Paperdoc::open('mixed.pdf', ['ocr' => false]);
// Enable LLM augmentation (summaries, structure, correction)
$doc = Paperdoc::open('scan.pdf', ['ocr' => true, 'llm' => true]);
// OCR language (e.g. 'fra', 'eng')
$doc = Paperdoc::open('document.pdf', ['language' => 'fra']);
Options: ocr (true / false / 'auto'), llm (bool), language (OCR language code). Defaults come from config/paperdoc.php.
OCR
Paperdoc uses Tesseract for text extraction from scanned documents and images. OCR can run automatically when opening a PDF (auto-detect) or be forced/skipped via open($path, ['ocr' => true|false]).
- Post-processing — character substitution (0→O, rn→m), optional spell correction, n-gram correction, pattern recognition (dates, amounts), structure detection (headings, lists).
- Parallel processing — multiple pages processed in parallel (pool size configurable).
- Laravel Artisan —
paperdoc:build-dictionaryto build a spell-check dictionary;paperdoc:train-ngramto train an n-gram model for better correction.
Config: config/paperdoc.php → ocr (enabled, driver, language, pool_size, tesseract binary) and ocr.post_processing.
LLM / AI
LLM augmentation improves OCR output and enables summaries, translations, and structured extraction. Paperdoc uses Neuron AI to connect to multiple providers.
- LlmAugmenter — post-process OCR text (correction, structure).
- PaperdocAgent — document Q&A, summaries, and structured data extraction.
- Providers — OpenAI, Anthropic, Gemini, Ollama (and others supported by Neuron AI).
Enable via open($path, ['llm' => true]) or in config: llm.enabled, llm.provider, llm.model, llm.api_key (or PAPERDOC_LLM_* env vars).
Laravel
Use the Paperdoc facade for the same API as DocumentManager: Paperdoc::create(), Paperdoc::open(), Paperdoc::save(), Paperdoc::convert(), Paperdoc::renderAs(), Paperdoc::openBatch().
Artisan commands (when the package is installed in a Laravel app):
php artisan paperdoc:build-dictionary <path>— build a dictionary from text files for OCR spell correction.php artisan paperdoc:train-ngram <path>— train an n-gram model from text files for OCR post-processing.
Configuration
Laravel only. Publish the config file:
php artisan vendor:publish --tag=paperdoc-config
This creates config/paperdoc.php. Main options:
- Default format — default output when creating documents
- Typography — fonts and sizes applied to documents
- Storage paths — where to read/write files
- OCR — Tesseract and post-processing settings
- LLM — AI extraction and augmentation (Neuron AI)
API reference
Main entry point: DocumentManager (standalone) or Paperdoc facade (Laravel).
| Method | Description |
|---|---|
| create($format, $title = '') | Create a new empty document (format: pdf, docx, html, csv, etc.). |
| open($filename, $options = []) | Parse a file. Options: ocr, llm, language. |
| save($document, $path) | Write document to a file. |
| renderAs($document, $format) | Render document to string (e.g. 'html', 'md'). |
| convert($source, $destination, $format) | Open source file and save as destination in given format. |
| openBatch($filenames, $options = []) | Open multiple files; returns array of documents. Same options as open(). |
| thumbnail($document, $maxWidth, $maxHeight, $quality) | Get thumbnail array (data, mimeType, width, height) for a document. See Thumbnails. |
| thumbnailDataUri($document, $maxWidth, $maxHeight, $quality) | Get thumbnail as a data:image/…;base64,… string for <img src="…">. |
Testing
Run the test suite from the library directory:
composer test
# or
./vendor/bin/phpunit
Integration tests are in tests/Integration/, unit tests in tests/Unit/.
Architecture
High-level layout of the package source:
src/
├── Concerns/ # Shared traits
├── Console/ # Artisan commands
├── Contracts/ # DocumentInterface, ParserInterface…
├── Document/ # Core model (Document, Section, Paragraph…)
├── Enum/ # Format enums
├── Facades/ # Laravel Facade
├── Factory/ # Document/Parser factories
├── Llm/ # AI/LLM integration (Neuron AI)
├── Ocr/ # OCR integration
├── Parsers/ # Format-specific parsers
├── Renderers/ # Format-specific renderers
├── Support/ # DocumentManager and helpers
└── PaperdocServiceProvider.php