Paperdoc Library
A zero-dependency PHP library for generating, parsing, and converting documents — PDF, HTML, CSV, DOCX, XLSX, PPTX, Markdown and more. One API for create, read, and convert.
Features
- Generate documents from scratch (PDF, HTML, CSV, DOCX, XLSX, PPTX, Markdown)
- Parse existing files into a unified in-memory model
- Convert between any supported format in one call
- Batch processing — open and process multiple files at once
- Laravel integration — ServiceProvider, Facade, and Artisan commands
- AI-powered — OCR (Tesseract) and LLM extraction via Neuron AI
- Zero native binary dependencies — pure PHP
Requirements
Paperdoc requires PHP 8.2+ and the following extensions (no external PHP packages):
| Dependency | Version |
|---|---|
| PHP | ^8.2 |
| ext-dom | * |
| ext-mbstring | * |
| ext-zip | * |
| ext-zlib | * |
Optional (Laravel): illuminate/support ^11.0 or ^12.0 for the Facade and ServiceProvider.
Installation
Install the package via Composer:
composer require paperdoc-dev/paperdoc-lib
Laravel auto-discovery
The PaperdocServiceProvider and Paperdoc facade are registered automatically. No manual registration needed.
Quick Start
Standalone PHP
Create a document, add content, and save to a file:
use Paperdoc\Support\DocumentManager;
$manager = new DocumentManager();
// Create a PDF document
$doc = $manager->create('pdf', 'My Report');
$doc->addSection()
->addParagraph('Hello, Paperdoc!')
->setBold(true);
$manager->save($doc, 'output/report.pdf');
Laravel (Facade)
Use the Paperdoc facade to create, parse, convert, or render:
use Paperdoc\Facades\Paperdoc;
// Create and save
$doc = Paperdoc::create('docx', 'Invoice #1042');
$doc->addSection()->addParagraph('Amount due: $500');
Paperdoc::save($doc, storage_path('invoices/1042.docx'));
// Parse an existing file
$doc = Paperdoc::open('uploads/report.xlsx');
// Convert directly (file to file)
Paperdoc::convert('report.docx', 'report.pdf', 'pdf');
// Render document to string (e.g. HTML)
$html = Paperdoc::renderAs($doc, 'html');
// Batch open multiple files
$docs = Paperdoc::openBatch([
'file1.pdf',
'file2.docx',
'file3.xlsx',
]);
Conversion & rendering
Convert a file to another format in one call, or render a document to a string (e.g. HTML or Markdown) without writing to disk.
File-to-file conversion
// Standalone
DocumentManager::convert('input.docx', 'output.pdf', 'pdf');
// Laravel
Paperdoc::convert('reports/data.xlsx', storage_path('reports/data.pdf'), 'pdf');
Render to string
Useful for web preview, APIs, or further processing:
$doc = Paperdoc::open('document.pdf');
$html = Paperdoc::renderAs($doc, 'html');
$markdown = Paperdoc::renderAs($doc, 'md');
Supported Formats
Parse existing files or generate new ones for each format. Legacy Office formats (DOC, XLS, PPT) are parse-only.
| Format | Parse | Render / Generate |
|---|---|---|
| ✓ | ✓ | |
| HTML | ✓ | ✓ |
| DOCX | ✓ | ✓ |
| XLSX | ✓ | ✓ |
| PPTX | ✓ | ✓ |
| CSV | ✓ | ✓ |
| Markdown | ✓ | ✓ |
| DOC | ✓ | — |
| XLS | ✓ | — |
| PPT | ✓ | — |
Document Model
Every format uses the same in-memory structure, so you can parse a PDF and render to DOCX without format-specific code:
Document
└── Section[]
├── Paragraph (TextRun[], bold, italic, font…)
├── Table → TableRow[] → TableCell[]
├── Image
└── PageBreak
Styles live in Document/Style/ and can be applied at paragraph, run, or section level.
Opening options (OCR, LLM)
When opening a file with open() or openBatch(), you can pass options to enable OCR and/or LLM augmentation:
// Force OCR on a scanned PDF
$doc = Paperdoc::open('scan.pdf', ['ocr' => true]);
// Skip OCR even if auto-detect would run it
$doc = Paperdoc::open('mixed.pdf', ['ocr' => false]);
// Enable LLM augmentation (summaries, structure, correction)
$doc = Paperdoc::open('scan.pdf', ['ocr' => true, 'llm' => true]);
// OCR language (e.g. 'fra', 'eng')
$doc = Paperdoc::open('document.pdf', ['language' => 'fra']);
Options: ocr (true / false / 'auto'), llm (bool), language (OCR language code). Defaults come from config/paperdoc.php.
OCR
Paperdoc uses Tesseract for text extraction from scanned documents and images. OCR can run automatically when opening a PDF (auto-detect) or be forced/skipped via open($path, ['ocr' => true|false]).
- Post-processing — character substitution (0→O, rn→m), optional spell correction, n-gram correction, pattern recognition (dates, amounts), structure detection (headings, lists).
- Parallel processing — multiple pages processed in parallel (pool size configurable).
- Laravel Artisan —
paperdoc:build-dictionaryto build a spell-check dictionary;paperdoc:train-ngramto train an n-gram model for better correction.
Config: config/paperdoc.php → ocr (enabled, driver, language, pool_size, tesseract binary) and ocr.post_processing.
LLM / AI
LLM augmentation improves OCR output and enables summaries, translations, and structured extraction. Paperdoc uses Neuron AI to connect to multiple providers.
- LlmAugmenter — post-process OCR text (correction, structure).
- PaperdocAgent — document Q&A, summaries, and structured data extraction.
- Providers — OpenAI, Anthropic, Gemini, Ollama (and others supported by Neuron AI).
Enable via open($path, ['llm' => true]) or in config: llm.enabled, llm.provider, llm.model, llm.api_key (or PAPERDOC_LLM_* env vars).
Laravel
Use the Paperdoc facade for the same API as DocumentManager: Paperdoc::create(), Paperdoc::open(), Paperdoc::save(), Paperdoc::convert(), Paperdoc::renderAs(), Paperdoc::openBatch().
Artisan commands (when the package is installed in a Laravel app):
php artisan paperdoc:build-dictionary <path>— build a dictionary from text files for OCR spell correction.php artisan paperdoc:train-ngram <path>— train an n-gram model from text files for OCR post-processing.
Configuration
Laravel only. Publish the config file:
php artisan vendor:publish --tag=paperdoc-config
This creates config/paperdoc.php. Main options:
- Default format — default output when creating documents
- Typography — fonts and sizes applied to documents
- Storage paths — where to read/write files
- OCR — Tesseract and post-processing settings
- LLM — AI extraction and augmentation (Neuron AI)
API reference
Main entry point: DocumentManager (standalone) or Paperdoc facade (Laravel).
| Method | Description |
|---|---|
| create($format, $title = '') | Create a new empty document (format: pdf, docx, html, csv, etc.). |
| open($filename, $options = []) | Parse a file. Options: ocr, llm, language. |
| save($document, $path) | Write document to a file. |
| renderAs($document, $format) | Render document to string (e.g. 'html', 'md'). |
| convert($source, $destination, $format) | Open source file and save as destination in given format. |
| openBatch($filenames, $options = []) | Open multiple files; returns array of documents. Same options as open(). |
Testing
Run the test suite from the library directory:
composer test
# or
./vendor/bin/phpunit
Integration tests are in tests/Integration/, unit tests in tests/Unit/.
Architecture
High-level layout of the package source:
src/
├── Concerns/ # Shared traits
├── Console/ # Artisan commands
├── Contracts/ # DocumentInterface, ParserInterface…
├── Document/ # Core model (Document, Section, Paragraph…)
├── Enum/ # Format enums
├── Facades/ # Laravel Facade
├── Factory/ # Document/Parser factories
├── Llm/ # AI/LLM integration (Neuron AI)
├── Ocr/ # OCR integration
├── Parsers/ # Format-specific parsers
├── Renderers/ # Format-specific renderers
├── Support/ # DocumentManager and helpers
└── PaperdocServiceProvider.php