Paperdoc Library

A zero-dependency PHP library for generating, parsing, and converting documents — PDF, HTML, CSV, DOCX, XLSX, PPTX, Markdown and more. One API for create, read, and convert.

Features

Generate documents from scratch (PDF, HTML, CSV, DOCX, XLSX, PPTX, Markdown)
Parse existing files into a unified in-memory model
Convert between any supported format in one call
Batch processing — open and process multiple files at once
Thumbnails — generate preview images from documents or any supported file (images, PDF, Office, HTML, etc.); LibreOffice required for Office/CSV, Imagick or Ghostscript for PDF
Laravel integration — ServiceProvider, Facade, and Artisan commands
AI-powered — OCR (Tesseract) and LLM extraction via Neuron AI
Core library: PHP + extensions only; thumbnails require LibreOffice and/or Imagick/Ghostscript

Requirements

Paperdoc requires PHP 8.2+ and the following extensions (no external PHP packages):

Dependency	Version
PHP	^8.2
ext-dom	*
ext-mbstring	*
ext-zip	*
ext-zlib	*

Optional (Laravel): illuminate/support ^11.0 or ^12.0 for the Facade and ServiceProvider.

Thumbnails — required system dependencies

For correct, high-quality thumbnails (fonts and layout):

LibreOffice is required for DOCX, XLSX, PPTX, and CSV (headless: libreoffice or soffice in PATH).
Imagick or Ghostscript is required for PDF thumbnails with correct rendering (otherwise a fallback text/image preview is used).

Without these, thumbnails fall back to native PHP previews (limited fonts, no real layout).

Installation

Install the package via Composer:

composer require paperdoc-dev/paperdoc-lib

Laravel auto-discovery

The PaperdocServiceProvider and Paperdoc facade are registered automatically. No manual registration needed.

Quick Start

Standalone PHP

Create a document, add content, and save to a file:

use Paperdoc\Support\DocumentManager;

$manager = new DocumentManager();

// Create a PDF document
$doc = $manager->create('pdf', 'My Report');
$doc->addSection()
    ->addParagraph('Hello, Paperdoc!')
    ->setBold(true);

$manager->save($doc, 'output/report.pdf');

Laravel (Facade)

Use the Paperdoc facade to create, parse, convert, or render:

use Paperdoc\Facades\Paperdoc;

// Create and save
$doc = Paperdoc::create('docx', 'Invoice #1042');
$doc->addSection()->addParagraph('Amount due: $500');
Paperdoc::save($doc, storage_path('invoices/1042.docx'));

// Parse an existing file
$doc = Paperdoc::open('uploads/report.xlsx');

// Convert directly (file to file)
Paperdoc::convert('report.docx', 'report.pdf', 'pdf');

// Render document to string (e.g. HTML)
$html = Paperdoc::renderAs($doc, 'html');

// Batch open multiple files
$docs = Paperdoc::openBatch([
    'file1.pdf',
    'file2.docx',
    'file3.xlsx',
]);

Conversion & rendering

Convert a file to another format in one call, or render a document to a string (e.g. HTML or Markdown) without writing to disk.

File-to-file conversion

// Standalone
DocumentManager::convert('input.docx', 'output.pdf', 'pdf');

// Laravel
Paperdoc::convert('reports/data.xlsx', storage_path('reports/data.pdf'), 'pdf');

Render to string

Useful for web preview, APIs, or further processing:

$doc = Paperdoc::open('document.pdf');
$html = Paperdoc::renderAs($doc, 'html');
$markdown = Paperdoc::renderAs($doc, 'md');

Supported Formats

Parse existing files or generate new ones for each format. Legacy Office formats (DOC, XLS, PPT) are parse-only.

Format	Parse	Render / Generate
PDF	✓	✓
HTML	✓	✓
DOCX	✓	✓
XLSX	✓	✓
PPTX	✓	✓
CSV	✓	✓
Markdown	✓	✓
DOC	✓	—
XLS	✓	—
PPT	✓	—

Document Model

Every format uses the same in-memory structure, so you can parse a PDF and render to DOCX without format-specific code:

Document
└── Section[]
    ├── Paragraph (TextRun[], bold, italic, font…)
    ├── Table → TableRow[] → TableCell[]
    ├── Image
    └── PageBreak

Styles live in Document/Style/ and can be applied at paragraph, run, or section level.

Generate thumbnails

Paperdoc can generate thumbnails from documents or from any supported file. Thumbnails are computed on the fly (no file is written unless you do it yourself). For in-memory documents, the thumbnail reflects the first image in the document, or falls back to the first page of the source file if the document was opened from disk.

LibreOffice is required for DOCX, XLSX, PPTX, and CSV to get thumbnails with correct fonts and layout. For PDF, Imagick or Ghostscript is required for proper rendering. See Requirements.

From a document (in-memory)

Use getThumbnail() for raw binary data (width, height, mimeType, data) or getThumbnailDataUri() for a data:image/…;base64,… string suitable for <img src="…">:

$doc = Paperdoc::open('report.pdf');

// Array: ['data' => '…', 'mimeType' => 'image/jpeg', 'width' => 300, 'height' => 200]
$thumb = $doc->getThumbnail(300, 200, 85);
if ($thumb) {
    file_put_contents('preview.jpg', base64_decode($thumb['data']));
}

// Data URI — use directly in HTML
$dataUri = $doc->getThumbnailDataUri(300, 200);
// <img src="<?php echo $dataUri; ?>" alt="Preview">

Via DocumentManager or Facade

Same API without holding the document instance:

// Standalone
$thumb = DocumentManager::thumbnail($document, 300, 300, 85);
$dataUri = DocumentManager::thumbnailDataUri($document, 300, 300);

// Laravel
$dataUri = Paperdoc::thumbnailDataUri($doc, 200, 200);

From a file path (any format)

ThumbnailGenerator can create a thumbnail directly from a file. Images use GD; PDF uses Imagick or Ghostscript (required for correct rendering); Office and CSV use LibreOffice headless (required) → PDF → first page. Without LibreOffice/Imagick/Ghostscript, a fallback text or grid preview is used.

use Paperdoc\Support\ThumbnailGenerator;

// Array or data URI
$thumb = ThumbnailGenerator::fromFile('document.docx', 300, 300, 85);
$dataUri = ThumbnailGenerator::fromFileDataUri('report.pdf', 400, 300, 90, 0); // page 0 = first page

Defaults: ThumbnailGenerator::DEFAULT_WIDTH / DEFAULT_HEIGHT = 300, DEFAULT_QUALITY = 85. For PDF and Office files, the optional $page argument (0-based) selects which page to thumbnail.

Opening options (OCR, LLM)

When opening a file with open() or openBatch(), you can pass options to enable OCR and/or LLM augmentation:

// Force OCR on a scanned PDF
$doc = Paperdoc::open('scan.pdf', ['ocr' => true]);

// Skip OCR even if auto-detect would run it
$doc = Paperdoc::open('mixed.pdf', ['ocr' => false]);

// Enable LLM augmentation (summaries, structure, correction)
$doc = Paperdoc::open('scan.pdf', ['ocr' => true, 'llm' => true]);

// OCR language (e.g. 'fra', 'eng')
$doc = Paperdoc::open('document.pdf', ['language' => 'fra']);

Options: ocr (true / false / 'auto'), llm (bool), language (OCR language code). Defaults come from config/paperdoc.php.

OCR

Paperdoc uses Tesseract for text extraction from scanned documents and images. OCR can run automatically when opening a PDF (auto-detect) or be forced/skipped via open($path, ['ocr' => true|false]).

Post-processing — character substitution (0→O, rn→m), optional spell correction, n-gram correction, pattern recognition (dates, amounts), structure detection (headings, lists).
Parallel processing — multiple pages processed in parallel (pool size configurable).
Laravel Artisan — paperdoc:build-dictionary to build a spell-check dictionary; paperdoc:train-ngram to train an n-gram model for better correction.

Config: config/paperdoc.php → ocr (enabled, driver, language, pool_size, tesseract binary) and ocr.post_processing.

LLM / AI

LLM augmentation improves OCR output and enables summaries, translations, and structured extraction. Paperdoc uses Neuron AI to connect to multiple providers.

LlmAugmenter — post-process OCR text (correction, structure).
PaperdocAgent — document Q&A, summaries, and structured data extraction.
Providers — OpenAI, Anthropic, Gemini, Ollama (and others supported by Neuron AI).

Enable via open($path, ['llm' => true]) or in config: llm.enabled, llm.provider, llm.model, llm.api_key (or PAPERDOC_LLM_* env vars).

Laravel

Use the Paperdoc facade for the same API as DocumentManager: Paperdoc::create(), Paperdoc::open(), Paperdoc::save(), Paperdoc::convert(), Paperdoc::renderAs(), Paperdoc::openBatch().

Artisan commands (when the package is installed in a Laravel app):

php artisan paperdoc:build-dictionary <path> — build a dictionary from text files for OCR spell correction.
php artisan paperdoc:train-ngram <path> — train an n-gram model from text files for OCR post-processing.

Configuration

Laravel only. Publish the config file:

php artisan vendor:publish --tag=paperdoc-config

This creates config/paperdoc.php. Main options:

Default format — default output when creating documents
Typography — fonts and sizes applied to documents
Storage paths — where to read/write files
OCR — Tesseract and post-processing settings
LLM — AI extraction and augmentation (Neuron AI)

API reference

Main entry point: DocumentManager (standalone) or Paperdoc facade (Laravel).

Method	Description
create($format, $title = '')	Create a new empty document (format: pdf, docx, html, csv, etc.).
open($filename, $options = [])	Parse a file. Options: `ocr`, `llm`, `language`.
save($document, $path)	Write document to a file.
renderAs($document, $format)	Render document to string (e.g. 'html', 'md').
convert($source, $destination, $format)	Open source file and save as destination in given format.
openBatch($filenames, $options = [])	Open multiple files; returns array of documents. Same options as `open()`.
thumbnail($document, $maxWidth, $maxHeight, $quality)	Get thumbnail array (data, mimeType, width, height) for a document. See Thumbnails.
thumbnailDataUri($document, $maxWidth, $maxHeight, $quality)	Get thumbnail as a `data:image/…;base64,…` string for `<img src="…">`.

Testing

Run the test suite from the library directory:

composer test
# or
./vendor/bin/phpunit

Integration tests are in tests/Integration/, unit tests in tests/Unit/.

Architecture

High-level layout of the package source:

src/
├── Concerns/          # Shared traits
├── Console/           # Artisan commands
├── Contracts/         # DocumentInterface, ParserInterface…
├── Document/          # Core model (Document, Section, Paragraph…)
├── Enum/              # Format enums
├── Facades/           # Laravel Facade
├── Factory/           # Document/Parser factories
├── Llm/               # AI/LLM integration (Neuron AI)
├── Ocr/               # OCR integration
├── Parsers/           # Format-specific parsers
├── Renderers/         # Format-specific renderers
├── Support/           # DocumentManager and helpers
└── PaperdocServiceProvider.php