Back to Home

Paperdoc Library

A zero-dependency PHP library for generating, parsing, and converting documents — PDF, HTML, CSV, DOCX, XLSX, PPTX, Markdown and more. One API for create, read, and convert.

Features

  • Generate documents from scratch (PDF, HTML, CSV, DOCX, XLSX, PPTX, Markdown)
  • Parse existing files into a unified in-memory model
  • Convert between any supported format in one call
  • Batch processing — open and process multiple files at once
  • Laravel integration — ServiceProvider, Facade, and Artisan commands
  • AI-powered — OCR (Tesseract) and LLM extraction via Neuron AI
  • Zero native binary dependencies — pure PHP

Requirements

Paperdoc requires PHP 8.2+ and the following extensions (no external PHP packages):

Dependency Version
PHP ^8.2
ext-dom *
ext-mbstring *
ext-zip *
ext-zlib *

Optional (Laravel): illuminate/support ^11.0 or ^12.0 for the Facade and ServiceProvider.

Installation

Install the package via Composer:

composer require paperdoc-dev/paperdoc-lib

Laravel auto-discovery

The PaperdocServiceProvider and Paperdoc facade are registered automatically. No manual registration needed.

Quick Start

Standalone PHP

Create a document, add content, and save to a file:

use Paperdoc\Support\DocumentManager;

$manager = new DocumentManager();

// Create a PDF document
$doc = $manager->create('pdf', 'My Report');
$doc->addSection()
    ->addParagraph('Hello, Paperdoc!')
    ->setBold(true);

$manager->save($doc, 'output/report.pdf');

Laravel (Facade)

Use the Paperdoc facade to create, parse, convert, or render:

use Paperdoc\Facades\Paperdoc;

// Create and save
$doc = Paperdoc::create('docx', 'Invoice #1042');
$doc->addSection()->addParagraph('Amount due: $500');
Paperdoc::save($doc, storage_path('invoices/1042.docx'));

// Parse an existing file
$doc = Paperdoc::open('uploads/report.xlsx');

// Convert directly (file to file)
Paperdoc::convert('report.docx', 'report.pdf', 'pdf');

// Render document to string (e.g. HTML)
$html = Paperdoc::renderAs($doc, 'html');

// Batch open multiple files
$docs = Paperdoc::openBatch([
    'file1.pdf',
    'file2.docx',
    'file3.xlsx',
]);

Conversion & rendering

Convert a file to another format in one call, or render a document to a string (e.g. HTML or Markdown) without writing to disk.

File-to-file conversion

// Standalone
DocumentManager::convert('input.docx', 'output.pdf', 'pdf');

// Laravel
Paperdoc::convert('reports/data.xlsx', storage_path('reports/data.pdf'), 'pdf');

Render to string

Useful for web preview, APIs, or further processing:

$doc = Paperdoc::open('document.pdf');
$html = Paperdoc::renderAs($doc, 'html');
$markdown = Paperdoc::renderAs($doc, 'md');

Supported Formats

Parse existing files or generate new ones for each format. Legacy Office formats (DOC, XLS, PPT) are parse-only.

Format Parse Render / Generate
PDF
HTML
DOCX
XLSX
PPTX
CSV
Markdown
DOC
XLS
PPT

Document Model

Every format uses the same in-memory structure, so you can parse a PDF and render to DOCX without format-specific code:

Document
└── Section[]
    ├── Paragraph (TextRun[], bold, italic, font…)
    ├── Table → TableRow[] → TableCell[]
    ├── Image
    └── PageBreak

Styles live in Document/Style/ and can be applied at paragraph, run, or section level.

Opening options (OCR, LLM)

When opening a file with open() or openBatch(), you can pass options to enable OCR and/or LLM augmentation:

// Force OCR on a scanned PDF
$doc = Paperdoc::open('scan.pdf', ['ocr' => true]);

// Skip OCR even if auto-detect would run it
$doc = Paperdoc::open('mixed.pdf', ['ocr' => false]);

// Enable LLM augmentation (summaries, structure, correction)
$doc = Paperdoc::open('scan.pdf', ['ocr' => true, 'llm' => true]);

// OCR language (e.g. 'fra', 'eng')
$doc = Paperdoc::open('document.pdf', ['language' => 'fra']);

Options: ocr (true / false / 'auto'), llm (bool), language (OCR language code). Defaults come from config/paperdoc.php.

OCR

Paperdoc uses Tesseract for text extraction from scanned documents and images. OCR can run automatically when opening a PDF (auto-detect) or be forced/skipped via open($path, ['ocr' => true|false]).

  • Post-processing — character substitution (0→O, rn→m), optional spell correction, n-gram correction, pattern recognition (dates, amounts), structure detection (headings, lists).
  • Parallel processing — multiple pages processed in parallel (pool size configurable).
  • Laravel Artisanpaperdoc:build-dictionary to build a spell-check dictionary; paperdoc:train-ngram to train an n-gram model for better correction.

Config: config/paperdoc.phpocr (enabled, driver, language, pool_size, tesseract binary) and ocr.post_processing.

LLM / AI

LLM augmentation improves OCR output and enables summaries, translations, and structured extraction. Paperdoc uses Neuron AI to connect to multiple providers.

  • LlmAugmenter — post-process OCR text (correction, structure).
  • PaperdocAgent — document Q&A, summaries, and structured data extraction.
  • Providers — OpenAI, Anthropic, Gemini, Ollama (and others supported by Neuron AI).

Enable via open($path, ['llm' => true]) or in config: llm.enabled, llm.provider, llm.model, llm.api_key (or PAPERDOC_LLM_* env vars).

Laravel

Use the Paperdoc facade for the same API as DocumentManager: Paperdoc::create(), Paperdoc::open(), Paperdoc::save(), Paperdoc::convert(), Paperdoc::renderAs(), Paperdoc::openBatch().

Artisan commands (when the package is installed in a Laravel app):

  • php artisan paperdoc:build-dictionary <path> — build a dictionary from text files for OCR spell correction.
  • php artisan paperdoc:train-ngram <path> — train an n-gram model from text files for OCR post-processing.

Configuration

Laravel only. Publish the config file:

php artisan vendor:publish --tag=paperdoc-config

This creates config/paperdoc.php. Main options:

  • Default format — default output when creating documents
  • Typography — fonts and sizes applied to documents
  • Storage paths — where to read/write files
  • OCR — Tesseract and post-processing settings
  • LLM — AI extraction and augmentation (Neuron AI)

API reference

Main entry point: DocumentManager (standalone) or Paperdoc facade (Laravel).

Method Description
create($format, $title = '')Create a new empty document (format: pdf, docx, html, csv, etc.).
open($filename, $options = [])Parse a file. Options: ocr, llm, language.
save($document, $path)Write document to a file.
renderAs($document, $format)Render document to string (e.g. 'html', 'md').
convert($source, $destination, $format)Open source file and save as destination in given format.
openBatch($filenames, $options = [])Open multiple files; returns array of documents. Same options as open().

Testing

Run the test suite from the library directory:

composer test
# or
./vendor/bin/phpunit

Integration tests are in tests/Integration/, unit tests in tests/Unit/.

Architecture

High-level layout of the package source:

src/
├── Concerns/          # Shared traits
├── Console/           # Artisan commands
├── Contracts/         # DocumentInterface, ParserInterface…
├── Document/          # Core model (Document, Section, Paragraph…)
├── Enum/              # Format enums
├── Facades/           # Laravel Facade
├── Factory/           # Document/Parser factories
├── Llm/               # AI/LLM integration (Neuron AI)
├── Ocr/               # OCR integration
├── Parsers/           # Format-specific parsers
├── Renderers/         # Format-specific renderers
├── Support/           # DocumentManager and helpers
└── PaperdocServiceProvider.php

Paperdoc Library is proprietary software. © 2024–2026 Paperdoc — paperdoc.dev