PDF Extract Text — OpenClaw Plugin

Extract text from PDFs using the PDFAPIHub API. This OpenClaw plugin gives your AI agent 8 tools for plain text extraction, structured parsing, OCR, and format conversion.

What It Does

Pull text content from any PDF — digital or scanned — using the best method for the job. Extract plain text, parse into structured JSON with tables and bounding boxes, OCR scanned documents in 100+ languages, or convert to Word/Excel/CSV/HTML.

Features

Plain Text Extraction — Fast text extraction from digital PDFs with page selection
Structured Parsing — JSON output with layout blocks, normalized bounding boxes, tables, and image metadata
4 Parse Modes — text, layout, tables, full (text + blocks + tables + images)
PDF OCR — Tesseract OCR for scanned PDFs with configurable DPI (72-400)
Image OCR — OCR photos of receipts, documents, signs with preprocessing
Multi-Language OCR — 100+ languages (eng, hin, fra, deu, etc.), combine with +
Word-Level Bounding Boxes — Per-word positions and confidence scores
Character Whitelisting — Restrict OCR to digits-only for invoices/meters
Image Preprocessing — Grayscale, sharpen, threshold, resize for noisy inputs
PDF to DOCX — Editable Word documents with formatting preserved
PDF to Excel — Tables extracted into XLSX (one sheet per page)
PDF to CSV — Tabular data for databases and BI tools
PDF to HTML — Styled HTML for web publishing

Tools

Tool	Description
`extract_text_from_pdf`	Extract plain text from PDF pages
`parse_pdf`	Parse into structured JSON (text, layout blocks, tables, images)
`ocr_pdf`	OCR scanned PDFs with multi-language Tesseract support
`ocr_image`	OCR images with preprocessing
`pdf_to_docx`	Convert PDF to editable Word document
`pdf_to_excel`	Extract tables into Excel workbook
`pdf_to_csv`	Extract tables into CSV format
`pdf_to_html`	Convert PDF to styled HTML

Installation

openclaw plugins install clawhub:pdf-extract-text

Configuration

Add your API key in ~/.openclaw/openclaw.json:

{
  "plugins": {
    "entries": {
      "pdf-extract-text": {
        "enabled": true,
        "env": {
          "PDFAPIHUB_API_KEY": "your-api-key-here"
        }
      }
    }
  }
}

Get your free API key at https://pdfapihub.com.

Usage Examples

Just ask your OpenClaw agent:

"Extract all text from this PDF"
"Parse the tables from this invoice"
"OCR this scanned document in English and Hindi"
"Convert this PDF to an Excel spreadsheet"
"Extract text from this receipt photo"
"Convert pages 1-3 to a Word document"
"Get the structured layout with bounding boxes"

Use Cases

Invoice Parsing — Extract line items, totals, and vendor info from PDF invoices
Resume Parsing — Extract name, experience, and skills from PDF resumes
Full-Text Search — Extract text for indexing in search engines
AI/LLM Processing — Feed PDF text into language models or chatbots
Financial Data — Extract tables from bank statements into Excel/CSV
Receipt Scanning — OCR receipts and invoices for expense tracking
Document Digitization — Convert scanned legacy documents into searchable text
Content Migration — Pull text from PDFs for migration to new systems
Translation Workflows — Convert PDFs to DOCX for easier translation

API Documentation

Full API docs: https://pdfapihub.com/docs

License

MIT

PDF Extract Text