@rishabhdugar

PDF Extract Text

Extract text from PDFs — plain text, structured JSON with layout blocks and tables, OCR for scanned PDFs with multi-language support, and conversion to Word/Excel/CSV/HTML. Powered by PDFAPIHub.

当前版本
v1.0.0
code-plugin社区source-linked

PDF Extract Text — OpenClaw Plugin

Extract text from PDFs using the PDFAPIHub API. This OpenClaw plugin gives your AI agent 8 tools for plain text extraction, structured parsing, OCR, and format conversion.

What It Does

Pull text content from any PDF — digital or scanned — using the best method for the job. Extract plain text, parse into structured JSON with tables and bounding boxes, OCR scanned documents in 100+ languages, or convert to Word/Excel/CSV/HTML.

Features

  • Plain Text Extraction — Fast text extraction from digital PDFs with page selection
  • Structured Parsing — JSON output with layout blocks, normalized bounding boxes, tables, and image metadata
  • 4 Parse Modes — text, layout, tables, full (text + blocks + tables + images)
  • PDF OCR — Tesseract OCR for scanned PDFs with configurable DPI (72-400)
  • Image OCR — OCR photos of receipts, documents, signs with preprocessing
  • Multi-Language OCR — 100+ languages (eng, hin, fra, deu, etc.), combine with +
  • Word-Level Bounding Boxes — Per-word positions and confidence scores
  • Character Whitelisting — Restrict OCR to digits-only for invoices/meters
  • Image Preprocessing — Grayscale, sharpen, threshold, resize for noisy inputs
  • PDF to DOCX — Editable Word documents with formatting preserved
  • PDF to Excel — Tables extracted into XLSX (one sheet per page)
  • PDF to CSV — Tabular data for databases and BI tools
  • PDF to HTML — Styled HTML for web publishing

Tools

ToolDescription
extract_text_from_pdfExtract plain text from PDF pages
parse_pdfParse into structured JSON (text, layout blocks, tables, images)
ocr_pdfOCR scanned PDFs with multi-language Tesseract support
ocr_imageOCR images with preprocessing
pdf_to_docxConvert PDF to editable Word document
pdf_to_excelExtract tables into Excel workbook
pdf_to_csvExtract tables into CSV format
pdf_to_htmlConvert PDF to styled HTML

Installation

openclaw plugins install clawhub:pdf-extract-text

Configuration

Add your API key in ~/.openclaw/openclaw.json:

{
  "plugins": {
    "entries": {
      "pdf-extract-text": {
        "enabled": true,
        "env": {
          "PDFAPIHUB_API_KEY": "your-api-key-here"
        }
      }
    }
  }
}

Get your free API key at https://pdfapihub.com.

Usage Examples

Just ask your OpenClaw agent:

  • "Extract all text from this PDF"
  • "Parse the tables from this invoice"
  • "OCR this scanned document in English and Hindi"
  • "Convert this PDF to an Excel spreadsheet"
  • "Extract text from this receipt photo"
  • "Convert pages 1-3 to a Word document"
  • "Get the structured layout with bounding boxes"

Use Cases

  • Invoice Parsing — Extract line items, totals, and vendor info from PDF invoices
  • Resume Parsing — Extract name, experience, and skills from PDF resumes
  • Full-Text Search — Extract text for indexing in search engines
  • AI/LLM Processing — Feed PDF text into language models or chatbots
  • Financial Data — Extract tables from bank statements into Excel/CSV
  • Receipt Scanning — OCR receipts and invoices for expense tracking
  • Document Digitization — Convert scanned legacy documents into searchable text
  • Content Migration — Pull text from PDFs for migration to new systems
  • Translation Workflows — Convert PDFs to DOCX for easier translation

API Documentation

Full API docs: https://pdfapihub.com/docs

License

MIT

源码与版本

源码仓库

PdfApiHub/openclaw-pdf-extract-text

打开仓库

源码提交

4869f67e362d10b6e0c5a0e3678e7ede00053aee

查看提交

安装命令

openclaw plugins install clawhub:pdf-extract-text

元数据

  • 包名: pdf-extract-text
  • 创建时间: 2026/04/17
  • 更新时间: 2026/04/17
  • 执行代码:
  • 源码标签: main

兼容性

  • 构建于 OpenClaw: 2026.3.24-beta.2
  • 插件 API 范围: >=2026.3.24-beta.2
  • 标签: latest
  • 文件数: 7