I needed to build a semantic search system for audio equipment manuals. The goal was to extract text from approximately 200 PDF manuals, generate embeddings using a sentence transformer model, and store them in a vector database for retrieval. The manuals ranged from 20 to 300 pages each, totaling around 15,000 pages of technical documentation.
Initial Approach with Unstructured
I started with the unstructured
library, which promised comprehensive document parsing with layout preservation and element detection. The implementation was straightforward:
from unstructured.partition.pdf import partition_pdf
def extract_with_unstructured(pdf_path: str) -> str:
elements = partition_pdf(pdf_path)
text = "\n\n".join([str(el) for el in elements])
return text
The library worked correctly and extracted text with good accuracy. However, processing time became a problem. For a single 150-page manual, extraction took approximately 13 minutes. With 200+ manuals in the pipeline, the total processing time would exceed 45 hours.
The Performance Bottleneck
The unstructured
library performs extensive document analysis including:
- Layout detection and element classification
- Table extraction with structure preservation
- Image detection and OCR integration
- Complex PDF structure parsing
While these features are valuable for document understanding tasks, they introduced overhead I didn’t need. My use case required clean text extraction without layout analysis or element classification.
Switching to PyMuPDF
PyMuPDF is a lightweight PDF parser built on the MuPDF library. It focuses on speed and direct text extraction without the additional processing layers.
Here’s the implementation I switched to:
import pymupdf
from typing import List, Dict
from pathlib import Path
def extract_text_from_pdf(pdf_path: str) -> Dict[str, any]:
"""
Extract text from PDF with metadata using PyMuPDF.
Args:
pdf_path: Path to the PDF file
Returns:
Dictionary containing extracted text and metadata
"""
doc = pymupdf.open(pdf_path)
# Extract metadata
metadata = {
"title": doc.metadata.get("title", ""),
"author": doc.metadata.get("author", ""),
"page_count": len(doc),
"filename": Path(pdf_path).name
}
# Extract text from all pages
text_blocks = []
for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text()
# Clean and filter
if text.strip():
text_blocks.append({
"page": page_num + 1,
"content": text.strip()
})
doc.close()
# Combine all text
full_text = "\n\n".join([block["content"] for block in text_blocks])
return {
"text": full_text,
"metadata": metadata,
"blocks": text_blocks
}
def process_manual_for_embeddings(pdf_path: str, chunk_size: int = 512) -> List[Dict]:
"""
Process PDF manual and prepare text chunks for embedding generation.
Args:
pdf_path: Path to the PDF file
chunk_size: Target size for each text chunk in characters
Returns:
List of text chunks with metadata
"""
result = extract_text_from_pdf(pdf_path)
# Simple chunking strategy - split by paragraphs and combine to target size
chunks = []
current_chunk = ""
current_page = 1
for block in result["blocks"]:
paragraphs = block["content"].split("\n\n")
for para in paragraphs:
para = para.strip()
if not para:
continue
if len(current_chunk) + len(para) > chunk_size and current_chunk:
chunks.append({
"text": current_chunk.strip(),
"page": current_page,
"metadata": result["metadata"]
})
current_chunk = para
current_page = block["page"]
else:
current_chunk += "\n\n" + para if current_chunk else para
current_page = block["page"]
# Add remaining chunk
if current_chunk:
chunks.append({
"text": current_chunk.strip(),
"page": current_page,
"metadata": result["metadata"]
})
return chunks
Performance Comparison
I measured extraction performance on a representative sample of 10 manuals with varying page counts:
Manual | Pages | Unstructured | PyMuPDF | Speedup |
---|---|---|---|---|
Manual A | 50 | 4.2 min | 1.8 sec | 140x |
Manual B | 120 | 10.1 min | 3.2 sec | 189x |
Manual C | 200 | 16.8 min | 4.9 sec | 205x |
Manual D | 85 | 7.1 min | 2.4 sec | 177x |
Manual E | 150 | 12.6 min | 3.8 sec | 199x |
Manual F | 45 | 3.8 min | 1.5 sec | 152x |
Manual G | 180 | 15.2 min | 4.3 sec | 212x |
Manual H | 95 | 8.0 min | 2.7 sec | 178x |
Manual I | 160 | 13.4 min | 4.0 sec | 201x |
Manual J | 110 | 9.2 min | 3.0 sec | 184x |
Average | 119.5 | 10.04 min | 3.16 sec | ~190x |
Processing the complete dataset of 215 manuals:
- Unstructured: 36 hours (estimated)
- PyMuPDF: 11.3 minutes (actual)
The speedup enabled rapid iteration during development. I could reprocess the entire dataset with chunking adjustments or preprocessing changes in minutes rather than waiting hours.
Text Quality Comparison
Text extraction quality was comparable between both libraries for standard text content. PyMuPDF handled:
- Multi-column layouts correctly
- Headers and footers
- Embedded fonts and special characters
- Mixed text encodings
For my use case, which involved technical manuals with primarily text content and simple layouts, PyMuPDF produced equivalent results to unstructured without the processing overhead.
Implementation Notes
A few considerations when using PyMuPDF:
-
Memory efficiency: PyMuPDF loads documents into memory. For very large PDFs (500+ pages), process in batches or use
page.get_text("blocks")
to extract incrementally. -
Text ordering: PyMuPDF returns text in reading order, which works correctly for most documents. For complex layouts, verify text sequence manually.
-
Missing features: PyMuPDF doesn’t provide OCR or advanced table structure extraction. If you need these features, consider using PyMuPDF for text extraction and targeted tools for specific requirements.
-
Dependencies: PyMuPDF has minimal dependencies compared to unstructured, which requires multiple heavyweight libraries.
Results
After switching to PyMuPDF, the complete pipeline processed all 215 manuals in under 15 minutes including:
- Text extraction: 11.3 minutes
- Chunking and preprocessing: 2.1 minutes
- Embedding generation: 8.4 minutes (separate batch job)
- Vector database insertion: 1.8 minutes
The processed corpus contained approximately 42,000 text chunks, which enabled semantic search across the entire manual collection with sub-second query response times.
For PDF text extraction where speed matters and complex layout analysis isn’t required, PyMuPDF delivers significant performance advantages over more comprehensive document parsing libraries.