How to Extract Text from PDF Files in Python
Extract all readable text from a PDF file using PyPDF2, iterating over each page and concatenating the content.
pip install PyPDF2
Python code
14 linesimport PyPDF2
def extract_text_from_pdf(pdf_path):
text = ""
with open(pdf_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text() + "\n"
return text.strip()
if __name__ == "__main__":
pdf_path = "sample.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text[:500])
Output
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris.
How it works
PyPDF2's PdfReader opens a PDF file in binary mode. Its .pages property yields each page object, and extract_text() returns the text on that page, which is appended to a buffer. The final .strip() removes leading/trailing whitespace. This works for text-based PDFs but may fail for scanned documents or highly formatted pages.
Common mistakes
- Opening the PDF without 'rb' (binary read) mode causes a TypeError.
- Calling `extract_text()` on a non-text PDF (e.g., image-only) returns an empty string.
- Not using `.strip()` can leave unwanted newlines or spaces at the edges.
Variations
- Use `pdfplumber` for better layout preservation and table extraction.
- Use `pypdf` (the maintained fork of PyPDF2) for the same API with ongoing updates.
Real-world use cases
- Parsing invoices or receipts from PDF attachments in an email processing pipeline.
- Extracting text from scanned contracts before feeding into a document classifier.
- Building a personal search engine over a library of e-books and research papers.
Sponsored
More from Files & data
- Build a Command-Line To-Do List Application with Data Persistence in Python easy
- Build a Python Script That Detects and Deletes Empty Files Across Folders easy
- Compare Two Folder Structures and Find Differences in Python easy
- Compress and Extract ZIP Files Programmatically in Python easy
- Convert CSV Files to JSON in Python easy
- Convert Image to ASCII Art in Python medium
Keep learning
Related tutorials and quizzes for this topic.