How to Extract Text from PDF Files in Python

Extract all readable text from a PDF file using PyPDF2, iterating over each page and concatenating the content.

Easy Python 3.6+ Jun 27, 2026 Files & data 1 views 0 copies

pdf text-extraction pypdf2 file-io

Requires third-party packages — install first

pip install PyPDF2

Python code

14 lines

Python 3.6+

import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text.strip()

if __name__ == "__main__":
    pdf_path = "sample.pdf"
    extracted_text = extract_text_from_pdf(pdf_path)
    print(extracted_text[:500])

Output

stdout

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris.

How it works

PyPDF2's PdfReader opens a PDF file in binary mode. Its .pages property yields each page object, and extract_text() returns the text on that page, which is appended to a buffer. The final .strip() removes leading/trailing whitespace. This works for text-based PDFs but may fail for scanned documents or highly formatted pages.

Common mistakes

Opening the PDF without 'rb' (binary read) mode causes a TypeError.
Calling `extract_text()` on a non-text PDF (e.g., image-only) returns an empty string.
Not using `.strip()` can leave unwanted newlines or spaces at the edges.

Variations

Use `pdfplumber` for better layout preservation and table extraction.
Use `pypdf` (the maintained fork of PyPDF2) for the same API with ongoing updates.

Real-world use cases

Parsing invoices or receipts from PDF attachments in an email processing pipeline.
Extracting text from scanned contracts before feeding into a document classifier.
Building a personal search engine over a library of e-books and research papers.

How to Extract Text from PDF Files in Python

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Files & data

Tutorials

Quizzes

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Files & data

Keep learning

Tutorials

Quizzes