Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

How to Extract Text from PDF Files in Python

Extract all readable text from a PDF file using PyPDF2, iterating over each page and concatenating the content.

Easy Python 3.6+ Jun 27, 2026 Files & data 1 views 0 copies

Requires third-party packages — install first
pip install PyPDF2

Python code

14 lines
Python 3.6+
import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text.strip()

if __name__ == "__main__":
    pdf_path = "sample.pdf"
    extracted_text = extract_text_from_pdf(pdf_path)
    print(extracted_text[:500])

Output

stdout
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris.

How it works

PyPDF2's PdfReader opens a PDF file in binary mode. Its .pages property yields each page object, and extract_text() returns the text on that page, which is appended to a buffer. The final .strip() removes leading/trailing whitespace. This works for text-based PDFs but may fail for scanned documents or highly formatted pages.

Common mistakes

  • Opening the PDF without 'rb' (binary read) mode causes a TypeError.
  • Calling `extract_text()` on a non-text PDF (e.g., image-only) returns an empty string.
  • Not using `.strip()` can leave unwanted newlines or spaces at the edges.

Variations

  1. Use `pdfplumber` for better layout preservation and table extraction.
  2. Use `pypdf` (the maintained fork of PyPDF2) for the same API with ongoing updates.

Real-world use cases

  • Parsing invoices or receipts from PDF attachments in an email processing pipeline.
  • Extracting text from scanned contracts before feeding into a document classifier.
  • Building a personal search engine over a library of e-books and research papers.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run locally

This sample needs third-party packages, so it cannot run in the browser IDE. Copy the code above, install the packages shown at the top, then run it in your own Python environment.

More from Files & data

Related tutorials and quizzes for this topic.