Split CSV Files into Smaller Chunks in Python
Splits a large CSV file into multiple smaller chunk files, preserving the header row in each chunk.
Python code
47 linesimport csv
import os
def split_csv(input_file, chunk_size=1000, output_prefix="chunk"):
"""Split a large CSV file into smaller chunks."""
with open(input_file, 'r', newline='') as infile:
reader = csv.reader(infile)
header = next(reader)
file_count = 1
row_count = 0
outfile = None
writer = None
for row in reader:
if row_count % chunk_size == 0:
if outfile:
outfile.close()
output_file = f"{output_prefix}_{file_count}.csv"
outfile = open(output_file, 'w', newline='')
writer = csv.writer(outfile)
writer.writerow(header)
file_count += 1
writer.writerow(row)
row_count += 1
if outfile:
outfile.close()
print(f"Split '{input_file}' into {file_count-1} chunks.")
if __name__ == "__main__":
# Create a sample large CSV for demonstration
sample_file = "large_data.csv"
with open(sample_file, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(["id", "name", "value"])
for i in range(2500):
writer.writerow([i, f"item_{i}", i * 1.5])
# Split into chunks of 1000 rows each
split_csv(sample_file, chunk_size=1000, output_prefix="split_chunk")
# Cleanup sample files
os.remove(sample_file)
for i in range(1, 4):
os.remove(f"split_chunk_{i}.csv")
Output
Split 'large_data.csv' into 3 chunks.
How it works
The script reads the header once from the original CSV, then writes that header at the start of each new chunk file. It tracks the row count and creates a new output file every N rows (default 1000) using modulo logic. Each chunk file is named with an incrementing suffix (e.g., chunk_1.csv) so you can easily identify parts. The csv module handles quoting and line endings correctly, making the split reliable for real-world data.
Common mistakes
- Forgetting to write the header in every chunk file, causing data to lose column names.
- Not closing the previous output file before opening a new one, which can lead to corrupted files.
- Assuming all CSV files have a header row; the script breaks if the file is headerless.
- Using `'w'` mode without `newline=''`, which can add extra blank lines on Windows.
Variations
- Use `pandas.read_csv` with `chunksize` parameter and `to_csv` for memory-efficient splitting of huge files.
- Skip the header row and split only data rows if the original CSV has no header.
Real-world use cases
- Breaking a multi-gigabyte log export into 10 MB chunks for uploading to cloud storage with file size limits.
- Distributing a customer database across parallel batch processing jobs where each job handles one chunk.
- Splitting a monthly sales report into daily partitions so analysts can load one day at a time.
Sponsored
More from Files & data
- Audit File Permissions Across a Project in Python easy
- Automatically Detect Corrupted Files Using SHA-256 Checksums in Python easy
- Automatically Highlight Data Validation Errors Inside Excel Files in Python easy
- Build a Command-Line To-Do List Application with Data Persistence in Python easy
- Build a Personal Work Hours Tracker in Python medium
- Build a Python Script That Detects and Deletes Empty Files Across Folders easy
Keep learning
Related tutorials and quizzes for this topic.