INNOVATION

Tối Ưu Hóa Pipeline Phân Tích Dữ Liệu Lớn Với Pandas

Bùi Đăng Minh•Thứ sáu, 12/6/2026•7 min read

Pandas là thư viện phân tích dữ liệu phổ biến nhất trong Python. Tuy nhiên, khi kích thước dữ liệu vượt quá vài GB, các thao tác mặc định của Pandas có thể tiêu tốn rất nhiều RAM và làm chậm hệ thống đáng kể.

1. Ép kiểu dữ liệu (Downcasting Types)

Mặc định, Pandas thường tự gán kiểu dữ liệu số nguyên là int64 và số thực là float64. Việc ép về int8, int16 hoặc float32 giúp tiết kiệm đến 75% dung lượng bộ nhớ.

import pandas as pd
import numpy as np

# Đọc dữ liệu mẫu
df = pd.read_csv('huge_dataset.csv')

# Tối ưu hóa kiểu số
for col in df.select_dtypes(include=[np.number]).columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')
    df[col] = pd.to_numeric(df[col], downcast='float')

2. Sử dụng kiểu dữ liệu Category

Đối với các cột chứa văn bản lặp lại nhiều lần (như giới tính, quốc gia, trạng thái đơn hàng), hãy chuyển chúng sang kiểu 'category' để tối ưu tốc độ tìm kiếm và sắp xếp.

# Chuyển đổi cột kiểu object sang category
df['gender'] = df['gender'].astype('category')
df['status'] = df['status'].astype('category')

# Kiểm tra mức độ giảm dung lượng bộ nhớ
print(df.info(memory_usage='deep'))

3. Sử dụng Vectorization thay vì lặp (Loops)

Sử dụng vòng lặp `for` hoặc `.iterrows()` để duyệt qua DataFrame cực kỳ chậm vì Pandas phải xử lý từng hàng một cách tuần tự. Thay vào đó, hãy sử dụng các phép toán vectorized để xử lý toàn bộ mảng dữ liệu song song trên nền tảng C/C++.

# TỆ - Sử dụng vòng lặp duyệt từng hàng
for idx, row in df.iterrows():
    df.at[idx, 'new_col'] = row['val'] * 2

# TỐT - Sử dụng vectorization của Pandas/Numpy
df['new_col'] = df['val'] * 2

Kết luận

Chỉ với việc ép kiểu dữ liệu phù hợp, tận dụng kiểu category và triệt để áp dụng vectorization, pipeline xử lý dữ liệu của bạn có thể chạy nhanh hơn tới 10 lần và tiết kiệm hàng gigabyte bộ nhớ RAM khi làm việc với Big Data.

INNOVATION

Optimizing Large-Scale Data Analysis Pipelines with Pandas

Bùi Đăng Minh•Friday, June 12, 2026•7 min read

Pandas is the most popular data analysis library in Python. However, when the data size exceeds a few GBs, default Pandas operations can consume massive amounts of RAM and significantly slow down the system.

1. Downcasting Data Types

By default, Pandas assigns integer columns as int64 and float columns as float64. Downcasting them to int8, int16, or float32 can save up to 75% of memory usage.

import pandas as pd
import numpy as np

# Read sample data
df = pd.read_csv('huge_dataset.csv')

# Optimize numerical types
for col in df.select_dtypes(include=[np.number]).columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')
    df[col] = pd.to_numeric(df[col], downcast='float')

2. Utilizing Category Data Type

For text columns with repeated values (like gender, country, order status), convert them to 'category' type to accelerate search and sort operations.

# Convert object column to category
df['gender'] = df['gender'].astype('category')
df['status'] = df['status'].astype('category')

# Check memory reduction
print(df.info(memory_usage='deep'))

3. Vectorization over Loops

Using `for` loops or `.iterrows()` to iterate over DataFrame rows is extremely slow because Pandas processes each row sequentially. Instead, use vectorized operations to compute results across whole arrays in parallel using C/C++ under the hood.

# BAD - Iterating through rows using loops
for idx, row in df.iterrows():
    df.at[idx, 'new_col'] = row['val'] * 2

# GOOD - Vectorized operation
df['new_col'] = df['val'] * 2

Conclusion

By downcasting data types, leveraging categories for repeated text columns, and utilizing vectorization, your pipelines will process millions of records up to 10x faster while keeping memory usage to an absolute minimum.