Advanced techniques to efficiently process and load data
In this story, I’d like to talk about the things I like about Pandas and that I often use in the ETL applications I write to process data. We will cover exploratory data analysis, data cleaning, and data frame transformations. I’ll demonstrate some of my favorite techniques for optimizing memory usage and efficiently processing large amounts of data using this library. Working with relatively small datasets in Pandas is rarely a problem. It easily manages data in data frames and provides a very convenient set of commands for processing it. When dealing with data transformations on much larger data frames (1 GB and above), I would normally use Spark and distributed compute clusters. It can handle terabytes and petabytes of data, but it will probably also cost a lot of money to run all that hardware. This is why Pandas might be a better choice when we need to process medium-sized datasets in environments with limited memory resources.
Pandas and Python generators
In one of my previous stories, I explained how to efficiently process data using generators in Python (1).
This is a simple trick to optimize memory usage. Imagine we have a huge dataset somewhere in external storage. This could be a database or just a simple large CSV file. Imagine we need to process this 2-3TB file and apply a transformation to each row of data in this file. Let’s say we have a service that will perform this task and it only has 32 GB of memory. This will limit us in loading data and we will not be able to load the entire file into memory to split it line by line by applying simple Python.
split(‘\n’) operator. The solution would be to deal with it line by line and
yield this frees up memory each time for the next one. This can help us create a constant flow of ETL data to the final destination of our data pipeline. It could be anything: a cloud storage bucket, another database, a data warehouse (DWH) solution, a streaming topic or something else…