Modin for Pandas - Data Courses

The Pandas package wasn’t built for parallel computations. It is, therefore, by default, unable to take advantage of parallel computations so as to make it faster. This is especially critical when dealing with a large dataset. In such a case, one can take advantage of Modin to make operations faster. The framework will run these operations faster by using all the available CPUs available or as many as you specify. Pandas is slow in such scenarios because it uses one CPU by default.

In this article, let’s look at how you can make your Pandas operations faster by changing one line in your Python code.

Modin Installation

Modin can be installed using either pip or conda. It requires Ray or Dask to be installed. If you are Windows you have to choose Dask because Ray isn’t supported on that platform. If you are on Linux, you can choose either. However, it is important to note that you don’t need any Dask or Ray experience. Modin just uses either of them as the backend engine. It, therefore, abstracts the parallelization process so that you only worry about writing your Pandas code the way you are already used to.

Modin uses Ray by default. Ray is an open-source framework that provides an API for building distributed applications. Ray runs on a single machine and can also be programmed to scale to hundreds of machines.

Dask is currently an experimental backend engine in Modin. It is also an open-source distributed computing library that is written in Python. Dask provides the ability to perform actions that require a lot of computation by distributing the workload to all available CPUs.

Since Dask can also be used for loading large DataFrame, questions have arisen as to why one should use and not Dask. There are a couple of subtle differences between Dask and modin. One being that Dask doesn’t implement the entire Pandas DataFrame API whereas modin aims at implementing the Pandas API in its entirety. This means that you will find many Pandas functions missing in Dask. The other difference is that the Dask API is lazy. This means that function calls are not computed immediately. To see the result of an operation in Dask, you have to call the compute() method on it. In modin the result of an operation is seen immediately. Modin can therefore be used as a drop-in replacement for Pandas. You can take your old notebooks, change the import statement and everything will work as if nothing changed.

You can install modin with conda using the following command:

$ conda install -c conda-forge modin

Use any of the following commands to install it via `pip`:

pip install modin[ray] # Install Modin dependencies and Ray to run on Ray
pip install modin[dask] # Install Modin dependencies and Dask to run on Dask
pip install modin[all] # Install all of the above

Modin Architecture

Before you can start using it, let’s take a moment to look at its system architecture. Here’s is the system’s architecture as represented by the official documentation.

Diagram sourced from https://modin.readthedocs.io/en/latest/developer/architecture.html

The system is logically separated into different layers. This makes it easy for items in each layer to be replaced easily. Let’s mention these layers briefly:

APIs: this is the user-facing facing part of Modin. At the moment the Pandas API is the one with the most focus as well as the only stable API.
the Query Compiler receives queries from the Pandas API and passes them to the Modin DataFrame. The Pandas API ensures that the data passed to the Query Compiler is clean.
the modin DataFrame is now the equivalent of the Pandas DataFrame, only faster.
the execution engine is responsible for performing computations on the data.

There are other parts of the framework that are not visualized in the image above. One of them is the partition manager. It is responsible for altering the shape and size of data partitions based on the operation in question. Partitions are used to manage different subsets of a DataFrame. A DataFrame is usually partitioned row and column-wise.

Using Modin

Modin detects the engine you have installed and uses that for parallelization. At the moment, it supports 93% of the Pandas API, making it possible to use for the most popular Pandas functions. In order to start using it, you just need to import Pandas from Modin. After that, you just continue with your normal Pandas code.

import modin.pandas as pd

For instance, here is a comparison between reading in a CSV file with Modin and normal Pandas.

%%time
dff = pds.read_csv(“kiva_loans.csv”)
%%time
df = pd.read_csv(“kiva_loans.csv”)

You can see that `modin.pandas` loads the file faster than normal Pandas. `modin.pandas` takes 967 milliseconds while Pandas takes 1.67 seconds.

Defaulting to Pandas

Methods that are not implemented in the library default to the normal Pandas implementations. This ensures that your code runs as usual, but can come with some trade-offs when it comes to the performance of these functions. For example, the `hist` function defaults to Pandas.

df[“term_in_months”].hist()

A complete list of supported and unsupported Pandas methods can be found here.

Advanced Usage

If you are an advanced user, you can set up your Ray or Dask environment to your preference. It is, however, important not to change your compute engine after importing `modin` as it could lead to unexpected behavior. For example, you can set up Dask with the number of workers you need. `modin` will then connect to this Dask client.

from distributed import Client
client = Client(n_workers=6)
import modin.pandas as pd

The process is the same if you are using Ray.

import ray
ray.init(plasma_directory=”/path/to/custom/dir”, object_store_memory=10**10)
# Modin will connect to the existing Ray environment
import modin.pandas as pd

Final Thoughts

In this article, you have seen how can speed up Pandas for your data analysis through the use of the Modin library. This is very important when loading large files or even when running compute and time-intensive operations such as group operations on large datasets. Armed with this information, your Pandas usage should now get an upgrade.

Resources

Notebook used in the article.