How to Make Pandas 2.0 Up to 30x Faster

Pandas is a popular library for data manipulation and analysis in Python. However, as datasets grow larger, Pandas can become slow and inefficient. In version 1.0, Pandas introduced a new feature called the “Extension Array Interface” which allows for custom data types to be used within Pandas. In this blog, we will explore how to use the Extension Array Interface to make Pandas 2.0 up to 30x faster with example code snippets.

Use Typed Arrays

One of the key benefits of the Extension Array Interface is the ability to use typed arrays. Typed arrays are arrays where the data type of each element is explicitly defined, unlike in traditional Python lists. By using typed arrays, Pandas can avoid the overhead of dynamically allocating memory and casting data types. Here’s an example of creating a typed array using the NumPy library:

import numpy as np

# create a NumPy array with a specific data type
arr = np.array([1, 2, 3], dtype=np.int32)

# convert the NumPy array to a Pandas Series with a specific data type
s = pd.Series(arr, dtype='Int32')

In this example, we create a NumPy array with the data type of ‘int32’ and then convert it to a Pandas Series with the data type of ‘Int32’, which is a nullable integer type introduced in Pandas 1.0.

Use Custom Extension Arrays

Another benefit of the Extension Array Interface is the ability to create custom extension arrays. Custom extension arrays can be used to efficiently store and manipulate data that doesn’t fit into traditional data types. Here’s an example of creating a custom extension array for storing IP addresses:

import ipaddress

class IPAddressArray(pd.api.extensions.ExtensionArray):
    def __init__(self, data):
        self.data = data

    @property
    def dtype(self):
        return pd.StringDtype()

    @classmethod
    def _from_sequence(cls, data, dtype=None, copy=False):
        return cls(ipaddress.ip_address(x) for x in data)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        return str(self.data[index])

# create a Pandas Series with the custom extension array
s = pd.Series(IPAddressArray(['192.168.0.1', '8.8.8.8']))

In this example, we create a custom extension array for storing IP addresses using the ipaddress library. The custom extension array implements the required methods for a Pandas extension array, including init, len, and getitem. We can then create a Pandas Series with the custom extension array by passing a list of IP addresses to the constructor.

Use Fast Pandas Operations

In addition to using typed arrays and custom extension arrays, there are several Pandas operations that are optimized for performance. Here are some examples:

  • Use .loc instead of .iloc for label-based indexing: .loc is optimized for label-based indexing and is faster than .iloc for large datasets.
  • Use .at instead of .loc for scalar value assignment: .at is optimized for scalar value assignment and is faster than .loc for single value assignments.
  • Use .isin instead of .apply(lambda x: x in list): .isin is optimized for checking if values are in a list and is faster than using apply with a lambda function.

Here’s an example of using .loc for label-based indexing:

# create a DataFrame with a range of integers
df = pd.DataFrame({'a': range(1000000)})

# use .loc to select rows with a specific label value
df.loc[df['a'] == 999999]

In this example, we create a DataFrame with a range of integers and then use .loc to select the row with a label value of 999999. Because .loc is optimized for label-based indexing, this operation is faster than using .iloc for large datasets.

Use Cython or Numba

If you need to perform custom calculations or operations that are not available in Pandas, you can use Cython or Numba to speed up your code. Cython is a programming language that is a superset of Python, while Numba is a just-in-time (JIT) compiler for Python. Here’s an example of using Numba to speed up a custom function:

import numba as nb

# define a custom function
@nb.njit
def custom_function(a):
    return a * 2

# create a Pandas Series
s = pd.Series(range(1000000))

# apply the custom function to the Series
s.apply(custom_function)

In this example, we define a custom function that multiplies a value by 2 using the @nb.njit decorator from Numba. We then create a Pandas Series with a range of integers and apply the custom function to the Series using .apply(). Because the custom function is compiled with Numba, this operation is faster than using a regular Python function.

Use Parallel Processing

If you have a large dataset and need to perform operations that can be parallelized, you can use parallel processing to speed up your code. Pandas supports parallel processing using the Dask library, which is designed for parallel computing with larger-than-memory datasets. Here’s an example of using Dask to parallelize a Pandas operation:

import dask.dataframe as dd

# create a Dask DataFrame with a range of integers
df = dd.from_pandas(pd.DataFrame({'a': range(1000000)}), npartitions=4)

# use .loc to select rows with a specific label value
df.loc[df['a'] == 999999].compute()

In this example, we create a Dask DataFrame with a range of integers and then use .loc to select the row with a label value of 999999. We then use the .compute() method to execute the operation in parallel across multiple partitions. Because this operation is parallelized using Dask, it is faster than using a regular Pandas operation.

In conclusion, the Extension Array Interface introduced in Pandas 1.0 provides a powerful tool for optimizing Pandas performance. By using typed arrays, custom extension arrays, fast Pandas operations, Cython or Numba, and parallel processing, you can make Pandas 2.0 up to 30x faster and handle larger datasets with ease.