DataFrame
To write a dataframe-agnostic function, the steps you'll want to follow are:
- Opt-in to the Dataframe API by calling
__dataframe_consortium_standard__
on your dataframe. - Express your logic using methods from the Dataframe API You may want to look at the official examples for inspiration.
- If you need to return a dataframe to the user in its original library, call
DataFrame.dataframe
.
Let's try writing a simple example.
Example 1: group-by and mean
Make a Python file t.py
with the following content:
def func(df):
# 1. Opt-in to the API Standard
df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
# 2. Use methods from the API Standard spec
df_s = df_s.group_by('a').mean()
# 3. Return a library from the user's original library
return df_s.dataframe
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
print(func(df))
a b
0 1 4.5
1 2 6.0
import polars as pl
df = pl.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
print(func(df))
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)
AGGREGATE
[col("b").mean()] BY [col("a")] FROM
DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"
If you look at the two outputs, you'll see that:
- For pandas, the output is a
pandas.DataFrame
. - But for Polars, the output is a
polars.LazyFrame
.
This is because the Dataframe API only has a single DataFrame
class - so for Polars,
all operations are done lazily in order to make full use of Polars' query engine.
If you want to convert that to a polars.DataFrame
, it is the caller's responsibility
to call .collect
. Check the modified example below:
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
print(func(df))
a b
0 1 4.5
1 2 6.0
import polars as pl
df = pl.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
print(func(df).collect())
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 4.5 │
│ 2 ┆ 6.0 │
└─────┴─────┘