Column
In dataframe.md, you learned how to write a dataframe-agnostic function.
We only used DataFrame methods there - but what if we need to operate on its columns?
Extracting a column
Example 1: filter based on a column's values
def my_func(df):
df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
df_s = df_s.filter(df_s.col('a') > 0)
return df_s.dataframe
import pandas as pd
df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
a b
0 1 5
1 3 -3
import polars as pl
df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df).collect())
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 5 │
│ 3 ┆ -3 │
└─────┴─────┘
Example 2: multiply a column's values by a constant
Let's write a dataframe-agnostic function which multiplies the values in column
'a'
by 2.
def my_func(df):
df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
df_s = df_s.assign(df_s.col('a')*2)
return df_s.dataframe
import pandas as pd
df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
a b
0 -2 3
1 2 5
2 6 -3
import polars as pl
df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df).collect())
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ -2 ┆ 3 │
│ 2 ┆ 5 │
│ 6 ┆ -3 │
└─────┴─────┘
Note that column 'a'
was overwritten. If we had wanted to add a new column called 'c'
containing column 'a'
's
values multiplied by 2, we could have used Column.rename
:
def my_func(df):
df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
df_s = df_s.assign((df_s.col('a')*2).rename('c'))
return df_s.dataframe
import pandas as pd
df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
a b c
0 -1 3 -2
1 1 5 2
2 3 -3 6
import polars as pl
df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df).collect())
shape: (3, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ -1 ┆ 3 ┆ -2 │
│ 1 ┆ 5 ┆ 2 │
│ 3 ┆ -3 ┆ 6 │
└─────┴─────┴─────┘
Example 3: cross-dataframe column comparisons
You might expect a function like the following to just work:
def my_func(df1, df2):
df1_s = df1.__dataframe_consortium_standard__(api_version='2023.11-beta')
df2_s = df2.__dataframe_consortium_standard__(api_version='2023.11-beta')
df1_s.filter(df2_s.col('a') > 0)
return df1_s.dataframe
However, if you tried passing two different dataframes to this function, you'd get a message saying something like:
cannot compare columns from different dataframes
This is because Column
s for the Polars implementation are backed by polars.Expr
s.
The error is there to ensure that the Polars and pandas implementations behave in the same way.
If you wish to compare columns from different dataframes, you should first join the dataframes.
For example:
def my_func(df1, df2):
df1_s = df1.__dataframe_consortium_standard__(api_version='2023.11-beta')
df2_s = df2.__dataframe_consortium_standard__(api_version='2023.11-beta')
df1_s = df1_s.join(
df2_s.rename({'a': 'a_right'}),
left_on='b',
right_on='b',
how='inner',
)
df1_s.filter(df1_s.col('a_right') > 0)
return df1_s.dataframe
import pandas as pd
df1 = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
df2 = pd.DataFrame({'a': [5, 4], 'b': [5, -3]})
print(my_func(df1, df2))
a b a_right
0 1 5 5
1 3 -3 4
import polars as pl
df1 = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
df2 = pl.DataFrame({'a': [5, 4], 'b': [5, -3]})
print(my_func(df1, df2).collect())
shape: (2, 3)
┌─────┬─────┬─────────┐
│ a ┆ b ┆ a_right │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════════╡
│ 1 ┆ 5 ┆ 5 │
│ 3 ┆ -3 ┆ 4 │
└─────┴─────┴─────────┘