Skip to content

Column

In dataframe.md, you learned how to write a dataframe-agnostic function.

We only used DataFrame methods there - but what if we need to operate on its columns?

Extracting a column

Example 1: filter based on a column's values

def my_func(df):
    df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df_s = df_s.filter(df_s.col('a') > 0)
    return df_s.dataframe

import pandas as pd

df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
   a  b
0  1  5
1  3 -3

import polars as pl

df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df).collect())
shape: (2, 2)
┌─────┬─────┐
 a    b   
 ---  --- 
 i64  i64 
╞═════╪═════╡
 1    5   
 3    -3  
└─────┴─────┘

Example 2: multiply a column's values by a constant

Let's write a dataframe-agnostic function which multiplies the values in column 'a' by 2.

def my_func(df):
    df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df_s = df_s.assign(df_s.col('a')*2)
    return df_s.dataframe

import pandas as pd

df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
   a  b
0 -2  3
1  2  5
2  6 -3

import polars as pl

df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df).collect())
shape: (3, 2)
┌─────┬─────┐
 a    b   
 ---  --- 
 i64  i64 
╞═════╪═════╡
 -2   3   
 2    5   
 6    -3  
└─────┴─────┘

Note that column 'a' was overwritten. If we had wanted to add a new column called 'c' containing column 'a''s values multiplied by 2, we could have used Column.rename:

def my_func(df):
    df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df_s = df_s.assign((df_s.col('a')*2).rename('c'))
    return df_s.dataframe

import pandas as pd

df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
   a  b  c
0 -1  3 -2
1  1  5  2
2  3 -3  6

import polars as pl

df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df).collect())
shape: (3, 3)
┌─────┬─────┬─────┐
 a    b    c   
 ---  ---  --- 
 i64  i64  i64 
╞═════╪═════╪═════╡
 -1   3    -2  
 1    5    2   
 3    -3   6   
└─────┴─────┴─────┘

Example 3: cross-dataframe column comparisons

You might expect a function like the following to just work:

def my_func(df1, df2):
    df1_s = df1.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df2_s = df2.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df1_s.filter(df2_s.col('a') > 0)
    return df1_s.dataframe

However, if you tried passing two different dataframes to this function, you'd get a message saying something like:

cannot compare columns from different dataframes

This is because Columns for the Polars implementation are backed by polars.Exprs. The error is there to ensure that the Polars and pandas implementations behave in the same way. If you wish to compare columns from different dataframes, you should first join the dataframes. For example:

def my_func(df1, df2):
    df1_s = df1.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df2_s = df2.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df1_s = df1_s.join(
        df2_s.rename({'a': 'a_right'}),
        left_on='b',
        right_on='b',
        how='inner',
    )
    df1_s.filter(df1_s.col('a_right') > 0)
    return df1_s.dataframe

import pandas as pd

df1 = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
df2 = pd.DataFrame({'a': [5, 4], 'b': [5, -3]})
print(my_func(df1, df2))
   a  b  a_right
0  1  5        5
1  3 -3        4

import polars as pl

df1 = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
df2 = pl.DataFrame({'a': [5, 4], 'b': [5, -3]})
print(my_func(df1, df2).collect())
shape: (2, 3)
┌─────┬─────┬─────────┐
 a    b    a_right 
 ---  ---  ---     
 i64  i64  i64     
╞═════╪═════╪═════════╡
 1    5    5       
 3    -3   4       
└─────┴─────┴─────────┘