Skip to content

Scalar

In column.md, you learned how to write a dataframe-agnostic function involving both dataframes and columns.

But what if we want to extract scalars as well?

Example 1: center features

Let's try writing a function which, for each column, subtracts its mean.

def my_func(df):
    df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
    new_columns = [col - col.mean() for col in df_s.iter_columns()]
    df_s = df_s.assign(*new_columns)
    return df_s.dataframe

Let's run it:

import pandas as pd

df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
     a         b
0 -2.0  1.333333
1  0.0  3.333333
2  2.0 -4.666667

import polars as pl

df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df).collect())
shape: (3, 2)
┌──────┬───────────┐
 a     b         
 ---   ---       
 f64   f64       
╞══════╪═══════════╡
 -2.0  1.333333  
 0.0   3.333333  
 2.0   -4.666667 
└──────┴───────────┘

The output looks as expected. df.col(column_name).mean() returns a Scalar, which can be combined with a Column from the same dataframe. Just like we saw for Columns, scalars from different dataframes cannot be compared - you'll first need to join the underlying dataframes.

Example 2: Store mean of each column as Python float

We saw in the above example that df.col(column_name).mean() returns a Scalar, which may be lazy. In particular, it's not a Python scalar. So, how would we force execution and store a Python scalar, in a dataframe-agnostic manner?

def my_func(df):
    df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
    # We'll learn more about `persist` in the next page
    df_s = df_s.mean().persist()
    means = []
    for column_name in df_s.column_names:
        mean = float(df_s.col(column_name).get_value(0))
        means.append(mean)
    return means

Let's run it:

import pandas as pd

df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
[1.0, 1.6666666666666667]

import polars as pl

df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
[1.0, 1.6666666666666667]

We'll learn more about DataFrame.persist in the next slide.