Scalar
In column.md, you learned how to write a dataframe-agnostic function involving both dataframes and columns.
But what if we want to extract scalars as well?
Example 1: center features
Let's try writing a function which, for each column, subtracts its mean.
def my_func(df):
df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
new_columns = [col - col.mean() for col in df_s.iter_columns()]
df_s = df_s.assign(*new_columns)
return df_s.dataframe
Let's run it:
import pandas as pd
df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
a b
0 -2.0 1.333333
1 0.0 3.333333
2 2.0 -4.666667
import polars as pl
df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df).collect())
shape: (3, 2)
┌──────┬───────────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════╪═══════════╡
│ -2.0 ┆ 1.333333 │
│ 0.0 ┆ 3.333333 │
│ 2.0 ┆ -4.666667 │
└──────┴───────────┘
The output looks as expected. df.col(column_name).mean()
returns a Scalar
, which
can be combined with a Column
from the same dataframe. Just like we saw for Column
s,
scalars from different dataframes cannot be compared - you'll first need to join the underlying
dataframes.
Example 2: Store mean of each column as Python float
We saw in the above example that df.col(column_name).mean()
returns a Scalar
, which may
be lazy. In particular, it's not a Python scalar. So, how would we force execution and store
a Python scalar, in a dataframe-agnostic manner?
def my_func(df):
df_s = df.__dataframe_consortium_standard__(api_version='2023.11-beta')
# We'll learn more about `persist` in the next page
df_s = df_s.mean().persist()
means = []
for column_name in df_s.column_names:
mean = float(df_s.col(column_name).get_value(0))
means.append(mean)
return means
Let's run it:
import pandas as pd
df = pd.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
[1.0, 1.6666666666666667]
import polars as pl
df = pl.DataFrame({'a': [-1, 1, 3], 'b': [3, 5, -3]})
print(my_func(df))
[1.0, 1.6666666666666667]
We'll learn more about DataFrame.persist
in the next slide.