Dataframe object¶

A conforming implementation of the dataframe API standard must provide and support a dataframe object having the following methods, attributes, and behavior.

class DataFrame(*args, **kwargs)¶

DataFrame object.

Note that this dataframe object is not meant to be instantiated directly by users of the library implementing the dataframe API standard. Rather, use constructor functions or an already-created dataframe object retrieved via

Python operator support

All arithmetic operators defined by the Python language, except for __matmul__, __neg__ and __pos__, must be supported for numerical data types.

All comparison operators defined by the Python language must be supported by the dataframe object for all data types for which those comparisons are supported by the builtin scalar types corresponding to a data type.

In-place operators must not be supported. All operations on the dataframe object are out-of-place.

Methods and Attributes

__abstractmethods__ = frozenset({})¶

__add__(other: AnyScalar) → Self¶

Add other scalar to this dataframe.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__and__(other: bool) → Self¶

Apply logical ‘and’ to other scalar and this dataframe.

Nulls should follow Kleene Logic.

Parameters:: other (bool) –
Returns:: DataFrame[bool]
Raises:: ValueError – If self or other is not boolean.

__dataframe_namespace__() → Namespace¶

Return an object that has all the top-level dataframe API functions on it.

Returns:: namespace (Any) – An object representing the dataframe API namespace. It should have every top-level function defined in the specification as an attribute. It may contain other public names as well, but it is recommended to only include those names that are part of the specification.

__divmod__(other: AnyScalar) → tuple[DataFrame, DataFrame]¶

Return quotient and remainder of integer division. See divmod builtin.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: A tuple of two `DataFrame`s

__eq__(other: AnyScalar) → Self¶

Compare for equality.

Nulls should follow Kleene Logic.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__floordiv__(other: AnyScalar) → Self¶

Floor-divide (returns integers) this dataframe by other scalar.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__ge__(other: AnyScalar) → Self¶

Compare for “greater than or equal to” other.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__gt__(other: AnyScalar) → Self¶

Compare for “greater than” other.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__init__(*args, **kwargs)¶

__invert__() → Self¶

Invert truthiness of (boolean) elements.

Raises:: ValueError – If any of the DataFrame’s columns is not boolean.

__iter__() → NoReturn¶

Iterate over elements.

This is intentionally “poisoned” to discourage inefficient code patterns.

Raises:: NotImplementedError –

__le__(other: AnyScalar) → Self¶

Compare for “less than or equal to” other.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__lt__(other: AnyScalar) → Self¶

Compare for “less than” other.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__mod__(other: AnyScalar) → Self¶

Return modulus of this dataframe by other (% operator).

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__mul__(other: AnyScalar) → Self¶

Multiply other scalar with this dataframe.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__ne__(other: AnyScalar) → Self¶

Compare for non-equality.

Nulls should follow Kleene Logic.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__or__(other: bool) → Self¶

Apply logical ‘or’ to other scalar and this DataFrame.

Nulls should follow Kleene Logic.

Parameters:: other (bool) –
Returns:: DataFrame[bool]
Raises:: ValueError – If self or other is not boolean.

__parameters__ = ()¶

__pow__(other: AnyScalar) → Self¶

Raise this dataframe to the power of other.

Integer dtype to the power of non-negative integer dtype is integer dtype. Integer dtype to the power of float dtype is float dtype. Float dtype to the power of integer dtype or float dtype is float dtype.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__radd__(other: AnyScalar) → Self¶

__rand__(other: AnyScalar) → Self¶

__rfloordiv__(other: AnyScalar) → Self¶

__rmod__(other: AnyScalar) → Self¶

__rmul__(other: AnyScalar) → Self¶

__ror__(other: AnyScalar) → Self¶: Return value|self.

__rpow__(other: AnyScalar) → Self¶

__rsub__(other: AnyScalar) → Self¶

__rtruediv__(other: AnyScalar) → Self¶

__sub__(other: AnyScalar) → Self¶

Subtract other scalar from this dataframe.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

__subclasshook__()¶

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

__truediv__(other: AnyScalar) → Self¶

Divide this dataframe by other scalar. True division, returns floats.

Parameters:: other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
Returns:: DataFrame

all(*, skip_nulls: bool | Scalar = True) → Self¶

Reduction returns a 1-row DataFrame.

Raises:: ValueError – If any of the DataFrame’s columns is not boolean.

any(*, skip_nulls: bool | Scalar = True) → Self¶

Reduction returns a 1-row DataFrame.

Raises:: ValueError – If any of the DataFrame’s columns is not boolean.

assign(*columns: Column) → Self¶

Insert new column(s), or update values in existing ones.

If inserting new columns, the column’s names will be used as the labels, and the columns will be inserted at the rightmost location.

If updating existing columns, their names will be used to tell which columns to update. To update a column with a different name, combine with Column.rename(), e.g.:

new_column = df.col('a') + 1
df = df.assign(new_column.rename('b'))

Parameters:: *columns (Column) – Column(s) to update/insert. If updating/inserting multiple columns, they must all have different names.
Returns:: DataFrame

Notes

All of columns’s parent DataFrame must be self - else, the operation is unsupported and may vary across implementations.

cast(dtypes: Mapping[str, DType]) → Self¶

Convert specified columns to specified dtypes.

The following is not specified and may vary across implementations:

Cross-kind casting (e.g. integer to string, or to float)
Behaviour in the case of overflows

col(name: str, /) → Column¶

Select a column by name.

Parameters:: name (str) –
Returns:: Column
Raises:: KeyError – If the key is not present.

property column_names: list[str]¶

Get column names.

Returns:: list[str]

property dataframe: SupportsDataFrameAPI¶

Return underlying (not-necessarily-Standard-compliant) DataFrame.

If a library only implements the Standard, then this can return self.

drop(*labels: str) → Self¶

Drop the specified column(s).

Parameters:: *label (str) – Column name(s) to drop.
Returns:: DataFrame
Raises:: KeyError – If the label is not present.

drop_nulls(*, column_names: list[str] | None = None) → Self¶

Drop rows containing null values.

Parameters:: column_names (list[str] | None) – A list of column names to consider when dropping nulls. If None, all columns will be considered.
Raises:: KeyError – If column_names contains a column name that is not present in the dataframe.

fill_nan(value: float | NullType | Scalar, /) → Self¶

Fill nan values with the given fill value.

The fill operation will apply to all columns with a floating-point dtype. Other columns remain unchanged.

Parameters:: value (float or null) – Value used to replace any nan in the column with. Must be of the Python scalar type matching the dtype of the column (or be null).

fill_null(value: AnyScalar, /, *, column_names: list[str] | None = None) → Self¶

Fill null values with the given fill value.

This method can only be used if all columns that are to be filled are of the same dtype (e.g., all of Float64 or all of string dtype). If that is not the case, it is not possible to use a single Python scalar type that matches the dtype of all columns to which fill_null is being applied, and hence an exception will be raised.

Parameters:

value (Scalar) – Value used to replace any null values in the dataframe with. Must be of the Python scalar type matching the dtype(s) of the dataframe.
column_names (list[str] | None) – A list of column names for which to replace nulls with the given scalar value. If None, nulls will be replaced in all columns.

Raises:

TypeError – If the columns of the dataframe are not all of the same kind.
KeyError – If column_names contains a column name that is not present in the dataframe.

filter(mask: Column) → Self¶

Select a subset of rows corresponding to a mask.

Parameters:: mask (Column) –
Returns:: DataFrame

Notes

mask’s parent DataFrame must be self - else, the operation is unsupported and may vary across implementations.

group_by(*keys: str) → GroupBy¶

Group the DataFrame by the given columns.

Parameters:: *keys (str) –
Returns:: GroupBy
Raises:: KeyError – If any of the requested keys are not present.

Notes

Downstream operations from this function, like aggregations, return results for which row order is not guaranteed and is implementation defined.

is_nan() → Self¶

Check for nan entries.

Returns:: DataFrame

See also

is_nan

Notes

Does not include NaN-like entries. May optionally include ‘NaT’ values (if present in an implementation), but note that the Standard makes no guarantees about them.

iter_columns() → Iterator[Column]¶: Return iterator over columns.

join(other: Self, *, how: Literal['left', 'inner', 'outer'], left_on: str | list[str], right_on: str | list[str]) → Self¶

Join with other dataframe.

Other than the joining column name(s), no column name is allowed to appear in both self and other. Rename columns before calling join if necessary using rename().

Parameters:

other (Self) – Dataframe to join with.
how (str) – Kind of join to perform. Must be one of {‘left’, ‘inner’, ‘outer’}.
left_on (str | list[str]) – Key(s) from self to perform join on. If more than one key is given, it must be the same length as right_on.
right_on (str | list[str]) – Key(s) from other to perform join on. If more than one key is given, it must be the same length as left_on.

Returns:

DataFrame

Raises:

ValueError – If, apart from left_on and right_on, there are any column names present in both self and other.

max(*, skip_nulls: bool | Scalar = True) → Self¶: Reduction returns a 1-row DataFrame.

mean(*, skip_nulls: bool | Scalar = True) → Self¶: Reduction returns a 1-row DataFrame.

median(*, skip_nulls: bool | Scalar = True) → Self¶: Reduction returns a 1-row DataFrame.

min(*, skip_nulls: bool | Scalar = True) → Self¶: Reduction returns a 1-row DataFrame.

persist() → Self¶

Hint that computation prior to this point should not be repeated.

This is intended as a hint, rather than as a directive. Implementations which do not separate lazy vs eager execution may ignore this method and treat it as a no-op.

Note

This method may trigger execution. If necessary, it should be called at most once per dataframe, and as late as possible in the pipeline.

For example, do this

df: DataFrame
result = df.std() > 0
result = result.persist()
features = [col.name for col in df.iter_columns() if col.get_value(0)]

instead of this:

df: DataFrame
result = df.std() > 0
features = [
    # Do NOT do this! This will trigger execution of the entire
    # pipeline for element in the for-loop!
    col.name for col in df.iter_columns() if col.get_value(0).persist()
]

prod(*, skip_nulls: bool | Scalar = True) → Self¶: Reduction returns a 1-row DataFrame.

rename(mapping: Mapping[str, str]) → Self¶

Rename columns.

Parameters:: mapping (Mapping[str, str]) – Keys are old column names, values are new column names.
Returns:: DataFrame

property schema: dict[str, DType]¶

Get dataframe’s schema.

Returns:: dict[str, Any] – Mapping from column name to data type.

select(*names: str) → Self¶

Select multiple columns by name.

Parameters:: *names (str) –
Returns:: DataFrame
Raises:: KeyError – If the any requested key is not present.

shape() → tuple[int, int]¶: Return number of rows and number of columns.

slice_rows(start: int | None, stop: int | None, step: int | None) → Self¶

Select a subset of rows corresponding to a slice.

Parameters:

start (int or None) –
stop (int or None) –
step (int or None) –

Returns:

DataFrame

sort(*keys: str, ascending: Sequence[bool] | bool = True, nulls_position: Literal['first', 'last'] = 'last') → Self¶

Sort dataframe according to given columns.

If you only need the indices which would sort the dataframe, use sorted_indices.

Parameters:

*keys (str) – Names of columns to sort by. If not specified, sort by all columns.
ascending (Sequence[bool] or bool) – If True, sort by all keys in ascending order. If False, sort by all keys in descending order. If a sequence, it must be the same length as keys, and determines the direction with which to use each key to sort by.
nulls_position ({'first', 'last'}) – Whether null values should be placed at the beginning or at the end of the result. Note that the position of NaNs is unspecified and may vary based on the implementation.

Returns:

DataFrame

Raises:

ValueError – If keys and ascending are sequences of different lengths.

std(*, correction: float | Scalar = 1, skip_nulls: bool | Scalar = True) → Self¶

Reduction returns a 1-row DataFrame.

Parameters:

correction – Correction to apply to the result. For example, 0 for sample standard deviation and 1 for population standard deviation. See Column.std for a more detailed description.
skip_nulls – Whether to skip null values.

sum(*, skip_nulls: bool | Scalar = True) → Self¶: Reduction returns a 1-row DataFrame.

take(indices: Column) → Self¶

Select a subset of rows, similar to ndarray.take.

Parameters:: indices (Column) – Positions of rows to select.
Returns:: DataFrame

Notes

indices’s parent DataFrame must be self - else, the operation is unsupported and may vary across implementations.

to_array() → Any¶

Convert to array-API-compliant object.

The resulting array will have the corresponding dtype from the Array API:

Bool() -> ‘bool’
Int8() -> ‘int8’
Int16() -> ‘int16’
Int32() -> ‘int32’
Int64() -> ‘int64’
UInt8() -> ‘uint8’
UInt16() -> ‘uint16’
UInt32() -> ‘uint32’
UInt64() -> ‘uint64’
Float32() -> ‘float32’
Float64() -> ‘float64’

and multiple columns’ dtypes are combined according to the Array API’s type promotion rules.

Returns:: Any – An array-API-compliant object.

Notes

While numpy arrays are not yet array-API-compliant, implementations may choose to return a numpy array (for numpy prior to 2.0), with the understanding that consuming libraries would then use the array-api-compat package to convert it to a Standard-compliant array.

var(*, correction: float | Scalar = 1, skip_nulls: bool | Scalar = True) → Self¶

Reduction returns a 1-row DataFrame.

Parameters:

correction – Correction to apply to the result. For example, 0 for sample standard deviation and 1 for population standard deviation. See Column.std for a more detailed description.
skip_nulls – Whether to skip null values.