Dataframe object

A conforming implementation of the dataframe API standard must provide and support a dataframe object having the following methods, attributes, and behavior.

class DataFrame(*args, **kwargs)

DataFrame object.

Note that this dataframe object is not meant to be instantiated directly by users of the library implementing the dataframe API standard. Rather, use constructor functions or an already-created dataframe object retrieved via

Python operator support

All arithmetic operators defined by the Python language, except for __matmul__, __neg__ and __pos__, must be supported for numerical data types.

All comparison operators defined by the Python language must be supported by the dataframe object for all data types for which those comparisons are supported by the builtin scalar types corresponding to a data type.

In-place operators must not be supported. All operations on the dataframe object are out-of-place.

Methods and Attributes

__abstractmethods__ = frozenset({})
__add__(other: AnyScalar) Self

Add other scalar to this dataframe.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__and__(other: bool) Self

Apply logical ‘and’ to other scalar and this dataframe.

Nulls should follow Kleene Logic.

Parameters:

other (bool) –

Returns:

DataFrame[bool]

Raises:

ValueError – If self or other is not boolean.

__dataframe_namespace__() Namespace

Return an object that has all the top-level dataframe API functions on it.

Returns:

namespace (Any) – An object representing the dataframe API namespace. It should have every top-level function defined in the specification as an attribute. It may contain other public names as well, but it is recommended to only include those names that are part of the specification.

__divmod__(other: AnyScalar) tuple[DataFrame, DataFrame]

Return quotient and remainder of integer division. See divmod builtin.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

A tuple of two `DataFrame`s

__eq__(other: AnyScalar) Self

Compare for equality.

Nulls should follow Kleene Logic.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__floordiv__(other: AnyScalar) Self

Floor-divide (returns integers) this dataframe by other scalar.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__ge__(other: AnyScalar) Self

Compare for “greater than or equal to” other.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__gt__(other: AnyScalar) Self

Compare for “greater than” other.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__init__(*args, **kwargs)
__invert__() Self

Invert truthiness of (boolean) elements.

Raises:

ValueError – If any of the DataFrame’s columns is not boolean.

__iter__() NoReturn

Iterate over elements.

This is intentionally “poisoned” to discourage inefficient code patterns.

Raises:

NotImplementedError

__le__(other: AnyScalar) Self

Compare for “less than or equal to” other.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__lt__(other: AnyScalar) Self

Compare for “less than” other.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__mod__(other: AnyScalar) Self

Return modulus of this dataframe by other (% operator).

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__mul__(other: AnyScalar) Self

Multiply other scalar with this dataframe.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__ne__(other: AnyScalar) Self

Compare for non-equality.

Nulls should follow Kleene Logic.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__or__(other: bool) Self

Apply logical ‘or’ to other scalar and this DataFrame.

Nulls should follow Kleene Logic.

Parameters:

other (bool) –

Returns:

DataFrame[bool]

Raises:

ValueError – If self or other is not boolean.

__parameters__ = ()
__pow__(other: AnyScalar) Self

Raise this dataframe to the power of other.

Integer dtype to the power of non-negative integer dtype is integer dtype. Integer dtype to the power of float dtype is float dtype. Float dtype to the power of integer dtype or float dtype is float dtype.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__radd__(other: AnyScalar) Self
__rand__(other: AnyScalar) Self
__rfloordiv__(other: AnyScalar) Self
__rmod__(other: AnyScalar) Self
__rmul__(other: AnyScalar) Self
__ror__(other: AnyScalar) Self

Return value|self.

__rpow__(other: AnyScalar) Self
__rsub__(other: AnyScalar) Self
__rtruediv__(other: AnyScalar) Self
__sub__(other: AnyScalar) Self

Subtract other scalar from this dataframe.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

__subclasshook__()

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

__truediv__(other: AnyScalar) Self

Divide this dataframe by other scalar. True division, returns floats.

Parameters:

other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.

Returns:

DataFrame

all(*, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

Raises:

ValueError – If any of the DataFrame’s columns is not boolean.

any(*, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

Raises:

ValueError – If any of the DataFrame’s columns is not boolean.

assign(*columns: Column) Self

Insert new column(s), or update values in existing ones.

If inserting new columns, the column’s names will be used as the labels, and the columns will be inserted at the rightmost location.

If updating existing columns, their names will be used to tell which columns to update. To update a column with a different name, combine with Column.rename(), e.g.:

new_column = df.col('a') + 1
df = df.assign(new_column.rename('b'))
Parameters:

*columns (Column) – Column(s) to update/insert. If updating/inserting multiple columns, they must all have different names.

Returns:

DataFrame

Notes

All of columns’s parent DataFrame must be self - else, the operation is unsupported and may vary across implementations.

cast(dtypes: Mapping[str, DType]) Self

Convert specified columns to specified dtypes.

The following is not specified and may vary across implementations:

  • Cross-kind casting (e.g. integer to string, or to float)

  • Behaviour in the case of overflows

col(name: str, /) Column

Select a column by name.

Parameters:

name (str) –

Returns:

Column

Raises:

KeyError – If the key is not present.

property column_names: list[str]

Get column names.

Returns:

list[str]

property dataframe: SupportsDataFrameAPI

Return underlying (not-necessarily-Standard-compliant) DataFrame.

If a library only implements the Standard, then this can return self.

drop(*labels: str) Self

Drop the specified column(s).

Parameters:

*label (str) – Column name(s) to drop.

Returns:

DataFrame

Raises:

KeyError – If the label is not present.

drop_nulls(*, column_names: list[str] | None = None) Self

Drop rows containing null values.

Parameters:

column_names (list[str] | None) – A list of column names to consider when dropping nulls. If None, all columns will be considered.

Raises:

KeyError – If column_names contains a column name that is not present in the dataframe.

fill_nan(value: float | NullType | Scalar, /) Self

Fill nan values with the given fill value.

The fill operation will apply to all columns with a floating-point dtype. Other columns remain unchanged.

Parameters:

value (float or null) – Value used to replace any nan in the column with. Must be of the Python scalar type matching the dtype of the column (or be null).

fill_null(value: AnyScalar, /, *, column_names: list[str] | None = None) Self

Fill null values with the given fill value.

This method can only be used if all columns that are to be filled are of the same dtype (e.g., all of Float64 or all of string dtype). If that is not the case, it is not possible to use a single Python scalar type that matches the dtype of all columns to which fill_null is being applied, and hence an exception will be raised.

Parameters:
  • value (Scalar) – Value used to replace any null values in the dataframe with. Must be of the Python scalar type matching the dtype(s) of the dataframe.

  • column_names (list[str] | None) – A list of column names for which to replace nulls with the given scalar value. If None, nulls will be replaced in all columns.

Raises:
  • TypeError – If the columns of the dataframe are not all of the same kind.

  • KeyError – If column_names contains a column name that is not present in the dataframe.

filter(mask: Column) Self

Select a subset of rows corresponding to a mask.

Parameters:

mask (Column) –

Returns:

DataFrame

Notes

mask’s parent DataFrame must be self - else, the operation is unsupported and may vary across implementations.

group_by(*keys: str) GroupBy

Group the DataFrame by the given columns.

Parameters:

*keys (str) –

Returns:

GroupBy

Raises:

KeyError – If any of the requested keys are not present.

Notes

Downstream operations from this function, like aggregations, return results for which row order is not guaranteed and is implementation defined.

is_nan() Self

Check for nan entries.

Returns:

DataFrame

See also

is_null

Notes

This only checks for ‘NaN’. Does not include ‘missing’ or ‘null’ entries. In particular, does not check for np.timedelta64('NaT').

is_null() Self

Check for ‘missing’ or ‘null’ entries.

Returns:

DataFrame

See also

is_nan

Notes

Does not include NaN-like entries. May optionally include ‘NaT’ values (if present in an implementation), but note that the Standard makes no guarantees about them.

iter_columns() Iterator[Column]

Return iterator over columns.

join(other: Self, *, how: Literal['left', 'inner', 'outer'], left_on: str | list[str], right_on: str | list[str]) Self

Join with other dataframe.

Other than the joining column name(s), no column name is allowed to appear in both self and other. Rename columns before calling join if necessary using rename().

Parameters:
  • other (Self) – Dataframe to join with.

  • how (str) – Kind of join to perform. Must be one of {‘left’, ‘inner’, ‘outer’}.

  • left_on (str | list[str]) – Key(s) from self to perform join on. If more than one key is given, it must be the same length as right_on.

  • right_on (str | list[str]) – Key(s) from other to perform join on. If more than one key is given, it must be the same length as left_on.

Returns:

DataFrame

Raises:

ValueError – If, apart from left_on and right_on, there are any column names present in both self and other.

max(*, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

mean(*, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

median(*, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

min(*, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

persist() Self

Hint that computation prior to this point should not be repeated.

This is intended as a hint, rather than as a directive. Implementations which do not separate lazy vs eager execution may ignore this method and treat it as a no-op.

Note

This method may trigger execution. If necessary, it should be called at most once per dataframe, and as late as possible in the pipeline.

For example, do this

df: DataFrame
result = df.std() > 0
result = result.persist()
features = [col.name for col in df.iter_columns() if col.get_value(0)]

instead of this:

df: DataFrame
result = df.std() > 0
features = [
    # Do NOT do this! This will trigger execution of the entire
    # pipeline for element in the for-loop!
    col.name for col in df.iter_columns() if col.get_value(0).persist()
]
prod(*, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

rename(mapping: Mapping[str, str]) Self

Rename columns.

Parameters:

mapping (Mapping[str, str]) – Keys are old column names, values are new column names.

Returns:

DataFrame

property schema: dict[str, DType]

Get dataframe’s schema.

Returns:

dict[str, Any] – Mapping from column name to data type.

select(*names: str) Self

Select multiple columns by name.

Parameters:

*names (str) –

Returns:

DataFrame

Raises:

KeyError – If the any requested key is not present.

shape() tuple[int, int]

Return number of rows and number of columns.

slice_rows(start: int | None, stop: int | None, step: int | None) Self

Select a subset of rows corresponding to a slice.

Parameters:
  • start (int or None) –

  • stop (int or None) –

  • step (int or None) –

Returns:

DataFrame

sort(*keys: str, ascending: Sequence[bool] | bool = True, nulls_position: Literal['first', 'last'] = 'last') Self

Sort dataframe according to given columns.

If you only need the indices which would sort the dataframe, use sorted_indices.

Parameters:
  • *keys (str) – Names of columns to sort by. If not specified, sort by all columns.

  • ascending (Sequence[bool] or bool) – If True, sort by all keys in ascending order. If False, sort by all keys in descending order. If a sequence, it must be the same length as keys, and determines the direction with which to use each key to sort by.

  • nulls_position ({'first', 'last'}) – Whether null values should be placed at the beginning or at the end of the result. Note that the position of NaNs is unspecified and may vary based on the implementation.

Returns:

DataFrame

Raises:

ValueError – If keys and ascending are sequences of different lengths.

std(*, correction: float | Scalar = 1, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

Parameters:
  • correction – Correction to apply to the result. For example, 0 for sample standard deviation and 1 for population standard deviation. See Column.std for a more detailed description.

  • skip_nulls – Whether to skip null values.

sum(*, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

take(indices: Column) Self

Select a subset of rows, similar to ndarray.take.

Parameters:

indices (Column) – Positions of rows to select.

Returns:

DataFrame

Notes

indices’s parent DataFrame must be self - else, the operation is unsupported and may vary across implementations.

to_array() Any

Convert to array-API-compliant object.

The resulting array will have the corresponding dtype from the Array API:

  • Bool() -> ‘bool’

  • Int8() -> ‘int8’

  • Int16() -> ‘int16’

  • Int32() -> ‘int32’

  • Int64() -> ‘int64’

  • UInt8() -> ‘uint8’

  • UInt16() -> ‘uint16’

  • UInt32() -> ‘uint32’

  • UInt64() -> ‘uint64’

  • Float32() -> ‘float32’

  • Float64() -> ‘float64’

and multiple columns’ dtypes are combined according to the Array API’s type promotion rules.

Returns:

Any – An array-API-compliant object.

Notes

While numpy arrays are not yet array-API-compliant, implementations may choose to return a numpy array (for numpy prior to 2.0), with the understanding that consuming libraries would then use the array-api-compat package to convert it to a Standard-compliant array.

var(*, correction: float | Scalar = 1, skip_nulls: bool | Scalar = True) Self

Reduction returns a 1-row DataFrame.

Parameters:
  • correction – Correction to apply to the result. For example, 0 for sample standard deviation and 1 for population standard deviation. See Column.std for a more detailed description.

  • skip_nulls – Whether to skip null values.