Dataframe object¶
A conforming implementation of the dataframe API standard must provide and support a dataframe object having the following methods, attributes, and behavior.
- class DataFrame(*args, **kwargs)¶
DataFrame object.
Note that this dataframe object is not meant to be instantiated directly by users of the library implementing the dataframe API standard. Rather, use constructor functions or an already-created dataframe object retrieved via
Python operator support
All arithmetic operators defined by the Python language, except for
__matmul__
,__neg__
and__pos__
, must be supported for numerical data types.All comparison operators defined by the Python language must be supported by the dataframe object for all data types for which those comparisons are supported by the builtin scalar types corresponding to a data type.
In-place operators must not be supported. All operations on the dataframe object are out-of-place.
Methods and Attributes
- __abstractmethods__ = frozenset({})¶
- __add__(other: AnyScalar) Self ¶
Add
other
scalar to this dataframe.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __and__(other: bool) Self ¶
Apply logical ‘and’ to
other
scalar and this dataframe.Nulls should follow Kleene Logic.
- Parameters:
other (bool) –
- Returns:
DataFrame[bool]
- Raises:
ValueError – If
self
orother
is not boolean.
- __dataframe_namespace__() Namespace ¶
Return an object that has all the top-level dataframe API functions on it.
- Returns:
namespace (Any) – An object representing the dataframe API namespace. It should have every top-level function defined in the specification as an attribute. It may contain other public names as well, but it is recommended to only include those names that are part of the specification.
- __divmod__(other: AnyScalar) tuple[DataFrame, DataFrame] ¶
Return quotient and remainder of integer division. See
divmod
builtin.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
A tuple of two `DataFrame`s
- __eq__(other: AnyScalar) Self ¶
Compare for equality.
Nulls should follow Kleene Logic.
- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __floordiv__(other: AnyScalar) Self ¶
Floor-divide (returns integers) this dataframe by
other
scalar.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __ge__(other: AnyScalar) Self ¶
Compare for “greater than or equal to”
other
.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __gt__(other: AnyScalar) Self ¶
Compare for “greater than”
other
.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __init__(*args, **kwargs)¶
- __invert__() Self ¶
Invert truthiness of (boolean) elements.
- Raises:
ValueError – If any of the DataFrame’s columns is not boolean.
- __iter__() NoReturn ¶
Iterate over elements.
This is intentionally “poisoned” to discourage inefficient code patterns.
- Raises:
NotImplementedError –
- __le__(other: AnyScalar) Self ¶
Compare for “less than or equal to”
other
.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __lt__(other: AnyScalar) Self ¶
Compare for “less than”
other
.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __mod__(other: AnyScalar) Self ¶
Return modulus of this dataframe by
other
(%
operator).- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __mul__(other: AnyScalar) Self ¶
Multiply
other
scalar with this dataframe.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __ne__(other: AnyScalar) Self ¶
Compare for non-equality.
Nulls should follow Kleene Logic.
- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __or__(other: bool) Self ¶
Apply logical ‘or’ to
other
scalar and this DataFrame.Nulls should follow Kleene Logic.
- Parameters:
other (bool) –
- Returns:
DataFrame[bool]
- Raises:
ValueError – If
self
orother
is not boolean.
- __parameters__ = ()¶
- __pow__(other: AnyScalar) Self ¶
Raise this dataframe to the power of
other
.Integer dtype to the power of non-negative integer dtype is integer dtype. Integer dtype to the power of float dtype is float dtype. Float dtype to the power of integer dtype or float dtype is float dtype.
- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __radd__(other: AnyScalar) Self ¶
- __rand__(other: AnyScalar) Self ¶
- __rfloordiv__(other: AnyScalar) Self ¶
- __rmod__(other: AnyScalar) Self ¶
- __rmul__(other: AnyScalar) Self ¶
- __ror__(other: AnyScalar) Self ¶
Return value|self.
- __rpow__(other: AnyScalar) Self ¶
- __rsub__(other: AnyScalar) Self ¶
- __rtruediv__(other: AnyScalar) Self ¶
- __sub__(other: AnyScalar) Self ¶
Subtract
other
scalar from this dataframe.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- __subclasshook__()¶
Abstract classes can override this to customize issubclass().
This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).
- __truediv__(other: AnyScalar) Self ¶
Divide this dataframe by
other
scalar. True division, returns floats.- Parameters:
other (Scalar) – “Scalar” here is defined implicitly by what scalar types are allowed for the operation by the underling dtypes.
- Returns:
DataFrame
- all(*, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- Raises:
ValueError – If any of the DataFrame’s columns is not boolean.
- any(*, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- Raises:
ValueError – If any of the DataFrame’s columns is not boolean.
- assign(*columns: Column) Self ¶
Insert new column(s), or update values in existing ones.
If inserting new columns, the column’s names will be used as the labels, and the columns will be inserted at the rightmost location.
If updating existing columns, their names will be used to tell which columns to update. To update a column with a different name, combine with
Column.rename()
, e.g.:new_column = df.col('a') + 1 df = df.assign(new_column.rename('b'))
- Parameters:
*columns (Column) – Column(s) to update/insert. If updating/inserting multiple columns, they must all have different names.
- Returns:
DataFrame
Notes
All of
columns
’s parent DataFrame must beself
- else, the operation is unsupported and may vary across implementations.
- cast(dtypes: Mapping[str, DType]) Self ¶
Convert specified columns to specified dtypes.
The following is not specified and may vary across implementations:
Cross-kind casting (e.g. integer to string, or to float)
Behaviour in the case of overflows
- col(name: str, /) Column ¶
Select a column by name.
- Parameters:
name (str) –
- Returns:
Column
- Raises:
KeyError – If the key is not present.
- property column_names: list[str]¶
Get column names.
- Returns:
list[str]
- property dataframe: SupportsDataFrameAPI¶
Return underlying (not-necessarily-Standard-compliant) DataFrame.
If a library only implements the Standard, then this can return
self
.
- drop(*labels: str) Self ¶
Drop the specified column(s).
- Parameters:
*label (str) – Column name(s) to drop.
- Returns:
DataFrame
- Raises:
KeyError – If the label is not present.
- drop_nulls(*, column_names: list[str] | None = None) Self ¶
Drop rows containing null values.
- Parameters:
column_names (list[str] | None) – A list of column names to consider when dropping nulls. If
None
, all columns will be considered.- Raises:
KeyError – If
column_names
contains a column name that is not present in the dataframe.
- fill_nan(value: float | NullType | Scalar, /) Self ¶
Fill
nan
values with the given fill value.The fill operation will apply to all columns with a floating-point dtype. Other columns remain unchanged.
- Parameters:
value (float or
null
) – Value used to replace anynan
in the column with. Must be of the Python scalar type matching the dtype of the column (or benull
).
- fill_null(value: AnyScalar, /, *, column_names: list[str] | None = None) Self ¶
Fill null values with the given fill value.
This method can only be used if all columns that are to be filled are of the same dtype (e.g., all of
Float64
or all of string dtype). If that is not the case, it is not possible to use a single Python scalar type that matches the dtype of all columns to whichfill_null
is being applied, and hence an exception will be raised.- Parameters:
value (Scalar) – Value used to replace any
null
values in the dataframe with. Must be of the Python scalar type matching the dtype(s) of the dataframe.column_names (list[str] | None) – A list of column names for which to replace nulls with the given scalar value. If
None
, nulls will be replaced in all columns.
- Raises:
TypeError – If the columns of the dataframe are not all of the same kind.
KeyError – If
column_names
contains a column name that is not present in the dataframe.
- filter(mask: Column) Self ¶
Select a subset of rows corresponding to a mask.
- Parameters:
mask (Column) –
- Returns:
DataFrame
Notes
mask
’s parent DataFrame must beself
- else, the operation is unsupported and may vary across implementations.
- group_by(*keys: str) GroupBy ¶
Group the DataFrame by the given columns.
- Parameters:
*keys (str) –
- Returns:
GroupBy
- Raises:
KeyError – If any of the requested keys are not present.
Notes
Downstream operations from this function, like aggregations, return results for which row order is not guaranteed and is implementation defined.
- is_nan() Self ¶
Check for nan entries.
- Returns:
DataFrame
See also
Notes
This only checks for ‘NaN’. Does not include ‘missing’ or ‘null’ entries. In particular, does not check for
np.timedelta64('NaT')
.
- is_null() Self ¶
Check for ‘missing’ or ‘null’ entries.
- Returns:
DataFrame
See also
Notes
Does not include NaN-like entries. May optionally include ‘NaT’ values (if present in an implementation), but note that the Standard makes no guarantees about them.
- join(other: Self, *, how: Literal['left', 'inner', 'outer'], left_on: str | list[str], right_on: str | list[str]) Self ¶
Join with other dataframe.
Other than the joining column name(s), no column name is allowed to appear in both
self
andother
. Rename columns before callingjoin
if necessary usingrename()
.- Parameters:
other (Self) – Dataframe to join with.
how (str) – Kind of join to perform. Must be one of {‘left’, ‘inner’, ‘outer’}.
left_on (str | list[str]) – Key(s) from
self
to performjoin
on. If more than one key is given, it must be the same length asright_on
.right_on (str | list[str]) – Key(s) from
other
to performjoin
on. If more than one key is given, it must be the same length asleft_on
.
- Returns:
DataFrame
- Raises:
ValueError – If, apart from
left_on
andright_on
, there are any column names present in bothself
andother
.
- max(*, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- mean(*, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- median(*, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- min(*, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- persist() Self ¶
Hint that computation prior to this point should not be repeated.
This is intended as a hint, rather than as a directive. Implementations which do not separate lazy vs eager execution may ignore this method and treat it as a no-op.
Note
This method may trigger execution. If necessary, it should be called at most once per dataframe, and as late as possible in the pipeline.
For example, do this
df: DataFrame result = df.std() > 0 result = result.persist() features = [col.name for col in df.iter_columns() if col.get_value(0)]
instead of this:
df: DataFrame result = df.std() > 0 features = [ # Do NOT do this! This will trigger execution of the entire # pipeline for element in the for-loop! col.name for col in df.iter_columns() if col.get_value(0).persist() ]
- prod(*, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- rename(mapping: Mapping[str, str]) Self ¶
Rename columns.
- Parameters:
mapping (Mapping[str, str]) – Keys are old column names, values are new column names.
- Returns:
DataFrame
- property schema: dict[str, DType]¶
Get dataframe’s schema.
- Returns:
dict[str, Any] – Mapping from column name to data type.
- select(*names: str) Self ¶
Select multiple columns by name.
- Parameters:
*names (str) –
- Returns:
DataFrame
- Raises:
KeyError – If the any requested key is not present.
- shape() tuple[int, int] ¶
Return number of rows and number of columns.
- slice_rows(start: int | None, stop: int | None, step: int | None) Self ¶
Select a subset of rows corresponding to a slice.
- Parameters:
start (int or None) –
stop (int or None) –
step (int or None) –
- Returns:
DataFrame
- sort(*keys: str, ascending: Sequence[bool] | bool = True, nulls_position: Literal['first', 'last'] = 'last') Self ¶
Sort dataframe according to given columns.
If you only need the indices which would sort the dataframe, use
sorted_indices
.- Parameters:
*keys (str) – Names of columns to sort by. If not specified, sort by all columns.
ascending (Sequence[bool] or bool) – If
True
, sort by all keys in ascending order. IfFalse
, sort by all keys in descending order. If a sequence, it must be the same length askeys
, and determines the direction with which to use each key to sort by.nulls_position (
{'first', 'last'}
) – Whether null values should be placed at the beginning or at the end of the result. Note that the position of NaNs is unspecified and may vary based on the implementation.
- Returns:
DataFrame
- Raises:
ValueError – If
keys
andascending
are sequences of different lengths.
- std(*, correction: float | Scalar = 1, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- Parameters:
correction – Correction to apply to the result. For example,
0
for sample standard deviation and1
for population standard deviation. SeeColumn.std
for a more detailed description.skip_nulls – Whether to skip null values.
- sum(*, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- take(indices: Column) Self ¶
Select a subset of rows, similar to
ndarray.take
.- Parameters:
indices (Column) – Positions of rows to select.
- Returns:
DataFrame
Notes
indices
’s parent DataFrame must beself
- else, the operation is unsupported and may vary across implementations.
- to_array() Any ¶
Convert to array-API-compliant object.
The resulting array will have the corresponding dtype from the Array API:
Bool() -> ‘bool’
Int8() -> ‘int8’
Int16() -> ‘int16’
Int32() -> ‘int32’
Int64() -> ‘int64’
UInt8() -> ‘uint8’
UInt16() -> ‘uint16’
UInt32() -> ‘uint32’
UInt64() -> ‘uint64’
Float32() -> ‘float32’
Float64() -> ‘float64’
and multiple columns’ dtypes are combined according to the Array API’s type promotion rules.
- Returns:
Any – An array-API-compliant object.
Notes
While numpy arrays are not yet array-API-compliant, implementations may choose to return a numpy array (for numpy prior to 2.0), with the understanding that consuming libraries would then use the
array-api-compat
package to convert it to a Standard-compliant array.
- var(*, correction: float | Scalar = 1, skip_nulls: bool | Scalar = True) Self ¶
Reduction returns a 1-row DataFrame.
- Parameters:
correction – Correction to apply to the result. For example,
0
for sample standard deviation and1
for population standard deviation. SeeColumn.std
for a more detailed description.skip_nulls – Whether to skip null values.