pyspark_utils package

Submodules

get_spark_session(app_name: str) → SparkSession[source]

Recover appropriate SparkSession

assert_cols_in_df(df: DataFrame, *columns: List[str], df_name: str | None = '') → None[source]

Assserts that all specified columns are present in specified dataframe. If not, displays an informative message.

Parameters:

df (pyspark.sql.DataFrame) – pyspark dataframe
df_name (Optional[str], optional) – list of column names. Defaults to “”.

assert_df_close(df1: DataFrame, df2: DataFrame, **kwargs) → None[source]

Asserts that two dataframes are (almost) equal, even if the order of the columns is different.

Parameters:

df1 (pyspark.sql.DataFrame) – _description_
df2 (pyspark.sql.DataFrame) – _description_
kwargs (Optional[dict]) – Any attribute of methods pandas.testing.assert_frame_equal

with_columns(df: DataFrame, col_func_mapping: Dict[str, Column]) → DataFrame[source]

Use multiple ‘withColumn’ calls on a dataframe in a single command. This function is tail recursive.

Parameters:

df (pyspark.sql.DataFrame) – pyspark dataframe
col_func_mapping (Dict[str, pyspark.sql.Column]) – dict to map each column name with the function to apply to it

Returns:

A pyspark dataframe identical to df but with additional columns.

Return type:

pyspark.sql.DataFrame

keep_first_rows(df: DataFrame, partition_cols, order_cols)[source]

Keep the first row of each group defined by partition_cols and order_cols.

Parameters:

Returns:

_description_

Return type:

_type_