pyspark_utils package

Submodules

pyspark_utils.utils module

get_spark_session(app_name: str) SparkSession[source]

Recover appropriate SparkSession

Parameters:

app_name (str) – Name of application

Returns:

A Spark session with name app_name

Return type:

pyspark.sql.SparkSession

assert_cols_in_df(df: DataFrame, *columns: List[str], df_name: str | None = '') None[source]

Assserts that all specified columns are present in specified dataframe. If not, displays an informative message.

Parameters:
  • df (pyspark.sql.DataFrame) – pyspark dataframe

  • df_name (Optional[str], optional) – list of column names. Defaults to “”.

assert_df_close(df1: DataFrame, df2: DataFrame, **kwargs) None[source]

Asserts that two dataframes are (almost) equal, even if the order of the columns is different.

Parameters:
  • df1 (pyspark.sql.DataFrame) – _description_

  • df2 (pyspark.sql.DataFrame) – _description_

  • kwargs (Optional[dict]) – Any attribute of methods pandas.testing.assert_frame_equal

with_columns(df: DataFrame, col_func_mapping: Dict[str, Column]) DataFrame[source]

Use multiple ‘withColumn’ calls on a dataframe in a single command. This function is tail recursive.

Parameters:
  • df (pyspark.sql.DataFrame) – pyspark dataframe

  • col_func_mapping (Dict[str, pyspark.sql.Column]) – dict to map each column name with the function to apply to it

Returns:

A pyspark dataframe identical to df but with additional columns.

Return type:

pyspark.sql.DataFrame

keep_first_rows(df: DataFrame, partition_cols, order_cols)[source]

Keep the first row of each group defined by partition_cols and order_cols.

Parameters:
  • df (pyspark.sql.DataFrame) – pyspark dataframe

  • partition_cols (_type_) – _description_

  • order_cols (_type_) – _description_

Returns:

_description_

Return type:

_type_

Module contents