pyspark_utils package
Submodules
pyspark_utils.utils module
- get_spark_session(app_name: str) SparkSession [source]
Recover appropriate SparkSession
- Parameters:
app_name (str) – Name of application
- Returns:
A Spark session with name app_name
- Return type:
pyspark.sql.SparkSession
- assert_cols_in_df(df: DataFrame, *columns: List[str], df_name: str | None = '') None [source]
Assserts that all specified columns are present in specified dataframe. If not, displays an informative message.
- Parameters:
df (pyspark.sql.DataFrame) – pyspark dataframe
df_name (Optional[str], optional) – list of column names. Defaults to “”.
- assert_df_close(df1: DataFrame, df2: DataFrame, **kwargs) None [source]
Asserts that two dataframes are (almost) equal, even if the order of the columns is different.
- Parameters:
df1 (pyspark.sql.DataFrame) – _description_
df2 (pyspark.sql.DataFrame) – _description_
kwargs (Optional[dict]) – Any attribute of methods pandas.testing.assert_frame_equal
- with_columns(df: DataFrame, col_func_mapping: Dict[str, Column]) DataFrame [source]
Use multiple ‘withColumn’ calls on a dataframe in a single command. This function is tail recursive.
- Parameters:
df (pyspark.sql.DataFrame) – pyspark dataframe
col_func_mapping (Dict[str, pyspark.sql.Column]) – dict to map each column name with the function to apply to it
- Returns:
A pyspark dataframe identical to df but with additional columns.
- Return type:
pyspark.sql.DataFrame
- keep_first_rows(df: DataFrame, partition_cols, order_cols)[source]
Keep the first row of each group defined by partition_cols and order_cols.
- Parameters:
df (pyspark.sql.DataFrame) – pyspark dataframe
partition_cols (_type_) – _description_
order_cols (_type_) – _description_
- Returns:
_description_
- Return type:
_type_