list_builder module

list_builder.build_list(df, measure_list, weight_list, show_weightings=False, hide_rank_cols=True, return_df=False)[source]

Construct a “hybrid” list ordering.

Note: first run the “prepare_master_list” function and use the output for the “df” input here.

Combine and sort various attributes according to variable multipliers to produce a list order. The list order output is based on a sliding scale of the priority assigned amoung the attributes.

The default output is a dataframe containing the new hybrid list order and employee numbers (empkey) only, and is written to disk as ‘dill/p_hybrid.pkl’.

The entire hybrid-sorted dataframe may be returned by setting the “return_df” input to True. This does not affect the hybrid list order dataframe - it is produced and stored regardless of the “return_df” option.

inputs
df

the prepared dataframe output of the prepare_master_list function

measure_list

a list of attributes that form the basis of the final sorted list. The employee groups will be combined, sorted, and numbered according to these attributes one by one. Each time the current attribute numbered list is formed, a weighting is applied to that order column. The final result number will be the rank of the cummulative total of the weighted attribute columns.

weight_list

a list of decimal weightings to apply to each corresponding measure within the measure_list. Normally the total of the weight_list should be 1, but any numbers may be used as weightings since the final result is a ranking of a cumulative total.

show_weightings

add columns to display the product of the weight/column mutiplcation

return_df

option to return the new sorted hybrid dataframe as output. Normally, the function produces a list ordering file which is written to disk and used as an input by the compute measures script.

hide_rank_cols

remove the attrubute rank columns from the dataframe unless visual review is desired

list_builder.compare_dataframes(base, compare, return_orphans=True, ignore_case=True, print_info=False, convert_np_timestamps=True)[source]

Compare all common index and common column DataFrame values and report if any value is not equal in a returned dataframe.

Values are compared only by index and column label, not order. Therefore, the only values compared are within common index rows and common columns. The routine will report the common columns and any unique index rows when the print_info option is selected (True).

Inputs are pandas dataframes and/or pandas series.

This function works well when comparing initial data lists, such as those which may be received from opposing parties.

If return_orphans, returns tuple (diffs, base_loners, compare_loners), else returns diffs. diffs is a differential dataframe.

inputs
base

baseline dataframe or series

compare

dataframe or series to compare against the baseline (base)

return_orphans

separately calculate and return the rows which are unique to base and compare

ignore_case

convert the column labels and column data to be compared to lowercase - this will avoid differences detected based on string case

print_info

option to print out to console verbose statistical information and the dataframe(s) instead of returning dataframe(s)

convert_np_timestamps

numpy returns datetime64 objects when the source is a datetime date-only object. this option will convert back to a date-only object for comparison.

list_builder.find_index_locs(df, index_values)[source]

Find the pandas dataframe index location of an array-like input of index labels.

Returns a list containing the index location(s).

inputs
df

dataframe - the index_values input is a subset of the dataframe index.

index_values

array-like collection of values which are a subset of the dataframe index

list_builder.find_row_orphans(base_df, compare_df, col, ignore_case=True, print_output=False)[source]

Given two columns (series) with the same column label in separate pandas dataframes, return values which are unique to one or the other column, not common to both series. Will also work with dataframe indexes.

Returns tuple (base_loners, compare_loners) if not print_output. These are dataframes with the series orphans.

Note: If there are orphans found that have identical values, they will both be reported. However, currently the routine will only find the first corresponding index location found and report that location for both orphans.

inputs
base_df

first dataframe to compare

compare_df

second dataframe to compare

col

column label of the series to compare. routine will compare the dataframe indexes with the input of ‘index’.

ignore_case

convert col to lowercase prior to comparison

print_output

print results instead of returning results

list_builder.find_series_locs(df, series_values, column_label)[source]

Find the pandas dataframe index location of an array-like input of series values.

Returns a list containing the index location(s).

inputs
df

dataframe - the series_values input is a subset of one of the dataframe columns.

series_values

array-like collection of values which are a subset of one of the dataframe columns (the column_lable input)

column_label

the series within the pandas dataframe containing the series_values

list_builder.names_to_integers(names, leading_precision=5, normalize_alpha=True)[source]

convert a list or series of string names (i.e. last names) into integers for numerical sorting

Returns tuple (int_names, int_range, name_percentages)

inputs
names

List or pandas series containing strings for conversion to integers

leading_precision

Number of characters to use with full numeric precision, remainder of characters will be assigned a rounded single digit between 0 and 9

normalize_alpha

If True, insert ‘aaaaaaaaaa’ and ‘zzzzzzzzzz’ as bottom and top names. Otherwise, bottom and top names will be calculated from within the names input

output

  1. an array of the name integers

  2. the range of the name integers,

  3. an array of corresponding percentages for each name integer relative to the range of name integers array

Note: This function demonstrates the possibility of constructing a list using any type or combination of attributes.

list_builder.prepare_master_list(name_int_demo=False, pre_sort=True)[source]

Add attribute columns to a master list. One or more of these columns will be used by the build_list function to construct a “hybrid” list ordering.

Employee groups must be listed in seniority order in relation to employees from the same group. Order between groups is uninmportant at this step.

New columns added: [‘age’, ‘s_lmonths’, ‘jnum’, ‘job_count’, ‘rank_in_job’, ‘jobp’, ‘eg_number’, ‘eg_spcnt’]

inputs
name_int_demo

if True, lname strings are converted to an integer then a corresponding alpha-numeric percentage for constructing lists by last name. This is a demo only to show that any attribute may be used as a list weighting factor.

pre_sort

sort the master data dataframe doh and ldate columns prior to beginning any calculations. This sort has no effect on the other columns. The s_lmonths coulumn will be calculated on the sorted ldate data.

Job-related attributes are referenced to job counts from the settings dictionary.

list_builder.sort_and_rank(df, col, tiebreaker1=None, tiebreaker2=None, reverse=False)[source]

Sort a datframe by a specified attribute and insert a column indicating the resultant ranking. Tiebreaker inputs select columns to be used for secondary ordering in the event of value ties. Reverse ordering may be selected as an option.

inputs
df

input dataframe

col (string)

dataframe column to sort

tiebreaker1, tiebreaker2 (string(s))

second and third sort columns to break ties with primary col sort

reverse (boolean)

If True, reverses sort (descending values)

list_builder.sort_eg_attributes(df, attributes=['doh', 'ldate'], reverse_list=[0, 0], add_columns=False)[source]

Sort master list attribute columns by employee group in preparation for list construction. The overall master list structure and order is unaffected, only the selected attribute columns are sorted (normally date-related columns such as doh or ldate)

inputs
df

The master data dataframe (does not need to be sorted)

attributes

columns to sort by eg (inplace)

reverse_list

If an attribute is to be sorted in reverse order (descending), use a ‘1’ in the list position corresponding to the position of the attribute within the attributes input

add_columns

If True, an additional column for each sorted attribute will be added to the resultant dataframe, with the suffix ‘_sort’ added to it.

list_builder.test_df_col_or_idx_equivalence(df1, df2, col=None)[source]

check whether two dataframes contain the same elements (but not necessarily in the same order) in either the indexes or a selected column

inputs
df1, df2

the dataframes to check

col

if not None, test this dataframe column for equivalency, otherwise test the dataframe indexes

Returns True or False