functions module¶
The functions module contains core program routines related to building and working with the data model and associated files. General definitions: dataset “month_form” is length n months in model “short_form” data has a length equal to the number of employees “long_form” data is the length of the cumulative sum non-retired employees for all months in the data model (could be millions of rows, depending on workgroup size and age)
-
functions.
add_zero_col
(arr)[source]¶ Add a column of zeros as the first column in a 2d array. Output will be a numpy array.
example:
input array:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])
output array:
array([[ 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [ 0, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [ 0, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [ 0, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [ 0, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])
- inputs
- arr (array)
2-dimensional numpy array
-
functions.
age_correction
(month_nums_array, ages_array, retage)[source]¶ Long_Form
Returns a long_form (all months) array of employee ages by incrementing starting ages according to month number.
Note: Retirement age increases are handled by the build_program_files script by incrementing retirement dates and by the clip_ret_ages function within the make_skeleton script.
- inputs
- month_nums_array (array)
gen_month_skeleton function output (ndarray)
- ages_array (array)
starting_age function output aligned with long_form (ndarray) i.e. s_age is starting age (aligned to empkeys) repeated each month.
- retage (integer or float)
output clip upper limit for retirement age
Output is s_age incremented by a decimal month value according to month_num (this is candidate for np.put refactored function)
-
functions.
align_fill_down
(l, u, long_indexed_df, long_array)[source]¶ Data align current values to all future months (short array segment aligned to long array) This function is used to set the values from the last standalone month as the initial data for integrated dataset computation when a delayed implementation exists.
uses pandas df auto align - relatively slow TODO (for developer) - consider an all numpy solution
- inputs
- l, u (integers)
current month slice indexes (from long df)
- long_indexed_df (dataframe)
empty long dataframe with empkey indexes
- long_array (array)
long array of multiple month data (orig_job, fur_codes, etc)
declare long indexed df outside of function (input). grab current month slice for array insertion (copy). chop long df to begin with current month (copy). assign array to short df. data align short df to long df (chopped to current month and future). copy chopped df column as array to long_array return long_array
-
functions.
align_next
[source]¶ “Carry forward” data from one month to the next. Compare indexes (empkeys) from one month to the next month. When matching index is found, assign corresponding index value to new result array. Effectively finds the remaining employees (not retired) in the next month and copies the target column data values for them from current month data into the next months data.
- inputs
- this_index_arr (array)
current month index of unique employee keys
- next_index_arr (array)
next month index of unique employee keys (a subset of this_index_arr)
- these_vals_arr (array)
the data column segment (attribute) to carry forward
-
functions.
anon_dates
(df, date_col_list, max_adj=5, positive_only=True, inplace=False)[source]¶ Add (or optionally, add or subtract) a random number of days to each element of a date attribute column.
- inputs
- df (dataframe)
short-form (master list) pandas dataframe containing a date attribute column
- date_col_list (list)
name(s) of date attribute column(s) to be adjusted (as a list of strings)
Example:
['ldate', 'doh', 'dob']
- max_adj (integer)
the maximum number of days to add (or optionally subtract) from each element within the date column
- positive_only (boolean)
if True limit the range of adjustment days from zero to the max_adj value. If False, limit the range of adjustment from negative max_adj value to positive max_adj value.
- inplace (boolean)
if True, insert the results directly into the date column(s) of the input dataframe. Caution: make a copy first!
-
functions.
anon_empkeys
(df, seq_start=10001, frame_num=10000000, inplace=False)[source]¶ Produce a list of unique, randomized employee numbers, catogorized by employee group number code. Output may be used to anonymize a dataset empkey column.
Dataframe input (df) must contain an employee group (eg) column.
- inputs
- df (dataframe)
short-form (master list) pandas dataframe containing an employee group code column
- seq_start (integer)
this number will be added to each employee group cumulative count to “seed” the random employee numbers. These numbers will be shuffled within employee groups by the function for the output
- frame_num (integer)
This number will be multiplied by each employee code and added to the employee group cumulative counts (added to the seq_start number), and should be much larger than the data model population to provide a constant length employee number (empkey) for all employees.
- inplace (boolean)
if True, insert the results directly into the “empkey” column of the input dataframe. Caution: make a copy first!
-
functions.
anon_master
(case, empkey=True, name=True, date=False, sample=False, seq_start=10001, frame_num=10000000, min_seg=3, max_seg=3, add_rev=False, date_col_list=['ldate', 'doh'], max_adj=5, positive_only=True, date_col_list_sec=['dob'], max_adj_sec=5, positive_only_sec=True, n=None, frac=None, reset_index=False)[source]¶ Specialized function to anonymize selected columns from a master.xlsx file and/or select a subset. All operations are inplace. The original master file is copied and saved as master_orig.xlsx.
The default parameters will replace last names and employee keys with substitute values. Date columns, (doh, ldate, dob) will also be adjusted if the date input is set True and the proper column names are set as column list inputs.
The function reads the original excel file, copies and saves it, modifies the original file as directed, and writes the results back to the original file. Subsequent dataset creation runs will use the modified data. The output master list will be sorted according to the original master list order.
- inputs
- case (string)
the case study name
- empkey (boolean)
if True, anonymize the empkey column
- name (boolean)
if True, anonymize the lname column
- date (boolean)
if True, anonymize date columns as disignated with the date_col_list and the date_col_list_sec inputs
- sample (boolean)
if True, sample the dataframe if the n or frac inputs is/are not None
- seq_start (integer)
beginning anonymous employee number portion of empkey
- frame_num (integer)
large frame number which will contain all generated employee numbers. This number will be adjusted to begin with the appropriate employee group code
- min_seg (integer)
minimum number of 2-character segments to include in the generated substitute last names.
- max_seg (integer)
maximum number of 2-character segments to include in the generated substitute last names.
- add_rev (boolean)
if True, add reversed, non-duplicated 2-character segments to the pool of strings for name construction. This is normally not necessary and will construct output strings with multiple consecutive consonants/vowels.
- date_col_list (list)
list of date value columns to adjust. All columns in this list will be adjusted in a syncronized fashion, meaning a random day adjustment for each row will be applied to each row member of all columns.
- max_adj (integer)
maximum random adjustment deviation, in days, from the original date(s)
- positive_only (boolean)
if True, only adjust dates forward in time
- date_col_list_sec (list)
a secondary list of date column(s) which will be adjusted independently from the date columns in the date_col_list
- max_adj_sec (integer)
maximum random adjustment deviation, in days, from the original date(s) in the date_col_list_sec columns
- positive_only_sec (boolean)
if True, only adjust dates forward in time (for secondary cols)
- n (integer or None)
number of rows to sample if the sample input is True. This input will override the frac input
- frac (float (0.0 - 1.0) or None)
decimal fraction (0.0 to 1.0) of the master list to sample, if the sample input is True and the n input is None
- reset_index (boolean)
if True, reset the index of the output master list (zero-based integer index). Do not use this option normally because it will wipe out the empkey index of the master list.
-
functions.
anon_names
(length=10, min_seg=3, max_seg=3, add_rev=False, df=None, inplace=False)[source]¶ Generate a list of random strings
Output may be used to anonymize a dataset name column
The length of the output strings will be determined by the min_seg and max_seg inputs. The segments (seg) are random 2-letter combinations of a consonant and a vowel. An additional random consonant or vowel will be added to the segment combinations, so the length of the output strings will always be an odd number. The min and max may be the same value to produce a list of strings of uniform length.
Example:
If the min_seg input is 1 and the max_seg input is 3, the output list will contain strings from 3 (2-letter seg + 1 random letter) to 7 characters.
- inputs
- length (integer)
the length of the output list
- min_seg (integer)
the minimum number of 2 letter segments to include in the output list
- max_seg (integer)
the maximum number of 2 letter segments to include in the output list (must be => “min_seg” input)
- add_rev (boolean)
add vowel-consonant combinations to the consonant-vowel segments. (this is not normally needed to produce random and readable strings)
- df (dataframe)
optional short-form pandas dataframe input. If not None, use the length of the dataframe as the “length” input value
- inplace (boolean)
if the “df” input is not None, insert the results directly into the input “lname” column. Caution: make a copy first!
-
functions.
anon_pay
(df, proportional=True, mult=1.0, inplace=False)[source]¶ Substitute pay table baseline rate information a proportional method or with a non-linear, non-proportional method.
- inputs
- df (dataframe)
pandas dataframe containing pay rate date (dataframe representation of the “rates” worksheet from the pay_tables.xlsx workbook)
- proportional (boolean)
if True, use the mult input to increase or decrease all of the “rates” worksheet pay data proportionally. If False, use a fixed algorithm to disproportionally adjust the pay rates.
- mult (integer or float)
if the proportional input is True, multiply all pay rate values by this input value
- inplace (boolean)
if True, replace the values within the original dataframe with the “anonomized” values.
-
functions.
anon_pay_table
(case, proportional=True, mult=1.0)[source]¶ Anonymize the “rates” worksheet of the “pay_tables.xlsx” input file. The rates may be proportionally adjusted (larger or smaller) or disproportionally adjusted with a fixed algorithm.
A copy of the original excel file is copied and saved as “pay_tables_orig.xlsx”.
All modifications are inplace.
- inputs
- case (string)
the case name
- proportional (boolean)
if True, use the mult input to increase or decrease all of the “rates” worksheet pay data proportionally. If False, use a fixed algorithm to disproportionally adjust the pay rates.
- mult (integer or float)
if the proportional input is True, multiply all pay rate values by this input value
-
functions.
assign_cond_ratio
(job, this_job_count, ratio_dict, orig_range, assign_range, eg_range, fur_range, cap=None)[source]¶ Apply a job ratio assignment condition
The main job assignment function calls this function at the appropriate months.
This function applies a ratio for job assignment between ratio groups. Each ratio group may contain one or more employee groups. The number of jobs affected may be limited with the “cap” input.
The ratio for job assignment is set with the inputs on the “ratio_cond” worksheet of the settings.xlsx input spreadsheet, using the “group” columns and the corresponding “weights” columns.
Optionally, the ratio of jobs which exists during the “month_start” spreadsheet input may be captured and used for job assignment during the data model months when the ratio job assignment condition is applicable (“month_start” through “month_end”). The existing ratios are captured and used by setting the “snapshot” input cell to True for the appropriate basic job row. When using the snapshot option, any weightings designated within the “weights” columns will be ignored.
There may be a mix of snapshot ratios and ratios set by the “weight” columns for use within the program. There may also be count-capped ratio assignments and straight ratio assignments within the same data model as long as the effective months and jobs do not overlap, but there may only be one row of ratio data for a job level within the same input worksheet.
No bump, no flush rules apply when assigning jobs by ratio, meaning only job openings due to retirements, increases in job counts, or other openings will be assigned according to the ratio schedule. Employees previously holding a job affected by the ratio condition will not be displaced to allow an employee from a different ratio group to have that job when the ratio assignment period begins. Therefore, it may take some time for the desired ratio of job assignments to be achieved if it differs significantly from the actual ratio(s) when the time period of conditional job assignment begins.
- inputs
- job (integer or float)
job level number
- this_job_count (integer or float)
number of jobs available
- ratio_dict (dictionary)
ratio condition dictionary, constructed with the build_program_files script and possibly modified by the set_snapshot_weights function if the “snapshot” option is set to True on the “ratio_cond” worksheet of the “settings.xlsx” input spreadsheet.
- orig_range (1d array)
original job range Month slice of the orig_job column array (normally pertaining a specific month).
- assign_range (1d array)
job assignment range Month slice of the assign_range column array
- eg_range (1d array)
employee group range Month slice of the eg_range column array
- fur_range (1d array)
furlough range Month slice of the fur_range column array
- cap (integer (or whole float))
if a count ratio job assignment is being used, this number represents the number of jobs affected by the conditional assignment. Available jobs above this amount are not affected.
-
functions.
assign_job_counts
[source]¶ assign job counts to job count array month slice
- inputs
- job_count_range (array)
month slice of long job count array
- assign_range (array)
month slice of long job assignment array
- job (integer)
job level number
- this_job_count (integer)
job count alloted for job level
-
functions.
assign_jobs_full_flush_job_changes
(nonret_counts, job_counts, num_job_levels)[source]¶ (Long_Form)
- Using the nonret counts for each month:
determine the long form slice for assignment, and
slice the jobs list from the top to create job assignment column
create a corresponding furlough column
create a job count column
Uses the job_counts (job_gain_loss_table function)[0] to build stovepiped job lists allowing for job count changes each month and a furlough status column.
Unassigned employees (not enough jobs), are left at job number zero This is the full bump and full flush version
- inputs
- nonret_counts (numpy array)
array containing the number of non-retired employees for each month
- job_counts (numpy array)
array containing the monthly counts of jobs for each job level
- num_job_levels (integer)
the number of job levels in the model (excluding furlough level)
-
functions.
assign_jobs_nbnf_job_changes
(df, lower, upper, total_months, job_reduction_months, start_month, condition_list, sdict, tdict, fur_return=False)[source]¶ (Long_Form)
Uses the job_gain_or_loss_table job count array for job assignments. Jobs counts may change up or down in any category for any time period. Handles furlough and return of employees, prior rights/conditions and restrictions and recall of initially furloughed employees.
Inputs are precalculated outside of function to the extent possible. Returns tuple (long_assign_column, long_count_column, orig jobs, fur_data)
- inputs
- df (dataframe)
long-form dataframe with [‘eg’, ‘sg’, ‘fur’, ‘orig_job’] columns.
- lower (array)
ndarry from make_lower_slice_limits function (calculation derived from cumsum of count_per_month function)
- upper (array)
cumsum of count_per_month function
- total_months (integer or float)
sum of count_per_month function output
- job_reduction_months (list)
months in which the number of jobs is decreased from the get_job_reduction_months function
- start_month (integer)
integer representing the month number to begin calculations, likely month of integration when there exists a delayed integration (from settings dictionary)
- condition_list (list)
list of special job assignment conditions to apply, example: [‘prex’, ‘count’, ‘ratio’]
- sdict (dictionary)
the program settings dictionary (produced by the build_program_files script)
- tdict (dictionary)
job tables dictionary (produced by the build_program_files script)
- fur_return (boolean)
model employee recall from furlough if True using recall schedule from settings dictionary (allows call to mark_for_recall function)
Assigns jobs so that original standalone jobs are assigned each month (if available) unless a better job is available through attrition of employees.
Each month loop starts with the lowest job number.
- For each month and for each job level:
assigns nbnf (orig) job if job array (long_assign_column) element is zero (unassigned) and orig job number is less than or equal to the job level in current loop, then
assigns job level in current loop to unassigned slots from top to bottom in the job array (up to the count of that job level remaining after step one above)
Each month range is determined by slicing using the lower and upper inputs. A comparison is made each month between the original job numbers and the current job loop number.
Job assignments are placed into the monthly segment (assign_range) of the long_assign_column.
The long_assign_column eventually becomes the job number (jnum) column in the dataset.
Original job numbers of 0 indicate no original job and are treated as furloughed employees. No jobs are assigned to furloughees unless furlough_return option is selected.
-
functions.
assign_standalone_job_changes
(eg, df_align, lower, upper, total_months, job_counts_each_month, total_monthly_job_count, nonret_each_month, job_change_months, job_reduction_months, start_month, sdict, tdict, apply_sg_cond=True)[source]¶ (Long_Form)
Uses the job_gain_or_loss_table job count array for job assignments. Jobs counts may change up or down in any category for any time period. Handles furlough and return of employees, prior rights/conditions and restrictions and recall of initially furloughed employees.
Inputs are precalculated outside of function to the extent possible. Returns tuple (long_assign_column, long_count_column, held_jobs, fur_data, orig_jobs)
- inputs
- eg (integer)
input from an incremental loop which is used to select the proper employee group recall scedule
- df_align (dataframe)
dataframe with [‘sg’, ‘fur’] columns
- num_of_job_levels (integer)
number of job levels in the data model (excluding a furlough level)
- lower (1d array)
ndarry from make_lower_slice_limits function (calculation derived from cumsum of count_per_month function)
- upper (1d array)
cumsum of count_per_month function
- total_months (integer or float)
sum of count_per_month function output
- job_counts_each_month (array)
output of job_gain_loss_table function[0] (precalculated monthly count of jobs in each job category, size (months,jobs))
- total_monthly_job_count (array)
output of job_gain_loss_table function[1] (precalculated monthly total count of all job categories, size (months))
- nonret_each_month (1d array)
output of count_per_month function
- job_change_months (list)
the min start month and max ending month found within the array of job_counts_each_month inputs (find the range of months to apply consideration for any job changes - prevents unnecessary looping)
- job_reduction_months (list)
months in which the number of jobs is decreased (list). from the get_job_reduction_months function
- start_month (integer)
starting month for calculations, likely implementation month from settings dictionary
- sdict (dictionary)
the program settings dictionary (produced by the build_program_files script)
- tdict (dictionary)
job tables dictionary (produced by the build_program_files script)
- apply_sg_cond (boolean)
compute with pre-existing special job quotas for certain employees marked with a one in the sg column (special group) according to a schedule defined in the settings dictionary
Assigns jobs so that original standalone jobs are assigned each month (if available) unless a better job is available through attrition of employees.
Each month loop starts with the lowest job number.
- For each month and for each job level:
1. assigns nbnf (orig) job if job array (long_assign_column) element is zero (unassigned) and orig job number is less than or equal to the job level in current loop, then 2. assigns job level in current loop to unassigned slots from top to bottom in the job array (up to the count of that job level remaining after step one above)
Each month range is determined by slicing using the lower and upper inputs.
A comparison is made each month between the original job numbers and the current job loop number.
Job assignments are placed into the monthly segment (assign_range) of the long_assign_column.
The long_assign_column eventually becomes the job number (jnum) column in the dataset.
Original job numbers of 0 indicate no original job and are treated as furloughed employees - no jobs are assigned to furloughees unless furlough_return option is selected.
-
functions.
career_months
(ret_input, start_date)[source]¶ (Short_Form)
Determine how many months each employee will work including retirement partial month
“ret_input” (retirement dates) may be in the form of a pandas dataframe, pandas series, array, list, or string
Output is a numpy array of integers containing the number of months between the start_date and each date in the ret_input (months from start date to retirement for each employee)
- inputs
- ret_input (dataframe, series, array, list, or string)
retirement dates input
- start_date (string date)
comparative date for the retirement dates input, normally the data model starting date
-
functions.
clear_dill_files
()[source]¶ remove all files from ‘dill’ folder. used when changing case study, avoids possibility of file from previos calculations being used by new study
-
functions.
clip_ret_ages
(ret_age_dict, init_ret_age, dates_long_arr, ages_long_arr)[source]¶ Clip employee ages in employee final month to proper retirement age if the model includes an increasing retirement age over time
- inputs
- ret_age_dict (dictionary)
dictionary of retirement increase date to new retirement age as defined in settings dictionary
- init_ret_age
initial retirement age prior to any increase
- dates_long_arr (numpy array)
array of month dates (long form, same value during each month)
- ages_long_arr (numpy array)
array of employee ages (long form)
-
functions.
contract_year_and_raise
(df, settings_dict)[source]¶ (Month_Form)
Generate the contract pay year for indexing into the pay table. Pay year is clipped to last year of contract.
Also create an annual assumed raise column applicable to the time period beyond the contract duration. This is a multiplier column with a compounded value each subsequent year. If no raise is elected (via the settings.xlsx input file, “scalars” worksheet), then this column will be all ones. The annual raise percentage is designated on the same worksheet. The input df must be a single column dataframe containing end-of-month dates, one for each month of the data model.
NOTE: (this function can accept any number of pay exception periods through the pay_exceptions dictionary, populated by the “pay_exceptions” worksheet values within the settings.xlsx input file, see the program documentation for more information)
- inputs
- df (dataframe)
a single column dataframe containing end-of-month dates, one for each month of the data model
- settings_dict (dictionary)
dictionary of program settings generated by the build_program_files script
-
functions.
convert_to_datetime
(date_data, attribute)[source]¶ Convert a dataframe column, series, list, or string input into an array of datetimes
- inputs
- data_data (dataframe, series, array, list, or string)
pandas dataframe with a date column containing string dates or datetime objects, pandas series of dates (strings or datetime objects), a list/array of date strings or datetime objects, or a single comma-separated string containing date information.
- attribute (string)
if the date_data type is a dataframe, the name of the column containing the date information. Otherwise, this input is ignored.
-
functions.
convert_to_enhanced
(eg_job_counts, j_changes, job_dict)[source]¶ Convert employee basic job counts to enhanced job counts (includes full-time and part-time job level counts) and convert basic job change schedules to enhanced job change schedules. Returns tuple (enhanced_job_counts, enhanced_job_changes)
- inputs
- eg_job_counts
A list of lists of the basic level job counts for each employee group. Each nested list has a length equal to the number of basic job levels.
example:
[[197, 470, 1056, 412, 628, 1121, 0, 0], [80, 85, 443, 163, 96, 464, 54, 66], [0, 26, 319, 0, 37, 304, 0, 0]]
- j_changes
input from the settings dictionary describing change of job quantity over months of time (list)
example:
[1, [35, 64], 87, [80, 7, 0]]
[[job level, [start and end month], total job count change, [eg allotment of change for standalone calculations]]
- job_dict
conversion dictionary for an enhanced model. This is the “jd” key value from the settings dictionary. It uses the basic job levels as the keys, and lists as values which containin the new full- and part-time job level numbers and the percentage of basic job counts to be converted to full-time jobs.
example:
{1: [1, 2, 0.6], 2: [3, 5, 0.625], 3: [4, 6, 0.65], 4: [7, 8, 0.6], 5: [9, 12, 0.625], 6: [10, 13, 0.65], 7: [11, 14, 0.65], 8: [15, 16, 0.65]}
-
functions.
convert_to_hex
(rgba_input)[source]¶ convert float rgba color values to string hex color values rgba = color values expressed as:
red, green, blue, and (optionally) alpha float values rgba_input may be:
a single rgba list or tuple
a list or tuple containing rgba lists or rgba tuples
a dictionary of key: rgba values
output is string hex color values in place of rgba values
Examples:
input single rgba value:
sample_value = (.5, .3, .2) convert_to_hex(sample_value) '#7f4c33'
input list:
sample_list = [[0.65, 0.81, 0.89, 1.0], [0.31, 0.59, 0.77, 1.0], [0.19, 0.39, 0.70, 1.0], [0.66, 0.85, 0.55, 1.0]] convert_to_hex(sample_list) ['#a5cee2', '#4f96c4', '#3063b2', '#a8d88c']
input dict:
sample_dict = {1: (.65, .45, .45, 1.), 2: [.60, .45, .45, 1.], 3: (.55, .45, .45, 1.)} convert_to_hex(sample_dict) {1: '#a57272', 2: '#99593a', 3: '#8c7249'}
- inputs
- rgba_input (tuple, list, or dictionary)
input may be a single list or tuple OR a list of float rgba values as lists or tuples OR a dictionary with values as lists or tuples. Valid string hex values may also be passed as inputs and will be returned unchanged.
-
functions.
copy_excel_file
(case, file, return_path_and_df=False, revert=False, verbose=True)[source]¶ Copy an excel file and add ‘_orig’ to the file name, or restore an excel file from the ‘_orig’ copy.
- inputs
- case (string)
the data model case name
- file (string)
the excel file name without the .xlsx extension
- return_path_and_df (boolean)
if True, return a tuple containing the file path as a string and the worksheet designated by the “file” input as a dataframe
- revert (boolean)
if False, copy the excel file and add ‘_orig’ to the file name. if True, restore the copied file and drop the ‘_orig’ suffix
- verbose (boolean)
if True, print a brief summary of the operation result
-
functions.
count_avail_jobs
[source]¶ use numba to loop through the job assignment range and count the number of jobs in a specified job level previously assigned from the previous month, then subtract result from the total job level positions count. This result identifies the number of openings available for the current month.
- inputs
- assign_range (array)
monthly slice of job assignment array
- job (integer)
job level being tested
- this_job_count (integer)
total job positions count for the job being tested
-
functions.
count_per_month
(career_months_array)[source]¶ Month_Form
Returns number of employees remaining for each month (not retired). Cumulative sum of career_months_array input (np array) that are greater or equal to each incremental loop month number.
Note: alternate method to this function is value count of mnums: df_actives_each_month = pd.DataFrame(df_idx.mnum.value_counts()) df_actives_each_month.columns = [‘count’]
- input
- career_months_array
output of career_months function. This input is an array containing the number of months each employee will work until retirement.
-
functions.
create_snum_and_spcnt_arrays
(jnums, job_level_count, monthly_population_counts, monthly_job_counts, lspcnt_remaining_only)[source]¶ Calculates: long_form seniority number (‘snum’, only active employees), seniority percentage (‘spcnt’, only active employees), list number (‘lnum’, includes furlougees), list percentage (‘lspcnt’, includes furloughees).
Iterate through monthly jobs count data, capturing monthly_job_counts to be used as the denominator for percentage calculations.
This function produces four ndarrays which will make up four columns in the long_form pandas dataset.
Returns tuple (long_snum_array, long_spcnt_array, long_list_array, long_lspcnt_array)
- inputs
- jnums
the long_form jnums result
- job_level_count
number of job levels in model
- monthly_population_counts
count_per_month function output
- monthly_job_counts
total of all jobs each month derived from job_gain_loss_table function (table) >>> np.add.reduce(table, axis=1)
- lspcnt_remaining_only
calculate list percentage based on employees remaining in each month including furloughees, otherwise percentage calculation denominator is the greater of employees remaining (incl fur) or jobs available
-
functions.
distribute
(available, weights, cap=None)[source]¶ proportionally distribute ‘available’ according to ‘weights’
usage example:
distribute(334, [2.48, 1])
returns distribution as a list, rounded as integers:
[238, 96]
- inputs
- available (integer)
the count (number) to divide
- weights (list)
relative weighting to be applied to available count for each section. numbers may be of any size, integers or floats. the number of resultant sections is the same as the number of weights in the list.
- cap (integer)
limit distribution total to this amount, if less than the “available” input.
-
functions.
eg_quotas
(quota, actual, cap=None, this_job_count=None)[source]¶ determine the job counts to be assigned to each ratio group during a ratio condition job assignment routine
- inputs
- quota (list or list-like)
the desired job counts for each employee group
- actual (list or list-like)
the actual job counts for each employee group
- cap (integer (or whole float))
if a count ratio routine is being used, this is the total count of jobs to be affected by the ratio
- this_job_count (integer (or whole float))
the monthly count of the applicable job
-
functions.
find_index_val
(df1, df2, df2_vals, col1=None, col2=None)[source]¶ find a value in another dataframe with the same index of another given value in a dataframe.
df1 index, df2 index, and the value columns must contain unique values.
- inputs
- df1 (dataframe)
the first dataframe containing values to index match in another dataframe
- df2 (dataframe)
the second dataframe with corresponding index values
- df2_vals (list)
values to match
-
functions.
gen_month_skeleton
[source]¶ Long_Form
Create an array of month numbers with each month number repeating n times for n non-retired employees in each month. i.e. the first month section of the array will be all zeros (month: 0) repeating for the number of non-retired employees. The next section of the array will be all ones (month: 1) repeating for the number of employees remaining in month 1. Output is a 1d ndarray.
This funtion creates the first column and the basic form of the skeleton dataframe which is the basis for the dataset dataframes.
- inputs
- month_count_array
a numpy array containing the number of employees remaining or not retired for each month. This input is the result of the count_per_month function.
-
functions.
gen_skel_emp_idx
[source]¶ Long_Form
For each employee who remains for each month, grab that employee index number.
This index will be the key to merging in other data using data alignment. Input is the result of the count_per_month function (np.array) and the result of the career_months function
- inputs
- monthly_count_array (numpy array)
count of non-retired active employees for each month in the model, the ouput from the count_per_month function.
- career_mths_array (numpy array)
career length in months for each employee, output of career_months functions.
- empkey_source_array (numpy array)
empkey column data as array
Returns tuple (skel_idx_array, skel_empkey_array)
-
functions.
get_job_change_months
(job_changes)[source]¶ extract a sorted list of only the unique months containing a change in any job count as defined within the settings dictionary job change schedules
- input
- job_changes
list of job change schedule lists, normally equal to the j_changes variable from the settings dictionary
-
functions.
get_job_reduction_months
(job_changes)[source]¶ extract a sorted list of only the unique months containing a reduction in any job count as defined within the settings dictionary job change schedules
- input
- job_changes
list of job change schedule lists, normally equal to the j_changes variable from the settings dictionary
-
functions.
get_month_slice
(df, l, h)[source]¶ Convenience function to extract data for a particular month. Input is low and high indexes of target month data (within dataset containing many months)
The input may also be an array (not limited to a dataframe).
- inputs
- df
dataframe (or array) to be sliced
- l
lower index of slice
- h
upper index of slice
-
functions.
get_recall_months
(list_of_recall_schedules)[source]¶ extract a sorted list of only the unique months containing a recall as defined within the settings dictionary recall schedules.
- input
- list_of_recall_schedules
list of recall schedule lists, normally equal to the recalls variable from the settings dictionary
-
functions.
job_gain_loss_table
(months, job_levels, init_job_counts, job_changes, standalone=False)[source]¶ Make two arrays of job tally information.
The first array has a row for each month in the model, and a column for each job level (excluding furlough). This array provides a count for each job for each month of the model accounting for changes provided by the job change schedules defined by the settings dictionary. The second array is a one-dimensional array containing the sum for all jobs for each month of the model.
- inputs
- months (integer)
number of months in model
- job_levels (integer)
number of job levels in model (excluding furlough level)
- init_job_counts (tuple of two numpy arrays)
initial job counts. Output from the make_jcnts function, essentially an array of the job count lists for each employee group and an array of the combined counts.
- job_changes (list)
The list of job changes from the settings dictionary.
- standalone (boolean)
if True, use the job count lists for the separate employee groups, otherwise use the combined job count
Returns tuple (job_table, monthly_job_totals)
-
functions.
load_datasets
(other_datasets=['standalone', 'skeleton', 'edit', 'hybrid'])[source]¶ Create a dictionary of proposal names to corresponding datasets. The datasets are generated with the RUN_SCRIPTS notebook. This routine reads the names of the case study proposals from a pickled dataframe (‘dill/proposal_names.pkl’), created by the build_program_files.py script. It then looks for the matching stored datasets within the dill folder. The datasets are loaded into a dictionary, using the proposal names as keys.
The dictionary allows easy reference to datasets from the Jupyter notebook and from within functions.
- input
- other_datasets (list)
list of datasets to load in addition to those computed from the proposals (from the case-specific proposals.xlsx Excel file)
-
functions.
longevity_at_startdate
(ldate_input, start_date, return_as_months=False)[source]¶ (Short_Form)
determine how much longevity (years) each employee has accrued as of the start date
float output is longevity in years (+1 added to reflect current 1-based pay year)
- inputs
- ldate_input (dataframe, series, list, or string)
list of longevity dates in datetime format
- start_date (string date)
comparative date for longevity dates, normally the data model starting date
- return_as_months (boolean)
option to return result as month value instead of year value
-
functions.
make_cat_order
(ds, table)[source]¶ make a long-form “cat_order” (global job ranking) column This function assigns a global job position value to each employee, considering the modeled job level hierarchy and the job count within each level. For example, if a case study contains 3 job levels with 100 jobs in each level, an employee holding a job in the middle of job level 2 would be assigned a cat_order value of 150.
Category order for standalone employee groups is “normalized” to an integrated scale by applying standalone job level percentage (relative position within a job level) to the integrated job level counts. This process allows “apples to apples” comparison between standalone and integrated job progression.
Standalone cat_order will only reflect job levels available within the standalone scenario. If the integrated model contains job levels which do not exist within a standalone employee group model, standalone cat_order results will exclude the respective job level rank segments and will rank the existing standalone data according to the integrated ranking scale.
The routine creates numpy array lookup tables from integrated job level count data for each month of the model. The tables are the source for count and additive information which is used to calculate a rank number within job level and cumulative job count additives.
Month number and job number arrays (from the input ds (dataset)) are used to index into the numpy lookup arrays, producing the count and additive arrays.
A simple formula is then applied to the percentage, count, and additive arrays to produce the cat_order array.
- inputs
- ds (dataframe)
a dataset containing [‘jobp’, ‘mnum’, ‘jnum’] columns
- table (numpy array)
the first output from the job_gain_loss_table function which is a numpy array with total job counts for each job level for each month of the data model
-
functions.
make_decile_bands
(num_bands=40, num_returned_bands=10)[source]¶ creates an array of lower and upper percentile values surrounding a consistent schedule of percentile markers. If the user desires to sample data at every 10th percentile, this function provides selectiable bottom and top percentile limits surrounding each 10th percentile, or variable width sample ranges.
num_bands input must be multiple of 5 greater than or equal to 10 and less than 10000.
num_returned_bands input must be multiple of 5, equal to or less than the num_bands input, and num_bands/num_returned_bands must have no remainder.
Used for selecting sample employees surrounding deciles (0, 10, 20 etc. percent levels).
Top and bottom bands will be half of normal size.
- inputs
- num_bands
Width of bands in percentage is determined by num_bands input. Input of 40 would mean bands 2.5% wide. (100/40) Top and bottom bands would be 1.25% wide. Ex. 0-1.25%, 8.75-11.25%, … 98.75-100%
- num_returned_bands
number of returned delineated sections. Must be a multiple of 5 less than or equal to the num_bands value with no remainder when divided into the num_bands value. (note: an input of 10 would result in 11 actual segments, one-half band at the top and bottom of list (0% and 100%), and 9 full bands surrounding each decile, 10% to 90%)
-
functions.
make_delayed_job_counts
(imp_month, delayed_jnums, lower, upper)[source]¶ Make an array of job counts to be inserted into the long_form job counts array of the job assignment function. The main assignment function calls this function prior to the implementation month. The array output of this function is inserted into what will become the job count column. These jobs are from the standalone job results. The job count column displays a total monthly count of the job in the corresponding jnum (job number) column.
- inputs
- imp_month (integer)
implementation month, defined by settings dictionary
- delayed_jnums (numpy array)
array of job numbers, normally data from the start of the model through the implementation month
- lower (numpy array)
array of indexes marking the beginning of data for each month within a larger array of stacked, multi-month data
- upper (numpy array)
array of indexes marking the end of data for each month
-
functions.
make_dict_from_columns
(df, key_col, value_col)[source]¶ Make a dictionary from two dataframe columns. One column will be the keys and the other the values.
Unique key column values will be assigned dictionary values. If the key_col input contains duplicates, only the last duplicate key-value pair will exist within the returned dictionary.
- inputs
- df (dataframe)
pandas dataframe containing the columns
- key_col (string (or possibly integer))
dataframe column which will become dictionary keys
- value_col (string (or possibly integer))
dataframe column which will become dictionary values
-
functions.
make_eg_pcnt_column
(df, recalc_each_month=False, mnum=0, inplace=True, trim_ones=True, fixed_col_name='eg_start_pcnt', running_col_name='eg_pcnt')[source]¶ make an array derived from the input df reflecting one of the following options:
- Option A:
The percentage of each employee within his/her original employee group for a selected month. The array values will be data-aligned with the df input index. This option is useful for tracking percentile cohorts throughout the model.
- Option B:
The percentage of each employee within his/her original employee group recalculated each month. This has the effect of adjusting each group relative percentage for population changes due to retirements, furlough, etc. This option is useful for monthly percentile cohort comparisons.
This function either adds a column to the input dataframe or returns an array of values, the same length as the input dataframe.
Note: This function calculations include any furloughed employees assign to long-form dataframe (with default month 0 values aligned):
make_eg_pcnt_column(df)
- inputs
- df (dataframe)
pandas dataframe containing an employee group code column (‘eg’) and a month number column (‘mnum’). The dataframe must be indexed with employee number code integers (‘empkey’)
- recalc_each_month (boolean)
- if True:
recalculate separate employee group percentage each month of data model
- if False:
calculate values for one month only - align those values by employee number (empkey) to the entire data model
- mnum (integer)
if recalc_each_month is True, calculate values for this selected month number
- inplace (boolean)
if True, add a column to the input dataframe with the calculated values. If False, return a numpy array of the calculated values.
- trim_ones (boolean)
if True, replace 100% values (1.0) with a value slightly under 1.0 (.9999). This action assists construction of percentile quantiles for membership grouping purposes.
- exclude_fur (boolean)
if True, remove furloughed employees from percentage calculations
- fixed_col_name (string)
manually designated name for dataframe column when recalc_each_month input is False and inplace input is True.
- running_col_name (string)
manually designated name for dataframe column when recalc_each_month input is True and inplace input is True.
-
functions.
make_group_lists
(df, column_name)[source]¶ this function is used with Excel input to convert string objects and integers into Python lists containing integers. This function is used with the count_ratio_dict dictionary construction. The function works with one column at a time.
Output is a list of lists which may be reinserted into a column of the dataframe.
example:
A
B
C
D
1
6
0
“2,3”
8
4
5
“5”
make_group_lists(df, ["D"]) [[2, 3], [5]]
This function allows the user to type the string 2,3 into an Excel worksheet cell and have it interpreted by seniority_list as [2, 3]
- inputs
- df (dataframe)
dataframe containing Excel employee group codes
- column_name
dataframe column name to convert
-
functions.
make_intgrtd_from_sep_stove_lists
(job_lists_arr, eg_arr, fur_arr, eg_total_jobs, num_levels, skip_fur=True)[source]¶ Month_Form
Compute an integrated job list built from multiple independent eg stovepiped job lists.
This function is for multiple egs (employee groups) - multiple lists in one job_lists_arr.
Creates an ndarray of job numbers.
Function takes independent job number lists and an array of eg codes which represent the eg ordering in the proposed list.
Job numbers from the separate lists are added to the result array according to the eg_arr order. Jobs on each list do not have to be in any sort of order. The routine simply adds items from the list(s) to the result array slots in list order.
- inputs
- job_lists_arr
array of the input job number arrays. represents the jobs that would be assigned to each employee in a list form. each list within the array will be the length of the respective eg.
- eg_arr
short_form array of eg codes (proposal eg ordering)
- fur_arr
short_form array of fur codes from proposal
- eg_total_jobs
list length n egs sums of total jobs available for each eg, form: [n,n,n]
- num_levels
number of job levels in model (excluding furlough level)
- skip_fur (boolean)
ignore or skip furloughs when assigning stovepipe jobs. If True, employees who are originally marked as furloughed are assigned the furlough level number which is 1 greater than the number of job levels. If False, jobs are assigned within each employee group in a stovepipe fashion, including those employees who are marked as furloughed
-
functions.
make_jcnts
(job_count_lists)[source]¶ Make two arrays: 1. array of n lists of job counts for n number of eg job count input lists 2. array of one summation list of first array (total count of all eg jobs) The arrays above will not contain a furlough count. Returns tuple (eg_job_counts, combined_job_count)
- inputs
- job_count_lists
list of the employee job count list(s). If the program is using the enhanced jobs option, this input will be the output of the convert_jcnts_to_enhanced function. Otherwise, it will be the eg_counts variable from the settings dictionary.
Example return:
(array([ [ 237, 158, 587, 1373, 352, 739, 495, 330, 784, 1457, 0, 471, 785, 0, 0, 0], [ 97, 64, 106, 575, 64, 310, 196, 130, 120, 603, 71, 72, 325, 38, 86, 46], [ 0, 0, 33, 414, 20, 223, 0, 0, 46, 395, 0, 28, 213, 0, 0, 0]]), array( [ 334, 222, 726, 2362, 436, 1272, 691, 460, 950, 2455, 71, 571, 1323, 38, 86, 46]))
-
functions.
make_lists_from_columns
(df, columns, remove_zero_values=False, try_integers=False, as_tuples=False)[source]¶ combine columns row-wise into separate lists, return a list of lists
- example:
A
B
C
D
1
6
0
2
8
4
5
3
make_lists_from_columns(df, ["A", "B", "C"]) [[1, 6, 0], [8, 4, 5]] make_lists_from_columns(df, ["A", "B", "C"], remove_zero_values=True, as_tuples=True) [(1, 6), (8, 4, 5)]
- inputs
- df (dataframe)
pandas dataframe containing columns to combine
- columns (list)
list of column names
- try_integers (boolean)
if True, if all column values are numerical, the output will be converted to integers
- remove_zero_values (boolean)
if True, remove zero values from list or tuple outputs. The routine checks for zeros as a zero value or a list with a single zero value
- as_tuples (boolean)
if True, output will be a list of tuples instead of a list of lists
-
functions.
make_lower_slice_limits
(month_counts_cumsum)[source]¶ for use when working with unique month data within larger array (slice).
The top of slice is cumulative sum, bottom of each slice will be each value of this function output array. Output is used as input for nbnf functions.
- inputs
- month_counts_cumsum (numpy array)
cumsum of count_per_month function output (employee count each month)
-
functions.
make_original_jobs_from_counts
(jobs_arr_arr, eg_array, fur_array, num_levels)[source]¶ Short_Form
This function grabs jobs from standalone job count arrays (normally stovepiped) for each employee group and inserts those jobs into a proposed integrated list, or a standalone list. Each eg (employee group) is assigned jobs from their standalone list in order top to bottom.
Result is a combined list of jobs with each eg maintaining ordered independent stovepipe jobs within the combined list of jobs jobs_arr_arr is an array of arrays, likely output[0] from make_array_of_job_lists function.
Order of job count arrays within jobs_arr_arr input must match emp group codes order (1, 2, 3, etc.).
If total group counts of job(s) is less than slots available to that group, remaining slots will be assigned (remain) a zero job number (0).
eg_array is list (order sequence) of employee group codes from proposed list with length equal to length of proposed list.
Result of this function is ultimately merged into long form for no bump no flush routine.
employees who are originally marked as furloughed are assigned the furlough level number which is 1 greater than the number of job levels.
- inputs
- jobs_arr_arr (numpy array of arrays)
lists of job counts for each job level within each employee group, each list in order starting with job level one.
- eg_array (numpy array)
employee group (eg) column data from master list source
- fur_array
furlough (fur) column data from master list source
- num_levels
number of job levels (without furlough level) in the model
-
functions.
make_preimp_array
(ds_stand, ds_integrated, imp_high, compute_cat, compute_pay)[source]¶ Create an ordered numpy array of pre-implementation data gathered from the pre-calculated standalone dataset and a dictionary to keep track of the information. This data will be joined with post_implementation integrated data and then copied into the appropriate columns of the final integrated dataset.
- inputs
- ds_stand (dataframe)
standalone dataset
- ds_integrated (dataframe)
dataset ordered for proposal
- imp_high
highest index (row number) from implementation month data (from long-form dataset)
- compute_cat (boolean)
if True, compute and append a job category order column
- compute_pay (boolean)
if True, compute and append a monthly pay column and a career pay column
-
functions.
make_starting_val_column
(df, attr, inplace=True)[source]¶ make an array of values derived from the input dataframe which will reflect the starting value (month zero) of a selected attribute. Each employee will be assigned the zero-month attribute value specific to that employee, duplicated in each month of the data model.
This column allows future attribute analysis with a constant starting point for all employees. For example, retirement job position may be compared to initial list percentage.
assign to long-form dataframe:
df['start_attr'] = make_starting_val_column(df, attr)
- input
- df (dataframe)
pandas dataframe containing the attr input column and a month number coulumn. The dataframe must be indexed with employee number code integers (‘empkey’)
- attr (column name in df)
selected zero-month attribute (column) from which to assign values to the remaining data model months
-
functions.
make_stovepipe_jobs_from_jobs_arr
(jobs_arr, total_emp_count=0)[source]¶ Month_Form
Compute a stovepipe job list derived from the total count of jobs in each job level.
This function is for one eg (employee group) and one jobs_arr (list).
Creates an array of job numbers from a job count list (converted to np.array).
Result is an array with each job number repeated n times for n job count. - job count list like : job_counts = [334, 222, 701, 2364] - jobs_array = np.array(job_counts)
- inputs
- jobs_arr (numpy array)
job counts starting with job level 1
- total_emp_count
if zero (normal input), sum of jobs_arr elements, otherwise user-defined size of result_jobs_arr
-
functions.
make_stovepipe_prex_shortform
(job_list, sg_codes, sg_rights, fur_codes)[source]¶ Short_Form
Creates a ‘stovepipe’ job assignment within a single eg including a special job assignment condition for a subgroup. The subgroup is identified with a 1 in the sg_codes array input, originating with the sg column in the master list.
This function applies a pre-existing (prior to the merger) contractual job condition, which is likely the result of a previous seniority integration.
The subset group will have proirity assignment for the first n jobs in the affected job category, the remainding jobs are assigned in seniority order.
The subgroup jobs are assigned in subgroup stovepipe order. This function is applicable to a condition with known job counts. The result of this function is used with standalone calculations or combined with other eg lists to form an integrated original job assignment list.
- inputs
- job_list
list of job counts for eg, like [23,34,0,54,…]
- sg_codes
ndarray eg group members entitled to job condition (marked with 1, others marked 0) length of this eg population
- sg_rights
list of lists (from settings dictionary) including job numbers and job counts for condition. Columns 2 and 3 are extracted for use.
- fur_codes
array of ones and zeros, one indicates furlough status
-
functions.
make_tuples_from_columns
(df, col_list, return_as_list=True, date_cols=[], return_dates_as_strings=False, date_format='%Y-%m-%d')[source]¶ Combine row values from selected columns to form tuples. Returns a list of tuples which may be assigned to a new column. The length of the list is equal to the length of the input dataframe. Date columns may be first converted to strings before adding to output tuples if desired.
- inputs
- df (dataframe)
input dataframe
- col_list (list)
columns from which to create tuples
- return_as_list (boolean)
if True, return a list of tuples
- date_cols (list)
list of columns to treat as dates
- return_dates_as_strings (boolean)
if True, for columns within the data_cols input, convert date values to string format
- date_format (string)
string format of converted date columns
-
functions.
mark_for_furlough
(orig_range, fur_range, month, jobs_avail, num_of_job_levels)[source]¶ Assign fur code to employees when count of jobs is less than count of active employees in inverse seniority order and assign furloughed job level number. note: normally only called during a job change month though it will do no harm if called in other months
- inputs
- orig_range
current month slice of jobs held
- fur_range
current month slice of fur data
- month
current month (loop) number
- jobs_avail
total number of jobs for each month array, job_gain_loss_table function output [1]
- num_of_job_levels
from settings dictionary, used to mark fur job level as num_of_job_levels + 1
-
functions.
mark_for_recall
(orig_range, num_of_job_levels, fur_range, month, recall_sched, jobs_avail, standalone=True, eg_index=0, method='sen_order', stride=2)[source]¶ change fur code to non-fur code for returning employees according to selected method (seniority order, every nth furloughee, or random) note: function assumes it is only being called during a recall month
- inputs
- orig_range
original job range
- num_of_job_levels
number of job levels in model, normally from settings dictionary
- fur_range
current month slice of fur data
- month
current month (loop) number
- recall sched
list(s) of recall schedule (recall amount/month, recall start month, recall end month)
- jobs_avail
total number of jobs for each month array, job_gain_loss_table function output [1]
- standalone (boolean)
This function may be used with both standalone and integrated dataset generation. Set this variable to True for use within standalone dataset calculation, False for integrated dataset calculation routine.
- eg_index (integer)
selects the proper recall schedule for standalone dataset generation, normally from a loop increment. The recall schedule is defined in the settings dictionary. set to zero for an integrated routine (integrated routine uses a global recall schedule)
- method
means of selecting employees to be recalled default is by seniority order, most senior recalled first other options are:
- stride:
i.e. every other nth employee. (note: could be multiple strides per month if multiple recall lists are designated).
- random:
use shuffled list of furloughees
- stride
set stride if stride option for recall selected. default is 2.
-
functions.
mark_fur_range
[source]¶ apply fur code to current month fur_range based on job assignment status
- inputs
- assign_range
current month assignment range (array of job numbers, 0 indicates no job)
- fur_range
current month fur status (1 means furloughed, 0 means not furloughed)
- job_levels
number of job levels in model (from settings dictionary)
-
functions.
max_of_nested_lists
(nested_list, return_min=False)[source]¶ Find the maximum value within a list of lists (or tuples or arrays). The function may optionally return the minimum value within nested containers.
- inputs
- nested_list (list, tuple, or array)
nested container input
- return_min (boolean)
if True, return minimum of nested_list input (vs. max)
-
functions.
monotonic
(sequence)[source]¶ test for stricly increasing array-like input May be used to determine when need for no bump, no flush routine is no longer required. If test is true, and there are no job changes, special rights, or furlough recalls, then a straight stovepipe job assignment routine may be implemented (fast).
- inputs
- sequence
array-like input (list or numpy array ok)
-
functions.
print_settings
()[source]¶ grab settings dictionary data settings and put it in a dataframe and then print it for a quick summary of scalar settings dictionary inputs
-
functions.
remove_zero_groups
(ratio_dict)[source]¶ remove data related to a “dummy” group represented by a zero
example:
{2: [([2], [0], [1]), [0, 2, 6], 34, 120]}
becomes:
{2: [([2], [1]), [0, 6], 34, 120]}
- inputs
- ratio_dict (dictionary)
the ratio dictionary produced by the build_program_files script originating from the “ratio_cond” worksheet of the settings.xlsx input file
-
functions.
sample_dataframe
(df, n=None, frac=None, reset_index=False)[source]¶ Get a random sample of a dataframe by rows, with the number of rows in the returned sample defined by a count or fraction input.
- inputs
- df (dataframe)
pandas dataframe for sampling
- n (integer)
If not None, the count of the rows in the returned sample dataframe. The “n” input will override the “frac” input if both are not None. Will be clipped between zero and len(df) if input exceeds these boundries.
- frac (float)
If not None, the size of the returned sample dataframe relative to the input dataframe. Will be ignored if “n” input is not None. Will be clipped between 0.0 and 1.0 if input exceeds these boundries. An input of .3 would randomly select 30% of the rows from the input dataframe.
- reset_index (boolean)
If True, reset the output dataframe index
If both the “n” and “frac” inputs are None, a random single row will be returned.
The rows in the output dataframe will be sorted according to original order.
-
functions.
save_and_load_dill_folder
(save_as=None, load_case=None, print_saved=False)[source]¶ Save the current “dill” folder to the “saved_dill_folders” folder, or load a saved dill folder as the “dill” folder if it exists.
This function allows calculated case study pickle files (including the calculated datasets) to be saved to or loaded loaded from a “saved_dill_folders” folder.
The “saved_dill_folders” folder is created if it does not already exist. The load_case input is a case study name. If the load_case input is set to None, the function will only save the current dill folder and do nothing else. If a load_case input is given, but is incorrect or no matching folder exists, the function will only save the current dill folder and do nothing else.
The user may print a list of available saved dill folders (for loading) by setting the print_saved input to True. No other action will take place when this option is set to True.
If an award has conditions which differ from proposed conditions, the settings dictionary must be modified and the dataset rebuilt.
This function allows previously calculated datasets to be quickly retrieved and eliminates continual adjustment of the settings spreadsheet if the user switches between case studies (assuming the award has been determined and no more input adjustment will be made).
- inputs
- save_as (string)
A user-specified folder prefix. If None, the current “dill” folder will be saved using the current case study name as a prefix. If set to a string value, the current dill folder will be saved with the “save_as” string value prefix.
Example with the save_as variable set to “test1”. The existing dill folder would be saved as:
saved_dill_folders/test1_dill_folder
- load_case (string)
The name of a case study. If None, the only action performed will be to save the current “dill” folder to the “saved_dill_folders” folder. If the load_case variable is a valid case study name and a saved dill folder for that case study exists, the saved dill folder will become the current dill folder (contents of the saved dill folder will be copied into the current dill folder). This action will occur after the contents of the current dill folder are copied into the “saved_dill_folders” folder.
- print_saved (boolean)
option to print the saved folder prefixes only. This provides a quick check of the folders available to be loaded. No other action will take place with this option set to True.
-
functions.
set_snapshot_weights
(job, ratio_dict, orig_rng, eg_range)[source]¶ Determine the job distribution ratios to carry forward during the ratio condition application period using actual jobs held ratios. likely called at implementation month by main job assignment function Count the number of jobs held by each of the ratio groups for each of the affected job level numbers. Set the weightings in the distribute function accordingly.
- inputs
- ratio_dict (dictionary)
dictionary containing job levels as keys and ratio groups, weightings, month_start and month end as values.
- orig_rng (numpy array)
month slice of original job array
- eg_range (numpy array)
month slice of employee group code array
-
functions.
squeeze_increment
(data, eg, low_num, high_num, increment)[source]¶ Move members of a selected eg (employee group) within a list according to an increment input (positive or negative) while retaining relative ordering within all eg groups.
- inputs
- data (dataframe)
dataframe with empkey as index which at minimum includes an order column and an eg column
- eg (integer)
employee group number code
- low_num and high_num
indexes for the beginning and end of the list zone to be reordered
- increment (integer)
the amount to add or subrtract from the appropriate eg order number increment can be positive (move down list) or negative (move up list - toward zero)
Selected eg order numbers within the selected zone (as a numpy array) are incremented - then the entire group order numbers are reset within the zone using scipy.stats.rankdata.
The array is then assigned to a dataframe with empkeys as index.
-
functions.
squeeze_logrithmic
(data, eg, low_value, high_value, log_factor=1.5, put_segment=1, direction='d')[source]¶ perform a log squeeze (logrithmic-based movement of one eg (employee group), determine the closest matching indexes within the rng to fit the squeeze, put the affected group in those indexes, then fill in the remaining slots with the other group(s), maintaining orig ordering within each group at all times
- inputs
- data (dataframe)
a dataframe indexed by empkey with at least 2 columns: employee group (eg) and order (order)
- eg (employee code integer)
the employee group to move
- low_val and high_val (integers)
integers marking the boundries (rng) for the operation (H must be greater than L)
- log_factor (float)
determines the degree of ‘logrithmic packing’
- put_segment (float)
allows compression of the squeeze result (values under 1)
- direction (string)
squeeze direction: “u” - move up the list (more senior) “d” - move down the list (more junior)
-
functions.
starting_age
(dob_input, start_date)[source]¶ Short_Form
Returns decimal age at given date.
“dob_input” (birth dates) may be in the form of a pandas dataframe, pandas series, list, or string
- inputs
- dob_list (dataframe, series, list, or string)
birth dates input
- start_date
comparative date for the birth dates, normally the data model starting date
-
functions.
update_excel
(case, file, ws_dict={}, sheets_to_remove=None)[source]¶ Read an excel file, optionally remove worksheet(s), add worksheets or overwrite worksheets with a dictionary of ws_name, dataframe key, value pairs, and write the excel file back to disk
- inputs
- case (string)
the data model case name
- file (string)
the excel file name without the .xlsx extension
- ws_dict (dictionary)
dictionary of worksheet names as keys and pandas dataframes as values. The items in this dictionary will be passed into the excel file as worksheets. The worksheet name keys may be the same as some or all of the worksheet names in the excel file. In the case of matching names, the data from the input dict will overwrite the existing data (worksheet) in the excel file. Non-overlapping worksheet names/dataframe values will be added as new worksheets.
- sheets_to_remove (list)
a list of worksheet names (strings) representing worksheets to remove from the excel workbook. It is not necessary to remove sheets which are being replaced by worksheet with the same name.