IPF

The IPF module provides classes and functions to run iterative proportional fitting (IPF) procedures. IPF is used to refactor n-dimensional arrays iteratively in order to obtain marginal sums that align with provided marginal targets. When the marginal values closely resemble the marginal targets (within a convergence tolerance), the resulting ndarray object is returned. If the IPF process reaches the maximum number of iterations or successive iterations fail to meaningfully close the convergence gap, the IPF process exits, returning the ndarray as of the latest iteration. Convergence is determined based on the root-mean-squared-error between the marginals and the targets.

Classes and Functions

  • IPF: a function to setup and apply the IPF procedures for a n-dimensional labeled array.

  • IPF_problem_np: a class to store marginal target information and apply the IPF process to balance values in a labeled array.

  • IPF_problem_series: a child class of IPF_problem_np that facilitates application of the IPF process in series (iterating over rows in a pandas DataFrame).

  • buildSeedMatrix: a legacy function to construct a naive seed matrix as a numpy array. The seed then needs to be converted to labeled array for use in IPF functions and classes.

IPF_problem_np

class emma.ipf.IPF_problem_np(m_targets)

A class defining key parameters for an iterative-proportional fitting problem starting with a labeled array (n-dimensional seed matrix) and specifying marginal targets to guide factoring.

Parameters

m_targets ([array_like,..] or ndarray) – Marginal target values along dimensional axes that guide the IPF’s factoring of matrix values in the attempt to acheive marginal values in line with the targets.

solve(lb_array, converges_at=1e-05, max_iters=500, tolerance=1e-08, report_convergence=False)

Solve the IPF problem.

Parameters
  • lb_array (LabeledArray) –

    A Labeled Array object on which to perform IPF. This is the “seed” matrix that will be balanced to acheive marginals that match (or approximate) m_targets.

    NOTE: The data in the array will be changed by the ipf process. Use the copy method of the LbArray class before running to preserve the input array.

  • converges_at (Float) – Specificies a convergence value. If the percentage error between lb_array marginals and m_targets is less than or equal to this value, the IPF process exits, returning an adequately fitted matrix. Default is 1e-5

  • max_iters (Int) – Maximum number of iterations allowed. The IPF process exits after this number of iterations even if convergence has not been achieved. Default is 500.

  • tolerance (Float) – The IPF process exits if the difference between the convergence variables of two consecutive iterations is below this value. If there is minimal difference between two iterations, the process is unlikely to acheive substantially stronger convergence through additional interations. Default is 1e-8.

  • report_convergence (Boolean) – If False (default), only the rebalanced matrix is returned. If True, the details of the IPF are returned as a tuple.

Returns

  • bal_mat (numpy ndarray) – A balanced matrix where marginals match (or approximate) the m_targets attribute.

  • number_of_iterations (Int) – If report_convergence is True, the second value returned is the number of iterations completed by the IPF process

  • convergence (Boolean) – If report_convergence is True, the third value returned is a boolean flag indicating whether convergence was acheived.

  • narrowing (Boolean) – If report_convergence is True, the fourth value returned is a boolean flag indicating whether convergence was narrowing in the final iteration. If False, the IPF has exited due to minimal improvement in convergence in two consecutve runs.

IPF_problem_series

class emma.ipf.IPF_problem_series(targets_df, index_cols, dim_cols, use_index=False)

A class defining key paramters for an iterative-proportional fitting problem starting with an n-dimensional seed matrix (lb_array) and speciying columns in a pandas data frame that provide marginal targets to guide factoring.

This is a child class of IPF_problem_np and simplifies the specification of marginal targets by providing them in a pandas data frame. This facilitates repetitive applications of the IPF process across many features stored as rows in the “targets” data frame.

Parameters
  • targets_df (pandas data frame) – Data frame from which to lookup marginal targets to guide the IPF process.

  • index_cols (String or [String,..]) – Specify column(s) in the data frame to uniquely identify each row. IPF results will be stored in a data frame containing these same index columns. If use_index is True, you can specify here the index labels to include or set index_cols=None to include all labels.

  • dim_cols ([[String,..]]) – A list of column name lists. Each inner list item specifies a column in targets_df corresponding to a marginal target guiding the IPF process; each list of columns defines a group corresponding to the dimensions of the seed matrix.

  • use_index (Boolean, default=False) – If True, the data frame’s index is used to determine index colums identifying each row.

See also

IPF_problem_np

solve(lb_array, converges_at=1e-05, max_iters=500, tolerance=1e-08, report_convergence=False)

Solve the IPF problem iteratively for all rows in targets_df.

Parameters
  • lb_array (LabeledArray) – A Labeled Array object on which to perform IPF. This is the “seed” matrix that will be balanced to acheive marginals that match (or approximate) targets_df.

  • converges_at (Float) – Specificies a convergence value. If the percentage error between matrix marginals and m_targets is less than or equal to this value, the IPF process exits, returning an adequately fitted matrix. Default is 1e-5

  • max_iters (Int) – Maximum number of iterations allowed. The IPF process exits after this number of iterations even if convergence has not been achieved. Default is 500.

  • tolerance (Float) – The IPF process exits if the difference between the convergence variables of two consecutive iterations is below this value. If there is minimal difference between two iterations, the process is unlikely to acheive substantially stronger convergence through additional interations. Default is 1e-8.

  • report_convergence (Boolean) – If False (default), only the rebalanced matrix is returned. If True, the details of the IPF are returned as a tuple.

Functions

emma.ipf.IPF(lb_array, m_targets, key_dims=None, converges_at=1e-05, max_iters=500, tolerance=1e-08, report_convergence=False, shadows=[], logger=None, log_axes=None)

Solve the IPF problem.

Parameters
  • lb_array (LabeledArray) –

    A Labeled Array object on which to perform IPF. This is the “seed” matrix that will be balanced to acheive marginals that match (or approximate) m_targets.

    NOTE: The data in the array will be changed by the ipf process. Use the copy method of the LbArray class before running to preserve the input array.

  • m_targets ([array_like,..] or ndarray) – Marginal target values along dimensional axes that guide the IPF’s factoring of matrix values in the attempt to acheive marginal values in line with the targets.

  • key_dims ([LbAxis, String, Integer,..], default=None) – If given, the IPF will focus only on specified dimensions of the labeled array. Other dimensional values will be not factor into the balancing procedure. The number of key dims must match the number of m_targets. If None, all dimensions of the labeled array are factored in balancing.

  • converges_at (Float) – Specificies a convergence value. If the percentage error between lb_array marginals and m_targets is less than or equal to this value, the IPF process exits, returning an adequately fitted matrix. Default is 1e-5

  • max_iters (Int) – Maximum number of iterations allowed. The IPF process exits after this number of iterations even if convergence has not been achieved. Default is 500.

  • tolerance (Float) – The IPF process exits if the difference between the convergence variables of two consecutive iterations is below this value. If there is minimal difference between two iterations, the process is unlikely to acheive substantially stronger convergence through additional interations. Default is 1e-8.

  • report_convergence (Boolean) – If False (default), only the rebalanced matrix is returned. If True, the details of the IPF are returned as a tuple.

  • shadows ([Shadow,..], default=[]) – Shadow arrays are modified in the IPF process in the same way lb_array is modified. The factors used to adjust lb_array are inherited by the shadow array based on each shadow’s ref_level (and leader_level if needed).

  • logger (logger) – If given, the logger will record info about the IPF process

  • log_axes ([String,..]) – A list of axes to sum by when logging (if logger is not None).

Returns

  • bal_mat (LbArray) – A balanced matrix where marginals match (or approximate) the m_targets attribute.

  • number_of_iterations (Int) – If report_convergence is True, the second value returned is the number of iterations completed by the IPF process

  • convergence (Boolean) – If report_convergence is True, the third value returned is a boolean flag indicating whether convergence was acheived.

  • narrowing (Boolean) – If report_convergence is True, the fourth value returned is a boolean flag indicating whether convergence was narrowing in the final iteration. If False, the IPF has exited due to minimal improvement in convergence in two consecutve runs.

  • (shadows ([LbArray,…])) – If shadows are provided, they are nor returned, but that have been modified in the IPF process.

emma.ipf.buildSeedMatrix(dimensions, axis_labels, exclusions={})

Create a simple seed matrix for IPF processing based on the given dimensions. The default simple seed matrix assumes a value of 1.0 for each cell. Exlcusions may be specified to set seed values to zeros for particular axis intersections.

Parameters
  • dimensions ([String,..]) – A list of dimension names. The number of dimensions in the seed matrix will equal the length of this list.

  • axis_labels ({String: [String,..]}) – A dictionary with keys matching values in the dimensions list and values each containing a list of labels identifying axis items in that dimension.

  • exclusions ({(String, String): {String: [String,..]}}) –

    A nested dictionary definining axis intersections where seed values should be set to zero. Keys in the exclusions dictionary are tuples of dimension, axis-label pairs. Values are dictionaries with keys containing dimension names and values listing axis labels.

    For example: {(“HH size”, “HH1”): {“Workers”: [“Wrk2”, “Wrk3p”]}} would indicate that in the household size dimension, households with size 1 are mutually exclusive with households having 2 or more workers.