API documentation

Core Utilities

Matching Data

class pybalance.utils.MatchingHeaders(categoric, numeric)[source]

MatchingHeaders is a simple data structure to store information about which features to be used for matching and separating these features into categoric (e.g. country, gender) and numeric (e.g. age, weight) types.

Parameters:
  • categoric (List[str]) – List of features to be treated as categoric variables.

  • numeric (List[str]) – List of features to be treated as numeric variables.

class pybalance.utils.MatchingData(data, headers=None, population_col='population')[source]

It is common in matching problems to require basic metadata about the data in order to perform matching. For instance, the data may contain columns such as “patient_id”, “population” and “index_date”, which are not intended to be used for matching but which must “go along for the ride” and follow the main data everywhere. MatchingData is a wrapper around pandas.DataFrame that includes this additional required logic about the columns. Features required for matching are described by a “headers” field, while other columns exist alongside. See MatchingHeaders.

Parameters:
  • data (pd.DataFrame) – Data frame containing both matching feature data for all populations as well as at least one additional column specifying to which population each row belongs. If a string is passed, it is assumed to be a path to the data frame.

  • headers (Optional[MatchingHeaders]) – A MatchingHeaders object with keys “numeric” and “categoric” and whoses values are names of columns to be used for matching. If None is passed, headers will be inferred based on how many unique values each column has. As guessing the headers can lead to errors, it is recommended to supply them explicitly.

  • population_col (str) – Name of the column used to split data into subpopulations.

append(df, name=None)[source]

Append a population to an existing MatchingData instance. This operation is inplace.

copy()[source]

Create a new MatchingData instance with exact same data and metadata.

Return type:

MatchingData

property data: DataFrame

Pointer to underlying pandas DataFrame.

describe(normalize=True, aggregations=['mean', 'std'], quantiles=[0, 0.25, 0.5, 0.75, 1])[source]

Calls describe_categoric() and describe_numeric() and returns the results in a single dataframe.

Return type:

DataFrame

describe_categoric(normalize=True)[source]

Create a summary statistics table split by population for categoric variables.

Return type:

DataFrame

describe_numeric(aggregations=['mean', 'std'], quantiles=[0, 0.25, 0.5, 0.75, 1], long_format=True)[source]

Create a summary statistics table split by population for numeric variables.

Return type:

DataFrame

get_population(population)[source]

Get the matching data for a population by its name.

Return type:

DataFrame

head(n=5)[source]

Return first n rows from underlying pandas DataFrame.

Return type:

DataFrame

property populations: List[str]

List of all populations present in the MatchingData object.

sample(n=5)[source]

Sample underlying pandas DataFrame.

Return type:

DataFrame

tail(n=5)[source]

Return last n rows from underlying pandas DataFrame.

Return type:

DataFrame

to_csv(*args, **kwargs)[source]

Write underlying pandas DataFrame to csv. Call signature is identical to pandas method.

to_parquet(*args, **kwargs)[source]

Write underlying pandas DataFrame to parquet. Call signature is identical to pandas method.

pybalance.utils.infer_matching_headers(data, max_categories=10, ignore_cols=['patient_id', 'patientid', 'population', 'index_date'])[source]

This utility function guesses which columns are numeric and which columns are categoric from input data. The data can be passed either as separate data frames target and pool or combined in one and passed with keyword argument data. The function returns a dictionary with keys ‘numeric’, ‘categoric’ and ‘all’ with values equal to the list of column names of the given type. By default, the function ignores patient_id and population columns.

Return type:

MatchingHeaders

pybalance.utils.split_target_pool(matching_data, pool_name=None, target_name=None)[source]

Split matching_data into target and pool populations based. If the names of the target and pool populations are not explicitly provided, the routine will attempt to infer their names, assuming that the target population is the smaller population.

Return type:

DataFrame

Preprocessing

class pybalance.utils.BaseMatchingPreprocessor[source]

BaseMatchingPreprocessor is an abstract preprocessor class for organizing data transformations on matching data, keeping track of all preprocessing steps so that data are always transformed transformed in the same way.

The inherited class must implement:

_fit(), _transform(), _get_output_headers()

and should implement, if possible,

_get_feature_names_out()

The class extends the preprocessing classes defined in sklearn. In addition to handling transformations of the data, this class also handles logic of MatchingHeaders. Preprocessors which conform to the BaseMatchingPreprocessor standard are chainable, allowing one to easily combine preprocessing tasks; see ChainPreprocessor.

abstract _fit(matching_data)[source]

This method performs all calculations needed in order to perform the transformation tasks of the preprocessor (e.g. computing means and standard deviations). Should accept a MatchingData instance as input and return None. This method must be overridden by the subclass.

_get_feature_names_out(feature_name_in)[source]

Same as get_feature_names_out but on a single feature level. This method should be overridden by the subclass.

Return type:

List[str]

abstract _get_output_headers()[source]

Return headers on the output matching data. This method must be overridden by the subclass.

abstract _transform(matching_data)[source]

Transform matching_data. Should accept a MatchingData instance as input and return a transformed MatchingData instance, including in particular transformed metadata. This method must be overridden by the subclass.

Return type:

MatchingData

fit(matching_data, refit=False)[source]

A simple wrapper around the workhorse method _fit() to mark the preprocessor as fitted after a call to fit() and prevent fitting twice.

class pybalance.utils.CategoricOneHotEncoder(drop='first')[source]

CategoricOneHotEncoder converts categoric covariates into one-hot encoded variables. Numeric columns are unaffected.

Parameters:

drop (str | None) – Which, if any, columns to drop in the transformation. Choices are: {‘first’, ‘if_binary’, None}. See sklearn.preprocessing.OneHotEncoder for more details.

class pybalance.utils.NumericBinsEncoder(n_bins=5, strategy='uniform', encode='onehot-dense', cumulative=False)[source]

NumericBinsEncoder discretizes numeric covariates according to specified binning strategy. Categoric columns are unaffected.

Parameters:
  • n_bins (int) – Number of bins to split numeric variable into. Note in the case of cumulative = True, the last bin would be always one. To avoid this, internally we use n_bins + 1 bins and drop the last bin.

  • strategy (str) – Strategy to use for binnings. Choices are: {‘uniform’, ‘quantile’, ‘kmeans’}. See sklearn.preprocessing.KBinsDiscretizer for more details.

  • encode (str) – Method to use for encoding numeric variable. Choices are: {‘onehot’, ‘onehot-dense’, ‘ordinal’}. See sklearn.preprocessing.KBinsDiscretizer for more details.

  • cumulative (bool) – Whether to transform numeric variables to discretized cumulative distribution. E.g. if x is in the 3rd bin and n_bins = 4, then x will map to [0, 0, 1, 0] if cumulative = False or [0, 0, 1, 1] otherwise.

class pybalance.utils.DecisionTreeEncoder(keep_original_features=False, **decision_tree_params)[source]

DecisionTreeEncoder transforms all covariates into binary coviates corresponding to their terminal (leaf) position on a decision tree.

class pybalance.utils.ChainPreprocessor(preprocessors)[source]

ChainPreprocessor applies a sequences of Preprocessors.

Parameters:

preprocessors (List[BaseMatchingPreprocessor]) – A list of preprocessors to be applied in sequence to the input data.

Balance Calculators

class pybalance.utils.BaseBalanceCalculator(matching_data, preprocessor, feature_weights=None, order=1, standardize_difference=True, device=None)[source]

BaseBalanceCalculator is the low-level interface to calculating balance. BaseBalanceCalculator can be used with any preprocessor defined as a subclass of BaseMatchingPreprocessor. BaseBalanceCalculator implements matrix calculations in pytorch to allow for GPU acceleration.

BaseBalanceCalculator performs two main tasks:

  1. Computes a per-feature-loss based on the output features of the given preprocessor and

  2. Aggregates the per-feature-loss into a single value for the loss.

Furthermore, the calculator can compute the loss for many populations at a time.

Matching_data:

Input matching data to be used for distance calculations. Must contain exactly two populations. The smaller population is used as a reference population. Calls to distance() compute the distance to this reference population.

Preprocessor:

Preprocessor to use for per-feature-loss calculation. The per-feature-loss is, up to some normalizations, the mean difference in the features at the output of the preprocessor.

Feature_weights:

How to weight features in aggregation of per-feature-loss.

Order:

Exponent to use in combining per-feature-loss into an aggregate loss. Total loss is sum(feature_weight * feature_loss**order)**(1/order).

Parameters:

standardize_difference (bool) – Whether to use the absolute standardized mean difference for the per-feature loss (otherwise uses absolute mean difference).

Device:

Name of device to use for matrix computations. By default, will use GPU if a GPU is found on the system.

per_feature_loss(pool_subsets, target_subsets=None)[source]

Compute mismatch (aka “distance”, aka “loss”) on a per-feature basis for a set of candidate populations.

Return type:

Tensor

class pybalance.utils.BetaBalance(matching_data, feature_weights=None, device=None, drop='first', standardize_difference=True)[source]

Convenience interface to BaseBalanceCalculator to computes the distance between populations as the mean standardized mean difference. Uses StandardMatchingPreprocessor as the preprocessor.

class pybalance.utils.BetaSquaredBalance(matching_data, feature_weights=None, device=None, drop='first', standardize_difference=True)[source]

Same as BetaBalance, except that per-feature balances are averaged in a mean square fashion.

class pybalance.utils.BetaMaxBalance(matching_data, feature_weights=None, device=None, drop='first', standardize_difference=True)[source]

Same as BetaBalance, except the worst-matched feature determines the loss. This class is provided as a convenience, since this balance metric is often a criterion used to determine if matching is “sufficiently good”. However, be aware that using this balance metric as an optimization objective with the various matchers can lead unwanted behavior, since if improvements in the worst-matched feature are not possible, there is no signal from the balance function to improve any other the other features.

class pybalance.utils.GammaBalance(matching_data, feature_weights=None, device=None, n_bins=5, encode='onehot-dense', cumulative=True, drop='first', standardize_difference=True)[source]

Convenience interface to BaseBalanceCalculator to compute the balance between two populations by computing the mean area between their one-dimensional marginal distributions. See GammaPreprocessor for description of preprocessing options.

class pybalance.utils.GammaSquaredBalance(matching_data, feature_weights=None, device=None, n_bins=5, encode='onehot-dense', cumulative=True, drop='first', standardize_difference=True)[source]

Same as GammaBalance, except that per-feature balances are averages in a mean square fashion.

class pybalance.utils.GammaXTreeBalance(matching_data, keep_original_features=False, device=None, standardize_difference=True, **decision_tree_params)[source]
pybalance.utils.BalanceCalculator(matching_data, objective='gamma', **kwargs)[source]

BalanceCalculator provides a convenience interface to balance calculators, allowing the user to initialize a balance calculator by name. The calculators are initialized with default parameters, but these can be overridden by passing the appropriate kwargs.

Parameters:
  • matching_data – MatchingData instance containing reference to the data against which matching metrics will be computed

  • objective – Name of objective function to be used for computing balance. Balance calculators must be implemented in utils.balance_calculators.py and registered in the BALANCE_CALCULATORS dictionary therein in order to be accessible from this interface.

  • kwargs – Any additional arguments required to configure the specific objective function (e.g. n_bins = 10 for “gamma”).

class pybalance.utils.BatchedBalanceCaclulator(balance_calculator, max_batch_size_gb=8)[source]

Batch balance calculations to avoid large peak memory usage.

Matchers

Propensity Score Matcher

class pybalance.propensity.PropensityScoreMatcher(matching_data, objective='beta', caliper=None, max_iter=50, time_limit=300, method='greedy', verbose=True)[source]

Use a propensity score model to match two populations. The Matcher searches randomly over hyperparameters for the propensity score model and selects the match that performs best according to the given optimization objective.

Parameters:
  • matching_data (MatchingData) – Data containing pool and target populations to be matched.

  • objective (str | BaseBalanceCalculator) – Matching objective to optimize in hyperparameter search. Can be a string referring to any balance calculator known to utils.balance_calculators.BalanceCalculator or an instance of BaseBalanceCalculator.

  • caliper (float | None) – If defined, restricts matches to those patients with propensity scores within the caliper of each other. Note that caliper matching may lead to a loss of patients in the target population if no patient in the pool exists within the specified caliper. Should be in (0, 1].

  • max_iter (int) – Maximum number of hyperparameters to try before returning the best match.

  • time_limit (float) – Restrict hyperparameter search based on time. No new model will be trained after time_limit seconds have passed since matching began.

  • method (str) – Method to use for propensity score matching. Can be either ‘greedy’ or ‘linear_sum_assignment’. The former method is locally optimal and globally sub-optimial; the latter globally optimial but far more compute intensive. For large problems, use greedy.

  • verbose (bool) – Flag to indicate whether to print diagnositic information during training.

match()[source]

Match populations passed during __init__(). Returns MatchingData instance containing the matched pool and target populations.

Return type:

MatchingData

pybalance.propensity.plot_propensity_score_match_distributions(matcher)[source]

Plot histograms of the estimated propensity score for pool and target populations pre- and post-matching.

Parameters:

matcher (PropensityScoreMatcher) – Fitted PropensityScoreMatcher model.

pybalance.propensity.plot_propensity_score_match_pairs(matcher)[source]

Plot scatterplot of pool-target pairs formed by propensity score matching.

Parameters:

matcher (PropensityScoreMatcher) – Fitted PropensityScoreMatcher model.

Genetic Matcher

class pybalance.genetic.GeneticMatcher(matching_data, objective='beta', **params)[source]

Match two populations using a genetic algorithm.

Parameters:
  • matching_data (MatchingData) – MatchingData to be matched. Must contain exactly two populations. The larger population will be matched to the smaller.

  • objective (str | BaseBalanceCalculator) – Matching objective to optimize in hyperparameter search. Can be a string referring to any balance calculator known to utils.balance_calculators.BalanceCalculator or an instance of BaseBalanceCalculator.

  • params – Configuration params for the genetic matcher. See pybalance.genetic.get_global_defaults for a list of options.

match(seed=None)[source]

Match populations passed during __init__(). Returns MatchingData instance containing the matched pool and target populations.

pybalance.genetic.get_global_defaults(n_candidate_populations=5000)[source]

Get a set of reasonable default values for evolutionary configuration. We break parameters into two groups: evolutionary, i.e., those that govern how the candidate populations are mixed, and initialization, i.e., those that govern the initial set of candidate populations.

Parameters:

n_candidate_populations – Number of candidate populations to evolve.

Constraint Satisfaction Matcher

class pybalance.lp.ConstraintSatisfactionMatcher(matching_data, objective='beta', match_size=None, pool_size=None, target_size=None, max_mismatch=None, time_limit=180, num_workers=4, ps_hinting=False, verbose=True)[source]

Population matching based on constraint satisfication formulation. This solver can only handle linear objective functions; see “objective” parameter below.

The constraints and optimization target are specified to the solver via the options pool_size, target_size, and max_mismatch. The behavior of the solver depends on which are these options are specified as given below:

(pool_size, target_size, max_mismatch) –> optimize balance subject to size and balance constraints

(pool_size, target_size) –> optimize balance subject to size constraints

(max_mismatch) –> optimize pool size subject to target_size = n_target and balance constraints

() –> optimize balance subject to size constraints with pool_size = target_size = n_target

Optimizing pool_size subject to balance constraint is known as “cardinality matching”. See https://kosukeimai.github.io/MatchIt/reference/method_cardinality.html and references therein.

Parameters:
  • matching_data (MatchingData) – A MatchingData object describing the pool and target populations. See utils.matching_data.

  • objective (str | BaseBalanceCalculator) – Matching objective to optimize. Technically, you can pass any balance calculator, but this solver cannot handle non-linear objective functions. The solver uses the preprocessing from the balance calculator for setting up the problem; the balance calculator itself is used to report the balance of generated matches but not in actually finding solutions (since the CS solver needs a discretized objective function). The solver will optimize the absolute mean difference on the output features of the balance calculator’s preprocessing.

  • match_size (int | None) – Number of samples to include in the matched population. If match_size < size of target population, then the target is subsetted to be the same size, that is, pool_size = target_size = match_size. If match_size >= size of target population, then the full target is used and only the pool is subsetted, that is, pool_size = match_size and target_size = n_target. This option cannot be used in combination with pool_size or target_size. This option is deprecated and will be removed in a later release.

  • pool_size (int | None) – Number of samples to include from the pool in the matched population. Must be less than the size of the pool. If pool_size is not set, then max_mismatch and target_size must be set and pool_size will be optimized subject to the target_size and max_mismatch constraints.

  • target_size (int | None) – Number of samples to include from the target in the matched population. Must be less than or equal to the size of the target. If target_size is not set,then max_mismatch and pool_size must be set and target_size will be optimized subject to the pool_size and max_mismatch constraints.

  • max_mismatch (float | None) – Maximum allowable absolute mean difference for any feature.

  • time_limit (float) – Time limit to stop solving in seconds (def: 180 sec).

  • num_workers (int) – Number of workers to use to optimize objective. See https://github.com/google/or-tools/blob/stable/ortools/sat/sat_parameters.proto#L556 for more detail.

  • ps_hinting (bool) – Compute a propensity score match and use the result as a hint for the solver

  • verbose (bool) – Verbose solving.

match(hint=None)[source]

Match populations passed during __init__(). Returns MatchingData instance containing the matched pool and target populations.

Parameters:

hint (List[int] | None) –

You can supply a “hint” as either (1) A list of indices to the pool. It will be assumed that the entire target is used, or (2) A list of two lists, the first list being the indices to the target, and the second being the indices to the pool, or (3) by omitting the hint altoghether and passing ps_hinting=True in __init__(). In case (3), a propensity score model will be estimated on the fly and used to create a match population as a hint to the solver.

I admit the interface here is a bit confusing. We will clean this up in a later release.

Return type:

MatchingData

Visualization

pybalance.visualization.plot_numeric_features(matching_data, col_wrap=2, height=6, **plot_params)[source]

Plot the one-dimensional marginal distributions for all numerical features and all treatment groups found in matching_data. Extra keyword arguments are passed to seaborn.histplot and override defaults.

Return type:

Figure

pybalance.visualization.plot_categoric_features(matching_data, col_wrap=2, height=6, include_binary=True, **plot_params)[source]

Plot the one-dimensional marginal distributions for all categoric features and all treatment groups found in matching_data. Extra keyword arguments are passed to seaborn.histplot and override defaults.

Return type:

Figure

pybalance.visualization.plot_binary_features(matching_data, max_features=25, include_only=None, orient_horizontal=False, standardize_difference=False, reference_population=None, **plot_params)[source]

Plot all binary features for all treatment groups found in matching_data. Additional keyword arguments are passed to sns.barplot and override default.

Parameters:
  • matching_data (MatchingData) – MatchingData instance containing at least a pool and target population.

  • max_features (int) – Max number of features to show in plot, in case there are a lot of binary features. Features are sorted in descending order by the initial mismatch between pool and target. The top max_features will be shown.

  • include_only (List[str] | None) – List of features to consider for plotting. Otherwise, all binary features are plotted.

  • orient_horizontal (bool) – If True, orient features along the x-axis. Otherwise, features will be along the y-axis.

  • standardize_difference (bool) – Whether to use the absolute standardized mean difference for the differences plot (otherwise plots absolute mean difference).

  • reference_population (str | None) – Name of population in matching_data against which other populations should be compared. If not supplied, will use the smaller population as the reference population.

  • plot_params – Parameters passed on to seaborn routines.

Return type:

Figure

pybalance.visualization.plot_joint_numeric_distributions(matching_data, joint_kind='kde', include_only=None, **plot_params)[source]

Plot 2D distributions of pairs of numeric features from matching_data. joint_kind can be either kde or scatter. scatter is usually a bad choice for large datasets. Choose subsets of features using include_only. Additional keyword arguments are passed to sns.JointGrid and override default.

pybalance.visualization.plot_joint_numeric_categoric_distributions(matching_data, include_only_numeric=None, include_only_categoric=None, **plot_params)[source]

Plot 2D distributions of pairs of numeric and categoric features from matching_data. Choose subsets of features using include_only. Additional keyword arguments are passed to sns.JointGrid and override default.

pybalance.visualization.plot_per_feature_loss(matching_data, balance_calculator, reference_population=None, debin=True, normalize=False, **plot_params)[source]

Plot the mismatch as a function of feature.

Parameters:
  • matching_data (MatchingData) – Input data to plot.

  • balance_calculator (BaseBalanceCalculator) – Balance metric to use for calculating the per feature loss. Balance calculator must implement a ‘per_feature_loss’ method.

  • reference_population (str | None) – Name of population in matching_data against which other populations should be compared. If not supplied, will use the smaller population as the reference population.

  • debin (bool) – If True, attempt to map effective features back into the real feature space. This is not always possible, e.g., features like age*height can’t be mapped back to a single feature but features like country_US, country_Germany can. In the former case, routine will plot loss per effective feature; in the latter, loss per input feature.

  • normalize (bool) – If True, divide loss by number of features such that the sum is the total loss. Otherwise, the plotted loss contributions must be averaged to obtain the total loss.

  • plot_params – Parameters passed on to seaborn routines.

Return type:

Figure

Simulation

pybalance.sim.generate_toy_dataset(n_pool=10000, n_target=1000, seed=45)[source]

Generate a toy matching dataset with n_pool patients in the pool and n_target patients in the target population. For finer control, see generate_random_feature_data_rwd and generate_random_feature_data_rct.

pybalance.sim.load_paper_dataset()[source]

Load the simulated matching dataset presented in the pybalance paper (https://onlinelibrary.wiley.com/doi/10.1002/pst.2352).