Cardinality matching

Cardinality matching is the process of finding the size of the largest subset \(\hat{P}\) of a pool of patient \(P\) within some “distance” of a given target population \(T\):

\begin{equation} \begin{aligned} & \underset{\hat{P}}{\text{maximize}} & & |\hat{P}| \\ & \text{subject to} & & |\mu_{\hat{P}k} - \mu_{Tk}| \leq \delta \textrm{ for all }k \end{aligned} \end{equation}

where \(k\) indexes the covariates of \(P\) and \(T\), and \(\mu\) denotes the means of the corresponding covariates. In cardinality matching, at least as implemented here, we search only for the size of the largest subset. Then in a second step, we optimize the balance (distance) among all subsets of the determined size.

[1]:
import logging
logging.basicConfig(level='INFO')
[2]:
from pybalance.utils.balance_calculators import *
from pybalance.utils import MatchingData
from pybalance.sim import generate_toy_dataset, load_paper_dataset
from pybalance.lp import ConstraintSatisfactionMatcher
from pybalance.visualization import *
[3]:
m = generate_toy_dataset(n_pool=15000, n_target=1000)
m
[3]:
Headers Numeric:
['age', 'height', 'weight']

Headers Categoric:
['gender', 'haircolor', 'country', 'binary_0', 'binary_1', 'binary_2', 'binary_3']

Populations
['pool', 'target']
age height weight gender haircolor country population binary_0 binary_1 binary_2 binary_3 patient_id
0 62.731988 130.816972 76.100401 0.0 0 4 pool 0 0 0 1 0
1 26.403338 130.784188 80.134423 1.0 1 2 pool 0 0 0 1 1
2 58.155044 175.704961 90.806745 0.0 1 4 pool 0 0 0 0 2
3 68.334248 167.485984 90.081777 0.0 0 4 pool 0 0 0 1 3
4 54.114518 130.782073 53.612174 1.0 1 1 pool 0 1 0 1 4
... ... ... ... ... ... ... ... ... ... ... ... ...
995 21.474205 168.602546 70.342128 0.0 2 5 target 0 0 0 1 15995
996 40.643320 188.188724 61.611744 0.0 2 4 target 1 0 0 1 15996
997 29.472765 161.408162 57.214095 0.0 0 1 target 0 1 1 1 15997
998 41.291949 150.968833 91.270798 0.0 0 3 target 0 0 0 0 15998
999 67.530294 155.124741 56.196505 1.0 0 1 target 1 0 0 0 15999

16000 rows × 12 columns

Optimize pool size with balance constraint

[4]:
# Note that by default, gamma uses the standardized mean difference
# to calculate distance. The ConstraintSatisfactionMatcher, however,
# can only optimize linear objectives. It is not an error to pass
# gamma with standardized_difference=True, as this is only affects
# reporting, but to make the reporting consistent with what we're
# optimizing, we explicitly pass standardized_difference=False here.

gamma = GammaBalance(m, standardize_difference=False)
matcher = ConstraintSatisfactionMatcher(
    matching_data=m,
    objective=gamma,
    max_mismatch=0.05,
    time_limit=600
)
match_card = matcher.match()
INFO:pybalance.utils.preprocess:Discretized age with bins [18.02, 27.51, 37.01, 46.51, 56.0, 65.5, 75.0].
INFO:pybalance.utils.preprocess:Discretized height with bins [125.0, 136.67, 148.33, 159.99, 171.66, 183.32, 194.99].
INFO:pybalance.utils.preprocess:Discretized weight with bins [50.01, 61.67, 73.34, 85.0, 96.67, 108.33, 120.0].
INFO:pybalance.lp.matcher:Scaling features by factor 200.00 in order to use integer solver with <= 0.0000% loss.
INFO:pybalance.lp.matcher:Solving for match population with pool size = None and target size = 1000 subject to 0.05 balance constraint.
INFO:pybalance.lp.matcher:Matching on 27 dimensions ...
INFO:pybalance.lp.matcher:Building model variables and constraints ...
INFO:pybalance.lp.matcher:Calculating bounds on feature variables ...
INFO:pybalance.lp.matcher:Applying size constraints on pool and target ...
INFO:pybalance.lp.matcher:Solving with 4 workers ...
INFO:pybalance.lp.matcher:Initial balance score: 0.1236
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 1, time = 0.37 m
INFO:pybalance.lp.matcher:Objective:    13153.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0068
INFO:pybalance.lp.matcher:Patients (pool):      1847
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 2, time = 0.48 m
INFO:pybalance.lp.matcher:Objective:    12981.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0163
INFO:pybalance.lp.matcher:Patients (pool):      2019
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 3, time = 0.52 m
INFO:pybalance.lp.matcher:Objective:    12937.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0171
INFO:pybalance.lp.matcher:Patients (pool):      2063
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 4, time = 0.54 m
INFO:pybalance.lp.matcher:Objective:    12910.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0178
INFO:pybalance.lp.matcher:Patients (pool):      2090
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 5, time = 0.55 m
INFO:pybalance.lp.matcher:Objective:    12906.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0221
INFO:pybalance.lp.matcher:Patients (pool):      2094
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 6, time = 0.95 m
INFO:pybalance.lp.matcher:Objective:    12874.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0221
INFO:pybalance.lp.matcher:Patients (pool):      2126
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 7, time = 2.11 m
INFO:pybalance.lp.matcher:Objective:    12650.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0241
INFO:pybalance.lp.matcher:Patients (pool):      2350
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 8, time = 2.27 m
INFO:pybalance.lp.matcher:Objective:    12580.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0247
INFO:pybalance.lp.matcher:Patients (pool):      2420
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 9, time = 2.27 m
INFO:pybalance.lp.matcher:Objective:    10818.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0413
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:Status = FEASIBLE
INFO:pybalance.lp.matcher:Number of solutions found: 9

Note that as the optimization progresses, the size of the matched population increases and the balance also increases. This will generally be the case, since it is harder to find a larger set of patients (they have more degrees of freedom) that match a given target. Increased mismatch is the tradeoff of having a larger matched population.

Let’s check that the found solution satisfies the balance constraint:

[5]:
gamma.per_feature_loss(match_card.get_population('pool')).max()
[5]:
tensor(0.0500)
[6]:
%matplotlib inline
match_card = matcher.get_best_match()
m_data = m.copy().get_population('pool')
m_data.loc[:, 'population'] = m_data['population'] + ' (prematch)'
match_card.append(m_data)
fig = plot_numeric_features(match_card, hue_order=['pool (prematch)', 'pool', 'target', ])
fig = plot_categoric_features(match_card,  hue_order=['pool (prematch)', 'pool', 'target'])
../_images/demos_card_matcher_9_0.png
../_images/demos_card_matcher_9_1.png

Optimize balance with size and balance constraints

Now that we have determined the optimal size of the matched population, let’s optimize the balance for all subsets of that size. For this, we create a second matcher object, setting size constraints and keeping the balance constraint (max_mismatch).

[7]:
matcher = ConstraintSatisfactionMatcher(
    matching_data=m,
    objective=gamma,
    max_mismatch=0.05,
    pool_size=len(match_card.get_population('pool')),
    target_size=len(match_card.get_population('target')),
    time_limit=600
)
match_dist = matcher.match()
INFO:pybalance.lp.matcher:Scaling features by factor 200.00 in order to use integer solver with <= 0.0000% loss.
INFO:pybalance.lp.matcher:Solving for match population with pool size = 4182 and target size = 1000 subject to 0.05 balance constraint.
INFO:pybalance.lp.matcher:Matching on 27 dimensions ...
INFO:pybalance.lp.matcher:Building model variables and constraints ...
INFO:pybalance.lp.matcher:Calculating bounds on feature variables ...
INFO:pybalance.lp.matcher:Applying size constraints on pool and target ...
INFO:pybalance.lp.matcher:Solving with 4 workers ...
INFO:pybalance.lp.matcher:Initial balance score: 0.1236
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 1, time = 0.27 m
INFO:pybalance.lp.matcher:Objective:    825443200.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0366
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 2, time = 0.34 m
INFO:pybalance.lp.matcher:Objective:    825304800.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0365
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 3, time = 0.49 m
INFO:pybalance.lp.matcher:Objective:    733533600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0325
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 4, time = 3.38 m
INFO:pybalance.lp.matcher:Objective:    733527200.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0325
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 5, time = 3.92 m
INFO:pybalance.lp.matcher:Objective:    733465600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0325
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 6, time = 4.07 m
INFO:pybalance.lp.matcher:Objective:    733395200.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0325
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:Status = OPTIMAL
INFO:pybalance.lp.matcher:Number of solutions found: 6
[8]:
match_dist = matcher.get_best_match()
m_data = m.copy().get_population('pool')
m_data.loc[:, 'population'] = m_data['population'] + ' (prematch)'
match_dist.append(m_data)
fig = plot_numeric_features(match_dist, hue_order=['pool (prematch)', 'pool', 'target', ])
fig = plot_categoric_features(match_dist,  hue_order=['pool (prematch)', 'pool', 'target'])
../_images/demos_card_matcher_13_0.png
../_images/demos_card_matcher_13_1.png

Optimize balance without balance constraint

You can also remove the balance constraint, if you don’t care about individual mismatched features. Here we run a matcher fixed to the size of the optimal subset size but relax the mismatch constraint.

[9]:
matcher = ConstraintSatisfactionMatcher(
    matching_data=m,
    objective=gamma,
    pool_size=len(match_card.get_population('pool')),
    target_size=len(match_card.get_population('target')),
    time_limit=600
)
match_dist2 = matcher.match()
INFO:pybalance.lp.matcher:Scaling features by factor 200.00 in order to use integer solver with <= 0.0000% loss.
INFO:pybalance.lp.matcher:Solving for match population with pool size = 4182 and target size = 1000 subject to None balance constraint.
INFO:pybalance.lp.matcher:Matching on 27 dimensions ...
INFO:pybalance.lp.matcher:Building model variables and constraints ...
INFO:pybalance.lp.matcher:Calculating bounds on feature variables ...
INFO:pybalance.lp.matcher:Applying size constraints on pool and target ...
INFO:pybalance.lp.matcher:Solving with 4 workers ...
INFO:pybalance.lp.matcher:Initial balance score: 0.1236
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 1, time = 0.08 m
INFO:pybalance.lp.matcher:Objective:    2127925600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0942
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 2, time = 0.12 m
INFO:pybalance.lp.matcher:Objective:    2020125600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0895
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 3, time = 0.14 m
INFO:pybalance.lp.matcher:Objective:    2019725600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0894
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 4, time = 0.22 m
INFO:pybalance.lp.matcher:Objective:    2016325600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0893
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 5, time = 0.28 m
INFO:pybalance.lp.matcher:Objective:    2015325600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0892
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 6, time = 0.33 m
INFO:pybalance.lp.matcher:Objective:    2012125600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0891
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 7, time = 0.34 m
INFO:pybalance.lp.matcher:Objective:    2011725600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0891
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 8, time = 0.36 m
INFO:pybalance.lp.matcher:Objective:    2011525600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0891
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 9, time = 0.38 m
INFO:pybalance.lp.matcher:Objective:    2011325600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0891
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 10, time = 0.61 m
INFO:pybalance.lp.matcher:Objective:    478921600.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0212
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 11, time = 1.05 m
INFO:pybalance.lp.matcher:Objective:    478920000.0
INFO:pybalance.lp.matcher:Balance (gamma):      0.0212
INFO:pybalance.lp.matcher:Patients (pool):      4182
INFO:pybalance.lp.matcher:Patients (target):    1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:Status = FEASIBLE
INFO:pybalance.lp.matcher:Number of solutions found: 11
[10]:
match_dist2 = matcher.get_best_match()
m_data = m.copy().get_population('pool')
m_data.loc[:, 'population'] = m_data['population'] + ' (prematch)'
match_dist2.append(m_data)
fig = plot_numeric_features(match_dist2, hue_order=['pool (prematch)', 'pool', 'target', ])
fig = plot_categoric_features(match_dist2,  hue_order=['pool (prematch)', 'pool', 'target'])
../_images/demos_card_matcher_17_0.png
../_images/demos_card_matcher_17_1.png

Note that the solution no longer satisfies the max_mismatch constraint.

[11]:
gamma.per_feature_loss(match_dist2.get_population('pool')).max()
[11]:
tensor(0.1696)