Cardinality matching
Cardinality matching is the process of finding the size of the largest subset \(\hat{P}\) of a pool of patient \(P\) within some “distance” of a given target population \(T\):
\begin{equation} \begin{aligned} & \underset{\hat{P}}{\text{maximize}} & & |\hat{P}| \\ & \text{subject to} & & |\mu_{\hat{P}k} - \mu_{Tk}| \leq \delta \textrm{ for all }k \end{aligned} \end{equation}
where \(k\) indexes the covariates of \(P\) and \(T\), and \(\mu\) denotes the means of the corresponding covariates. In cardinality matching, at least as implemented here, we search only for the size of the largest subset. Then in a second step, we optimize the balance (distance) among all subsets of the determined size.
[1]:
import logging
logging.basicConfig(level='INFO')
[2]:
from pybalance.utils.balance_calculators import *
from pybalance.utils import MatchingData
from pybalance.sim import generate_toy_dataset, load_paper_dataset
from pybalance.lp import ConstraintSatisfactionMatcher
from pybalance.visualization import *
[3]:
m = generate_toy_dataset(n_pool=15000, n_target=1000)
m
[3]:
['age', 'height', 'weight']
Headers Categoric:
['gender', 'haircolor', 'country', 'binary_0', 'binary_1', 'binary_2', 'binary_3']
Populations
['pool', 'target']
age | height | weight | gender | haircolor | country | population | binary_0 | binary_1 | binary_2 | binary_3 | patient_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 62.731988 | 130.816972 | 76.100401 | 0.0 | 0 | 4 | pool | 0 | 0 | 0 | 1 | 0 |
1 | 26.403338 | 130.784188 | 80.134423 | 1.0 | 1 | 2 | pool | 0 | 0 | 0 | 1 | 1 |
2 | 58.155044 | 175.704961 | 90.806745 | 0.0 | 1 | 4 | pool | 0 | 0 | 0 | 0 | 2 |
3 | 68.334248 | 167.485984 | 90.081777 | 0.0 | 0 | 4 | pool | 0 | 0 | 0 | 1 | 3 |
4 | 54.114518 | 130.782073 | 53.612174 | 1.0 | 1 | 1 | pool | 0 | 1 | 0 | 1 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 21.474205 | 168.602546 | 70.342128 | 0.0 | 2 | 5 | target | 0 | 0 | 0 | 1 | 15995 |
996 | 40.643320 | 188.188724 | 61.611744 | 0.0 | 2 | 4 | target | 1 | 0 | 0 | 1 | 15996 |
997 | 29.472765 | 161.408162 | 57.214095 | 0.0 | 0 | 1 | target | 0 | 1 | 1 | 1 | 15997 |
998 | 41.291949 | 150.968833 | 91.270798 | 0.0 | 0 | 3 | target | 0 | 0 | 0 | 0 | 15998 |
999 | 67.530294 | 155.124741 | 56.196505 | 1.0 | 0 | 1 | target | 1 | 0 | 0 | 0 | 15999 |
16000 rows × 12 columns
Optimize pool size with balance constraint
[4]:
# Note that by default, gamma uses the standardized mean difference
# to calculate distance. The ConstraintSatisfactionMatcher, however,
# can only optimize linear objectives. It is not an error to pass
# gamma with standardized_difference=True, as this is only affects
# reporting, but to make the reporting consistent with what we're
# optimizing, we explicitly pass standardized_difference=False here.
gamma = GammaBalance(m, standardize_difference=False)
matcher = ConstraintSatisfactionMatcher(
matching_data=m,
objective=gamma,
max_mismatch=0.05,
time_limit=600
)
match_card = matcher.match()
INFO:pybalance.utils.preprocess:Discretized age with bins [18.02, 27.51, 37.01, 46.51, 56.0, 65.5, 75.0].
INFO:pybalance.utils.preprocess:Discretized height with bins [125.0, 136.67, 148.33, 159.99, 171.66, 183.32, 194.99].
INFO:pybalance.utils.preprocess:Discretized weight with bins [50.01, 61.67, 73.34, 85.0, 96.67, 108.33, 120.0].
INFO:pybalance.lp.matcher:Scaling features by factor 200.00 in order to use integer solver with <= 0.0000% loss.
INFO:pybalance.lp.matcher:Solving for match population with pool size = None and target size = 1000 subject to 0.05 balance constraint.
INFO:pybalance.lp.matcher:Matching on 27 dimensions ...
INFO:pybalance.lp.matcher:Building model variables and constraints ...
INFO:pybalance.lp.matcher:Calculating bounds on feature variables ...
INFO:pybalance.lp.matcher:Applying size constraints on pool and target ...
INFO:pybalance.lp.matcher:Solving with 4 workers ...
INFO:pybalance.lp.matcher:Initial balance score: 0.1236
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 1, time = 0.37 m
INFO:pybalance.lp.matcher:Objective: 13153.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0068
INFO:pybalance.lp.matcher:Patients (pool): 1847
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 2, time = 0.48 m
INFO:pybalance.lp.matcher:Objective: 12981.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0163
INFO:pybalance.lp.matcher:Patients (pool): 2019
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 3, time = 0.52 m
INFO:pybalance.lp.matcher:Objective: 12937.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0171
INFO:pybalance.lp.matcher:Patients (pool): 2063
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 4, time = 0.54 m
INFO:pybalance.lp.matcher:Objective: 12910.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0178
INFO:pybalance.lp.matcher:Patients (pool): 2090
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 5, time = 0.55 m
INFO:pybalance.lp.matcher:Objective: 12906.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0221
INFO:pybalance.lp.matcher:Patients (pool): 2094
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 6, time = 0.95 m
INFO:pybalance.lp.matcher:Objective: 12874.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0221
INFO:pybalance.lp.matcher:Patients (pool): 2126
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 7, time = 2.11 m
INFO:pybalance.lp.matcher:Objective: 12650.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0241
INFO:pybalance.lp.matcher:Patients (pool): 2350
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 8, time = 2.27 m
INFO:pybalance.lp.matcher:Objective: 12580.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0247
INFO:pybalance.lp.matcher:Patients (pool): 2420
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 9, time = 2.27 m
INFO:pybalance.lp.matcher:Objective: 10818.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0413
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:Status = FEASIBLE
INFO:pybalance.lp.matcher:Number of solutions found: 9
Note that as the optimization progresses, the size of the matched population increases and the balance also increases. This will generally be the case, since it is harder to find a larger set of patients (they have more degrees of freedom) that match a given target. Increased mismatch is the tradeoff of having a larger matched population.
Let’s check that the found solution satisfies the balance constraint:
[5]:
gamma.per_feature_loss(match_card.get_population('pool')).max()
[5]:
tensor(0.0500)
[6]:
%matplotlib inline
match_card = matcher.get_best_match()
m_data = m.copy().get_population('pool')
m_data.loc[:, 'population'] = m_data['population'] + ' (prematch)'
match_card.append(m_data)
fig = plot_numeric_features(match_card, hue_order=['pool (prematch)', 'pool', 'target', ])
fig = plot_categoric_features(match_card, hue_order=['pool (prematch)', 'pool', 'target'])
Optimize balance with size and balance constraints
Now that we have determined the optimal size of the matched population, let’s optimize the balance for all subsets of that size. For this, we create a second matcher object, setting size constraints and keeping the balance constraint (max_mismatch).
[7]:
matcher = ConstraintSatisfactionMatcher(
matching_data=m,
objective=gamma,
max_mismatch=0.05,
pool_size=len(match_card.get_population('pool')),
target_size=len(match_card.get_population('target')),
time_limit=600
)
match_dist = matcher.match()
INFO:pybalance.lp.matcher:Scaling features by factor 200.00 in order to use integer solver with <= 0.0000% loss.
INFO:pybalance.lp.matcher:Solving for match population with pool size = 4182 and target size = 1000 subject to 0.05 balance constraint.
INFO:pybalance.lp.matcher:Matching on 27 dimensions ...
INFO:pybalance.lp.matcher:Building model variables and constraints ...
INFO:pybalance.lp.matcher:Calculating bounds on feature variables ...
INFO:pybalance.lp.matcher:Applying size constraints on pool and target ...
INFO:pybalance.lp.matcher:Solving with 4 workers ...
INFO:pybalance.lp.matcher:Initial balance score: 0.1236
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 1, time = 0.27 m
INFO:pybalance.lp.matcher:Objective: 825443200.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0366
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 2, time = 0.34 m
INFO:pybalance.lp.matcher:Objective: 825304800.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0365
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 3, time = 0.49 m
INFO:pybalance.lp.matcher:Objective: 733533600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0325
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 4, time = 3.38 m
INFO:pybalance.lp.matcher:Objective: 733527200.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0325
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 5, time = 3.92 m
INFO:pybalance.lp.matcher:Objective: 733465600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0325
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 6, time = 4.07 m
INFO:pybalance.lp.matcher:Objective: 733395200.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0325
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:Status = OPTIMAL
INFO:pybalance.lp.matcher:Number of solutions found: 6
[8]:
match_dist = matcher.get_best_match()
m_data = m.copy().get_population('pool')
m_data.loc[:, 'population'] = m_data['population'] + ' (prematch)'
match_dist.append(m_data)
fig = plot_numeric_features(match_dist, hue_order=['pool (prematch)', 'pool', 'target', ])
fig = plot_categoric_features(match_dist, hue_order=['pool (prematch)', 'pool', 'target'])
Optimize balance without balance constraint
You can also remove the balance constraint, if you don’t care about individual mismatched features. Here we run a matcher fixed to the size of the optimal subset size but relax the mismatch constraint.
[9]:
matcher = ConstraintSatisfactionMatcher(
matching_data=m,
objective=gamma,
pool_size=len(match_card.get_population('pool')),
target_size=len(match_card.get_population('target')),
time_limit=600
)
match_dist2 = matcher.match()
INFO:pybalance.lp.matcher:Scaling features by factor 200.00 in order to use integer solver with <= 0.0000% loss.
INFO:pybalance.lp.matcher:Solving for match population with pool size = 4182 and target size = 1000 subject to None balance constraint.
INFO:pybalance.lp.matcher:Matching on 27 dimensions ...
INFO:pybalance.lp.matcher:Building model variables and constraints ...
INFO:pybalance.lp.matcher:Calculating bounds on feature variables ...
INFO:pybalance.lp.matcher:Applying size constraints on pool and target ...
INFO:pybalance.lp.matcher:Solving with 4 workers ...
INFO:pybalance.lp.matcher:Initial balance score: 0.1236
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 1, time = 0.08 m
INFO:pybalance.lp.matcher:Objective: 2127925600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0942
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 2, time = 0.12 m
INFO:pybalance.lp.matcher:Objective: 2020125600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0895
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 3, time = 0.14 m
INFO:pybalance.lp.matcher:Objective: 2019725600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0894
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 4, time = 0.22 m
INFO:pybalance.lp.matcher:Objective: 2016325600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0893
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 5, time = 0.28 m
INFO:pybalance.lp.matcher:Objective: 2015325600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0892
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 6, time = 0.33 m
INFO:pybalance.lp.matcher:Objective: 2012125600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0891
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 7, time = 0.34 m
INFO:pybalance.lp.matcher:Objective: 2011725600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0891
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 8, time = 0.36 m
INFO:pybalance.lp.matcher:Objective: 2011525600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0891
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 9, time = 0.38 m
INFO:pybalance.lp.matcher:Objective: 2011325600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0891
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 10, time = 0.61 m
INFO:pybalance.lp.matcher:Objective: 478921600.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0212
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:=========================================
INFO:pybalance.lp.matcher:Solution 11, time = 1.05 m
INFO:pybalance.lp.matcher:Objective: 478920000.0
INFO:pybalance.lp.matcher:Balance (gamma): 0.0212
INFO:pybalance.lp.matcher:Patients (pool): 4182
INFO:pybalance.lp.matcher:Patients (target): 1000
INFO:pybalance.lp.matcher:
INFO:pybalance.lp.matcher:Status = FEASIBLE
INFO:pybalance.lp.matcher:Number of solutions found: 11
[10]:
match_dist2 = matcher.get_best_match()
m_data = m.copy().get_population('pool')
m_data.loc[:, 'population'] = m_data['population'] + ' (prematch)'
match_dist2.append(m_data)
fig = plot_numeric_features(match_dist2, hue_order=['pool (prematch)', 'pool', 'target', ])
fig = plot_categoric_features(match_dist2, hue_order=['pool (prematch)', 'pool', 'target'])
Note that the solution no longer satisfies the max_mismatch constraint.
[11]:
gamma.per_feature_loss(match_dist2.get_population('pool')).max()
[11]:
tensor(0.1696)