Balance Calculators
[1]:
from pybalance.utils.balance_calculators import *
from pybalance.utils import MatchingData
from pybalance.sim import load_paper_dataset
[2]:
m =load_paper_dataset()
m
[2]:
Headers Numeric:
['age', 'height', 'weight']
Headers Categoric:
['gender', 'haircolor', 'country', 'binary_0', 'binary_1', 'binary_2', 'binary_3']
Populations
['pool', 'target']
['age', 'height', 'weight']
Headers Categoric:
['gender', 'haircolor', 'country', 'binary_0', 'binary_1', 'binary_2', 'binary_3']
Populations
['pool', 'target']
age | height | weight | gender | haircolor | country | population | binary_0 | binary_1 | binary_2 | binary_3 | patient_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 64.854093 | 189.466850 | 88.835049 | 1.0 | 1 | 4 | pool | 0 | 1 | 0 | 1 | 135740 |
1 | 52.571993 | 158.134940 | 94.215107 | 1.0 | 1 | 1 | pool | 0 | 1 | 0 | 1 | 49288 |
2 | 25.828361 | 154.692482 | 94.226222 | 1.0 | 0 | 3 | pool | 0 | 0 | 1 | 0 | 256676 |
3 | 70.177571 | 160.536632 | 94.244356 | 1.0 | 0 | 2 | pool | 0 | 0 | 0 | 1 | 338287 |
4 | 73.779164 | 153.551419 | 86.161814 | 0.0 | 0 | 1 | pool | 0 | 0 | 1 | 1 | 72849 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
274995 | 62.547794 | 186.005015 | 50.975051 | 0.0 | 0 | 1 | target | 0 | 0 | 1 | 1 | 579081 |
274996 | 69.879934 | 142.371386 | 100.138389 | 1.0 | 1 | 4 | target | 0 | 1 | 1 | 0 | 569939 |
274997 | 56.921402 | 130.639589 | 108.745182 | 1.0 | 1 | 5 | target | 0 | 1 | 0 | 0 | 532419 |
274998 | 34.082754 | 174.764051 | 67.998396 | 0.0 | 2 | 2 | target | 0 | 0 | 0 | 1 | 566266 |
274999 | 60.981259 | 137.419436 | 89.897817 | 1.0 | 0 | 5 | target | 1 | 1 | 1 | 1 | 544231 |
275000 rows × 12 columns
[13]:
m.counts()
[13]:
N | |
---|---|
population | |
pool | 250000 |
target | 25000 |
Fit Balance Calculator
[14]:
# Balance calculators in general are "fit" to the whole population data
# Fitting here means fitting preprocessors (e.g. what bins to use when binning
# is involved). It's important to fit once so that all calls to distance()
# can be compared meaningfully.
beta = BetaBalance(m)
target, pool = split_target_pool(m)
Balance between pool and target
[15]:
beta.distance(pool)
[15]:
tensor(0.2353, dtype=torch.float64)
[16]:
# Specifying target is optional
beta.distance(pool, target)
[16]:
tensor(0.2353, dtype=torch.float64)
Balance between subset of pool and target
[17]:
beta.distance(pool.sample(n=100))
[17]:
tensor(0.2366, dtype=torch.float64)
[18]:
# Can also take subsets of the target
beta.distance(pool.sample(n=100), target.sample(n=100))
[18]:
tensor(0.2669, dtype=torch.float64)
Balance between several subsets simultaneously
[19]:
pool_subsets = np.array([
np.random.choice(pool.reset_index().index.values, size=100, replace=False),
np.random.choice(pool.reset_index().index.values, size=100, replace=False)
])
beta.distance(pool_subsets)
[19]:
tensor([0.2404, 0.2418], dtype=torch.float64)
[9]:
pool_subsets = [
np.random.choice(pool.reset_index().index.values, size=100, replace=False),
np.random.choice(pool.reset_index().index.values, size=100, replace=False)
]
target_subsets = [
np.random.choice(target.reset_index().index.values, size=100, replace=False),
np.random.choice(target.reset_index().index.values, size=100, replace=False)
]
beta.distance(pool_subsets, target_subsets)
/Users/gmema/src/pybalance/pybalance/utils/balance_calculators.py:224: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_new.cpp:278.)
subset_populations = torch.tensor(
[9]:
tensor([0.2602, 0.2757], dtype=torch.float64)
[28]:
# Must have same number of subsets! This will throw an error:
pool_subsets = [
np.random.choice(pool.reset_index().index.values, size=100, replace=False),
np.random.choice(pool.reset_index().index.values, size=100, replace=False)
]
target_subsets = [
np.random.choice(target.reset_index().index.values, size=100, replace=False),
np.random.choice(target.reset_index().index.values, size=100, replace=False),
np.random.choice(target.reset_index().index.values, size=100, replace=False)
]
try:
beta.distance(pool_subsets, target_subsets)
except ValueError as e:
print(e)
Number of subset populations must be same for pool and target!
Basic Genetic Optimizer
Here is a very basic, un-optimized implementation of genetic matching! It’s not very smart, because it doesn’t mix the good populations. This is just an illustration of using the balance calculator.
[27]:
def get_subsets(pool, target, pool_size, target_size, n_subsets):
pool = pool.reset_index()
target = target.reset_index()
pool_subsets = [
np.random.choice(pool.index.values, size=pool_size, replace=False) for _ in range(n_subsets)
]
target_subsets = [
np.random.choice(target.index.values, size=target_size, replace=False) for _ in range(n_subsets)
]
return pool_subsets, target_subsets
pool_size = 1000
target_size = 1000
n_subsets = 100
best_match = None
best_distance = 100000
for j in range(100):
pool_subsets, target_subsets = get_subsets(pool, target, pool_size, target_size, n_subsets)
distances = beta.distance(pool_subsets, target_subsets)
this_best_distance = distances.min()
if this_best_distance < best_distance:
best_distance = this_best_distance
best_match_idx = distances.argmin()
best_match = pool_subsets[best_match_idx], target_subsets[best_match_idx]
if not j % 10:
print(f'Generation {j} / Best distance found {best_distance:.3f}')
Generation 0 / Best distance found 0.215
Generation 10 / Best distance found 0.215
Generation 20 / Best distance found 0.211
Generation 30 / Best distance found 0.211
Generation 40 / Best distance found 0.208
Generation 50 / Best distance found 0.208
Generation 60 / Best distance found 0.208
Generation 70 / Best distance found 0.208
Generation 80 / Best distance found 0.208
Generation 90 / Best distance found 0.208
[ ]: