Mother-ML
A ML framework that takes care.
Mother is a machine-learning framework for predicting properties from chemical molecules. The major features are:
SMILES preprocessing
Generating of feature vectors from molecules
Grouping and cross-validation, based on chemical similarity
Model Training: Standard catboost models, and feature selection methods
Training, cross-validation, and hyperparameter optimization of machine-learning models
Handling Gene expression data from transcriptomics experiments including different normalisation techniques
Explainability analysis with SHAP(Currently not supported, will be added in a later release)Generative chemistry(Currently not supported)
Mother provides methods for each of these steps in the form of sklearn transformer objects. By that, all methods are designed to be easily accessible and usable in a modular way. The methods can be combined to ML workflows with sklearn pipelines, column transformers, and feature unions.
All methods can be used as sklearn transformer or estimator. Combination with other methods, or own methods and models (e.g. using mother preprocessing with other model) is therefore straightforward. To be as compatible as possible, every transformer can be constructed using a dictionary containing the required parameters. However, to provide some convenience to the users, a settings class MotherSettings. This class can be used to store all relevant settings for your ML project.
Usage
A basic example can be found in the example regression notebook. Other examples are in the examples folder.
SMILES preprocessing and mol-object generation
SMILES preprocessing is done with the StandardizerTransformer class. The class is used to preprocess SMILES strings to construct a pipeline from SMILES to rdkit mol-objects with:
from sklearn import pipeline as sklearn_pipeline
preprocessor: sklearn_pipeline.Pipeline = sklearn_pipeline.Pipeline(
[
(
"smiles_standardizer",
StandardizerTransformer(flags=["STANDARDIZE", "DESALT","NEUTRALIZE"]),
),
("smiles_to_mol", SmilesToMolTransformer()),
# Add other column transformations here if needed
],
memory=None,
).set_output(transform="pandas")
mol_data: pd.DataFrame = preprocessor.fit_transform(structure_data)
Customize by changing the flags attribute.
Feature Generation
Mother provides three types of feature generators: MaccsFingerprints, MorganFingerprints, and ChemicalDescriptors:
from sklearn import pipeline as sklearn_pipeline
feature_generator = sklearn_pipeline.FeatureUnion(
transformer_list=[
("maccs", MaccsFingerprints()),
("morgan", MorganFingerprints()),
("desc", ChemicalDescriptors()),
],
).set_output(transform="pandas")
features: pd.DataFrame = feature_generator.fit_transform(mol_data["Molecule"])
The FeatureUnion class is used to combine the feature generators. Each feature generator can be configured.
Grouping and Cross-Validation
For cross-validation, or test-set selection based on chemical similarity, mother provides a transformer-class for
generating groups (TanimotoGroupingFromMols):
groups_engine = cv_module.TanimotoGroupingFromMols(similarity_threshold=0.3)
groups: pd.DataFrame = groups_engine.set_output(transform="pandas").fit_transform(mol_data)
These groups can be used, e.g. in the GroupKFold class from the sklearn.model_selection module:
Model Training
The standard model setup of Mother consists of a feature selection, and a classification- or regression
model. Both are based on Catboost. The standard setup for a regression task would be:
import mother.pipeline_utils as mother_takes_care
model_settings = {
"feature_selection_flags": ["DROP_CORRELATED", "DROP_CONSTANT", "DROP_DUPLICATES", "DROP_UNIMPORTANT"],
"feature_selection_threshold": 1e-5,
"correlation_threshold": 0.9,
"algorithm": "catboost",
"feature_selection_type": "catboost",
"type": "regression",
"target_type": "single_target"
}
pipeline_settings = {
"remainder": "drop" if len(categorical_features) == 0 else "passthrough",
"verbose_feature_names_out": False,
}
model = ml.PipelineWithHyperparameterRooting(
[
(
"feature_selector",
mother_takes_care.get_feature_selection_pipeline(
settings=model_settings, pipeline_settings=pipeline_settings,
cv=cv
).set_output(transform="pandas"),
),
("ml_model", ml.CatboostRegressorMother(target_type="single_target", logging_level="Silent")),
]
)
Here, we use the extended sklearn pipeline PipelineWithHyperparameterRooting for some additional methods for hyperparameter
tuning.
Without feature selection, this is simplified:
Any other sklearn model, or own model can be used instead of CatboostRegressorMother. An example, on how a custom
preprocessing step is added to the model, can be found in the
example notebook on custom preprocessing.
Cross-validation
Having used any sklearn pipeline, or sklearn estimator or transformer classes, we can use the sklearn
methods for e.g. cross-validation (cross_validate):
A more convenient method is provided by mother. Using this methods gives you additional output considering CV and groups.
import mother.pipeline_utils as mother_takes_care
mother_takes_care.mother_cv(estimator=model, X=features, y=data["target"],cv=cv)
Hyperparameter Optimization
The Mother object MotherTuner uses optuna to optimize hyperparameters:
tuner = opt.MotherTuner(
scorer="r2",
n_threads_optuna=10, # parallel threads for cross-validation evaluation
)
model_tuned = tuner.optimize(
model,
features,
targets,
cv,
groups=groups.values,
)
The function model.get_hyperparameter_space returns the hyperparameter space for the model. For the default
catboost model, and the PipelineWithHyperparameterRooting class, this is already implemented.
For examples, on how to customize the hyperparameter optimization, or define hyperparameters for your own models, see the example notebook.
Handling Gene expression data from transcriptomics experiments including different normalisation techniques
The RNA processing pipeline is implemented in the RNA class, which incorporates various preprocessing steps tailored for RNA sequencing data. All RNA code can be found in the rna.py file.
The pipeline includes normalization, feature selection, and discretization, utilizing the power of the scikit-learn framework. The normalization methods available are "Scanpy," "UQ," "CUF," and "CPM.". You can customise the pipeline to your needs, or try different normalisation methods and bin sizes in hyper-parameter tuning. The pipeline can be fitted and re-applied to avoid data-leakage during the normalisation.
Hereβs how to set up and use the RNA processing pipeline:
from mother.ml.rna import RNA
from sklearn.pipeline import Pipeline
rna_pipeline: Pipeline = RNA(
n_features=None, # Number of features (=genes) to keep for the prediction. If None this will keep all non-zero importance genes
n_bins=20, # Number of bins to use for the discretisation of the target variable.
normalisation_method="Scanpy", # Which normalisation to use
)._build_pipeline()
# Fit the pipeline to your RNA sequencing data
transformed_train_data: pd.DataFrame = rna_pipeline.fit_transform(rna_data_train)
transformed_test_data: pd.DataFrame = rna_pipeline.transform(rna_data_test)
A complete walkthrough of the RNA functionality is found in the example notebook.
Install
uv add mother-ml
Optional Features and Extras
To keep the package size small, some dependencies are added as optional extras. These extras provide additional functionality for specific use cases:
| Extra | Description | Key Packages | Notes |
|---|---|---|---|
all |
All optional features | All packages below | Installs everything |
report |
Visualization and reporting tools | plotly, kaleido | For generating plots and reports |
rna |
RNA sequence analysis | rnalib | RNA-specific preprocessing |
torch |
PyTorch neural network support | torch, pytorch-tabular | Adds ~3GB to environment size! |
tabpfn |
TabPFN model support | tabpfn | Prior-fitted networks for tabular data |
clustering |
Chemical compound clustering | mol2vec, cluster-my-molecules | For molecular clustering analysis |
Installation Examples
Using pip:
# Install with report generation support
pip install 'mother[report]'
# Install with PyTorch support (adds ~3GB!)
pip install 'mother[torch]'
# Install multiple extras
pip install 'mother[report,torch,tabpfn]'
Using uv:
Note: There is also a different
motherpackage on PyPI. Be sure to installmother-ml.
Acknowledgements
Thank you to the following contributors:
- Thomas Wolf
- Lukas Hebing
- Kai Sommer
and all the others.