Feature Generation in Cheminformatics
Feature generation is the process of extracting meaningful numerical values from chemical structures. These features capture important chemical properties and patterns, enabling machine learning models to learn relationships between molecular structure and target properties.
Chemical Features supported by Mother
The Mother framework provides tools to automate feature generation:
-
Descriptors
- Numerical values summarizing molecular properties (e.g., Molecular weight, LogP (hydrophobicity), number of rotatable bonds, topological polar surface area).
- Mother supports all descriptors provided by rdkit.Chem.Descriptors. You can check the full list of available descriptors:
-
Fingerprints
- Binary or count vectors encoding the presence or absence of specific substructures
- Mother supports
- MACCS keys
MaccsFingerprints - A convenience class
FingerprintsGenericto generate fingerprints supported byrdkit.Chem.rdFingerprintGenerator. You can provide one from the following list to thefp_typeparameter:- RDKitFP
- MorganFP
- AtomPairFP
- TopologicalTorsionFP
- MACCS keys
Binary vs. Count Fingerprints
By default, fingerprints are generated as binary vectors — each bit is either
0(substructure absent) or1(substructure present). You can switch to count-based fingerprints by settinguse_counts=True, which records how many times each substructure occurs in the molecule. Count fingerprints can improve model performance when the frequency of substructures is informative.Mode Parameter Output values Use case Binary (default) use_counts=False0 or 1 General-purpose fingerprinting Count use_counts=True0, 1, 2, … When substructure frequency matters The
use_countsparameter is available on bothFingerprintsGenericand theMorganFingerprintsconvenience class:Pythonfrom mother.feature_generation import FingerprintsGeneric, MorganFingerprints # Count-based Morgan fingerprints via the convenience class morgan_counts = MorganFingerprints(radius=2, fpSize=1024, use_counts=True) features = morgan_counts.fit_transform(molecule_objects) # Count-based fingerprints via the generic class (works with any supported fp_type) fp_counts = FingerprintsGeneric( fp_type="AtomPairFP", parameters={"fpSize": 2048}, use_counts=True, ) features = fp_counts.fit_transform(molecule_objects)Note
use_countsis a standard scikit-learn parameter — it is preserved throughset_params(),get_params(), andsklearn.base.clone(). -
Custom Features from Machine Learning Models
- Features generated by trained machine learning models
- Any input data can be provided as input to the final machine learning object.
Usage Example
from mother.feature_generation import ChemicalDescriptors
# Example: Generate molecular weight and LogP descriptors
descriptor_generator = ChemicalDescriptors(descriptor_list=["RingCount", "MolLogP"])
features = descriptor_generator.fit_transform(molecule_objects)
You can combine multiple feature generators using scikit-learn’s FeatureUnion:
from sklearn.pipeline import FeatureUnion
from mother.feature_generation import ChemicalDescriptors, MorganFingerprint
feature_generator = FeatureUnion([
("descriptors", ChemicalDescriptors(descriptor_list=["MolWt", "MolLogP"])),
("morgan_fp", MorganFingerprint(radius=2, n_bits=1024))
])
features = feature_generator.fit_transform(molecule_objects)