Skip to content

Feature Generation in Cheminformatics

Feature generation is the process of extracting meaningful numerical values from chemical structures. These features capture important chemical properties and patterns, enabling machine learning models to learn relationships between molecular structure and target properties.


Chemical Features supported by Mother

The Mother framework provides tools to automate feature generation:

  1. Descriptors

    • Numerical values summarizing molecular properties (e.g., Molecular weight, LogP (hydrophobicity), number of rotatable bonds, topological polar surface area).
    • Mother supports all descriptors provided by rdkit.Chem.Descriptors. You can check the full list of available descriptors:
      Python
      from rdkit.Chem import Descriptors
      for k, v in Descriptors.descList:
          print(k)
      
  2. Fingerprints

    • Binary or count vectors encoding the presence or absence of specific substructures
    • Mother supports
      • MACCS keys MaccsFingerprints
      • A convenience class FingerprintsGeneric to generate fingerprints supported by rdkit.Chem.rdFingerprintGenerator. You can provide one from the following list to the fp_type parameter:
        • RDKitFP
        • MorganFP
        • AtomPairFP
        • TopologicalTorsionFP

    Binary vs. Count Fingerprints

    By default, fingerprints are generated as binary vectors — each bit is either 0 (substructure absent) or 1 (substructure present). You can switch to count-based fingerprints by setting use_counts=True, which records how many times each substructure occurs in the molecule. Count fingerprints can improve model performance when the frequency of substructures is informative.

    Mode Parameter Output values Use case
    Binary (default) use_counts=False 0 or 1 General-purpose fingerprinting
    Count use_counts=True 0, 1, 2, … When substructure frequency matters

    The use_counts parameter is available on both FingerprintsGeneric and the MorganFingerprints convenience class:

    Python
    from mother.feature_generation import FingerprintsGeneric, MorganFingerprints
    
    # Count-based Morgan fingerprints via the convenience class
    morgan_counts = MorganFingerprints(radius=2, fpSize=1024, use_counts=True)
    features = morgan_counts.fit_transform(molecule_objects)
    
    # Count-based fingerprints via the generic class (works with any supported fp_type)
    fp_counts = FingerprintsGeneric(
        fp_type="AtomPairFP",
        parameters={"fpSize": 2048},
        use_counts=True,
    )
    features = fp_counts.fit_transform(molecule_objects)
    

    Note

    use_counts is a standard scikit-learn parameter — it is preserved through set_params(), get_params(), and sklearn.base.clone().

  3. Custom Features from Machine Learning Models

    • Features generated by trained machine learning models
    • Any input data can be provided as input to the final machine learning object.

Usage Example

Python
from mother.feature_generation import ChemicalDescriptors

# Example: Generate molecular weight and LogP descriptors
descriptor_generator = ChemicalDescriptors(descriptor_list=["RingCount", "MolLogP"])
features = descriptor_generator.fit_transform(molecule_objects)

You can combine multiple feature generators using scikit-learn’s FeatureUnion:

Python
from sklearn.pipeline import FeatureUnion
from mother.feature_generation import ChemicalDescriptors, MorganFingerprint

feature_generator = FeatureUnion([
    ("descriptors", ChemicalDescriptors(descriptor_list=["MolWt", "MolLogP"])),
    ("morgan_fp", MorganFingerprint(radius=2, n_bits=1024))
])
features = feature_generator.fit_transform(molecule_objects)

Further Reading