Module dataset

The class MoccaDataset provides high-level interface for managing datasets of chromatograms and libraries of compounds.

MoccaDataset

class mocca2.dataset.dataset.MoccaDataset

Collection of chromatograms, compounds, and other information

chromatograms: Dict[int, Chromatogram]: Chromatograms in the dataset [id -> Chromatogram]

compounds: Dict[int, Compound]: Compounds by ID [id -> Compound]

compound_references: Dict[int, Tuple[str, float | None]]: If some chromatogram is reference for a compound, this we be stored here as [chromatogram id -> (compound name, concentration?)]

istd_concentrations: Dict[int, float]: Concentrations of internal standard [chromatogram id -> concentration]

istd_compound: int | None: ID of compound that is internal standard

settings: ProcessingSettings | None: Settings for automatic chromatogram processing

add_chromatogram(chromatogram: Chromatogram, istd_concentration: float | None = None, reference_for_compound: str | None = None, compound_concentration: float | None = None, istd_reference: bool = False) → int

Adds chromatogram to the dataset, returns the assigned ID

Parameters

chromatogram: Chromatogram: the chromatogram that is added
istd_concentration: float | None: if internal standard is present, specify concentration
reference_for_compound: str | None: if this chromatogram contains compound reference, specify compound name
compound_concentration: float | None: if this chromatogram contains compound reference with known concentration, specify concentration
istd_reference: bool = False: specify whether this chromatogram is reference for internal standard

time() → ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None: Returns time axis of the chromatograms, if there are any

wavelength() → ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None: Returns wavelength axis of the chromatograms, if there are any

wavelength_raw() → ndarray[tuple[int, ...], dtype[_ScalarType_co]] | None: Returns wavelength axis of the raw data (without cropping), if there are any

time_step() → float | None: Returns the sampling step of the time axis in the chromatograms, if there are any

wavelength_step() → float | None: Returns the sampling step of the wavelength axis in the chromatograms, if there are any

closest_time(time: float) → Tuple[int, float] | None: Returns index and value of time point that is closest to specified time, if there are any chromatograms

closest_wavelength(wavelength: float) → Tuple[int, float] | None: Returns index and value of wavelength point that is closest to specified wavelength, if there are any chromatograms

process_all(settings: ProcessingSettings, verbose: bool = True, cores: int = 1): Processes all chromatograms: finds and deconvolves peaks, creates averaged compounds, and refines peaks

get_area_percent(wl_idx: int) → Tuple[DataFrame, List[int]]

Calculates area % of deconvolved peaks at given wavelength

Parameters

wl_idx: int: index of wavelength which will be used for calculating area %

Returns

pd.DataFrame: The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
List[int]: IDs of compounds in the same order as in DataFrame

get_concentrations() → Tuple[DataFrame, List[int]]

Calculates concentrations of deconvolved peaks based on specified concentration_factor

Returns

pd.DataFrame: The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
List[int]: IDs of compounds in the same order as in DataFrame

get_relative_concentrations() → Tuple[DataFrame, List[int]]

Calculates concentrations of deconvolved peaks relative to internal standard.

If compound has concentration_factor specified, the integrals are multiplied by this factor

Returns

pd.DataFrame: The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
List[int]: IDs of compounds in the same order as in DataFrame

get_integrals() → Tuple[DataFrame, List[int]]

Calculates integrals of deconvolved peaks.

Returns

pd.DataFrame: The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
List[int]: IDs of compounds in the same order as in DataFrame

get_relative_integrals() → Tuple[DataFrame, List[int]]

Calculates integrals of deconvolved peaks relative to internal standard.

Returns

pd.DataFrame: The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
List[int]: IDs of compounds in the same order as in DataFrame

to_dict() → Dict[str, Any]: Converts the data to a dictionary for serialization

static from_dict(data: Dict[str, Any]) → MoccaDataset: Creates a MoccaDataset object from a dictionary

ProcessingSettings

The ProcessingSettings class contains all settings for automatic batch processing of batches of chromatograms. It is primarily used by the MoccaDataset class.

class mocca2.dataset.settings.ProcessingSettings(baseline_model: Literal['asls', 'arpls', 'flatfit'] = 'flatfit', baseline_smoothness: float = 1.0, min_rel_prominence: float = 0.01, min_prominence: float = 1, border_max_peak_cutoff: float = 0.1, split_threshold: float = 0.05, explained_threshold: float = 0.995, peak_model: Literal['BiGaussian', 'BiGaussianTailing', 'FraserSuzuki', 'Bemg'] = 'Bemg', max_peak_comps: int = 4, max_peak_distance: float = 1.0, min_spectrum_correl: float = 0.99, min_elution_time: float = 0.4, max_elution_time: float = 10.0, min_wavelength: float = 210.0, max_wavelength: float = 400.0, min_rel_integral: float = 0.01, relaxe_concs: bool = False)

Collection of all settings required for automatic chromatogram processing in MoccaDataset

baseline_model: Literal['asls', 'arpls', 'flatfit'] = 'flatfit': Name of baseline estimator

baseline_smoothness: float = 1.0: Smoothness penalty for baseline

min_rel_prominence: float = 0.01: Minimal relative peak height

min_prominence: float = 1: Minimal peak height

border_max_peak_cutoff: float = 0.1: Maximum relative peak height for peak cutoff

split_threshold: float = 0.05: Maximum relative height of minima between peaks to split them

explained_threshold: float = 0.995: Minimal R2 to consider peak resolved

peak_model: Literal['BiGaussian', 'BiGaussianTailing', 'FraserSuzuki', 'Bemg'] = 'Bemg': Model that describes the peak shape

max_peak_comps: int = 4: Maximum number of deconvolved components in single peak

max_peak_distance: float = 1.0: Maximum peak distance deviation relative to peak width for one compound

min_spectrum_correl: float = 0.99: Minimum correlation of spectra for one compound

min_elution_time: float = 0.4: Peaks with maxima before this time will not be considered

max_elution_time: float = 10.0: Peaks with maxima after this time will not be considered

min_wavelength: float = 210.0: The data will be cropped such that lower wavelengths are not included

max_wavelength: float = 400.0: The data will be cropped such that higher wavelengths are not included

min_rel_integral: float = 0.01: Minimum integral relative to the largest peak

relaxe_concs: bool = False: If True, the concentrations will be relaxed to fit the calibration curve without any peak model

to_yaml() → str: Converts self to YAML string

to_dict() → Dict[str, Any]: Converts the data to a dictionary for serialization

static from_dict(data: Dict[str, Any]) → ProcessingSettings: Creates a ProcessingSettings object from a dictionary