Module dataset

The class MoccaDataset provides high-level interface for managing datasets of chromatograms and libraries of compounds.

MoccaDataset

class mocca2.dataset.dataset.MoccaDataset

Collection of chromatograms, compounds, and other information

chromatograms: Dict[int, Chromatogram]

Chromatograms in the dataset [id -> Chromatogram]

compounds: Dict[int, Compound]

Compounds by ID [id -> Compound]

compound_references: Dict[int, Tuple[str, float | None]]

If some chromatogram is reference for a compound, this we be stored here as [chromatogram id -> (compound name, concentration?)]

istd_concentrations: Dict[int, float]

Concentrations of internal standard [chromatogram id -> concentration]

istd_compound: int | None

ID of compound that is internal standard

settings: ProcessingSettings | None

Settings for automatic chromatogram processing

add_chromatogram(chromatogram: Chromatogram, istd_concentration: float | None = None, reference_for_compound: str | None = None, compound_concentration: float | None = None, istd_reference: bool = False) int

Adds chromatogram to the dataset, returns the assigned ID

Parameters

chromatogram: Chromatogram

the chromatogram that is added

istd_concentration: float | None

if internal standard is present, specify concentration

reference_for_compound: str | None

if this chromatogram contains compound reference, specify compound name

compound_concentration: float | None

if this chromatogram contains compound reference with known concentration, specify concentration

istd_reference: bool = False

specify whether this chromatogram is reference for internal standard

time() ndarray[Any, dtype[_ScalarType_co]] | None

Returns time axis of the chromatograms, if there are any

wavelength() ndarray[Any, dtype[_ScalarType_co]] | None

Returns wavelength axis of the chromatograms, if there are any

wavelength_raw() ndarray[Any, dtype[_ScalarType_co]] | None

Returns wavelength axis of the raw data (without cropping), if there are any

time_step() float | None

Returns the sampling step of the time axis in the chromatograms, if there are any

wavelength_step() float | None

Returns the sampling step of the wavelength axis in the chromatograms, if there are any

closest_time(time: float) Tuple[int, float] | None

Returns index and value of time point that is closest to specified time, if there are any chromatograms

closest_wavelength(wavelength: float) Tuple[int, float] | None

Returns index and value of wavelength point that is closest to specified wavelength, if there are any chromatograms

process_all(settings: ProcessingSettings, verbose: bool = True, cores: int = 1)

Processes all chromatograms: finds and deconvolves peaks, creates averaged compounds, and refines peaks

get_area_percent(wl_idx: int) Tuple[DataFrame, List[int]]

Calculates area % of deconvolved peaks at given wavelength

Parameters

wl_idx: int

index of wavelength which will be used for calculating area %

Returns

pd.DataFrame

The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds

List[int]

IDs of compounds in the same order as in DataFrame

get_concentrations() Tuple[DataFrame, List[int]]

Calculates concentrations of deconvolved peaks based on specified concentration_factor

Returns

pd.DataFrame

The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds

List[int]

IDs of compounds in the same order as in DataFrame

get_relative_concentrations() Tuple[DataFrame, List[int]]

Calculates concentrations of deconvolved peaks relative to internal standard.

If compound has concentration_factor specified, the integrals are multiplied by this factor

Returns

pd.DataFrame

The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds

List[int]

IDs of compounds in the same order as in DataFrame

get_integrals() Tuple[DataFrame, List[int]]

Calculates integrals of deconvolved peaks.

Returns

pd.DataFrame

The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds

List[int]

IDs of compounds in the same order as in DataFrame

get_relative_integrals() Tuple[DataFrame, List[int]]

Calculates integrals of deconvolved peaks relative to internal standard.

Returns

pd.DataFrame

The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds

List[int]

IDs of compounds in the same order as in DataFrame

to_dict() Dict[str, Any]

Converts the data to a dictionary for serialization

static from_dict(data: Dict[str, Any]) MoccaDataset

Creates a MoccaDataset object from a dictionary

ProcessingSettings

The ProcessingSettings class contains all settings for automatic batch processing of batches of chromatograms. It is primarily used by the MoccaDataset class.

class mocca2.dataset.settings.ProcessingSettings(baseline_model: Literal['asls', 'arpls', 'flatfit'] = 'flatfit', baseline_smoothness: float = 1.0, min_rel_prominence: float = 0.01, min_prominence: float = 1, border_max_peak_cutoff: float = 0.1, split_threshold: float = 0.05, explained_threshold: float = 0.995, peak_model: Literal['BiGaussian', 'BiGaussianTailing', 'FraserSuzuki', 'Bemg'] = 'Bemg', max_peak_comps: int = 4, max_peak_distance: float = 1.0, min_spectrum_correl: float = 0.99, min_elution_time: float = 0.4, max_elution_time: float = 10.0, min_wavelength: float = 210.0, max_wavelength: float = 400.0, min_rel_integral: float = 0.01, relaxe_concs: bool = False)

Collection of all settings required for automatic chromatogram processing in MoccaDataset

baseline_model: Literal['asls', 'arpls', 'flatfit'] = 'flatfit'

Name of baseline estimator

baseline_smoothness: float = 1.0

Smoothness penalty for baseline

min_rel_prominence: float = 0.01

Minimal relative peak height

min_prominence: float = 1

Minimal peak height

border_max_peak_cutoff: float = 0.1

Maximum relative peak height for peak cutoff

split_threshold: float = 0.05

Maximum relative height of minima between peaks to split them

explained_threshold: float = 0.995

Minimal R2 to consider peak resolved

peak_model: Literal['BiGaussian', 'BiGaussianTailing', 'FraserSuzuki', 'Bemg'] = 'Bemg'

Model that describes the peak shape

max_peak_comps: int = 4

Maximum number of deconvolved components in single peak

max_peak_distance: float = 1.0

Maximum peak distance deviation relative to peak width for one compound

min_spectrum_correl: float = 0.99

Minimum correlation of spectra for one compound

min_elution_time: float = 0.4

Peaks with maxima before this time will not be considered

max_elution_time: float = 10.0

Peaks with maxima after this time will not be considered

min_wavelength: float = 210.0

The data will be cropped such that lower wavelengths are not included

max_wavelength: float = 400.0

The data will be cropped such that higher wavelengths are not included

min_rel_integral: float = 0.01

Minimum integral relative to the largest peak

relaxe_concs: bool = False

If True, the concentrations will be relaxed to fit the calibration curve without any peak model

to_yaml() str

Converts self to YAML string

to_dict() Dict[str, Any]

Converts the data to a dictionary for serialization

static from_dict(data: Dict[str, Any]) ProcessingSettings

Creates a ProcessingSettings object from a dictionary