Module dataset
The class MoccaDataset
provides high-level interface for managing datasets of chromatograms and libraries of compounds.
MoccaDataset
- class mocca2.dataset.dataset.MoccaDataset
Collection of chromatograms, compounds, and other information
- chromatograms: Dict[int, Chromatogram]
Chromatograms in the dataset [id -> Chromatogram]
- compound_references: Dict[int, Tuple[str, float | None]]
If some chromatogram is reference for a compound, this we be stored here as [chromatogram id -> (compound name, concentration?)]
- istd_concentrations: Dict[int, float]
Concentrations of internal standard [chromatogram id -> concentration]
- istd_compound: int | None
ID of compound that is internal standard
- settings: ProcessingSettings | None
Settings for automatic chromatogram processing
- add_chromatogram(chromatogram: Chromatogram, istd_concentration: float | None = None, reference_for_compound: str | None = None, compound_concentration: float | None = None, istd_reference: bool = False) int
Adds chromatogram to the dataset, returns the assigned ID
Parameters
- chromatogram: Chromatogram
the chromatogram that is added
- istd_concentration: float | None
if internal standard is present, specify concentration
- reference_for_compound: str | None
if this chromatogram contains compound reference, specify compound name
- compound_concentration: float | None
if this chromatogram contains compound reference with known concentration, specify concentration
- istd_reference: bool = False
specify whether this chromatogram is reference for internal standard
- time() ndarray[Any, dtype[_ScalarType_co]] | None
Returns time axis of the chromatograms, if there are any
- wavelength() ndarray[Any, dtype[_ScalarType_co]] | None
Returns wavelength axis of the chromatograms, if there are any
- wavelength_raw() ndarray[Any, dtype[_ScalarType_co]] | None
Returns wavelength axis of the raw data (without cropping), if there are any
- time_step() float | None
Returns the sampling step of the time axis in the chromatograms, if there are any
- wavelength_step() float | None
Returns the sampling step of the wavelength axis in the chromatograms, if there are any
- closest_time(time: float) Tuple[int, float] | None
Returns index and value of time point that is closest to specified time, if there are any chromatograms
- closest_wavelength(wavelength: float) Tuple[int, float] | None
Returns index and value of wavelength point that is closest to specified wavelength, if there are any chromatograms
- process_all(settings: ProcessingSettings, verbose: bool = True, cores: int = 1)
Processes all chromatograms: finds and deconvolves peaks, creates averaged compounds, and refines peaks
- get_area_percent(wl_idx: int) Tuple[DataFrame, List[int]]
Calculates area % of deconvolved peaks at given wavelength
Parameters
- wl_idx: int
index of wavelength which will be used for calculating area %
Returns
- pd.DataFrame
The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
- List[int]
IDs of compounds in the same order as in DataFrame
- get_concentrations() Tuple[DataFrame, List[int]]
Calculates concentrations of deconvolved peaks based on specified concentration_factor
Returns
- pd.DataFrame
The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
- List[int]
IDs of compounds in the same order as in DataFrame
- get_relative_concentrations() Tuple[DataFrame, List[int]]
Calculates concentrations of deconvolved peaks relative to internal standard.
If compound has concentration_factor specified, the integrals are multiplied by this factor
Returns
- pd.DataFrame
The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
- List[int]
IDs of compounds in the same order as in DataFrame
- get_integrals() Tuple[DataFrame, List[int]]
Calculates integrals of deconvolved peaks.
Returns
- pd.DataFrame
The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
- List[int]
IDs of compounds in the same order as in DataFrame
- get_relative_integrals() Tuple[DataFrame, List[int]]
Calculates integrals of deconvolved peaks relative to internal standard.
Returns
- pd.DataFrame
The columns of the dataframe are: ‘Chromatogram ID’, ‘Chromatogram’, and names of compounds
- List[int]
IDs of compounds in the same order as in DataFrame
- to_dict() Dict[str, Any]
Converts the data to a dictionary for serialization
- static from_dict(data: Dict[str, Any]) MoccaDataset
Creates a MoccaDataset object from a dictionary
ProcessingSettings
The ProcessingSettings class contains all settings for automatic batch processing of batches of chromatograms. It is primarily used by the MoccaDataset
class.
- class mocca2.dataset.settings.ProcessingSettings(baseline_model: Literal['asls', 'arpls', 'flatfit'] = 'flatfit', baseline_smoothness: float = 1.0, min_rel_prominence: float = 0.01, min_prominence: float = 1, border_max_peak_cutoff: float = 0.1, split_threshold: float = 0.05, explained_threshold: float = 0.995, peak_model: Literal['BiGaussian', 'BiGaussianTailing', 'FraserSuzuki', 'Bemg'] = 'Bemg', max_peak_comps: int = 4, max_peak_distance: float = 1.0, min_spectrum_correl: float = 0.99, min_elution_time: float = 0.4, max_elution_time: float = 10.0, min_wavelength: float = 210.0, max_wavelength: float = 400.0, min_rel_integral: float = 0.01, relaxe_concs: bool = False)
Collection of all settings required for automatic chromatogram processing in MoccaDataset
- baseline_model: Literal['asls', 'arpls', 'flatfit'] = 'flatfit'
Name of baseline estimator
- baseline_smoothness: float = 1.0
Smoothness penalty for baseline
- min_rel_prominence: float = 0.01
Minimal relative peak height
- min_prominence: float = 1
Minimal peak height
- border_max_peak_cutoff: float = 0.1
Maximum relative peak height for peak cutoff
- split_threshold: float = 0.05
Maximum relative height of minima between peaks to split them
- explained_threshold: float = 0.995
Minimal R2 to consider peak resolved
- peak_model: Literal['BiGaussian', 'BiGaussianTailing', 'FraserSuzuki', 'Bemg'] = 'Bemg'
Model that describes the peak shape
- max_peak_comps: int = 4
Maximum number of deconvolved components in single peak
- max_peak_distance: float = 1.0
Maximum peak distance deviation relative to peak width for one compound
- min_spectrum_correl: float = 0.99
Minimum correlation of spectra for one compound
- min_elution_time: float = 0.4
Peaks with maxima before this time will not be considered
- max_elution_time: float = 10.0
Peaks with maxima after this time will not be considered
- min_wavelength: float = 210.0
The data will be cropped such that lower wavelengths are not included
- max_wavelength: float = 400.0
The data will be cropped such that higher wavelengths are not included
- min_rel_integral: float = 0.01
Minimum integral relative to the largest peak
- relaxe_concs: bool = False
If True, the concentrations will be relaxed to fit the calibration curve without any peak model
- to_yaml() str
Converts self to YAML string
- to_dict() Dict[str, Any]
Converts the data to a dictionary for serialization
- static from_dict(data: Dict[str, Any]) ProcessingSettings
Creates a ProcessingSettings object from a dictionary