Facilitating statistical significance in clinical research

SUPER SESI is aimed at enabling scientists and clinicians around the globe to measure the breath composition of their patients so as to identify biomarkers associated with a wide variety of heath problems. Ultimately, the identification of breath biomarkers will enable breath diagnostic, which is non-invasive and provide results in real time.
This analyzer can produce as much as 500 Mbytes of data to measure the composition of the breath of an individual. This data is highly complex and requires extensive post-processing and data mining to differentiate the relevant biomarkers from other non-specific species that can also be found in breath.
For this reason, SUPER SESI is accompanied by a dedicated software that greatly facilitates data handling and data mining for this specific type of data. This software is named ARIADNA, and it is currently under development.
The current version of ARIADNA is under development, and it is designed to assist each researcher individually. ARIADNA incorporates a set of menus designed to follow the natural post-processing workflow.
Obtaining statistically significant results requires measuring large population samples. Each practitioner has access to a relatively small pool of patients, which is limited by the prevalence of the health problem being studied, the size of the hospital, the number of patients under the practitioner, and other factors. In particular, obtaining a large number of control subjects with features similar to those of the patients (in terms of gender, age, etcetera) can be very challenging because patients are often young infants and elderly adults. As a result, even if the researcher has access to very good instrumentation, the results of the research are often non-conclusive because the number of patients and controls analyzed in one clinical campaign is not sufficient.
The Objective of this project is to develop a post-processing software that allows different users to share their data. Each researcher can focus on a particular health problem, while the pool of control data is shared across all researchers. The total amount of data made accessible to each user will be dramatically increased, thereby improving the statistical significance of each particular study. Ultimately, this will allow scientists to collaborate in their quest to find reliable biomarkers.
This workflow includes the following steps:
(i) Definitions and medical history template: The first menu allows the researcher to define the characteristic of interests. These characteristics might include the gender and age of the patients, clinical results, such as being positive or negative for a diagnostic procedure, or other measurements that the researcher might render of interest. In this step, the researcher defines a template for the medical history that is to be used in the study.
(ii) Import individual data: This menu provides a tool to import the data acquired with the mass spectrometer, and allows the researcher to complete the information defined in the medical history, as defined in the menu (i). Once this information is completed, this menu allows the information to be uploaded to the database. This menu also allows for the data to be edited.
(iii) pre-process and evaluate individual data: This menu provides the tools to view and evaluate the data for each individual. This is important to make sure that the data is consistent and that its quality is correct. Basic features include: (a) view total ion counts evolution, (b) view spectra for a set of time intervals, (c) view the signal evolution in time for a given set of molecular mass ranges, and (d) cluster the signals according to their profile pattern.
This menu also provides key post-processing tools required to improve the quality of the data, these tools include: (a) calibration of the spectra, (b) time smoothing, and (c) identification of unstable time intervals.
Finally, this menu also includes a tool to identify exhalations and background signals and to quantify the quality of the data. Once this evaluation is completed, and the operator concludes that the quality of the data meets its quality standards, the results of each individual data pre-processing and evaluation are stored together with each individual data.
(iv) Campaign definition: This menu provides a set of filters and Boolean tools that the researcher can use to define the categories that are used to separate the subgroups that the researcher wishes to differentiate. The filters also allow the data mining algorithms to be a plied to a sub-group of interest.
(v & vi) Harmonization and normalization: For the selected data sets, this tool harmonizes all the data by defining the mass ranges identified in the entire population. For each mass range and each individual, this tool provides the average and the variance of the exhaled breath and the background levels. The resulting matrix is then normalized to correct for gain drifts of the instrument, which can be easily identified by measuring the intensity of calibrants.
(v) Unsupervised data mining: This menu includes blind tools to unravel the structure of the data. The most common unsupervised tool being Principal Component Analysis. This menu also provides the option to correlate different parameters, including characteristics of the medical history, quality data, harmonization data, normalization data, and signal data.
(vi) Supervised data mining: This menu includes a set of classification algorithms that are used to identify species of interests. The typical output of these algorithms is the ordered list of biomarkers and the Receiver Operating Characteristic (ROC) curve.