We first state the process of construction of classification models for diagnosing patients with bronchitis and pneumonia, then introduce the details of collecting patient audios, and finally show the statistics of the patient audios. Table 1 shows the notations used in this paper.
In the prediction process, for an input x N + 1 in the given test set, the prediction system outputs a classification result by the model.
In the learning process, the learning system uses the training set T to learn a classification decision function. The classification decision function is denoted by Y = f(X), which describes the mapping between inputs and outputs.
We use the training set to learn a cough sound classification model and then use the model to predict the test data. We formalize the problem of disease classification. Let T = {(x 1 , y 1 ), (x 2 , y 2 ),⋯,(x N , y N )} denote the training set, where (x i , y i ) is a sample, i = 1,2,⋯, N. Let x i denote the features that the model used for classification. Let y i denote the diagnostic result of x i , i.e., y i = {-1,+1}. y i = -1 if x i is the negative case (bronchitis) and y i = +1 otherwise.
The patient audios used in this work are collected from the West China Second University Hospital, Sichuan University, Chengdu, China. The Ethics Committees of West China Second University Hospital approved the study and the verbal consent procedures. Verbal informed consent was obtained from the legal guardians of all participants and recorded with the recorder. We collect 173 audios from 91 bronchitis (51 male, 40 female; 1 acute asthmatic bronchitis, 13 acute bronchiolitis,76 acute bronchitis) and 82 pneumonia patients (43 male, 39 female; 1 lobar pneumonia, 81 bronchopneumonia) (ages from 0 years to 11 years). Bronchitis and pneumonia were diagnosed according to Zhu Futang Practice of Pediatrics (8th Edition) [ 14 ]. Fig 2 shows the age distributions of bronchitis, pneumonia, and all patients. The proportion of age represents the ratio of the patients in each age group to the number of patients. As shown in Fig 3 , patient audios are collected in a consulting room of pediatrics as MP3 files. The distance between the recorder and the patient’s mouth varies from 20 to 40 cm. The average duration of the audio for each patient is 3.92s. In addition, children are usually accompanied by their families, who can make some extra noise for the audios.
Table 2 shows the detailed statistics of the patient audios, and each disease accounts for about half of the dataset. Intuitively, the statistical data approximately follows a power-law distribution, and the proportion of children under the age of one accounts for more than 50%.
Structure of the feature aggregation framework
The core of the framework is aggregation operation, which can obtain the patient’s features. Fig 4 shows the structure of the feature aggregation framework. We first take a recording of patients, followed by noise reduction and normalization. Then, we segment the patient audios into several cough chunks. In addition, we apply three data augmentation technologies to the cough chunks. Later, we extract MFCC features from the cough chunks. Finally, we train a classifier to classify pneumonia and bronchitis.
Data pre-processing. We will start by improving the signal-to-noise ratio (SNR) of the patient audios. We first convert patient audios into WAV format at 44.1kHz sampling frequency with 16 bits per sample. Fig 5A shows the waveform of the original patient audio. We adopt Log-MMSE [15], a frequently used speech enhancement algorithm [16], to improve the SNR. It minimizes the mean square error of the log-spectral, resulting in a much lower residual noise level without further affecting the patient audio itself. In addition, we normalize the amplitude value of the patient audios by limiting the peak amplitude to -0.1dB. Fig 5B shows the waveform after speech enhancement and normalization. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 5. Example of preprocessing and patient audio segmentation waveforms. (A) Waveform of the original patient audio; (B) Waveform after noise reduction and normalization; (C) Waveform after patient audios segmentation according to the energy threshold. https://doi.org/10.1371/journal.pone.0275479.g005
Patient audios segmentation. After data pre-processing, patient audios still contain people’s speaking voices. So we further need to segment patient audios into cough chunks. There are many audio segmentation algorithms. A widely adopted algorithm for audio segmentation is based on the Bayesian Information Criterion (BIC), applied within a sliding variable-size analysis window and some smoothing rules. The sliding variable-size analysis window can classify each one-second window into different audio classes by audio signals features. The smoothing rules of an audio sequence can segment an audio stream into speech, music, environment sound and silence [17,18]. Auditok is fast and works well for audio streams with low background noise (e.g., few people talking) [19]. Auditok uses a signal energy threshold to obtain valid audio events, where the valid audio events are those signal energy equal to or above this threshold. Moreover, the energy of the audio signal is the log energy, which is computed as: , where a i is the ith audio sample and N is the number of audio samples in data. Meanwhile, our dataset is collected in a quiet environment with few people, so we use a toolkit called auditok to segment the patient audios. In the data collection process, the distance between the recorder and the patient’s mouth is no more than 40cm. Statistical analysis shows us an energy threshold value to discard speaking voice from cough sounds. Therefore, we used this energy threshold to segment cough sounds. Fig 5C shows the waveform after patient audios segmentation according to the threshold. Cough chunks are retained and speaking voices are discarded. In the future, we can use a sliding variable-size analysis window to perform segmentation in the scene of a complex environment.
Data augmentation. As medical data is a kind of private data, it is expensive to collect such private data. Furthermore, deep learning relies on a large-scale dataset. Data augmentation aims to increase the number and diversity of training data to improve the robustness of deep learning models. We term the cough sounds collected in the hospital as the RAW dataset. We have adopted three data augmentation technologies: time shifting [20], pitch shifting [21] and adding noise [22] to the RAW dataset. Fig 6 shows the waveforms of the cough chunks through the three data augmentation methods, where the cough chunks are derived from the above selection. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 6. Waveforms of the cough chunks under three data augmentation methods. https://doi.org/10.1371/journal.pone.0275479.g006 Time shifting. We adopt time shifting to increase the number of samples of the RAW dataset. This operation can be seen as deleting a small portion of cough sound information to obtain new samples. Time shifting deletes the information at the beginning or end of the cough chunks, where the period ranges from 0 to 0.1s, and fills fixed frequency to keep the duration unchanged. Pitch shifting. We also adopt pitch shifting to increase the number of samples of the RAW dataset. Raise the pitch of the cough chunks within five half-tones. That is, turn up the frequency. The higher the pitch, the higher the frequency. We can obtain new cough sounds through raw cough chunks by conducting pitch shifting. Noise adding (white/pink noise). To increase the diversity of samples of the RAW dataset, we mix noise with the original sounds. This operation can be seen as changing the SNR distribution of each cough chunk. Mix each cough chunk with white noise or pink noise. White noise contains various characteristics of noise. Pink noise is the most common noise in nature, and traffic sound can be simulated by pink noise. So we mix the white and pink noise to the cough chunks to obtain new cough sounds.
Feature extraction. In speech recognition, there are many feature extraction methods [23–25]. We extracted the Mel frequency cepstral coefficients (MFCC) [26] from the cough chunks using a non-parametric FFT-based approach. MFCC describes the energy distribution of a signal in the frequency domain. The dimension of MFCC depends on the front part number of dimensions taken from the data after discrete cosine transform (DCT). Because a lot of the signal data will be transformed in the low-frequency region after DCT, it is only necessary to take the front part after DCT and discard the redundant data. MFCC is frequently used as an acoustic feature to assess pathological voice quality [27,28]. This study uses a 20-dimensional feature vector consisting of log energy and 19-dimensional Mel frequency cepstral coefficients. The process of computing MFCC involves four steps: (1) Divide the audio stream into overlapping frames. (2) Perform an FFT for each frame to obtain the frequency spectrum. (3) Then, take the logarithm on the spectrum and convert the log spectrum to the Mel spectrum. (4) Finally, take the Discrete Cosine Transform (DCT) on the Mel spectrum
Aggregation operation. Some cough chunks carry information about the disease, while others are not. In reality, we do not know the label of cough chunks in the dataset. We only have the patient audios and disease labels. Thus we want to analyze all the cough chunks comprehensively. We first intuitively concatenate the cough chunks on the time sequence. E.g., starting from 0s, if the first cough chunk lasts t 1 s, then the second cough chunk starts from t 1 s, following the first cough chunk, and so on. We obtained the experimental results of SVM [29], XGBoost [30], RF [31], LSTM [32], RNN [33], and GRU [34] classification accuracy in 45 random test on the RAW dataset, 69.79%, 64.38%, 67.5%, 66.46%, 62.5%, and 66.25%, respectively. From the above results, there seems to be much room for improvement in classification accuracy. So, we further take into account the interconnection of each cough chunk. As we all know, the binary classification result is determined by the distribution of features. For example, as shown in the third part of Fig 1, the features of cough chunks are all distributed in the same feature space, and the hyperplane divides these features into two classes. If multiple cough chunks of one patient audio are mapped in the same class and are far from the hyperplane, the patient is more likely to belong to this class. Furthermore, the mean value of all cough chunks (i.e., the center value in Fig 1) can comprehensively represent all cough chunks. Then, we utilize the mean value of the features of all cough chunks in one patient audio as the patient’s disease feature. Formally, the feature extracted from the ith cough chunk of the patient k can be defined as , which is a 20-dimensional feature vector, where is the tth feature of the x ki . From the above, we set the feature of the patient k as follows: