Acoustic Scene Classficiation using CNN

Posted on

Acoustic Scene Classification (ASC) aims to distinguish between different acoustic environments and is a technology that can be used by smart devices for contextualisation and personalization. In this document we suggest to implement an algorithm proposed by Battaglino, Lepauloux and Evans (2016) using Convolutional Neuron Network (CNN) to tackle this issue.

Domain background

Acoustic Scene Classification (ASC) is a task of classifying audio samples on the basis of their soundscape. It consists of using acoustic information to imply about the context of the recorded environment. The basic idea is to be able to analyse a recording or a real time audio flow and categorize it in a set predetermined categories. As an example a recording can be composed of uncomprehensive conversations (babble), plates, glasses, coutlery sounds, etc and be categorize as a restaurant scene. Enabling devices to make sense of their environment through the analysis of sounds is one of the main objectives in machine listening, a broad investigation area related to Computional Au- ditory Scene Analysis (CASA) (see Wang & Brown, 2006). Machine listening systems perform analogous processing tasks to the human auditory system and are part of a wider research theme linking fields such as machine learn- ing, robotics and artificial intelligence. ASC refers to the task of associating a semantic label to an audio stream that identifies the environment in which it has been produced. It exists two main areas in the ASC, the first refers to understand the human cognitive process that enables the understanding of acoustic scenes, the second refers to computational algorithms that attempt to automatically perform task using signal processing and machine-learning methods to indentify the environment where the recording has been made. We will focus on the latter subject in this current proposal. The applications are numerous. It is actually used to offer a better voice communication on smartphones and in research to categorize several types of environments or animals. In our present case, our motivation is led by the development of an open source hearing aid system. Current manufacturer are working with research institutions to make use of machine learning to improve automatic parameters fitting and offer a better experience to users. Here, making use of machine learning to predict the type of environment would be helpful for to the current signal processing algorithms used to be 2updated in real time to fit the current listening situation according to the impairement of its user.

Problem statement

In order to perform a classification of an acoustic scene we need to perform several steps. Given an audio file representing the acoustic scene, we need first to extract acoustic features that are relevant for the classification. These features are usually acoustic cues coming from the signal processing domain such intensity cues, recurrences of an event, the spatial information and so on or from the psychoacoustic domain which describes how the human perceive sounds and process them. Once this step achieved, these features will be used to train a classifier. First we have to extract audio features relevant for the semantic classification. The choice of the features will be crucial because none of them can capture the type of event occuring during the acoustic scene and how they evolve (their nature, temporality, position in space, recurrence, etc…). Then, even though a a sound is primarily a signal evolving over time, the choice of the classifier isn’t obvious and will depend of the nature of the feature.

Dataset and inputs

The Detection and Classification of Acoustic Scenes and Events (DCASE) challenge data set (Giannoulis et al., 2013) was specially created to provide researchers with a standardized set of recordings produced in ten different urban environments. Each recording consists in an 30 seconds audio file recorded using binaural headphones in locations around London at various times in 2012 by three different people. The format of the audio files is described in Table 1.

Format Encoding Channels Sample rate Depth
WAV PCM 2 44.1 kHz 16 bits

Table 1: Format of recorded audio files.

Locations were selected to represent instances of 15 classes as described in Table 2. The data set consists of 10 recordings of each class, making 100 recordings total. The training set can be downloaded on the official website (A. Mesaros, Heittola & Virtanen, 2016) and results can be tested against the private dataset from the previous year (Stowell & Benetos, 2016).

Scene type
Cafe / Restaurant - small cafe/restaurant (indoor)
Car - driving or traveling as a passenger, in the city (vehicle)
City center (outdoor)
Forest path (outdoor)
Grocery store - medium size grocery store (indoor)
Home (indoor)
Lakeside beach (outdoor)
Library (indoor)
Metro station (indoor)
Office - multiple persons, typical work day (indoor)
Residential area (outdoor)
Train (traveling, vehicle)
Tram (traveling, vehicle)
Urban park (outdoor)

Table 2: Environment label categories.

Solution statement

The solution would be to implement the algorithm proposed by Battaglino et al. (2016). CNN has been successfully used in speech recognition, music analysis and event detection. As stated in their paper the motivation of this approach lies in the potential of using raw time-frequency representations as the input1, replacement of hand-crafted features (like Mel-Frequency Cepstrum Coefficient (MFCC)s for example) with automatically learned features and the potential of capturing reccurent spectro-temporal structures. A spectrogram is obtained using a Fourier transform. Because of the nature of the spectrogram which output a 2D image, a CNN classifier suits well the problem because it preserves the spatial relationship of the data and can detect recurrences of an event that will be translated over time or shifted in frequency.

Schematic of the algorithm Figure 1: The CNN architecture used in this work. The input is static and dynamic spectrograms. These are followed by two, stacked convolutional and pooling layers. Fully connected and output layers produce the probabilities of the input data belonging to each acoustic class. Convolutional filters are illustrated in light orange while pooling blocks are illustrated in dark green. Reproduced from Battaglino, Lepauloux and Evans (2016).

The main structure is given in Figure 1. CNNs have a multi-layered, deep network architecture. They are in some ways a natural extension of the standard multilayer perceptron model but with several differences. CNNs can handle high-dimensional data; second, each hidden unit is connected only to a sub-region of the data input (referred as receptive field ) and therefore, captures only the local structure. Lastly, CNNs are able to capture recurrent local structure in an audio flow. The architecture of the algorithm is composed of an input layer, a stack of convolutional and pooling layers, a fully connected hidden layer and a final output layer.

Benchmark model and evaluation metrics

In order to benchmark the algorithm, it can be tested against a base- line. Here the baseline provided with the database consists of a classical MFCC and a Gaussian Mixture Model (GMM) based classifier (see Anna- maria Mesaros, Heittola & Virtane, 2016). For each acoustic scene, GMM class model with 32 components was trained based on the described features using expectation maximization algorithm. The testing stage uses maximum likelihood decision among all acoustic scene class models. Classification per- formance is measured using accuracy: the number of correctly classified seg- ments among the total number of test segments. The classification results using the cross-validation setup for the development test are given in Figure 2. Because the DCASE is a challenge, the evaluation metric is the same for everyone is order to have a base to compare one algorithm to another.

Baseline results accross all the classes Figure 2: TUT Acoustic Scenes 2016: Baseline system performance on de- velopment test. Reproduced from Annamaria Mesaros, Heittola and Virtane (2016).

Project design

Data transformation

Before performing the training, the audio data will need to be transformed using a discrete Fourier transform using a 40 ms window with an overlap of 20s. Then we’ll extract the static spectrograms. These are formed from magnitude spectra which are passed through a bank of 60 log and Mel-scaled filters with a maximum frequency of 22050 Hz. The dynamic spectrograms are calculated in the usual way with a time-window of 9 frames. Each 30s clips of the database are split into 25 sub-clips of 1.2 seconds duration. Each sub- clip will be represented with both static and dynamic spectrogram segments as illustrated in Figure 3, resulting in input data of 60 bands x 60 frames.


The CNN has two stacked pairs of convolutional and pooling. The first convolutional layer contains 32 filters each of which spans 57 frequency bands and 6 frames (342 elements). This results in a set of 32 features map each of 4 bands and 55 frames (220 elements). The pooling layer performs a max-pooling over 2 adjacent units in both frequency and time. A second convolutional layer creates 32 features map using filters each of which spans

Example of spectrograms Figure 3: CNN input data is a pair of static (log-Mel) and dynamic (first derivatives, ∆) spectrograms. Each is first segmented into smaller sub-clip illustrated in red, each then forming separate input data. Reproduced from Battaglino, Lepauloux and Evans (2016).

1 band and 2 frames (2 elements). The fully connected layer in comprised of 2000 nodes and is followed by a softmax layer which returns output prob- abilities for all 15 DCASE classes. Data is treated in batches of 1000 input samples and the network is trained for 100 epochs. The learning rate is set to 0.001 with an initial momentum of 0.9 which is increased linearly to 0.99 for the final epoch. You can find more details of the implementations described in Battaglino et al. (2016).


Battaglino, D., Lepauloux, L. & Evans, N. (2016). Acoustic scene classifica- tion using convolutional neural networks. In Ieee aasp.

Giannoulis, D., Stowell, D., Benetos, E., Rossignol, M., Lagrange, M. &

Plumbley, M. D. (2013). A database and challenge for acoustic scene classification and event detection. submitted to. In Proc. eusipco.

Mesaros, A. [A.], Heittola, T. & Virtanen, T. (2016). Tut acoustic scenes 2016. Retrieved from

Mesaros, A. [Annamaria], Heittola, T. & Virtane, T. (2016). Tut database for acoustic scene classification and sound event detection. In 24th european signal processing conference 2016 (pp. 1128–1132). EUSIPCO 2016. Budapest, Hungary.

Stowell, D. & Benetos, E. (2016). Dcase private scene classification testing dataset. Retrieved from classification_testset

Wang, D. & Brown, G. J. (2006). Computational auditory scene analysis: principles, algorithms, and applications (D. Wang & G. J. Brown, Eds.). Wiley-IEEE Press.

  1. Time-frequency representation is a 2 dimensional plan used in acoustics to observe the frequency content of an audio file over time. This can called as well spectrogram. [return]