One thing I’ve been really interested in lately is the wide range of wildlife sounds that I hear all over the hills of where I’ve been living. Each season brings new sounds that I’ve never heard before: bizarre bird calls, frogs croaking together in loud intricate patterns and yipping howls from coyotes in the field by our house late at night. When mixed together, many of these sounds form interesting patterns with defined pitches and random rhythms, somewhat reminnescent of moments in musical genres like minimal techno and ambient music.

I started recording some of these sounds and wound up with a good number of files that seemed to showcase some of the differents characteristics of these wildlife sounds. I thought it would be interesting to figure out how to classify these sounds using Deep Learning by extracting some of the different features of the audio.

Frog Song

First I needed to figure out how to extract features from the audio files and represent the data as input for a neural network. I decided to use spectral representations after reading about this approach in [1] and [2]. I used librosa to extract various features from my field recordings using a couple of different techniques.

Here is an example of a chorus of frogs croaking that I recorded in SW Washington State:

First I performed a short-time fourier and then used it to compute a power spectrogram, spectral contrast and mean Mel Frequency Cepstral Coefficients. Here is what the spectral features might look like for our frog sound: Additionally, we also can use the Tonal Centroid (Tonnetz) and Chromagram as features that showcase the pitch and frequency of our audio. This is what those features look like for the frog audio:

Classification

The next step was to create a classifier that would be able to identify different wildlife sounds. I used Keras to create model that would be able to read in the extracted audio features and labels and hopefully be able to learn the different categories of audio.

Model

I used a Convolutional Neural Network similar to the one described in [2]. I had two 1D Convolutional layers with 64 filters each, being fed into ReLU activation layers and then a Max Pooling layer. Then we do it all over again but with 128 filters each and a Global Average Pooling layer. Lastly, we add a layer of dropout to prevent overfitting and then a final fully-connected layer with Softmax activation. For my loss function I am using categorical cross entropy and stochastic gradient descent with Nesterov momentum, which is different than the recommended Adam optimizer.

Here is a link to my model code on Github: https://github.com/accraze/petsounds/blob/master/model.py

Training

I decided to create my own custom dataset by using a subset of the ESC-50 dataset. I only used the animal sounds which included 10 categories. Next, I extracted the audio features for each sample in all categories and stored the features and labels into numpy arrays.

I fed the feature and label arrays into the CNN model and trained with a batch size of 128 and learning rate of 0.002. After training for 10,000 epochs, I was able to get an accuracy of ~93% on the test data.

Evaluation

Now I wanted to see if my model could identify the sounds from my field recordings around the house. I extracted audio features from a small datasets of bird, frog and coyote sounds and was only able to get an accuracy of around 88%. After taking a look at my audio, there was a large amount of noise and I was not using great equipment (just an Iphone).

Next Steps

There were a couple of outcomes from my experiment that I would like to explore further in the upcoming weeks. I would like to first try and pre-process my audio a bit more using compression, or maybe even EQ, before extracting the audio features. Another alternative might be to try and get better audio recordings that closer to the source. You know the saying… garbage in, garbage out. Also, I would like to explore transfer learning or maybe few-shot learning with my small custom dataset.

Here is a link to my code on Github if you would like to see what I have so far: https://github.com/accraze/petsounds

References

[1] Audio spectrogram representations for processing with Convolutional Neural Networks - Wyse - https://arxiv.org/pdf/1706.09559.pdf

[2] Environmental Sound Classification With Convolutional Neural Networks - Piczak - http://karol.piczak.com/papers/Piczak2015-ESC-ConvNet.pdf