Librosa Library – Speech Recognition, Speech Tone Recognition Training and Applications

Many students think that speech recognition is very difficult, but it is not, at first I also think so, but later found that speech recognition is the easiest, because students may not know that Python has an audio processing library Librosa, this library is very powerful, can be audio processing,spectrogramRepresentation, amplitude conversion, time-frequency conversion, feature extraction (timbre, pitch extraction) and so on, about Librosa’s more introduction or application need you to go to the official website or check other blog information, here I will simply install, and then the speech recognition explanation.

Step 1: Install the Librosa library in the terminal

Method 1: Use the pip command

pip install librosa

Librosa Library - Speech Recognition, Speech Tone Recognition Training and Applications

Method 2: Use the conda command

conda install -c conda-forge librosa

Librosa Library - Speech Recognition, Speech Tone Recognition Training and Applications

Step 2: Open jupyter and import the libraries for this guide

import librosa
import numpy as np
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import IPython.display as ipd

Step 3: produce voice data, here means that they record audio recordings of different people’s voices, the length of each recording does not require, of course, I personally think that the recording time of 20-30 seconds can be, at least 3 recordings of audio, because I use the following method is a multi-classification training method, must be 3 audio, and Librosa audio format is generally WAV, MP3 (here to pay attention to the) is that if you record the audio file was not MP3 or WAV format, do not rename the file to change the suffix, to use a special tool to change the suffix, or report an error, I tried), the following is my 3 recordings to train the audio file were tbb-01.mp3 (I speak directly to the voice), the 3 audio to replace the sound of your own recordings, if you still don’t understand the comment area we see. If you still don’t understand, let’s see you in the comments section.

# Load data sets
def load_data():
    # Load the audio data of the three instruments tbb, aichen, xsc
    tbb, sr1 = librosa.load('tbb-01.mp3')
    aichen, sr2 = librosa.load('aichen-01.mp3')
    xsc, sr3 = librosa.load('xsc-01.mp3')

    # Extraction of MFCC features, in this case different human voice timbre extraction
    tbb_mfcc = librosa.feature.mfcc(y=tbb, sr=sr1)
    aichen_mfcc = librosa.feature.mfcc(y=aichen, sr=sr2)
    xsc_mfcc = librosa.feature.mfcc(y=xsc, sr=sr3)

    # Combine MFCC features of different human voice color into one dataset
    X = np.concatenate((tbb_mfcc.T, aichen_mfcc.T, xsc_mfcc.T), axis=0)

    # Generate label vectors
    y = np.concatenate((np.zeros(len(tbb_mfcc.T)), np.ones(len(aichen_mfcc.T)), 2*np.ones(len(xsc_mfcc.T))))

    return X, y

Execute the function and output

# Load data sets
X, y = load_data()
y

Why is this result starting from 0 to 2? Because there are 3 audios here, which can be said to be the default labels of the generated dataset, the first one has a label of 0, the second one has a label of 1, the third one has a label of 2, and so on and so forth, as many as there are, so why are there multiple 0s, 1s, and 2s? Because the audio will be split into segments to be labeled when this dataset is being made, so that The more datasets you make, the better the training will be!

Step 4: Training with the dataset processed above

# Training models
def train(X, y):
    # Split the dataset into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Multi-category classification using logistic regression algorithms
    model = LogisticRegression(multi_class='ovr')

    # Training models
    model.fit(X_train, y_train)

    return model

executable function

# Training models
model = train(X, y)

Step 5: Perform model testing

# Test models
def predict(model, audio_file):
    # Load audio files and extract MFCC features
    y, sr = librosa.load(audio_file)
    mfcc = librosa.feature.mfcc(y=y, sr=sr)

    # Multi-category classification predictions made
    label = model.predict(mfcc.T)
    proba = model.predict_proba(mfcc.T)

    # Get the category tag with the highest probability
    max_prob_idx = np.argmax(proba[0])
    max_prob_label = label[max_prob_idx]

    return max_prob_label

Execute the function, here I re-recorded one of my own voices for the test

# Test models
label = predict(model, 'tbb-02.mp3')

print('Tone is:', label)

The results are as follows:

Librosa Library - Speech Recognition, Speech Tone Recognition Training and Applications

The identified label is 0, which is indeed correct

Then speech recognition is actually here at the end, of course, I only do here timbre recognition, is to recognize the voice of different people speak, Librosa library can also be other recognition, waiting for you to understand the

One more library here is IPython.display, as follows

import IPython.display as ipd

This one allows for direct audio playback in jupyter

audio_data = 'nideyangzi.mp3'
ipd.Audio(audio_data)

The results are as follows:

Librosa Library - Speech Recognition, Speech Tone Recognition Training and Applications Well, that’s it for this speech recognition, thanks again for your support!

Librosa Library – Speech Recognition, Speech Tone Recognition Training and Applications

Recommended Today

JavaWeb Framework: Introduction to Spring MVC