Speech recognition in action (python code)

Time:2024-2-14

Speech recognition in action

(python : pyttsx, SAPI, SpeechLib example code) (I)

Table of Contents for this article:

I. Basic Principles of Speech Recognition

(1) The origin and development of speech recognition

(2) Basic principles of speech recognition

(3) Speech recognition process

(4) Recent developments in speech recognition

II. Python Speech Recognition

(1), text-to-speech conversion

(2), text to save as a voice file wav

III. Summary


I. Basic Principles of Speech Recognition

(1) The origin and development of speech recognition

Speech recognition is a complex cross-cutting technical discipline involving acoustics, linguistics, signal processing, pattern recognition, psychology, and computer science, among other subject areas.

The development of speech recognition technology:

Speech recognition in action (python code)

(2) Basic principles of speech recognition

To the average person, we usually feel that speech is made up of individual words, but how is it produced and perceived by us?

In fact, speech is a continuous dynamic audio stream, consisting of a portion of fairly stable states mixed with many dynamically changing states. Within this sequence of states, one can define more or less similar sounds or phonemes.

Speech recognition in action (python code)

Speech is a dynamic waveform of speech over time.

A typical voice dialog system generally consists of the following technical modules:

  • Dialog Manager
  • Speech Recognizer (Speech)
  • Language Parser
  • Language Generator (LG)
  • Speech Synthesizer (Speech Synthesizer)

Speech recognition in action (python code)

Among them, the speech recognizer (which can also be called a language recognition module or a language recognition system) is mainly used to convert the user input speech into text.

(3) Speech recognition process

The general approach to speech recognition is:

First, sound is input, a waveform is extracted, and then, the waveform is broken down into speech segments and an attempt is made to recognize what is contained in each speech segment.

Typically, to do this, we need to try to match all possible word combinations with the audio and finally select the best matching combination, which relies on sound models, speech models, and various pronunciation dictionaries. In this matching process, the parameters need to be optimized due to their large number. Generally, the speech is divided into small units of frames, and then, for each frame (usually about 10ms in duration), 39 numbers representing speech features are extracted, and these numbers are the speech feature vectors.

Schematic diagram of speech recognition process.

Speech recognition in action (python code)

Speech recognition is a process of encoding and then decoding. In this, Signal Processing (Signal Processing) and Feature Extraction (Feature Extraction) is the beginning of speech recognition system, which is a coding process.

Feature extraction refers to obtaining speech feature vectors from the original speech input after corresponding processing.

Language modeling: a sequence of several words can be considered a sentence only when the sequence is grammatical, therefore, language modeling has been introduced in speech recognition to achieve this constraint. There are two main categories of current language models: syntax-based language models and statistical-based language models. Syntactic Language Model (SLCM), also known as Deterministic Language Model (DLM) or Formal Language Model (FLM), is an artificial approach to summarize a set of formally deducible and extensible grammars for the intrinsic laws of human language, to The recognition results that do not conform to the grammar are excluded. This method can achieve good results in some recognition tasks. Statistical-based language models count the frequency of occurrence of words and their occurrence conditions in a large number of texts. Usually we combine the statistical language model with the acoustic model to accomplish the recognition task, which can reduce the rejection rate due to the irrationality of the acoustic model. Currently, N-Gram language model is commonly used in large vocabulary continuous speech recognition [23], and for Chinese, we call it Chinese Language Model (CLM). An evaluation metric for the quality of a language model is usually expressed in terms of language model complexity (Perplexity), which is defined as the inverse of the geometric mean of the probability of word sequences. Formula.

Speech recognition in action (python code)

When the complexity is lower, it means that the language model is more certain about the prediction of the current word. Therefore, the training of language models generally takes the minimization of the complexity of the training utterances as the goal. To achieve this goal, the word frequency in the training utterances should be counted first, so as to calculate the parameters of the language model. And when the word list is large and the training data is not sufficiently large, it happens that the probability of some word sequences is very small or has not occurred (oov). In order to solve these problems, some techniques such as discouting and back-off are needed. An evaluation metric for the quality of a language model is usually expressed in terms of language model complexity (Perplexity), which is defined as the inverse of the geometric mean of the probabilities of word sequences.

(4) Recent developments in speech recognition

According to: China Voice Industry Alliance “China Intelligent Voice Industry Development Report (2021-2022)” statistics:

From a global perspective, the scale of the global intelligent voice industry will reach $35.12 billion in 2022, maintaining a high growth rate of 33.1%; from China’s point of view, according to Deloitte’s statistics, China’s intelligent voice market will reach 34.1 billion yuan in 2022, with a year-on-year growth of 13.4%.

Leading companies such as KDDI, Baidu, Ali and others are leading the industry in technological innovation and application development by creating open platforms for technological capabilities and building open-source ecosystems.

The report points out that China’s intelligent voice enterprises have realized new breakthroughs in a number of difficult technologies. Vertically, it extends from speech recognition, synthesis, and translation to the fields of computer vision, cognitive intelligence, and motion intelligence, and horizontally, it develops from a single-point technological breakthrough mode to machine cognition and multi-modal complex scene applications.

  In speech synthesis, with the boom of e-commerce live broadcasting and other industries, speech synthesis technology also shows the trend of anthropomorphization and spoken language.

  In speech recognitionThe multimodal interaction technology of audio-visual fusion has become the main direction of technological evolution.

  In the industrial sectorChina has built a number of voice technology innovation “national teams”, including the National Intelligent Speech Innovation Center, to carry out industrial acoustics, multilingualism, AI voice chips and other key common technology research work.

  In the urban sphereThe intelligent voice technology has been applied in innovative demonstration projects such as Anhui integrated online government service platform, Liaocheng City Brain, Sanya Yazhou Bay Science and Technology City Intelligent Industry and City Park, and Tianjin AI Silver Hair Intelligent Service Platform.

  In the medical fieldThe Smart Outcall and Smart Medical Assistant can be used for the daily care and pocket protection of elderly people living alone and left-behind children.

  In the field of educationThe use of intelligent voice and artificial intelligence technology can realize one-stop service for English listening and speaking teaching, learning, examination, assessment and management, and reduce ineffective training.

  In the field of operators, intelligent voice technology is combined with health care, family education, family entertainment and other scenarios to bring a smarter family life experience. The barrier-free intelligent communication integrating 5G and machine translation technology allows ordinary 5G cell phone users, without downloading any software, to use real-time translation and transcription services to realize barrier-free video calls across languages.

  In the automotive fieldIntelligent voice has become a key link in human-computer interaction, and derives from in-vehicle interaction to out-of-vehicle interaction, from single-mode interaction to multi-mode interaction, and from passive interaction to active interaction, providing full-stack technology empowerment for automobile enterprises.

  In the area of consumer productsAI+ learning products, such as AI learning machine and translation pen, help students to learn with less burden and more efficiency; AI+ office products, such as smart voice recorder, smart office book and smart mouse, are welcomed by people in the workplace; and AI+ life products, such as AI translator, smart microphone, smart voice keyboard, smart headset and smart hearing aid, allow more people to enjoy the convenience of AI technology.

  The development path of voice:

In view of the disciplinary characteristics of the multidisciplinary crossover of intelligent speech, researchers need to explore new principles, new mechanisms, new materials, new processes and new devices, and integrate innovation to promote the progress of core technologies. On the other hand, speech technology needs to be further extended to deep understanding. “A more advanced voice interaction system should not only “be able to listen and speak”, but also understand human information in depth. With a clear direction of development, we can promote the continued breakthrough of intelligent voice technology.

II. Python Speech Recognition

(1), text-to-speech conversion

(a) Use of pyttsx

Install the pyttsx package

  • import pyttsx3 as pyttsx
import pyttsx3 as pyttsx
engine = pyttsx.init()
engine.say('I can because i think i can. Life is not all roses. Life is not all roses. ')
engine.runAndWait()

If you install it without error, turn on the sound of your computer and you can hear the voice (converting the text text we just wrote in the code into a voice broadcast out)

Code Analysis:

pyttsx3 gets the speech engine by initializing it, and returns an engine object after calling init. import pyttsx3 # Initialize the speech engine engine = pyttsx3.init()
Set parameters such as speech rate, volume, etc: engine.setProperty(‘rate’, 100) # Set the speed engine.setProperty(‘volume’,0.6) # Set the volume
View parameters such as speech rate, volume, etc. rate = engine.getProperty(‘rate’) print(f’Rate of speech: {rate}’) volume = engine.getProperty(‘volume’) print (f’ volume: {volume}’)
Complete example code:
import pyttsx3 as pyttsx
engine = pyttsx.init()
engine.say('I can because i think i can. Life is not all roses. Life is not all roses. ')
rate = engine.getProperty('rate')
print(f'Rate of speech: {rate}')
volume = engine.getProperty('volume')   
print (f'volume: {volume}') 
engine.runAndWait()

The results of the run are:

Speech recognition in action (python code)

View Speech Synthesizer voices = engine.getProperty(‘voices’) for voice in voices: print(voice)

The main parameters of the synthesizer are as follows:

The main parameters of the synthesizer are as follows:
age Age of pronouncer Defaults to None
gender Pronunciator’s gender typed as a string male, female, or neutral. Defaults to None
id String confirmation message about Voice
languages List of supported languages for pronunciation Defaults to an empty list
name Pronunciator’s name Defaults to None
speech synthesizer The default has two Both speech synthesizers can synthesize English audio, the Only the first synthesizer can synthesize Chinese audio. If you need other speech synthesizer you need to download and set it up by yourself.
# Setting up the first speech synthesizer voices = engine.getProperty(‘voices’) engine.setProperty(‘voice’,voices[0].id)
Voice Over: engine.say(‘I can because i think i can. Life is not all roses. Life is not all roses. ‘) engine.runAndWait() engine.stop()
import pyttsx3
engine = pyttsx3.init() #initialize the speech engine
engine.say('I can because i think i can. Life is not all roses. Life is not all roses. ')
rate = engine.getProperty('rate')
print(f'Rate of speech: {rate}')
volume = engine.getProperty('volume')   
print (f'volume: {volume}') 
engine.setProperty('rate', 100) # Set the speed
engine.setProperty('volume',0.6) # Set the volume
voices = engine.getProperty('voices') 
engine.setProperty('voice',voices[0].id) # set the first voice synthesizer
engine.runAndWait()
engine.stop()

The results of the run are:

If you have not made a mistake, open the sound of the computer, you can hear the voice (the text text we just wrote in the code is converted into voice broadcast out: I can because i think i can. adversity sober Life is not all roses. life is not the road to prosperity.)

We set up to use the first speech synthesizer, and at the same time, the screen will print out that the current speech rate is 200 and the volume is: 1.0

Speech recognition in action (python code)

(b) Use of SAPI

You can also use SAPI to do text-to-speech conversion.

SAPI is Microsoft Speech API , is a voice interface introduced by Microsoft.

from win32com.client import Dispatch
# Getting the object of speech
speaker = Dispatch('SAPI.SpVoice')
# Speech content
speaker.Speak('Just Be You')
speaker.Speak('No one can do it just like you')
speaker.Speak('Something magic in the way you move')
speaker.Speak('You are original, you know it is true')
speaker.Speak('do not let anybody take your tune')
speaker.Speak('You are not got a single thing to prove')
speaker.Speak('You are original, so just be you')
speaker.Speak('Just be you')
speaker.Speak('You are one of a kind')
speaker.Speak('The kind of once in a lifetime')
speaker.Speak('Not just another face in the crowd')
speaker.Speak('You light up a room')
speaker.Speak('With the light that is inside you')
speaker.Speak('do not be afraid to let it out')
speaker.Speak('So hold your head up')
speaker.Speak('do not let anybody get you down')
speaker.Speak('No one can do it just like you')
speaker.Speak('Something magic in the way you move')
speaker.Speak('You are original, you know it is true')
speaker.Speak('do not let anybody take your tune')
speaker.Speak('You are not got a single thing to prove')
speaker.Speak('You are original, so just be you')
speaker.Speak('Just be you')
speaker.Speak('Ooh, just be you')
speaker.Speak('Ooh, just be you')
speaker.Speak('You gotta believe')
speaker.Speak('You are here for a reason')
speaker.Speak('This world needs somebody like you')
speaker.Speak('Cause anybody can be a copy')
speaker.Speak('And there will always be people talking')
speaker.Speak('So face your fears and chase your dreams')
speaker.Speak('And dance like no one is watching')
speaker.Speak('No one can do it just like you')
speaker.Speak('Something magic in the way you move')
speaker.Speak('You are original, you know it is true')
speaker.Speak('do not let anybody take your tune')
speaker.Speak('You are not got a single thing to prove')
speaker.Speak('You are original, so just be you')
speaker.Speak('Just be you')
speaker.Speak('Ooh, just be you')
speaker.Speak('Ooh, just be you')
speaker.Speak('You are one of a kind')
speaker.Speak('The kind of once in a lifetime')
speaker.Speak('Not just another face in the crowd')
speaker.Speak('No one can do it just like you')
speaker.Speak('Something magic in the way you move')
speaker.Speak('You are original, you know it is true')
speaker.Speak('do not let anybody take your tune')
speaker.Speak('You are not got a single thing to prove')
speaker.Speak('You are original, so just be you')
speaker.Speak('Just be you')
speaker.Speak('Ooh, just be you')
speaker.Speak('Ooh, just be you')

speaker.Speak('No one can be like you')
Speaker.Speak('so full of magic')
Speaker.Speak('You know, you're the original you')
Speaker.Speak('Don't let anyone change you')
speaker.Speak('You don't have to prove anything')
Speaker.Speak('You are the original you, be yourself')
Speaker.Speak('Be yourself')
Speaker.Speak('You are unique')
Speaker.Speak('the once-in-a-lifetime kind')
speaker.Speak('You don't look like another face in the crowd')
speaker.Speak('You use the warmth of your heart')
Speaker.Speak('lit up the room')
speaker.Speak('Don't be afraid to speak up')
Speaker.Speak('Keep your head up')
speaker.Speak('Don't let anyone get you down')
speaker.Speak('No one can be like you')
Speaker.Speak('so full of magic')
Speaker.Speak('You know, you're still the same person you were in the beginning')
Speaker.Speak('Don't let anyone change you')
speaker.Speak('You don't have to prove anything')
Speaker.Speak('Because you are you')
speaker.Speak(' Be yourself ')
speaker.Speak('Ooh, just be you')
Speaker.Speak('Be yourself')
speaker.Speak('Ooh, just be you')
speaker.Speak(' Be yourself ')
speaker.Speak('You have to believe')
speaker.Speak('You're here for a reason')
speaker.Speak('The world needs people like you to exist')
Speaker.Speak('Because anyone can be a copy')
speaker.Speak('People are always pointing fingers behind their backs')
Speaker.Speak('Conquer your inner fears and go after your dreams')
Speaker.Speak('Pretend no one is watching you, dance at your own pace')
speaker.Speak('No one can be like you')
Speaker.Speak('so full of magic')
Speaker.Speak('You know, you're still the same person you were in the beginning')
speaker.Speak('Don't change yourself for anyone')
Speaker.Speak('And you don't have to prove anything')
Speaker.Speak('You are the original you')
speaker.Speak(' Be yourself ')
Speaker.Speak('Be yourself')
speaker.Speak(' Be yourself ')
Speaker.Speak('You are unique')
Speaker.Speak('the once-in-a-lifetime kind')
speaker.Speak('not like another face in the crowd')
Speaker.Speak('There's no one like you')
speaker.Speak('so magical')
Speaker.Speak('And you know it, you're still the same person you were')
Speaker.Speak('Don't let anyone change you')
speaker.Speak('You don't have to prove anything')
Speaker.Speak('You're still the same')
speaker.Speak(' Be yourself ')
Speaker.Speak('Be yourself')
speaker.Speak('Ooh, just be you')
Speaker.Speak('Be yourself')
# Release objects
del speaker

If you didn’t make a mistake, turn on your computer’s sound and you’ll hear the voice (converting the text text we just wrote in the code into a voice broadcast)

(2), text to save as a voice file wav

(a) Use SpeechLib libraries

Method: 1. Get the text content of the input voice from a text file, convert it to voice and save it in .wav format

To use SpeechLib, you need to install it first, with the following commands:

  • pip install comtypes
from comtypes.client import CreateObject
engine = CreateObject('SAPI.SpVoice')
stream = CreateObject('SAPI.SpFileStream')
from comtypes.gen import SpeechLib
infile = 'Even if the world has no fairy tales.txt'
outfile = 'Even if the world has no fairy tales.wav'
stream.open(outfile, SpeechLib.SSFMCreateForWrite)
engine.AudioOutputStream = stream
f = open(infile, 'r', encoding='utf-8')
theText = f.read()
f.close()
stream.close()

If you are not wrong, the converted voice file ” Even if the world has no fairy tales.wav ” appears in the same directory on your computer.

Speech recognition in action (python code)

III. Summary

This paper focuses on the conceptual graphic discussion of the origin and development of speech recognition, the fundamentals of speech recognition, the process of speech recognition, and recent developments in speech recognition.

The code analyzes the technology used in Python speech recognition, from (1), text to speech. (2), text to speech file wav. two aspects of the example operation of the speech recognition technology to achieve, and provides a complete source code for reference.

There are other methods of python speech recognition, such as recognizing voice input from a microphone, etc. These will be analyzed in more detail in a later update to this blog post.

Recommended Today

API interfaces to choose from: RESTful, GraphQL, gRPC, WebSocket, Webhook.

Hello, I’m BitTorrent. Currently our lives are closely surrounded by a large number of Internet services, and there are tens of billions of API calls on the Internet every day.API is a way for two devices to communicate with each other, and every time people move their fingertips on their cell phones, there is an […]