springboot integration vosk to achieve simple voice recognition function

vosk open source speech recognition

Vosk is the open source speech recognition toolkit.Things that Vosk supports include:

Nineteen languages are supported – Chinese, English, Indian English, German, French, Spanish, Portuguese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Persian, Filipino, Ukrainian, Kazakh.
Work offline on mobile devices – Raspberry Pi, Android, iOS.
Install it with a simple pip3 install vosk.
The portable model is only 50Mb per language, but there are larger server models available.
Provides streaming API to provide the best user experience (unlike popular speech recognition python packages).
There are also wrappers for different programming languages – java / csharp / javascript and so on.
The vocabulary can be quickly reconfigured for optimal accuracy.
Supports speaker recognition.

vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and with Python, Java, C#, etc.

Link.vosk-api github address

Examples of use in each language

vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries

Link.vosk-server github address

Examples of use in each language

Use in vosk-api-java-springboot

Importing dependency packages

<! -- Speech Recognition -->
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>5.13.0</version>
        </dependency>
        <dependency>
            <groupId>com.alphacephei</groupId>
            <artifactId>vosk</artifactId>
            <version>0.3.45</version>
        </dependency>

        <! -- The JAVE2 (Java Audio Video Encoder) library is a Java wrapper on the ffmpeg project. -->
        <dependency>
            <groupId>ws.schild</groupId>
            <artifactId>jave-core</artifactId>
            <version>3.1.1</version>
        </dependency>

        <! -- Developed on windows Developer can realize compression effect window64 bit -->
        <dependency>
            <groupId>ws.schild</groupId>
            <artifactId>jave-nativebin-win32</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>ws.schild</groupId>
            <artifactId>jave-nativebin-win64</artifactId>
            <version>3.1.1</version>
        </dependency>

VoskResult

public class VoskResult {

    private String text;

    public String getText() {
        return text;
    }

    public void setText(String text) {
        this.text = text;
    }
}

vosk model loading

package com.fjdci.vosk;

import org.vosk.LibVosk;
import org.vosk.LogLevel;
import org.vosk.Model;

import java.io.IOException;

/**
 * :: vosk model loading
 * @author zhou
 */
public class VoskModel {

    /**
     * :: 3. Use volatile for thread-safety
     * :: Prohibition of reordering of instructions
     * :: Ensuring visibility
     * :: No guarantee of atomicity
     */
    private static volatile VoskModel instance;

    private Model voskModel;

    public Model getVoskModel() {
        return voskModel;
    }

    /**
     * :: 1. Private constructors
     */
    private VoskModel() {
        System.out.println("SingleLazyPattern instantiated ");
        //String modelStr = "D:\\work\\project\\fjdci-vosk\\src\\main\\resources\\vosk-model-small-cn-0.22";
        String modelStr = "D:\\work\\fjdci\\docker\\vosk\\vosk-model-cn-0.22";
        try {
            voskModel = new Model(modelStr);
            LibVosk.setLogLevel(LogLevel.INFO);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * :: 2. Obtain a unique instance through a static method
     * DCL Double-CheckedLocking
     * :: Maintaining high performance in multi-threaded situations
     */
    public static VoskModel getInstance() {
        if (instance == null) {
            synchronized (VoskModel.class) {
                if (instance == null) {
                    // 1. allocate memory space 2. execute the constructor method to initialize the object 3. point this object to this space
                    instance = new VoskModel();
                }
            }
        }
        return instance;
    }

    /**
     * :: Multi-threaded test loading
     * @param args
     */
    public static void main(String[] args) {
        for (int i = 0; i < 5; i++) {
            new Thread(() -> {
                VoskModel.getInstance();
            }).start();
        }
    }


}

Language Recognition Tools

package com.fjdci.vosk;

import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
import org.vosk.Model;
import org.vosk.Recognizer;
import ws.schild.jave.EncoderException;
import ws.schild.jave.MultimediaObject;
import ws.schild.jave.info.AudioInfo;
import ws.schild.jave.info.MultimediaInfo;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import java.util.UUID;

@Slf4j
@Component
public class VoiceUtil {

    public static void main(String[] args) throws EncoderException {
        String wavFilePath = "D:\\fjFile\\annex\\xwbl\\tem_2.wav";
        // sec.
        long cutDuration = 20;
        String waveForm = acceptWaveForm(wavFilePath, cutDuration);
        System.out.println(waveForm);
    }

    /**
     * :: Speech recognition translation of audio files in Wav format
     *
     * @param wavFilePath
     * @param cutDuration
     * @return
     * @throws EncoderException
     */
    private static String acceptWaveForm(String wavFilePath, long cutDuration) throws EncoderException {
        // Determine the length of the video
        long startTime = System.currentTimeMillis();
        MultimediaObject multimediaObject = new MultimediaObject(new File(wavFilePath));
        MultimediaInfo info = multimediaObject.getInfo();
        // Duration/milliseconds
        long duration = info.getDuration();
        AudioInfo audio = info.getAudio();
        // Number of channels
        int channels = audio.getChannels();
        // sec.
        long offset = 0;
        long forNum = (duration / 1000) / cutDuration;
        if (duration % (cutDuration * 1000) > 0) {
            forNum = forNum + 1;
        }
        // Perform chunking
        List<String> strings = cutWavFile(wavFilePath, cutDuration, offset, forNum);
        // Loop over translations
        StringBuilder result = new StringBuilder();
        for (String string : strings) {
            File f = new File(string);
            result.append(VoiceUtil.getRecognizerResult(f, channels));
        }
        long endTime = System.currentTimeMillis();
        String msg = "Time consuming:" + (endTime-startTime) + "ms";
        System.out.println(msg);
        return result.toString();
    }

    /**
     * :: Chunking of wav
     *
     * @param wavFilePath Path of the wav file to process
     * @param cutDuration The fixed length of the cut/second.
     * @param offset Sets the starting offset in seconds.
     * @param forNum Number of times to slice the block
     * @return
     * @throws EncoderException
     */
    private static List<String> cutWavFile(String wavFilePath, long cutDuration, long offset, long forNum) throws EncoderException {
        UUID uuid = UUID.randomUUID();
        // Cut large files into smaller files of fixed duration
        List<String> strings = new ArrayList<>();
        for (int i = 0; i < forNum; i++) {
            String target = "D:\\fjFile\\annex\\xwbl\\" + uuid + "\\" + i + ".wav";
            Float offsetF = Float.valueOf(String.valueOf(offset));
            Float cutDurationF = Float.valueOf(String.valueOf(cutDuration));
            Jave2Util.cut(wavFilePath, target, offsetF, cutDurationF);
            offset = offset + cutDuration;
            strings.add(target);
        }
        return strings;
    }

    /**
     * :: Translation
     *
     * @param f
     * @param channels
     */
    public static String getRecognizerResult(File f, int channels) {
        StringBuilder result = new StringBuilder();
        Model voskModel = VoskModel.getInstance().getVoskModel();
        // Sampling rate is a channel multiple of the audio sampling rate
        log.info("==== loading is complete, start analyzing ====");;
        try (
                Recognizer recognizer = new Recognizer(voskModel, 16000 * channels);
                InputStream ais = new FileInputStream(f)
        ) {
            int nbytes;
            byte[] b = new byte[4096];
            while ((nbytes = ais.read(b)) >= 0) {
                if (recognizer.acceptWaveForm(b, nbytes)) {
                    // Return speech recognition results
                    result.append(getResult(recognizer.getResult()));
                }
            }
            // Return speech recognition results。和结果一样，但不要等待沉默。你通常在流的最后调用它来获得音频的最后部分。它刷新功能管道，以便处理所有剩余的音频块。
            result.append(getResult(recognizer.getFinalResult()));
            log.info("Recognize result: {}", result.toString());
        } catch (Exception e) {
            e.printStackTrace();
        }
        return result.toString();
    }

    /**
     * :: Obtaining return results
     *
     * @param result
     * @return
     */
    private static String getResult(String result) {
        VoskResult voskResult = JacksonMapperUtils.json2pojo(result, VoskResult.class);
        return Optional.ofNullable(voskResult).map(VoskResult::getText).orElse("");

    }

}

jave2 Audio Processing Toolkit

package com.fjdci.vosk;

import ws.schild.jave.Encoder;
import ws.schild.jave.EncoderException;
import ws.schild.jave.InputFormatException;
import ws.schild.jave.MultimediaObject;
import ws.schild.jave.encode.AudioAttributes;
import ws.schild.jave.encode.EncodingAttributes;
import ws.schild.jave.info.AudioInfo;
import ws.schild.jave.info.MultimediaInfo;

import java.io.File;

public class Jave2Util {

    /**
     * @param src source file path
     * @param target target file path
     * @param offset Sets the starting offset in seconds.
     * @param duration Sets the length of the sliced audio (in seconds).
     * @throws EncoderException
     */
    public static void cut(String src, String target, Float offset, Float duration) throws EncoderException {

        File targetFile = new File(target);
        if (targetFile.exists()) {
            targetFile.delete();
        }

        File srcFile = new File(src);
        MultimediaObject srcMultiObj = new MultimediaObject(srcFile);
        MultimediaInfo srcMediaInfo = srcMultiObj.getInfo();

        Encoder encoder = new Encoder();

        EncodingAttributes encodingAttributes = new EncodingAttributes();
        //Set the starting offset (sec)
        encodingAttributes.setOffset(offset);
        //Set the audio length of the slice (in seconds)
        encodingAttributes.setDuration(duration);
        // Input format
        encodingAttributes.setInputFormat("wav");

        //Set audio properties
        AudioAttributes audio = new AudioAttributes();
        audio.setBitRate(srcMediaInfo.getAudio().getBitRate());
        //audio.setSamplingRate(srcMediaInfo.getAudio().getSamplingRate());
        // Converted to 16KHZ to meet vosk recognition criteria
        audio.setSamplingRate(16000);
        audio.setChannels(srcMediaInfo.getAudio().getChannels());
        // If you want to synchronize the encoding when intercepting, you can set a different encoding.
//        audio.setCodec("pcm_u8");
        //audio.setCodec(srcMediaInfo.getAudio().getDecoder().split(" ")[0]);
        encodingAttributes.setAudioAttributes(audio);
        // Write file
        encoder.encode(srcMultiObj, new File(target), encodingAttributes);
    }

    /**
     * :: Conversion of audio formats
     *
     * @param oldFormatPath : original music path
     * @param newFormatPath : target music path
     * @return
     */
    public static boolean transforMusicFormat(String oldFormatPath, String newFormatPath) {
        File source = new File(oldFormatPath);
        File target = new File(newFormatPath);
        // Audio conversion format class
        Encoder encoder = new Encoder();
        // Setting Audio Properties
        AudioAttributes audio = new AudioAttributes();
        audio.setCodec(null);
        // Setting transcoding properties
        EncodingAttributes attrs = new EncodingAttributes();
        attrs.setInputFormat("wav");
        attrs.setAudioAttributes(audio);
        try {
            encoder.encode(new MultimediaObject(source), target, attrs);
            System.out.println("Summoning is complete...") ;
            return true;
        } catch (IllegalArgumentException e) {
            e.printStackTrace();
        } catch (InputFormatException e) {
            e.printStackTrace();
        } catch (EncoderException e) {
            e.printStackTrace();
        }
        return false;
    }

    
    
    public static void main(String[] args) throws EncoderException {

        String src = "D:\\fjFile\\annex\\xwbl\\ly8603f22f24e0409fa9747d50a78ff7e5.wav";
        String target = "D:\\fjFile\\annex\\xwbl\\tem_2.wav";

        Jave2Util.cut(src, target, 0.0F, 60.0F);

        String inputFormatPath = "D:\\fjFile\\annex\\xwbl\\ly8603f22f24e0409fa9747d50a78ff7e5.m4a";
        String outputFormatPath = "D:\\fjFile\\annex\\xwbl\\ly8603f22f24e0409fa9747d50a78ff7e5.wav";

        info(inputFormatPath);

       // audioEncode(inputFormatPath, outputFormatPath);


    }

    /**
     * :: Obtaining encoding information for audio files
     *
     * @param filePath
     * @throws EncoderException
     */
    private static void info(String filePath) throws EncoderException {
        File file = new File(filePath);
        MultimediaObject multimediaObject = new MultimediaObject(file);
        MultimediaInfo info = multimediaObject.getInfo();
        // Duration
        long duration = info.getDuration();
        String format = info.getFormat();
        // format:mov
        System.out.println("format:" + format);
        AudioInfo audio = info.getAudio();
        // It sets the number of audio channels that will be used in the re-encoded audio stream (1 = mono, 2 = stereo). If no channel values are set, the encoder will select the default values.
        int channels = audio.getChannels();
        // It sets the bitrate value for the new re-encoded audio stream. If the bitrate value is not set, the encoder will select the default value.
        // The value should be expressed in bits per second. For example, if you want a bit rate of 128 kb / s, setBitRate (new Integer (128000)) should be called.
        int bitRate = audio.getBitRate();
        // It sets the sample rate for the new re-encoded audio stream. If no sample rate value is set, the encoder will select the default value. The value should be expressed in Hertz. For example, if you want something like CD
        // Sampling rate, audio sample level 16000 = 16KHz
        int samplingRate = audio.getSamplingRate();

        // Setting the audio volume
        // This method can be called to change the volume of the audio stream. A value of 256 means that the volume is unchanged. Therefore, a value less than 256 indicates a decrease in volume, while a value greater than 256 will increase the volume of the audio stream.
        // setVolume(Integer volume)

        String decoder = audio.getDecoder();

        System.out.println(" Sound duration: ms "+ duration);
        System.out.println(" channels :" + channels);
        System.out.println("bitRate:" + bitRate);
        System.out.println("samplingRate Sampling rate, audio sampling level 16000 = 16KHz:" + samplingRate);
        // aac (LC) (mp4a / 0x6134706D)
        System.out.println("decoder:" + decoder);
    }

    /**
     * :: Audio format conversion
     * @param inputFormatPath
     * @param outputFormatPath
     * @return
     */
    public static boolean audioEncode(String inputFormatPath, String outputFormatPath) {
        String outputFormat = getSuffix(outputFormatPath);
        String inputFormat = getSuffix(inputFormatPath);
        File source = new File(inputFormatPath);
        File target = new File(outputFormatPath);
        try {
            MultimediaObject multimediaObject = new MultimediaObject(source);
            // Get the encoding information of the audio file
            MultimediaInfo info = multimediaObject.getInfo();
            AudioInfo audioInfo = info.getAudio();
            //Set audio properties
            AudioAttributes audio = new AudioAttributes();
            audio.setBitRate(audioInfo.getBitRate());
            audio.setSamplingRate(audioInfo.getSamplingRate());
            audio.setChannels(audioInfo.getChannels());
            // Setting transcoding properties
            EncodingAttributes attrs = new EncodingAttributes();
            attrs.setInputFormat(inputFormat);
            attrs.setOutputFormat(outputFormat);
            attrs.setAudioAttributes(audio);
            // Audio conversion format class
            Encoder encoder = new Encoder();
            // Convert
            encoder.encode(new MultimediaObject(source), target, attrs);
            return true;
        } catch (IllegalArgumentException | EncoderException e) {
            e.printStackTrace();
        }
        return false;
    }

    /**
     * Get the path of the file with the . suffix of the file path
     * @param outputFormatPath
     * @return
     */
    private static String getSuffix(String outputFormatPath) {
        return outputFormatPath.substring(outputFormatPath.lastIndexOf(".") + 1);
    }


}

Voice model download address Over the wall
https://alphacephei.com/vosk/models

Reference Links

Link.vosk open source speech recognition

Link.Summary of Whisper-based audio transcription services

Link.Several free voice to text tools recommended

Link.java Offline Chinese Speech Text Recognition

Link.Asr – python Chinese speech recognition using vosk

Link.NeMo is very powerful, covering ASR, NLP, TTS, providing pre-trained models and complete training modules. Its commercial version is RIVA.

Link.ASRT Speech Recognition Documentation
ASRT is a deep learning-based speech recognition tool that can be used to develop state-of-the-art speech recognition systems. it is an open source speech recognition project made by AI Lemon Blogger (Xi’an University of Electronic Science and Technology – Xi’an Key Laboratory of Big Data and Visual Intelligence) since 2016, with a baseline of 85% recognition accuracy, and can achieve around 95% recognition accuracy under certain conditions. asrt contains ASRT contains a speech recognition algorithm server (for training or deploying API services) and a variety of platforms and programming languages, client SDK, support for one sentence recognition and real-time streaming recognition, the relevant code has been open-sourced on GitHub and Gitee.
The ASRT speech recognition system API has been provided to the AI Lemon station search engine for the implementation of the station’s voice search function.

Build an offline speech recognition system and provide webApi access to it.

Some directions and thoughts:

Determine the speech recognition engine

First, you need to choose a suitable speech recognition engine. Some common engines are CMU Sphinx, Kaldi, Baidu Speech, Xunfei Open Platform and so on. After selecting an engine, you need to configure and train it so that it can adapt to your application scenario.

Build an offline speech recognition system

Next, work needs to be done to build an offline speech recognition system. This can be done by using a Linux system such as Ubuntu for installation and configuration. The speech recognition engine and related dependency packages selected in the previous step need to be installed on the system.

Provides Web API access

In order to make the offline speech recognition system easily accessible and usable, you need to provide the corresponding Web API. you can use a framework such as Flask to build Web services and call the speech recognition engine in its context to perform speech recognition work.

Finally, in order to ensure the accuracy and smoothness of speech recognition, a series of optimization and debugging work is needed, such as voice noise reduction, speech rate control, model tuning and so on. We hope the above directions can help you.

2 whisper

Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web.

Whisper is a speech recognition engine that can be used to develop voice control applications, but it is typically used on mobile and embedded devices to provide offline speech recognition. If you want to build offline speech recognition using Java, you can try using other speech recognition engines such as CMU Sphinx and Kaldi. These engines support offline speech recognition and provide Java APIs for developers to use.

3 Kaldi

Open source Chinese speech recognition project：ASRFrame

https://blog.csdn.net/sailist/article/details/95751825

Tencent AI Lab open source lightweight speech processing toolkit PIKA

Focus on E2E speech recognition, Tencent AI Lab open source lightweight speech processing toolkit PIKA-Community

What open source python Chinese speech to text project?

https://blog.csdn.net/devid008/article/details/129656356

Offline speech recognition third-party service providers

1 KU Xunfei

https://www.xfyun.cn/service/offline_iat

KDDI offline package is only based on android, also does not support java offline version

It also looks like you can call a local dll for offline voice.

2 Baidu Speech Recognition

https://ai.baidu.com/tech/speech/realtime_asr

Offline not supported

3 AliCloud Speech Recognition

https://ai.aliyun.com/nls/trans