Parameter description of the speech recognition model whisper

I. Introduction to whisper:

Whisper is a general purpose speech recognition model. It is trained on a large dataset of various audios and is also a multitasking model that performs multilingual speech recognition, speech translation and language recognition.

II. Parameters of whisper

1、-h, –help

Viewing the parameters of whisper

2、–model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}

Select the model to use, default: small

Parameter description of the speech recognition model whisper

3、–model_dir MODEL_DIR

The path where the model file is saved, default value: ~/.cache/whisper

4、–device DEVICE

Device used by the PyTorch interface, default: CPU

5、–output_dir OUTPUT_DIR, -o OUTPUT_DIR

The directory where the output is saved, default value: current directory

6、–output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}

Format of the output file, default: all

7、–verbose VERBOSE

Whether to print progress and debug information, default: true

8、–task {transcribe,translate}

transcribe: speech to text

translate: Speech to English

Default value: transcribe

9、–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}

Language setting for audio files, set to none for language detection

Default: None

10、–temperature TEMPERATURE

Temperature for sampling

Default value: 0

11、–best_of BEST_OF

Number of candidates when sampling at non-zero temperatures

Default value: 5

12、–beam_size BEAM_SIZE
The number of beams in the beam search algorithm only applies if the temperature is 0

Default value: 5
13、–patience PATIENCE

The option patience value is used for beam decoding, see https://arxiv.org/abs/2204.05424, which by default (1.0) is equivalent to a traditional beam search.

Default: None
14、–length_penalty LENGTH_PENALTY
Optional token length punishment coefficient (alpha) refer to https://arxiv.org/abs/1609.08144, use simple length normalization by default

Default: None
15、–suppress_tokens SUPPRESS_TOKENS
Comma-separated list of token IDs to suppress during sampling; ‘-1’ will suppress most special characters except common punctuation Default: -1
16、–initial_prompt INITIAL_PROMPT
Optional text as a reminder for the first window

Default: None
17、–condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT

If true, the output of the previous model is used as a prompt for the next window; disabling may result in inconsistent text between windows, but the model is less likely to fall into a failure loop

Default value: true
18、–fp16 FP16

Whether to use FP16

Default value: true
19、–temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
The temperature will increase when a decoding failure encounters a threshold fallback of any of the following

Default value: 0.2
20、–compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
If the gzip compression ratio is greater than this value, it is considered a decoding failure
Default: 2.4
21、–logprob_threshold LOGPROB_THRESHOLD
If the average log probability is lower than this value, the decoding is considered to have failed
Default value: -1.0
22、–no_speech_threshold NO_SPEECH_THRESHOLD
If the probability of |nospeech| is higher than this value and decoding fails because of `logprob_threshold`, the clip is considered to have no sound.

Default value: 0.6
23、–word_timestamps WORD_TIMESTAMPS
Extract word-level timestamps and refine results based on them (experimental in nature)

Default value: False
24、–prepend_punctuations PREPEND_PUNCTUATIONS

If word_timestamps is set to true, merge these punctuation marks with the following word

Default: “‘”¿([{-
25、–append_punctuations APPEND_PUNCTUATIONS

If word_timestamps is set to true, combine these punctuation marks with the previous word

Default: “‘…. ,, !!???? ::”)]},
26、–highlight_words HIGHLIGHT_WORDS
Underline each word spoken in srt and vtt (condition: –word_timestamps True)

Default value: false
27、–max_line_width MAX_LINE_WIDTH

Maximum number of characters in a line before line feed (condition: –word_timestamps True)

Default: None
28、–max_line_count MAX_LINE_COUNT
Maximum number of lines in a fragment (condition: –word_timestamps True)

Default: None
29、–threads THREADS

Number of threads used by the torch under the CPU interface, replacing MKL_NUM_THREADS/OMP_NUM_THREADS

Parameter description of the speech recognition model whisper

Recommended Today

Resolved the Java. SQL. SQLNonTransientConnectionException: Could not create connection to the database server abnormal correctly solved