I. Introduction to whisper:
Whisper is a general purpose speech recognition model. It is trained on a large dataset of various audios and is also a multitasking model that performs multilingual speech recognition, speech translation and language recognition.
II. Parameters of whisper
1、-h, –help
Viewing the parameters of whisper
2、–model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}
Select the model to use, default: small
3、–model_dir MODEL_DIR
The path where the model file is saved, default value: ~/.cache/whisper
4、–device DEVICE
Device used by the PyTorch interface, default: CPU
5、–output_dir OUTPUT_DIR, -o OUTPUT_DIR
The directory where the output is saved, default value: current directory
6、–output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
Format of the output file, default: all
7、–verbose VERBOSE
Whether to print progress and debug information, default: true
8、–task {transcribe,translate}
transcribe: speech to text
translate: Speech to English
Default value: transcribe
9、–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
Language setting for audio files, set to none for language detection
Default: None
10、–temperature TEMPERATURE
Temperature for sampling
Default value: 0
11、–best_of BEST_OF
Number of candidates when sampling at non-zero temperatures
Default value: 5
12、–beam_size BEAM_SIZE
The number of beams in the beam search algorithm only applies if the temperature is 0
Default value: 5
13、–patience PATIENCE
The option patience value is used for beam decoding, see https://arxiv.org/abs/2204.05424, which by default (1.0) is equivalent to a traditional beam search.
Default: None
14、–length_penalty LENGTH_PENALTY
Optional token length punishment coefficient (alpha) refer to https://arxiv.org/abs/1609.08144, use simple length normalization by default
Default: None
15、–suppress_tokens SUPPRESS_TOKENS
Comma-separated list of token IDs to suppress during sampling; ‘-1’ will suppress most special characters except common punctuation Default: -1
16、–initial_prompt INITIAL_PROMPT
Optional text as a reminder for the first window
Default: None
17、–condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
If true, the output of the previous model is used as a prompt for the next window; disabling may result in inconsistent text between windows, but the model is less likely to fall into a failure loop
Default value: true
18、–fp16 FP16
Whether to use FP16
Default value: true
19、–temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
The temperature will increase when a decoding failure encounters a threshold fallback of any of the following
Default value: 0.2
20、–compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
If the gzip compression ratio is greater than this value, it is considered a decoding failure
Default: 2.4
21、–logprob_threshold LOGPROB_THRESHOLD
If the average log probability is lower than this value, the decoding is considered to have failed
Default value: -1.0
22、–no_speech_threshold NO_SPEECH_THRESHOLD
If the probability of |nospeech| is higher than this value and decoding fails because of `logprob_threshold`, the clip is considered to have no sound.
Default value: 0.6
23、–word_timestamps WORD_TIMESTAMPS
Extract word-level timestamps and refine results based on them (experimental in nature)
Default value: False
24、–prepend_punctuations PREPEND_PUNCTUATIONS
If word_timestamps is set to true, merge these punctuation marks with the following word
Default: “‘”¿([{-
25、–append_punctuations APPEND_PUNCTUATIONS
If word_timestamps is set to true, combine these punctuation marks with the previous word
Default: “‘…. ,, !!???? ::”)]},
26、–highlight_words HIGHLIGHT_WORDS
Underline each word spoken in srt and vtt (condition: –word_timestamps True)
Default value: false
27、–max_line_width MAX_LINE_WIDTH
Maximum number of characters in a line before line feed (condition: –word_timestamps True)
Default: None
28、–max_line_count MAX_LINE_COUNT
Maximum number of lines in a fragment (condition: –word_timestamps True)
Default: None
29、–threads THREADS
Number of threads used by the torch under the CPU interface, replacing MKL_NUM_THREADS/OMP_NUM_THREADS