728x90

https://arxiv.org/abs/2212.04356

 

Robust Speech Recognition via Large-Scale Weak Supervision

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard

arxiv.org

Robust Speech Recognition via Large-Scale Weak Supervision

 

Abstract & Introduction

680,000 ์‹œ๊ฐ„์˜ ๋‹ค๊ตญ์–ด ํ•™์Šต์„ ์ง„ํ–‰ ์‹œ fine-tuning ์—†์ด zero-shot transfer benchmark ์ˆ˜์ค€์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์‚ฌ๋žŒ์— ๊ทผ์ ‘ํ•œ accuracy์™€ robustness๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋จ.

๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ํ•™์Šต ๋ฐฉ์‹์€ Wave2Vec์„ ์ด์šฉํ•œ ๋น„์ง€๋„ ํ•™์Šต ๋ฐฉ์‹์ด๋‹ค. ์‚ฌ๋žŒ์˜ labeling ์—†์ด ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์‚ฐํ•˜์—ฌ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ ํ•™์Šต์„ ์ง„ํ–‰์‹œํ‚ค๋ฏ€๋กœ data setting์— ๋Œ€ํ•œ ๋ถ€๋‹ด์„ ์ค„์˜€๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ํ•™์Šต ๋ฐฉ๋ฒ•์€ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ‘œํ˜„๋งŒ ์ข‹์„ ๋ฟ unsupervised data์— ๋Œ€ํ•œ decoder mapping์€ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค.

๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์€  fine-tuning ์‹œ ๋ณต์žกํ•œ ์ž‘์—…์„ ์‹ค๋ฌด์ž๊ฐ€ ์ง„ํ–‰ํ•ด์•ผํ•œ๋‹ค.(risk ์ถ”๊ฐ€, fine-tuning์‹œ ์„ฑ๋Šฅ์ด ์ž˜ ์•ˆ ๋‚˜์˜ฌ ์ˆ˜๋„ ์žˆ์Œ)

๋จธ์‹  ๋Ÿฌ๋‹ method๋Š” ๊ฐ™์€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ•™์Šต ํŒจํ„ด์„ ์ฐพ๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ outlier(brittle, spurious)์™€ ์ •์ƒ์น˜์™€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์— ์˜ํ•ด ํ•™์Šต์ด ์ž˜๋˜์ง€ ์•Š๋Š”๋‹ค.

Radford et al

  • ImageNet classification์—์„œ ๊ฐ™์€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ํด๋ž˜์Šค๋ฅผ 7๊ฐ€์ง€ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜๋กœ ์„ธ๋ถ„ํ™”ํ–ˆ์„ ๋•Œ acc 9.2% ์ฆ๊ฐ€ ์‹œ์ผฐ๋‹ค. 

large-scale์— ๋Œ€ํ•œ ํŽธํ–ฅ์„ ๊ฐ€์ง„ dataset์„ ํ•™์Šต ์‹œํ‚ค๊ณ  ๋‹ค์‹œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ high-quality dataset์˜ ๋ช‡ ๋ฐฐ๋‚˜ ๋˜๊ณ  ์ด์ „ ํ•™์Šต๋ณด๋‹ค ์ ์€ ์–‘์˜ ํ•™์Šต์„ ํ•˜๊ฒŒ ๋œ๋‹ค. (์˜ฌ๋ฐ”๋ฅธ ํ•™์Šต ๋ฏธ๋ฏธ)

OpenAI ์—ฐ๊ตฌ์ง„์€ ๋ฐ์ดํ„ฐ inbalanced ๋ฌธ์ œ๋ฅผ ์ขํžˆ๊ธฐ ์œ„ํ•ด 68,000์‹œ๊ฐ„์˜ labeling์ด ๋œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

  • Whisper: *weakly supervised speech recognition

์˜์–ด๋ฟ๋งŒ์•„๋‹ˆ๋ผ 117,000 h 96๊ฐœ์˜ ์–ธ์–ด dataset, 125,000h Xโ†’en ์˜ ์˜๋ฌธ ๋ณ€ํ™˜ ๋ฒˆ์—ญ dataset ํ•™์Šต

์ผ๋ฐ˜์ ์œผ๋กœ large-model์— ์žˆ์–ด ๋‹ค๊ตญ์–ด ํ•™์Šต์€ ๋‹จ์ ์ด๋‚˜ ์žฅ์  ๋‘˜ ๋‹ค ์—†๋‹ค.

์ตœ๊ทผ weakly supervised pre-training์ด ์ €ํ‰๊ฐ€๋จ์— ์žˆ์–ด lage-scale dataset์„ ํ•™์Šต ์‹œ self-supervision ๋˜๋Š” self-trainig์— ๋Œ€ํ•œ ๊ณ ์ฐฐ์ด ํ•„์š”ํ•˜๋‹ค. 

์Œ์„ฑ ์–ธ์–ด ๋ชจ๋ธ๋ง ์—ฐ๊ตฌ์— ๊ธฐ์—ฌํ•˜๊ธฐ ์œ„ํ•ด OpenAI whisper๋ฅผ ๊ณต๊ฐœํ•จ. 

 

Approach

Data Processing

Trend์— ๋”ฐ๋ผ ML system์˜  web-scale text๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์ตœ์†Œํ•œ์˜ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰

๊ธฐ์กด์˜ ๋ฐฉ์‹๊ณผ ๋‹ค๋ฅด๊ฒŒ Significant standardization(ํ‘œ์ค€ํ™”) ์—†์ด seq2seq model๋กœ audio data์™€ transcript pair ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ฆ

Naturalistic transcriptions์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ Text ์ •๊ทœํ™”๋ฅผ ํ•˜์ง€ ์•Š์•„ pipeline์ด ๊ฐ„์†Œํ™”๋œ๋‹ค.

๊ตฌ์ถ• ๊ฒฐ๊ณผ environments, recording setups, speakers, languages ๋ณ„ ๋‹ค ์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์ด ๊ตฌ์ถ•๋˜์—ˆ๋‹ค.

๋‹ค์–‘ํ•œ audio quality๋Š” model ํ•™์Šต์— ๋„์›€์ด ๋˜์ง€๋งŒ transcript quality์—๋Š” ๊ฑฐ์˜ ๋˜์ง€ ์•Š๋Š”๋‹ค.

๋ฐ์ดํ„ฐ ๊ฒ€์ฆ ๊ฒฐ๊ณผ, ์› ๋ฐ์ดํ„ฐ์— ์ˆ˜์ค€ ์ดํ•˜์˜ ๋ฐ์ดํ„ฐ๋“ค์ด ๋งŽ์ด ๋ฐœ๊ฒฌ๋˜์—ˆ๋‹ค.

์ธํ„ฐ๋„ท์— ์žˆ๋Š” ๋งŽ์€ transcript ๋ฐ์ดํ„ฐ๋Š” ์‹ค์ œ๋กœ ์‚ฌ๋žŒ์ด ์ƒ์„ฑํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ธฐ์กด ASR์˜ ๊ฒฐ๊ณผ๋ฌผ์ด ๋งŽ์•˜๊ณ  ์ธ๊ฐ„๊ณผ ๊ธฐ๊ณ„๊ฐ€ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•  ๊ฒฝ์šฐ ํ•™์Šต ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ์ €ํ•˜์‹œํ‚จ๋‹ค๋Š” ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ๋‹ค. (https://arxiv.org/abs/2109.07740)

๊ธฐ๊ณ„์Œ์„ฑ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด heuristics ๊ฐœ๋ฐœ / ๊ธฐ์กด์˜ ๋งŽ์€ ASR์€ ๋ณต์žกํ•œ ๊ตฌ๋‘์ (. , ! ?) ๋‹จ๋ฝ๊ณผ ๊ฐ™์€ ์„œ์‹ ๊ณต๋ฐฑ, ๋Œ€๋ฌธ์ž ๋“ฑ ์˜ค๋””์˜ค vocab์— ์“ฐ๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ์ •๊ทœํ™”๋ฅผ ํ†ตํ•œ ์ œํ•œ๋œ ์ง‘ํ•ฉ์˜ ๋ฌธ์ž์–ธ์–ด๋งŒ ์‚ฌ์šฉํ–ˆ๋‹ค.

๋งŽ์€ ASR ์‹œ์Šคํ…œ์—๋Š” ์–ด๋Š ์ •๋„์˜ ํ…์ŠคํŠธ ์ •๊ทœํ™”๊ฐ€ ์ง„ํ–‰๋˜์ง€๋งŒ ๋‹จ์ˆœํ•˜๊ฑฐ๋‚˜ rule-base๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

Language detector(VoxLingua107, proto-type dataset) ์‚ฌ์šฉ

  • CLD2์— ๋”ฐ๋ผ ์Œ์„ฑ ์–ธ์–ด๊ฐ€ script ๊ฒฐ๊ณผ์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ๋น„๊ต (์ผ์น˜ํ•˜์ง€ ์•Š์œผ๋ฉด train ์ œ์™ธ)
  • X->en (์Œ์„ฑ ๋ฒˆ์—ญ ํ›ˆ๋ จ ์˜ˆ์ œ)๋กœ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ถ”๊ฐ€ํ•œ๋‹ค.
  • fuzzy de-duping์„ ์‚ฌ์šฉํ•ด transcript๋ฅผ ํ•œ๋ฒˆ ๋” ์ •์ œ (์ค‘๋ณต ๋ฐฉ์ง€, ๊ธฐ๊ณ„ ์ƒ์„ฑ ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ)
  • ์˜ค๋””์˜ค ํŒŒ์ผ 30์ดˆ segment๋กœ ๋‚˜๋ˆ„๊ณ  ํ•ด๋‹น ์‹œ๊ฐ„ ๋‚ด์—์„œ ๋ฐœ์ƒํ•˜๋Š” transcript ํ•˜์œ„ ์ง‘ํ•ฉ๊ณผ pair ์ง„ํ–‰ (audiofile_1 โ€“ transcript_1, audiofile_2 โ€“ transcript_2 ..)

์Œ์„ฑ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋Š” ์Œ์„ฑ ํ™œ๋™ ๊ฐ์ง€๋ฅผ ์œ„ํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉ

์ด ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ ์˜ค๋ฅ˜์œจ์„ ์ง‘๊ณ„ํ•˜๊ณ  ํ’ˆ์งˆ์ด ๋‚ฎ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ์ œ๊ฑฐ, ์ˆ˜๋™์ž‘์—…

์Šคํฌ๋ฆฝํŠธ์™€ ์Œ์„ฑ ์ •๋ ฌ์ด ์ž˜๋ชป๋˜์—ˆ๊ฑฐ๋‚˜ ํ•„ํ„ฐ๋ง์— ๊ฑธ๋ฆฌ์ง€ ๋ชปํ•œ ์ด์ƒ ๋ฐ์ดํ„ฐ ๋Œ€๋Ÿ‰ ๋ฐœ๊ฒฌ

์˜ค์—ผ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ค‘๋ณต ์ œ๊ฑฐ + TED-LIUM 3 ์ˆ˜์ค€์˜ ์ „์ฒ˜๋ฆฌ ์ง„ํ–‰

Model

์—ฐ๊ตฌ ๊ฒฐ๊ณผ์— ํ˜ผ์„ ์ด ๊ฐ€์ง€ ์•Š๊ฒŒ ๊ธฐ์„ฑ ๋ชจ๋ธ ์•„ํ‚คํ…์ณ๋ฅผ ์‚ฌ์šฉ(๋น„ํŒ ์š”์†Œ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์ด ์•„๋‹Œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ค‘์ ์œผ๋กœ ๋‹ค๋ฃธ, human detect filter)

๊ฒ€์ฆ๋œ ์•„ํ‚คํ…์ฒ˜ Transformer(Attention Is All You Need) encoder-decoder๋ฅผ ์‚ฌ์šฉํ•จ.

๋ชจ๋“  ์˜ค๋””์˜ค๋Š” 16,000HZ๋กœ resampling -> 80 channel -> log-mel spectrogram

0.010 sec ๊ฐ„๊ฒฉ์œผ๋กœ ํ‘œํ˜„๋œ mel spectogram์„ 25mm/sec ๋‹จ์œ„๋กœ channel ๊ณ„์‚ฐ (25 mm/sec - 10 mm/sec )

Normalization - Standardization -1~1

Conv1D(filter=3) + Conv1D(filter=3, stride=2) & GELU

Sinusoidal position embeddings (=positional encoding)

N x encoder Block -> N x decoder Block

 

Multitask Format

๋ถ„์ ˆ๋œ audio๊ฐ€ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋ฐœํ™”๋˜์—ˆ๋Š”์ง€ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด speech recognition ๋ถ„์•ผ์—์„œ ์ฃผ๋กœ ๋‹ค๋ฃจ๋Š” ํ•ต์‹ฌ ๋ฌธ์ œ์ด์ง€๋งŒ, ์ด ๋ฌธ์ œ๋งŒ์ด ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.

ํ˜„ speech recognition์—๋Š” ๊ฐœ์„ ํ•ด์•ผํ•  ๋ถ€๋ถ„์ด ๋” ์žˆ๋‹ค.

 

1. voice activity detection (์Œ์„ฑ ํ™œ๋™ ๊ฐ์ง€)

2. speaker diarization (ํ™”์ž ๋ถ„ํ• )

3. inverse text normalization (ITN, ์‚ผ์‹ญ์ผ๋ถ„ -> 31๋ถ„)

 

์ด ๋ฌธ์ œ๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ๋ชจ๋ธ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋ณต์žกํ•ด ์ง„๋‹ค.

Whisper๋Š” ๋‹จ์ผ ๋ชจ๋ธ๋กœ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•จ

 

๋ชจ๋ธ์˜ ์ธํ„ฐํŽ˜์ด์Šค์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” Task

1. Transcription (์ „์‚ฌ)

2. Translation (๋ฒˆ์—ญ)

3. voice activity detection (์Œ์„ฑ ํ™œ๋™ ๊ฐ์ง€)

4. Alignment (์œ„์น˜ ์กฐ์ •)

5. language identification (์–ธ์–ด ์‹๋ณ„)

 

์ด ๊ฒฝ์šฐ ๋‹จ์ผ ๋ชจ๋ธ์—์„œ  one to many ์ž‘์—…์ด ๋ถˆ๊ฐ€ํ”ผํ•ด resource๊ฐ€ ํ•„์ˆ˜๋กœ ํ•„์š”ํ•˜๋‹ค.

์กฐ๊ฑด๋ถ€ ์ •๋ณด(conditional information)๋ฅผ input token์œผ๋กœ Decoder์— ์ „๋‹ฌ

Start token :  <|startoftranscript|>

 

Task (1)
Token 99๊ฐœ๋ฅผ ํ†ตํ•ด language identification ์ˆ˜ํ–‰ (VoxLingua107 ํ•™์Šต ๋ชจ๋ธ ์‚ฌ์šฉ)
์ด๋•Œ audio segment์— ์Œ์„ฑ์ด ์—†๋Š” ๊ฒฝ์šฐ, ๋ชจ๋ธ์€ ์ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” <|nospeech|> token ์˜ˆ์ธก
๋‹ค์Œ token์€ translate ๋˜๋Š” transcript์— ๋”ฐ๋ผ <|transcribe|> ๋˜๋Š” <|translate|> token์œผ๋กœ ์ž‘์—…์„ ์ง€์ •
<|notimestamps|> token์„ ์ถ”๊ฐ€ํ•ด Timestamp์˜ ์˜ˆ์ธก ์—ฌ๋ถ€๋ฅผ ์ง€์ •
Timstamp Mode
* ํ˜„์žฌ audio์˜ segment๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์‹œ๊ฐ„์„ ์˜ˆ์ธกํ•˜๊ณ  ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด 20 mm/s ๋‹จ์œ„๋กœ ์ˆ˜์น˜
* ๊ฐ audio์— vocab token ์ถ”๊ฐ€
* <|startoftimestamps|>- <texttoken_1> - โ€ฆ - <texttoken_n> - <|startoftimestamps|>
๋งˆ์ง€๋ง‰ transcript segment์— <|endoftranscript|>๊ฐ€ ํฌํ•จ์ด ๋˜์ง€ ์•Š์„ ๊ฒฝ์šฐ, ์ด์ „ ๋ถ„์ ˆ์—์„œ ์‹œ์ž‘ํ•œ segment๋ฅผ ํ˜„์žฌ ๋ถ„์ ˆ์—์„œ ๋””์ฝ”๋”ฉํ•˜๊ธฐ ์œ„ํ•ด start token์„ ์˜ˆ์ธกํ•ด ์ดํ›„ ๋””์ฝ”๋”ฉ์€ ํ•ด๋‹น ์‹œ๊ฐ„์— ์ •๋ ฌ๋œ ์˜ค๋””์˜ค ์ฐฝ์—์„œ ์ˆ˜ํ–‰๋˜๋„๋กํ•œ๋‹ค.
๋”ฐ๋ผ์„œ 30์ดˆ ์ดํ•˜์˜ ์˜ค๋””์˜ค ๋ถ„์ ์€ ํ•ด๋‹น segment๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๋„๋ก ์ž˜๋ผ๋ƒ„
๋งˆ์ง€๋ง‰์œผ๋กœ <|endoftranscript|> token์„ ์ถ”๊ฐ€ํ•จ
์ด์ „ ๋ฌธ๋‹จ์— ๋Œ€ํ•œ ํ•™์Šต loss๋งŒ masking ํ•˜๊ณ  ๋‹ค๋ฅธ ๋ชจ๋“  token์„ ์˜ˆ์ธกํ•˜๋„๋ก model์„ train
 

Training Details

Whiper scale๋ณ„ ๋ชจ๋ธ ์—ฐ๊ตฌ ์ง„ํ–‰
FP16์„ ์‚ฌ์šฉํ•ด ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋กœ train
gradient๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ์ฐพ๊ธฐ์œ„ํ•ด  warmup, decay, gradient norm clipping ์‚ฌ์šฉ
Optimizer AdamW ์‚ฌ์šฉ
Batch Size 256
๊ณผ์ ํ•ฉ ๋ฐฉ์ง€
* Few Epoch
* Not Use Data Augmentation
* Not Use regularization
๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๋‹ค์–‘์„ฑ์— ์˜์กด
์ฒ˜์Œ ํ•™์Šต ๋•Œ ํ™”์ž์˜ ์ด๋ฆ„์ด ๋“ค์–ด๊ฐ€๋Š” transcript๊ฐ€ ๋งŽ์•„ ํ‹€๋ฆฐ ๋‹ต์„ ํ•จ
* Transcript์˜ speaker annotaion์„ ์ œ๊ฑฐ

Experiments

Zero-shot Evaluation

Whisper์˜ ํŠน์ • ๋ถ„ํฌ์—์„œ ๋†’์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด Dataset ๋ณ„ fine-tuning ์—†์ด ์•ˆ์ •์ ์œผ๋กœ ์ž‘๋™ํ•˜๋Š” single robust speech processing์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•จ
Whisper๋Š” domain, task, languages generalize๊ฐ€ ์ž˜ ์ด๋ฃจ์–ด์ง€๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด speech processing dataset์„ ์žฌ์‚ฌ์šฉํ•ด ํ™•์ธํ•จ.
Whisper์˜ ์ผ๋ฐ˜ํ™”๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด dataset๋“ค์˜ ํ‘œ์ค€ ํ‰๊ฐ€๋Š” train / test๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๋Œ€์‹  train ๋ฐ์ดํ„ฐ์— ์ „ํ˜€ ์‚ฌ์šฉ๋˜์ง€ ์•Š๊ณ  zero-shot setting์—์„œ whisper๋ฅผ evaluateํ•œ๋‹ค.

 

Evaluation Metrics

Speech recognition ์—ฐ๊ตฌ์—์„œ ํ†ต์ƒ ์‚ฌ์šฉํ•˜๋Š” metric WER(word error rate) ์‚ฌ์šฉ

   * ๊ทธ๋Ÿฌ๋‚˜ WER์€ string edit distance(levenshtein distance)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•ด transcript์˜ ํŠน์„ฑ์ด ๋ฌด์‹œ๋จ

   * ์‚ฌ๋žŒ์ด ์ •ํ™•ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜๋Š” transcript๋ฅผ WER์€ ์ž‘์€ ์ฐจ์ด๋กœ ํฌ๊ฒŒ ๊ณ„์‚ฐํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ

   * ๋ชจ๋“  transcript์—์„œ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€๋งŒ ํŠน์ • dataset transcript์—์„œ whisper์™€ ๊ฐ™์€ zero-shot ๋ชจ๋ธ์—์„œ ํŠนํžˆ ์‹ฌ๊ฐํ•˜๊ฒŒ ์ž‘์šฉ

์ธ๊ฐ„์˜ ํŒ๋‹จ๊ณผ ๋” ์ž˜ ๋งž๋Š” metric์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ ๋˜ํ•œ ํ•˜๋‚˜์˜ ์—ฐ๊ตฌ ๋ถ„์•ผ์ด๋‹ค. ํ•˜์ง€๋งŒ ์Œ์„ฑ ์ธ์‹์— ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์€ ์•„์ง ์—†์Œ

WER ๊ณ„์‚ฐ ์ „์— text๋ฅผ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ํ‘œ์ค€ํ™”(standardization)ํ•˜์—ฌ ์ •๋‹ต label๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”ํ•จ

Whisper์˜ text normalizer(4.4)๋Š” ๋ฐ˜๋ณต์ ์ธ ์ˆ˜๋™ manual์„ ํ†ตํ•ด ๊ฐœ๋ฐœ

WER์ด ์‚ฌ์†Œํ•œ ์ฐจ์ด๋กœ ์ธํ•ด Whisper ๋ชจ๋ธ์˜ loss๋ฅผ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์„ ํŒจํ„ด์„ ํ†ตํ•ด ์ฐพ์Œ

๋ช‡ ๋ช‡ dataset์—์„œ ๊ณต๋ฐฑ์ด ์žˆ๋Š” ๋‹จ์–ด์˜ ์ถ•์•ฝ์–ด๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒฐ๊ณผ WER 50% ํ•˜๋ฝ์„ ๊ด€์ฐฐ

์ฝ”๋“œ ๊ณต๊ฐœ (https://github.com/openai/whisper/tree/main/whisper/normalizers)

English Speech Recognition

[Deep Speech2]์—์„œ Speech Recognition์ด LibriSpeech test-clean ๋ถ„ํ• ์„ ํ•  ๋•Œ ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค ์–ธ๊ธ‰
 DeepSpeech2 ์—ฐ๊ตฌ ๊ฒฐ๊ณผ domain adeption ์—†์ด clean read speech๋ฅผ ๋”์ด์ƒ ๊ฐœ์„ ํ•  ๋ฐฉ๋ฒ•์€ ์—†๋‹คํ•จ
  * LibriSpeech SOTA test ๊ฒฐ๊ณผ WER 5.3% -> 1.4%๋กœ 73% ๋” ๊ฐ์†Œํ•จ
  * ์ธ๊ฐ„์˜ Speech Recognition ์ˆ˜์ค€ WER(5.8%) ๋ณด๋‹ค ๋” ๋‚ฎ์Œ
LibriSpeech๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์€ LibriSpeech train data ์ด์™ธ์˜ dataset์—์„œ๋Š” ์ธ๊ฐ„์˜ ์˜ค๋ฅ˜๋ณด๋‹ค ํผ
์ด๋Ÿฐํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜๋Š” ์ด์œ ๋Š”?
  * ์ธ๊ฐ„์ด ์ƒ๊ฐํ•˜๋Š” Speech Recognition ๋Šฅ๋ ฅ์˜ metric ๋ฐฉ์‹๊ณผ Machine์ด ์ธก์ •ํ•˜๋Š” metric์˜ ์ฐจ์ด๊ฐ€ ์žˆ์Œ
  * Test ๊ณผ์ •์—์„œ ์ธ๊ฐ„๊ณผ Machine์€ ๊ฐ™์€ Test๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ง€๋งŒ ๋‹ค๋ฆ„
  * ์ธ๊ฐ„์€ ์ข…์ข… ์—ฐ๊ตฌ ์ค‘์ธ ํŠน์ • data ๋ถ„ํฌ์— ๋Œ€ํ•œ ์ง€์‹์ด ์—†๋Š” ์ƒํƒœ์—์„œ ํ–‰๋™์„ ํ•˜๊ฒŒ ์š”์ฒญ ๋ฐ›์Œ
  * ์ธ๊ฐ„์˜ ์„ฑ๋Šฅ์€ data์— ์—†๋Š” ๊ฒƒ์„ test ๋ฐ›๋„๋ก ์ˆ˜ํ–‰
  * Machine์€ evaluate ๋ถ„ํฌ๊ฐ€ ์ด๋ฏธ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋˜์–ด์žˆ์Œ
  * Unsupervised test task(Human) vs supervised test task(Machine)
Whisper์˜ train
  * Zero-shot ํ™˜๊ฒฝ์—์„œ evaluate ์ง„ํ–‰์œผ๋กœ ์ธ๊ฐ„์˜ ํ–‰๋™๊ณผ ์ผ์น˜ํ•˜๋„๋ก ๋งŒ๋“ฆ
  * ์ธ๊ฐ„๊ณผ ๊ฐ™์€ ๊ท€๋ฅผ ๊ตฌํ˜„ํ• ์ง€ / suphuman์ด ๋˜์–ด์•ผ ํ• ์ง€

Human vs Machine ๋‘ ์ฐจ์ด๋ฅผ ์ •๋Ÿ‰ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๋ถ„ํฌ/dataset์— ๊ฑธ์นœ ํ‰๊ท  ์„ฑ๋Šฅ ๋น„๊ต
  * Measuring Robustness to Natural Distribution Shifts in Image Classification(Rohan Taori)
  * Robustness๋Š” out-of-distribution datasets์—์„œ ๋” ์ข‹์€ performance๋ฅผ ๋ƒ„
  * ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Robustness behaviors๋ฅผ ํŠน์„ฑํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก LibriSpeech dataset์„ ์‚ฌ์šฉ
Zero-shot whisper ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ๊ณผ LibriSpeech test-clean์—์„œ ๊ฐ€์žฅ ๊ทผ์ ‘ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ supervised LibriSpeech model๊ณผ ๋น„๊ต
LibriSpeech Clean๊ณผ ๋™์ผํ•˜๋ฉด์„œ๋„ ๋‹ค๋ฅธ ์Œ์„ฑ ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ํ‰๊ท  55.2% ์ƒ๋Œ€์ ์œผ๋กœ ๊ฐ์†Œ
์ธ๊ฐ„์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด model์˜ zero-shot & out-of-distribution์„ ์‚ฌ์šฉํ•  ๊ฒƒ์„ ๊ฐ•์กฐ

 

Multi-lingual Speech Recognition

๋‹ค๊ตญ์–ด ์Œ์„ฑ ์ธ์‹์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ๋‘๊ฐ€์ง€ low-data benchmark ์‚ฌ์šฉ

Whisper zero-shot setting์—์„œ MLS, outperforming XLS-R, mSLAM, Maestro dataset์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

  • Text standardizer๋ฅผ ์‚ฌ์šฉํ•ด SOTA ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ์œ ์˜
  • VoxPopuli dataset์—์„œ whisper๋Š” ์ด์ „ ์ž‘์—…๋ณด๋‹ค ์„ฑ๋Šฅ์ด ํ˜„์ €ํžˆ ๋–จ์–ด์ง, VP-10K+FT์—์„œ๋งŒ ์ข‹์€ ์„ฑ๋Šฅ
    • Voxpopuli์—์„œ Whisper ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ฒŒ ๋‚˜์˜จ ๊ฒƒ์€ ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์ด unsupervised pre-training data ๋•Œ ์ด dataset์„ ์‚ฌ์šฉํ•˜๊ณ  dataset ํ•™์Šต์— ์žˆ์–ด supervised data๋ฅผ ๋” ๋งŽ์ด ์‚ฌ์šฉํ•ด fine-tuning์— ์ด์ ์ด ์žˆ์„ ๊ฒƒ์ด๋ผ ๋ด„
  • MLS๋Š” ์–ธ์–ด๋‹น 10์‹œ๊ฐ„์˜ train data๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ˜๋ฉด, VoxPopuli๋Š” ์–ธ์–ด๋‹น ํ‰๊ท ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ์•ฝ 10๋ฐฐ ๋” ๋งŽ๋‹ค.

 

์ด ๋‘ benchmark๋Š” 15๊ฐœ์˜ ์–ธ์–ด๋งŒ ํฌํ•จํ•˜๋ฉฐ, ๋Œ€๋ถ€๋ถ„์ด ์ธ๋„-์œ ๋Ÿฝ๊ถŒ์— ์†ํ•˜๊ณ  ๋ฆฌ์†Œ์Šค๊ฐ€ ๋งŽ์ด ํ•„์š”ํ•œ ์–ธ์–ด์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฒ”์œ„๊ฐ€ ์ข์Œ

Whisper๋ฅผ ์ข€๋” broadํ•˜๊ฒŒ ์„ฑ๋Šฅ ์ธก์ •์„ ์œ„ํ•ด Fleurs dataset์„ ์‚ฌ์šฉ

  • ์ฃผ์–ด์ง„ Train data์˜ ์–‘๊ณผ zero-shot downsteam performance์˜ ๊ด€๊ณ„๋ฅผ ์—ฐ๊ตฌ
  • Fig3 log(WER)์™€ log(์–ธ์–ด๋ณ„ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ์–‘) ๊ด€๊ณ„๋ฅผ ๋ณด๋ฉด coefficient 0.83์œผ๋กœ ๊ฐ•ํ•œ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
  • ํ•™์Šต data๊ฐ€ 16๋ฐฐ ์ฆ๊ฐ€ ํ•  ๋•Œ๋งˆ๋‹ค WER์ด ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ค ๊ฒƒ์œผ๋กœ ์ถ”์ •
  • ๋˜ํ•œ ์˜ˆ์ธก์ด ๊ฐ€์žฅ ๋–จ์–ด์ง€๋Š” ํžˆ๋ธŒ๋ฆฌ์–ด(HE), ํƒˆ๋กœ๊ทธ์–ด(TE), ์ค‘๊ตญ์–ด(ZH), ํ•œ๊ตญ์–ด(KO) ๋“ฑ์€ ์ธ๋„-์œ ๋Ÿฝ ์–ธ์–ด์™€ ๋” ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๊ณ  unique scripts๋ฅผ ๊ฐ€์ง„ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚จ
    • Linguastic distance(์–ธ์–ด์  ๊ฑฐ๋ฆฌ)
    • Our byte level BPE tokenizer poor match (Tokenizer ์„ฑ๋Šฅ์ฐจ์ด)
    • Variations data quality (๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ์ฐจ์ด)

Translation

CoVoST2 dataset(Xโ†’en)์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•ด whisper์˜ translate performance ์—ฐ๊ตฌ

์„ ํ–‰  language detect ๋ชจ๋ธ Maestro, mSLAM, XLS-R ๋น„๊ต

Whisper๋Š” CoVoST2 dataset์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  zero-shot  29.1 BLEU achieve

  • Whisper๋Š” 68,000 ์‹œ๊ฐ„์˜ dataset(Xโ†’en) ํ•™์Šต์ด๋Š” CoVoST2 861์‹œ๊ฐ„๊ณผ ๋น„๊ตํ•˜๋ฉด ํฐ ์ฐจ์ด๊ฐ€ ์žˆ์Œ

Whisper์˜ ๊ฒ€์ฆ์€ zero-shot์ด๊ธฐ ๋•Œ๋ฌธ์— CoVoST2์—์„œ๋„ ์ ์€ resource๋กœ mSLAM๋ณด๋‹ค 6.7 BLEU ๊ฐœ์„ ๋จ

High resource model์—์„œ๋Š” Maestro, mSLAM๋ณด๋‹ค ๊ฐœ์„ ๋˜์ง€ ์•Š์•˜์Œ

Language identification performance

  • Fleurs dataset ๊ฒฐ๊ณผ whisper๋Š” Fleurs์˜ 20๊ฐœ ์–ธ์–ด๊ฐ€ ์ œ์™ธ๋จ์— ๋”ฐ๋ผ ์„ ํ–‰ ๋ชจ๋ธ์— ๋น„ํ•ด ๋‚ฎ๊ฒŒ ๋‚˜์˜ด

์–ธ์–ด๋ณ„ translate train data์˜ ์–‘๊ณผ Fleurs zero-shot BLEU ์ ์ˆ˜ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„ ์‹œ๊ฐํ™”

Train data์˜ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๊ฐœ์„ ๋˜๋Š” ์ถ”์„ธ๋Š” ๋ถ„๋ช…ํ•˜์ง€๋งŒ, WER์˜ ์ƒ๊ด€๊ด€๊ณ„์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ 0.83๋ณด๋‹ค ํ›จ์”ฌ ๋‚ฎ์€ 0.24์— ๋ถˆ๊ณผ

  • ๋ถ€๋ถ„์ ์œผ๋กœ ์Œ์„ฑ ์–ธ์–ด ์‹๋ณ„ ์˜ค๋ฅ˜๋กœ ์ธํ•ด train data๊ฐ€ noise๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์ธ ๊ฒƒ์œผ๋กœ ์ถ”์ •
  • ์›จ์ผ์ฆˆ(CY) ์–ธ์–ด๋ฅผ ์˜ˆ๋ฅผ ๋“ค๋ฉด 9000์‹œ๊ฐ„์˜ translate data๋ฅผ train ํ–ˆ์Œ์—๋„ ์ „์ฒด translate ๋ณด๋‹ค ๋‚ฎ์€ 13 BLEU performance๋ฅผ ๋ณด์ด๊ณ  ์ „์ฒด 4์œ„๋ฅผ ์ฐจ์ง€ํ•˜๋ฉฐ ํ”„๋ž‘์Šค์–ด, ์ŠคํŽ˜์ธ์–ด, ๋Ÿฌ์‹œ์•„์–ด ๋“ฑ ์„ธ๊ณ„์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์–ธ์–ด๋ณด๋‹ค ์•ž์„œ ์žˆ์Œ
  • ์›จ์ผ์ฆˆ ์–ธ์–ด๋ฅผ ์กฐ์‚ฌํ•ด๋ณด๋‹ˆ ๋Œ€๋ถ€๋ถ„์˜ data์— ์˜์–ด caption์ด ์žˆ๊ณ  indentification system์— ์˜ํ•ด ์˜์–ด audio๊ฐ€ ์›จ์ผ์ฆˆ ์–ธ์–ด์— ์ž˜๋ชป ํฌํ•จ๋˜์–ด์žˆ๋Š” data๊ฐ€ ์žˆ์Œ

 

Language Identification

Evaluate dataset : Fleurs

pretrained ๋˜์–ด ์žˆ์ง€ ์•Š์œผ๋ฏ€๋กœ SOTA ๋Œ€๋น„ 13.6% ๋‚ฎ์€ ์„ฑ๋Šฅ ๋ณด์œ 

Whisper๋Š” Fleues์— ํฌํ•จ๋œ 102๊ฐœ์˜ ์–ธ์–ด ์ค‘ 20๊ฐœ์˜ ์–ธ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์Œ

๋‹ค๋ฅธ SOTA ๋ชจ๋ธ์ด 80.4% ์ƒํ•œ ์„ ์„ ๋„˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— 20๊ฐœ ์–ธ์–ด๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋‹ค๋ฅธ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜๊ธฐ๋Š” ํž˜๋“ฆ

Robustness to Additive Noise

Test Dataset : LibriSpeech

Model : 14๊ฐœ์˜ LibriSpeech Trained Model, noise robustness of Whisper models

Audio Degradation Toolbox(์˜ค๋””์˜ค ํ’ˆ์งˆ ์ €ํ•˜) program์„ ์‚ฌ์šฉ

White noise & pub noise ์ถ”๊ฐ€

  • Pub noise : ์ฃผ๋ณ€ ์†Œ์Œ๊ณผ ๋ถˆํ™•์‹คํ•œ ๋Œ€ํ™”, ์‹œ๋„๋Ÿฌ์šด ํ™˜๊ฒฝ, ๋ ˆ์Šคํ† ๋ž‘์ด๋‚˜ ํŽ ํ™˜๊ฒฝ

14๊ฐœ์˜ ๋ชจ๋ธ ์‚ฌ์ด 12๊ฐœ์˜ ๋ชจ๋ธ์€ LibriSpeech์— ๋Œ€ํ•ด pre-trainded model ๋˜๋Š” fine-tuning ๋œ ๋ชจ๋ธ์ด๋ฉฐ, ๋‚˜๋จธ์ง€ ๋‘๊ฐœ์˜ ๋ชจ๋ธ์€ SpeechStew์™€ ๊ฐ™์€ ์„ ํ–‰ ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ ํ˜ผํ•ฉ dataset์— ๋Œ€ํ•ด ํ•™์Šต๋œ NVIDIA STT models์ž„

์ฃผ์–ด์ง„ signal ๋Œ€๋น„ signal-to-noise ๋น„์œจ(SNR)์€ signal power๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ๋จ

  • SNR  =   P_sig/P_noise   

Noise(SNR) ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ WER

  • ์ €์žก์Œ(40dB SNR)์—์„œ whisper ๋Šฅ๊ฐ€(LibriSpeech ๊ธฐ๋ฐ˜ train)
  • 14๊ฐœ์˜ ๋ชจ๋ธ์€ 10dB ๋ฏธ๋งŒ์—์„œ Whisper๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋‚˜๋น ์ง
  • ์ด๋Š” whisper์˜ Robustness to Additive Noise ์„ฑ๋Šฅ

 

Long-form Transcription

Chunk size : 30์ดˆ (์ดˆ๊ณผ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ ๋ถˆ๊ฐ€)

Chunk size์— ๋Œ€ํ•œ ISSUE

  • ์งง์€ ๋ฐœํ™” audio๋กœ ๊ตฌ์„ฑ๋œ train dataset์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์ง€๋งŒ ๋ช‡๋ถ„ ๋ช‡ ์‹œ๊ฐ„ ๋ถ„๋Ÿ‰์˜ audio๋ฅผ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•ด์•ผํ•˜๋Š” ์ž‘์—…์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ
  • 30์ดˆ ๋ถ„๋Ÿ‰์˜ audio segment๋ฅผ ์—ฐ์†์œผ๋กœ transcriptํ•˜๊ณ  model์ด ์˜ˆ์ธกํ•œ timestamp์— ๋”ฐ๋ผ window๋ฅผ ์ด๋™ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ธด audio์˜ buffering์„ transcriptํ•˜๋Š” ์ „๋žต

๊ธด audio๋ฅผ ์•ˆ์ •์ ์œผ๋กœ transcriptํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ ์˜ˆ์ธก์˜ ๋ฐ˜๋ณต์„ฑ(reptitiveness)๊ณผ ๋กœ๊ทธ ํ™•๋ฅ (log probability)์„ ์‚ฌ์šฉ

์ด ๊ณผ์ •์—์„œ beam  search์™€ temperature scheduling์ด ์ค‘์š” (4.5)

๊ธด transcription์— ๋Œ€ํ•œ performance ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด 7๊ฐœ์˜ dataset์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋‹ค์–‘ํ•œ ๊ธธ์ด์˜ recording ์ƒํƒœ๋ฅผ ๊ฒ€์ฆ(๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๋„๋ก ๊ตฌ์„ฑ)

  • TED-LIUM3์˜ full-length TED ๊ฐ•์—ฐ
  • The Late Show Jargon-laden(with Stephen Colbert)
  • Videos/podcasts
  • online blogs (Rev16 and Kincaid46)
  • recordings of earnings calls
  • the full-length interviews from the Corpus of Regional African American Language (CORAAL)
  • Full detail longform dataset Appendix A

Result

4๊ฐœ์˜ ์ƒ์—…์šฉ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ASR model์„ ๋น„๊ต
  * WER ๋น„๊ตํ•ด๋ณด๋ฉด earnings-21, CORAAL ๋นผ๊ณ  ์ค€์ˆ˜
  * NVIDIA STT Conformer-CTC NeMo(Large model)์„ ๋ชจ๋“  dataset์—์„œ ๋Šฅ๊ฐ€

2023.08.19

 

 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค