728x90

https://github.com/openai/whisper 

 

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

github.com

 

Paper Review 

Abstract & Introduction

  • 680,000 ์‹œ๊ฐ„์˜ ๋‹ค๊ตญ์–ด ํ•™์Šต์„ ์ง„ํ–‰ ์‹œ fine-tuning ์—†์ด zero-shot transfer benchmark ์ˆ˜์ค€์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์‚ฌ๋žŒ์— ๊ทผ์ ‘ํ•œ accuracy์™€ robustness๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋จ.
  •  ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ํ•™์Šต ๋ฐฉ์‹์€ Wave2Vec์„ ์ด์šฉํ•œ ๋น„์ง€๋„ ํ•™์Šต ๋ฐฉ์‹์ด๋‹ค. ์‚ฌ๋žŒ์˜ labeling ์—†์ด ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์‚ฐํ•˜์—ฌ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ ํ•™์Šต์„ ์ง„ํ–‰์‹œํ‚ค๋ฏ€๋กœ data setting์— ๋Œ€ํ•œ ๋ถ€๋‹ด์„ ์ค„์˜€๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ํ•™์Šต ๋ฐฉ๋ฒ•์€ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ‘œํ˜„๋งŒ ์ข‹์„ ๋ฟ unsupervised data์— ๋Œ€ํ•œ decoder mapping์€ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค.
  • ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์€  fine-tuning ์‹œ ๋ณต์žกํ•œ ์ž‘์—…์„ ์‹ค๋ฌด์ž๊ฐ€ ์ง„ํ–‰ํ•ด์•ผํ•œ๋‹ค.(risk ์ถ”๊ฐ€, fine-tuning์‹œ ์„ฑ๋Šฅ์ด ์ž˜ ์•ˆ ๋‚˜์˜ฌ ์ˆ˜๋„ ์žˆ์Œ)
  • ๋จธ์‹  ๋Ÿฌ๋‹ method๋Š” ๊ฐ™์€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ•™์Šต ํŒจํ„ด์„ ์ฐพ๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ outlier(brittle, spurious)์™€ ์ •์ƒ์น˜์™€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์— ์˜ํ•ด ํ•™์Šต์ด ์ž˜๋˜์ง€ ์•Š๋Š”๋‹ค.
    • Radford et al 
      • ImageNet classification์—์„œ ๊ฐ™์€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ํด๋ž˜์Šค๋ฅผ 7๊ฐ€์ง€ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜๋กœ ์„ธ๋ถ„ํ™”ํ–ˆ์„ ๋•Œ acc 9.2% ์ฆ๊ฐ€ ์‹œ์ผฐ๋‹ค.  
  • large-scale์— ๋Œ€ํ•œ ํŽธํ–ฅ์„ ๊ฐ€์ง„ dataset์„ ํ•™์Šต ์‹œํ‚ค๊ณ  ๋‹ค์‹œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ high-quality dataset์˜ ๋ช‡ ๋ฐฐ๋‚˜ ๋˜๊ณ  ์ด์ „ ํ•™์Šต๋ณด๋‹ค ์ ์€ ์–‘์˜ ํ•™์Šต์„ ํ•˜๊ฒŒ ๋œ๋‹ค. (์˜ฌ๋ฐ”๋ฅธ ํ•™์Šต ๋ฏธ๋ฏธ)
  • OpenAI ์—ฐ๊ตฌ์ง„์€ ๋ฐ์ดํ„ฐ inbalanced ๋ฌธ์ œ๋ฅผ ์ขํžˆ๊ธฐ ์œ„ํ•ด 68,000์‹œ๊ฐ„์˜ labeling์ด ๋œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.
    • Whisper: *weakly supervised speech recognition
  • ์˜์–ด๋ฟ๋งŒ์•„๋‹ˆ๋ผ 117,000 h 96๊ฐœ์˜ ์–ธ์–ด dataset, 125,000h Xโ†’en ์˜ ์˜๋ฌธ ๋ณ€ํ™˜ ๋ฒˆ์—ญ dataset ํ•™์Šต
  • * large-model์— ์žˆ์–ด ๋‹ค๊ตญ์–ด ํ•™์Šต์€ ๋‹จ์ ์ด๋‚˜ ์žฅ์  ๋‘˜ ๋‹ค ์—†๋‹ค.
  • ์ตœ๊ทผ weakly supervised pre-training์ด ์ €ํ‰๊ฐ€๋จ์— ์žˆ์–ด lage-scale dataset์„ ํ•™์Šต ์‹œ self-supervision ๋˜๋Š” self-trainig์— ๋Œ€ํ•œ ๊ณ ์ฐฐ์ด ํ•„์š”ํ•˜๋‹ค. 
  • ์Œ์„ฑ ์–ธ์–ด ๋ชจ๋ธ๋ง ์—ฐ๊ตฌ์— ๊ธฐ์—ฌํ•˜๊ธฐ ์œ„ํ•ด OpenAI whisper๋ฅผ ๊ณต๊ฐœํ•จ. 

 

 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค