728x90

Model Info

  • ์ค‘๊ตญ Baidu์—์„œ ๊ณต๊ฐœํ•œ End-to-End ์Œ์„ฑ์ธ์‹ ๋ชจ๋ธ(2015.12)
  • ์Œ์„ฑ๋ฐ์ดํ„ฐ์— Melspectrograms์„ ์ ์šฉ
    • Fourier Transform์‹œ ๋ฐœ์ƒํ•˜๋Š” ๊ฐ ์Œ์„ฑ feature์˜ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์—†๋‹ค.
      • STFT(short time fourier transform)์„ ์ ์šฉ, ์Œ์„ฑ feature๋ฅผ ์ข์€ ๋‹จ์œ„๋กœ FT๋ฅผ ์ ์šฉํ•ด feature์˜ ์œ„์น˜๋ฅผ ๋ฐ˜์˜
    • ์‚ฌ๋žŒ์€ ์ €์ฃผํŒŒ์ˆ˜์— ๋Œ€ํ•ด ๋ฏผ๊ฐํžˆ ์ž˜ ํŒŒ์•…ํ•œ๋‹ค. ๊ณ ์ฃผํŒŒ์ˆ˜์— ๋Œ€ํ•œ ์Œ์„ฑ์€ ์ž˜ ์ธ์‹ํ•˜์ง€ ๋ชปํ•œ๋‹ค.
      • ์ฃผํŒŒ์ˆ˜๋ฅผ ์‚ฌ๋žŒ์˜ ์ธ์‹๋‹จ์œ„๋กœ mel scale ๋ณ€ํ™˜
        • Mel(f) = 2595 * log(1+ f / 700)
    • Mel feature๋ฅผ CNN๊ณผ RNN์„ ๊ฑฐ์นœ ๋’ค CTC(Connectionist Temporal Classification)์„ ์ ์šฉ
       

 

CTC (Connectionist temporal classification)

  • ์žฅ์ 
    • ์Œ์„ฑ ๋ฐ์ดํ„ฐ์˜ ๋ณ„๋‹ค๋ฅธ ๋ผ๋ฒจ๋ง ์—†์ด ์‹œํ€€์Šค ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ ํŒŒ์•…
    • ์‹œํ€€์Šค๊ฐ„ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ํ†ตํ•ด P(S_t|S_t+1) ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์œผ๋กœ ์ „์‚ฌ์— ๋Œ€ํ•œ ๊ตฌ๋ถ„ C(hel-lo) = C(h-ello) = C(hello)
  • ๋‹จ์ 
    • ์‹œํ€€์Šค๊ฐ„ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ์ ์šฉํ•˜๋ฏ€๋กœ ๊ณ„์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•œ๋‹ค.
      • beam ๊ณ„์‚ฐ์œผ๋กœ ์ค‘๋ณต ์—ฐ์‚ฐ๋ฐฉ์ง€
    • Mel ํ•จ์ˆ˜ ์ ์šฉ + CTC ์ ์šฉ์‹œ feature์˜ ํ”„๋ ˆ์ž„์ด ๋ณ€ํ•˜๊ฒŒ ๋˜์–ด ํ•™์Šต์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค.

 

Train Compose (AI Hub ํ•œ๊ตญ์–ด ์Œ์„ฑ, 17G)

Using Kospeech

batch size 32
init_lr_scale 0.01
final_lr_scale 0.05
optimizer adam
init_lr 0.000001
final_lr 0.000001
 

Result

Cost Time 137.68h
Epoch 18 / 70 
CER 0.26
loss 0.419

 

 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค