728x90

"Should You Mask 15% in Masked Language Modeling?" 

https://arxiv.org/abs/2202.08005

 

Should You Mask 15% in Masked Language Modeling?

Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In

arxiv.org

 

 BERT ๋…ผ๋ฌธ์€ (2024.3.24) ๊ธฐ์ค€ 95486๋ฒˆ ์ธ์šฉ๋  ์ •๋„๋กœ NLP์—์„œ ๋น ์งˆ ์ˆ˜ ์—†๋Š” ๋ชจ๋ธ์ด๋‹ค. Generation ๋ชจ๋ธ์ด ์œ ๋ช…ํ•ด์ง€๊ธฐ ์ „๊นŒ์ง€ pre-training์ด๋ผ๋Š” ๊ฐœ๋…๊ณผ NLP์—์„œ NLU๋กœ Task ํ™•์žฅ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ฃผ์š” ๋“ฑ์žฅํ•˜๋Š” Task์ธ Masked LM, Next Sentence Prediction NSP ์ค‘ Masked LM์˜ ๋น„์œจ์ด ์™œ? 15% ์ธ๊ฐ€ ๊ถ๊ธˆํ•ด ๋ณธ ๋…ผ๋ฌธ์„ ๊ฒ€์ƒ‰ํ•ด๋ณด์•˜๋‹ค.

 

"Should You Mask 15% in Masked Language Modeling?" ๋…ผ๋ฌธ์€ 2022๋…„์— BERT model์˜ Masked ๋น„์œจ์„ ์กฐ์ ˆํ•ด๊ฐ€๋ฉฐ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์ •๋ฆฌํ–ˆ๋‹ค.

 

 

1) ๋งˆ์Šคํ‚น ๋น„์œจ์˜ ํšจ๊ณผ

๋งˆ์Šคํ‚น ๋น„์œจ์ด๋ž€ ์›๋ณธ ๋ฌธ์žฅ์—์„œ ๋งˆ์Šคํ‚น๋œ ํ† ํฐ์˜ ๋น„์œจ์„ ์˜๋ฏธํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ BERT๋ชจ๋ธ์—์„œ 15%์˜ ๋งˆ์Šคํ‚น ๋น„์œจ์ด ์‚ฌ์šฉ๋˜๋Š”๋ฐ ์›๋ณธ ๋ฌธ์žฅ์—์„œ 15%์˜ ํ† ํฐ์ด ๋งˆ์Šคํ‚น๋˜๊ณ  ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต์ด ์ด๋ฃจ์–ด์ง„๋‹ค. ์—ฐ๊ตฌ ๊ฒฐ๊ณผ, ๋งˆ์Šคํ‚น ๋น„์œจ์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค.

  • Pre-tranined scratch์— ๋”ฐ๋ผ GLUE ๋ฐ SQuAD๋ฅผ Pre-trained model์„ ์‚ฌ์šฉํ•  ๋•Œ BERT-large ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ 40% ๋งˆ์Šคํ‚น์ด 15% ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€.
  • ํฅ๋ฏธ๋กญ๊ฒŒ๋„ ๋งค์šฐ ๋†’์€ ๋งˆ์Šคํ‚น ๋น„์œจ(์ตœ๋Œ€ 80% ๋งˆ์Šคํ‚น)์˜ ๊ฒฝ์šฐ์—๋„ large ๋ชจ๋ธ์€ ์—ฌ์ „ํžˆ โ€‹โ€‹์ข‹์€ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๊ณ  ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ pre-trained ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Œ.

 ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ผ๋ฐ˜์ ์ธ ๋ชจ๋ธ์—์„œ๋„ ๋งˆ์Šคํ‚น ๋น„์œจ์„ ๋†’์ด๋Š” ๊ฒƒ์ด ๋” ์ข‹์€ ํ•™์Šต ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„ ๊ฒƒ์„ ์‹œ์‚ฌ.

 

2) ๋งˆ์Šคํ‚น์˜ ๊ธฐ๋Šฅ๊ณผ ํšจ๊ณผ, ์š”์ธ๋ถ„์„

 ๋งˆํ‚น์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. ๋งˆ์Šคํ‚น์„ ๋†’์ด๋ฉด ๋ชจ๋ธ์€ ๋” ๋งŽ์€ ๋ฌธ๋งฅ์„ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋„ˆ๋ฌด ๋†’์€ ๋งˆ์Šคํ‚น ์ž‘์—…์€ ๋ชจ๋ธ์—๊ฒŒ ์–ด๋ ค์šด ์˜ˆ์ธก ๊ณผ์ œ๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด ์ ์ ˆํ•œ ๋งˆ์Šคํ‚น์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.

 ๋งˆ์Šคํ‚น์˜ ์„ฑ๋Šฅ์€ ๋‘ ๊ฐ€์ง€ ์š”์ธ์ด ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์š”์ธ์€ ๋ฌธ๋งฅ์˜ loss๋กœ, ๋งˆ์Šคํ‚น๋œ ํ‘œํ˜„์˜ ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์š”์ธ์€ Prediction rate์œผ๋กœ, ์˜ˆ์ธก์ด ์˜ˆ์ธกํ•˜๋Š” ์˜๋ฏธ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ด ๋‘ ๊ฐ€์ง€ ์š”์ธ์€ ์ƒ๋ฐ˜๋œ ํšจ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ์˜ˆ์ธก ๋ฒ”์œ„๊ฐ€ ๋†’์Œ์„ ์ข€ ๋” ๋ชจ๋ธ์˜ ํ•™์Šต ์‹ ํ˜ธ๊ฐ€ ๋” ๋งŽ์ด ์ƒ์„ฑ๋˜์–ด ์ตœ์ ํ™”์— ๋„์›€์ด ๋˜๋Š” ๊ฒฝ์šฐ, ๋ฌธ๋งฅ์˜ ์ •๋ณด๊ฐ€ ์ ์„ ์ˆ˜๋ก ์˜ˆ์ธก์ด ๋” ์–ด๋ ค์›Œ์ง€๋ฏ€๋กœ ์ ์ ˆํ•œ ๋ฒ”์œ„๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์ค‘์š” (๋„ˆ๋ฌด ๋งŽ์€ ์ •๋ณด์˜ ์†์‹ค์€ ๊ณผ์ ํ•ฉ์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค๋Š” ์˜๋ฏธ)

 

 

 ๋ณธ ๋…ผ๋ฌธ์—์„œ GLUE, SQuAD์— ๋Œ€ํ•ด์„œ๋งŒ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ํ‘œ์ค€ ์ผ๋ฐ˜์ ์ธ ์—ฐ๊ตฌ์—์„œ ์ •๋Ÿ‰์ ์ธ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๋ คํ–ˆ์œผ๋‚˜ NLP์˜ ์ ์šฉ ๋ถ„์•ผ Task(NLU, sentiment classification ๋“ฑ) ๋งˆ๋‹ค ๋‹ค๋ฅธ ํšจ๊ณผ๋ฅผ ๊ฐ€์ง€๊ธฐ์— ์ œํ•œ์ ์ธ ์„ค๋ช…์„ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ์—ฐ๊ตฌ๋ฅผ ํ† ๋Œ€๋กœ ๋งˆ์Šคํ‚น ๋น„์œจ๋งŒ์œผ๋กœ ํ•™์Šตํ•˜๋ ค๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ณ  ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๋ ค์ค€๋‹ค.  

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค