728x90

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision in Transformer๋Š” CNN ์—†์ด Image Classification์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€ ํ•˜๋‚˜์˜ ์‚ฌ๋ก€์ด๋‹ค. ์ด์ „์—๋„ Transformer์™€ Vision์„ ํ•ฉ์นœ ์‹œ๋„๋Š” ์žˆ์—ˆ์ง€๋งŒ ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ์—ฐ๊ตฌ๋Š” ์—†์—ˆ๋‹ค. ํ•™์Šต์–‘์ด ๋งŽ๊ณ  ์˜ˆ์ธก Label์ด ๋งŽ์„ ๊ฒฝ์šฐ ๊ฐ–๋Š” ๋ฌธ์ œ์ ์„ ๊ฐœ์„ ํ•˜๊ณ  ์—ฐ์‚ฐ ๊ฐ์†Œ(CNN ๋Œ€๋น„)์™€ Pre-trained Model์ด๋ผ๋Š” ์ ์—์„œ Attention์˜ ์žฅ์ ์„ ๋งŽ์ด ํ™œ์šฉํ–ˆ๋‹ค.

CNN SOTA ๋Œ€๋น„ ๋‘๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค.

1. CNN(Convolution Neural Network)์—์„œ ๋ฐœ์ƒํ•˜๋Š” translation equivariance๋ฅผ Self-Attention๊ณผ Large-scale training์œผ๋กœ ํ•ด๊ฒฐ

*translation equivariance : ๊ฐ™์€ object์˜ ์ด๋ฏธ์ง€๋”๋ผ๋„ ๋ฌผ์ฒด์˜ ์œ„์น˜์— ๋”ฐ๋ผ Input ๊ฐ’์ด ๋‹ฌ๋ผ์ ธ Output์ด ๋™์ผํ•˜๊ฒŒ ๋‚˜์˜ค์ง€ ์•Š๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

 

2. ๋ชจ๋ธ ์ผ๋ฐ˜ํ™” ๊ฐœ์„ 

* inductive bias : ํ•™์Šต ์‹œ์— ์ฃผ์–ด์ง€์ง€ ์•Š์€ Input์— ๋Œ€ํ•ด ์ •ํ™•ํ•œ ์˜ˆ์ธก์„ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ์ถ”๊ฐ€์ ์ธ ๊ฐ€์ •

์˜ˆ์ธกํ•ด์•ผํ•˜๋Š” label์ด ๋งŽ์„ ๊ฒฝ์šฐ globalํ•œ training์ด ์ˆ˜ํ–‰ํ•ด์•ผํ•˜๋Š”๋ฐ CNN์˜ ๊ฒฝ์šฐ local์ ์œผ๋กœ ์–ป๋Š” ์ •๋ณด๊ฐ€ ๋งŽ์•„ Transformer ์‚ฌ์šฉ (์˜ˆ์ธก Label์ด ์ ์„ ๊ฒฝ์šฐ CNN ๊ณ„์—ด ์‚ฌ์šฉ์ด ๋” ์„ฑ๋Šฅ์ด ์ข‹๋‹ค.)

 

paper : https://arxiv.org/pdf/2010.11929.pdf

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค