The Path to Achieve Ultra-Low Inference Latency With LLaMA 65B on PyTorch/XLA
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
BACKGROUND & STATE OF THE ART ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP) ์˜์—ญ์—์„œ ์–ธ์–ด ๋ชจ๋ธ์€ ๊ณผ๊ฑฐ ์ž…๋ ฅ ํ† ํฐ์˜ ์‹œํ€€์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฐ(์˜ˆ: ๋‹จ์–ด)์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋Œ€์šฉ๋Ÿ‰ ์–ธ์–ด ๋ชจ๋ธ(Large Language Models, LLMs)์€ ์ด ๊ณต๊ฐ„์—์„œ์˜ ์ตœ์‹  ๋”ฅ๋Ÿฌ๋‹ ํ˜์‹ ์œผ๋กœ, ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ๋ฐฉ์‹์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ž…๋ ฅ ํ† ํฐ์˜ ํฐ ์‹œํ€€์Šค์— ๋Œ€ํ•œ ์ฃผ์˜๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด transformer๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. LLaMA๋Š” 1์กฐ ๊ฐœ ์ด์ƒ์˜ ํ† ํฐ์œผ๋กœ ํ›ˆ๋ จ๋œ ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ฐ˜ LLM์œผ๋กœ, Meta AI์—์„œ ์˜คํ”ˆ ์†Œ์Šค๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. LLaMA๋Š” GPT-3, Chinchilla, PaLM๊ณผ ๊ฐ™์€ ๋งŽ์€ ์ตœ๊ณ ์˜ ๋ชจ๋ธ๊ณผ ๊ฒฝ์Ÿ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. LLaMA (13B)๋Š” GPT..
๋‹คํ–ˆ๋‹ค
'#PoweredByPyTorch' ํƒœ๊ทธ์˜ ๊ธ€ ๋ชฉ๋ก