The Path to Achieve Ultra-Low Inference Latency With LLaMA 65B on PyTorch/XLA

파라미터	(MB)	캐시 (MB)	전체 (GB)	최소 TPU v4 칩 수
7B	14,000	134	14.128	1
33B	66,000	408	66.41	3
65B	130,000	671	130.67	5
175B	350,000	1,208	351.21	11

[Gemini] ValueError: The `response.parts` quick accessor only works for a single candidate, but none were returned. Check the `response.prompt_feedback` to see if the prompt was blocked. (0)	2024.02.12
[Pinecone] llama-index with Pinecone (0)	2023.10.01
Textbooks Are All You Need (0)	2023.07.02
LLM Context 확장 불가능은 아니다. (token size 늘리기 정리) (0)	2023.06.28
Text Embedding + t-SNE Visualization (0)	2023.06.22

파라미터