728x90

Google VertaxAI Python SDK

 

 gemini tokenizer๋Š” ์›๋ž˜ ๊ณต๊ฐœ๋˜์ง€ ์•Š์•„ token์„ ๊ณ„์‚ฐํ•˜๋ ค๋ฉด API๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ๋ฒˆ๊ฑฐ๋กœ์›€์ด ์žˆ์—ˆ๋‹ค. ๊ทธ๋ž˜์„œ Google Cloud Community์— ์งˆ๋ฌธ์„ ๋“ฑ๋กํ–ˆ๋‹ค. ์š”๋Š” tiktoken๊ณผ ๊ฐ™์ด token ๊ณ„์‚ฐ์„ local๋กœ ํ•  ์ˆ˜์žˆ๊ฒŒ ๊ณต๊ฐœํ•ด ๋‹ฌ๋ผ๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. token ์ˆ˜๋ฅผ ์•Œ ์ˆ˜ ์—†์œผ๋ฉด gemini-api๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ max_token์— ๋งž๊ฒŒ ๊ณ„์† ์กฐ์ •ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜ ์ฐจ๋ก€ API๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ์˜์‹ํ–ˆ๋Š”์ง€ token ์ˆ˜๋ฅผ API๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๊ณต๊ฐœํ–ˆ์ง€๋งŒ ๋ฌธ์ œ๋Š” ๋‘ ๋ฒˆ์˜ API ์‚ฌ์šฉ๋„ ๋งˆ์Œ์— ๋“ค์ง€ ์•Š์•˜๋‹ค. 

 

 tiktoken์œผ๋กœ OpenAI์—์„œ๋Š” tokenizer์™€ vocab์„ ๊ณต๊ฐœํ•ด ํˆฌ๋ช…ํ•˜๊ฒŒ API ๋น„์šฉ์„ ์ฒญ๊ตฌํ•˜๊ณ  ์žˆ๋‹ค. chatGPT์˜ vocab์€ cl100k_base 10๋งŒ 256๊ฐœ์˜ vocab์„ ๊ฐ€์ง€๊ณ  ์ถ”๋ก ์— ์‚ฌ์šฉํ•œ๋‹ค. ์ด ์ค‘ ํ•œ๊ตญ์–ด๋Š” 281๊ฐœ๋กœ GPT3.5์™€ GPT 4.0 ์‚ฌ์šฉํ• ๋•Œ ์—„์ฒญ ๋Š๋ฆฌ๊ณ  ํ† ํฐ์ˆ˜๋ฅผ ๋งŽ์ด ์ฐจ์ง€ํ•ด ์„ฑ๋Šฅ์€ ์ข‹์ง€๋งŒ ์„œ๋น„์Šค๋กœ ์‚ฌ์šฉํ•˜๊ธฐ์— ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค๋Š” ๋ง์„ ๋งŽ์ด ๋“ค์—ˆ์„ ๊ฒƒ์ด๋‹ค. ์•„๋ž˜ ๋‹ต๋ณ€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๊ฐ„๋‹จํ•œ ๋‹ต๋ณ€์—๋„ 40ํผ์„ผํŠธ๋‚˜ ์ฐจ์ด๊ฐ€ ๋‚œ๋‹ค.  

 

39 vs 25

 

 cl100k_base์—๋Š” ํ•œ๊ตญ์–ด๊ฐ€ 281๊ฐœ์˜ ํ† ํฐ์ด o200k_base_vocab์—๋Š” 2360๊ฐœ๋กœ ์•ฝ 8๋ฐฐ ์ •๋„ ๋งŽ์ด ๋ฐ˜์˜์ด ๋˜์—ˆ๋‹ค. ์ถ”๋ก ์—์„œ ํ† ํฐํ™”๋Š” ์ •๋ง ์ค‘์š”ํ•œ๋ฐ ์ด์ œ์„œ์•ผ ๋ฐ˜์˜์ด ๋œ ๊ฒŒ 1๋…„ ๋™์•ˆ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ํ† ํฐ์ด ๋‚ญ๋น„๋œ ๊ฒƒ์ธ์ง€ ์•„์‰ฝ์ง€๋งŒ ๊ฐœ์„ ํ•ด ์ค€ ๊ฒƒ๋งŒ์œผ๋กœ๋„ ๊ฐ์‚ฌํ•˜๋‹ค.

 

 token์˜ ์ค‘์š”์„ฑ์„ ์•Œ์•˜์œผ๋‹ˆ Google VertaxAI Python SDK๋ฅผ ํ†ตํ•ด gemini tokenizer๋ฅผ local์—์„œ ์‚ฌ์šฉํ•ด๋ณด์ž.

 

1. vertaxai ์„ค์น˜

python -m pip install --upgrade "google-cloud-aiplatform[tokenization]"

 

2. tokenization ๋ชจ๋“ˆ ์‹คํ–‰

from vertexai.preview import tokenization

model_name = "gemini-1.5-flash-001"
tokenizer = tokenization.get_tokenizer_for_model(model_name)

contents = "์•ˆ๋…•ํ•˜์„ธ์š”. ์•ˆ๋…•ํ•˜์„ธ์š”! ๊ถ๊ธˆํ•œ ์ ์ด ์žˆ๋‚˜์š”? ์–ด๋–ป๊ฒŒ ๋„์™€๋“œ๋ฆด๊นŒ์š”?"
result = tokenizer.count_tokens(contents)

print(f"{result.total_tokens = :,}")
#result.total_tokens = 28

 

  ๊ฐ™์€ input token๊ณผ output token์˜ ์ฐจ๋ฅผ ๋ณด๋ฉด. ์•ž์„œ ๋น„๊ตํ•œ cl100k_base(39), o200k_base_vocab(25)๊ฐœ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ํšจ์œจ์ ์ธ ํ•œ๊ตญ์–ด tokenzier๋Š” 

 

o200k_base > gemini-tokenizer > cl100k_base 

o200k_base 547 vs 620 gemini tokenizer

 ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 ๊ฐ€์žฅ ์ตœ๊ทผ์— ๋‚˜์˜จ ๋ชจ๋ธ์ด GPT4o(o200k_base)์ธ๋งŒํผ ๋” ๋งŽ์€ ๊ฒƒ์ด ๋ฐ˜์˜๋˜์–ด ์žˆ๊ฒ ์ง€๋งŒ ์•„์ง ์˜์–ด์— ๋น„ํ•ด ๋งŽ์ด ๋น„ํšจ์œจ์ ์ธ ๊ฒŒ ์‚ฌ์‹ค์ด๋‹ค. ํ•œ๊ตญ์–ด Foundation ๋ชจ๋ธ์ด ๋‚˜์™€์•ผ ์ด๋Ÿฐ ๋ฌธ์ œ๊ฐ€ ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ ๊ทธ๋Ÿฌ๊ธฐ์—๋Š” ํ˜„์‹ค์ ์œผ๋กœ ์–ด๋ ค์šด ํ˜„์‹ค์ด๋‹ค. ์™ธ๊ตญ์—์„œ ๋งŒ๋“  ๋ฒ ์ด์Šค ๋ชจ๋ธ๋“ค์ด ํ•œ๊ตญ์–ด๋ฅผ ๋” ์ž˜ํ•˜๋Š” ๊ฒƒ์ด ์•„์‰ฝ์ง€๋งŒ ๊ทธ๋Ÿผ์—๋„ ์—ฐ๊ตฌ๋Š” ๊ณ„์†๋˜์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐ๋œ๋‹ค. ํŠน์ • ๊ตญ๊ฐ€์˜ ์œค๋ฆฌ๊ด€์ด๋‚˜ ์—ญ์‚ฌ์˜์‹์ด ๋ฐ˜์˜๋œ LLM์ด ๋‹ค๋ฅธ ๊ตญ๊ฐ€์—์„œ ๊ต์œก์šฉ์œผ๋กœ ์ž๋ฆฌ์žก๊ฑฐ๋‚˜ ์„œ๋น„์Šค๊ฐ€ ๋œ๋‹ค๋ฉด ๋ฌธ์ œ๊ฐ€ ์‹ฌ๊ฐํžˆ ๋ฐœ์ƒ๋  ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๋„ค์ด๋ฒ„์˜ ํ•˜์ดํผํฌ๋กœ๋ฒ„ X, LG AI ์—ฐ๊ตฌ์›์˜ ์—‘์‚ฌ์› 2.0, ์‚ผ์„ฑ์ „์ž์˜ Gauss, NC์†Œํ”„ํŠธ์˜ VARCO ๋“ฑ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š๋”๋ผ๋„ ๋‚˜์ค‘์—๋Š” ๋”ฐ๋ผ์žก์„ ๊ฒƒ์ด๋ผ๋Š” ํฌ๋ง์„ ๊ฐ–๊ณ  ์‘์›ํ•œ๋‹ค.

 

 

์ฐธ๊ณ  ๋ฌธํ—Œ ) 

https://www.googlecloudcommunity.com/gc/AI-ML/Please-share-Gemini-tokenize-information/m-p/773390

 

Re: Please share Gemini tokenize information

Thank you so much Now i can request just once not twice anymore ^^

www.googlecloudcommunity.com

 

https://platform.openai.com/playground/chat?models=gpt-3.5-turbo&models=gpt-4o

 

https://orange-mansion.com/news/240418_fm/

 

๐ŸŠ์˜ค๋ Œ์ง€๋งจ์ˆ€ - โ€œํ•œ๊ตญ์ด ๊ฐœ๋ฐœํ•œ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์€ ์—†๋‹คโ€

*๋‹ค๋งŒ ๋ณด๊ณ ์„œ์— ๋ˆ„๋ฝ๋  ์ˆ˜๋Š” ์žˆ์Šต๋‹ˆ๋‹ค๐Ÿคช

orange-mansion.com

 

 

 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค