LLM + Chain Tool
์ต๊ทผ LangChain์ ํตํ GPT ์๋ํํฐ ํด์ด ์์์ง๊ณ ์๋ค. LLM ๋ชจ๋ธ์ ํ์ฉํด ์๋์ผ๋ก Tool์ ์ฌ์ฉํ๊ฒ ๋ง๋ค์ด ์ฃผ๊ณ ์๋ค.
๊ทธ ๊ฒฐ๊ณผ ์ค์ ๋ก ์ํ๋ ์ฌ์ดํธ ๋ง๋ค๊ธฐ - Auto-GPT
https://youtu.be/gWy-pJ2ofEM?si=f3pADRKEIZMsdhB2
Q&A ์ง์ ๊ธฐ๋ฐ ๊ฒ์ Bot ๋ง๋ค๊ธฐ
https://youtu.be/cFCGUjc33aU?si=s7m0nw4MjKzaoQII
๋ค์ํ ์๋น์ค๊ฐ LLM ๋ชจ๋ธ์ ํตํด ๊ฐ๋ฐ๋๊ณ ์๋ค.
๊ทธ๋ฌ๋ ์ฐ๋ฆฌ๊ฐ ์ฌ์ฉ ๊ฐ๋ฅํ ๋ฆฌ์์ค๋ก๋ Meta์์ ๋ฐํํ LLAMA ๋ชจ๋ธ์ด๋ Open์์ค๋ก ๊ณต๊ฐํ databricks์ dolly๋ฅผ ํ์ฉํด์ ์๋น์ค๋ฅผ ๊ตฌ์ถํด์ผํ๋๋ฐ ํ์ต ๋น์ฉ๋ง ์๋ฐฑ๋ง์์ ๋ชจ๋ธ์ ๋ง๋ค์๋ค๊ณ ํด๋ ์ฌ์ฉํ๋ ค๋ฉด ๊ณ ์คํ GPU๋ฅผ ์๊ตฌํ๋ค. ๊ฒฐ๊ตญ์ OpenAI์ API๋ฅผ ํ์ฉํ๋ ์ ๋ฐ์ ์๋ค.
OpenAI API ํ ํฐ ํจ์จ์ ์ผ๋ก ๊ด๋ฆฌํ๊ธฐ
์ฐ๋ฆฌ๊ฐ ์ฌ์ฉํ๋ ChatGPT์ ์๋น์ค๋ Chat ํํ์ ์ง๋ฌธ query๋ฅผ ์ํํ๋ฏ๋ก ๋ฌธ์ฅ ํํ์ ์ง์๋ฅผ ๋ณด๋ด์ผํ๋ค. ๊ฐ์ Task๋ฅผ ์ ์ฐจ ์ ์๋ฅผํ๊ณ ์ง๋ฌธ์ ํด์ผํ๋ฏ๋ก ์ด ๊ฒฝ์ฐ ํ ํฐ์ด ์๋นํ ๋ง์ด ์์๋๋ค.(ChatGPT, $0.002 / 1K tokens) Token์ ๊ณง ๋์ด๋ฉฐ 1000๊ฐ์ ํ ํฐ์ 2์์ด๋ผ๊ณ ๋ฌด์ํ ์ ์๋ค. ๋ํ ํ๊ตญ์ด ์ ์ฉ LLM ๋ชจ๋ธ์ด ์๋๋ผ Tokenize์ ๋ ๋ง์ ๋์๊ฐ ๋ฐ์ํ๋ค.
https://devocean.sk.com/blog/techBoardDetail.do?ID=164758&boardType=techBlog
์ข ํฉํ๋ฉด ํ ํฐ์ ํจ์จ์ ์ผ๋ก ์ฌ์ฉํ๊ธฐ ์ํด์๋ ๋จ์ด๋ณด๋ค OpenAI์ GPT๊ฐ ์ํ๋ ํํ์ embedding vector๋ฅผ ํ์ฉํด์ผ ๊ฐ์ ์ง๋ฌธ์ด๋๋ผ๋ ๋ ์งง์ ๋ฌธ์ฅ์ผ๋ก ์ผ์ ์ฒ๋ฆฌ ํ ์ ์๋ค.
10๊ฐ์ง์ QnA ๋ฌธ์๊ฐ ์ ์ฌ๋๋ฅผ ํตํด QnA bot ๋ง๋ค๊ธฐ
1 ) OpenAI API ํค ๋ฑ๋ก & Document ๋ถ๋ฌ์ค๊ธฐ
import os
import time
import pandas as pd
import openai
import re
import requests
import sys
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity
import tiktoken
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.organization = os.getenv("OPENAI_ORGANIZATION")
start_time=time.time()
path ='./QnA/'
########### This helps takes care of removing metadata
search_string = ""
metadata_counter = 0
############
d = []
text=""
for root, directories, files in os.walk(path , topdown=False):
for file in files:
if file.lower().endswith(".txt"):
name =(os.path.join(root,file))
f = open(name, "r",encoding="utf-8")
for line in f:
text +=line
f.close()
d.append({'FILE NAME': file ,'CONTENT': text})
pd.DataFrame(d)
metadata_counter = 0
text=""
end_time = time.time()
duration = end_time - start_time
print ("Script Execution: ", duration)
2) ๋ฌธ์ฅ ์ ์ฒ๋ฆฌ
# input sentence ์ ์ฒ๋ฆฌ
def normalize_text(s, sep_token = " \n "):
s = re.sub(r'\s+', ' ', s).strip()
s = re.sub(r". ,","",s)
# remove all instances of multiple spaces
s = s.replace("..",".")
s = s.replace(". .",".")
s = s.replace("\n", "")
s = s.replace("#","")
s = s.strip()
if s =="":
s = "<blank>"
return s
df_normalized=df.copy()
df_normalized['CONTENT'] = df["CONTENT"].apply(lambda x : normalize_text(x))
3) Tokenize
tokenizer = tiktoken.get_encoding("cl100k_base")
df_tok=df_normalized.copy()
df_tok['n_tokens'] = df_normalized["CONTENT"].apply(lambda x: len(tokenizer.encode(x)))
df_tok
4) Token ๋น์ฉ ๊ณ์ฐ
# Based on https://openai.com/api/pricing/ on 01/29/2023
# If you were using this for approximating pricing with Azure OpenAI adjust the values below with: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/
#MODEL USAGE
#Ada v1 $0.0040 / 1K tokens
#Babbage v1 $0.0050 / 1K tokens
#Curie v1 $0.0200 / 1K tokens
#Davinci v1 $0.2000 / 1K tokens
#MODEL USAGE
#Ada v2 $0.0004 / 1K tokens
#This Ada model, text-embedding-ada-002, is a better and lower cost replacement for our older embedding models.
n_tokens_sum = df['n_tokens'].sum()
ada_v1_embeddings_cost = (n_tokens_sum/1000) *.0040
babbage_v1_embeddings_cost = (n_tokens_sum/1000) *.0050
curie_v1_embeddings_cost = (n_tokens_sum/1000) *.02
davinci_v1_embeddings_cost = (n_tokens_sum/1000) *.2
ada_v2_embeddings_cost = (n_tokens_sum/1000) *.0004
print("Number of tokens: " + str(n_tokens_sum) + "\n")
print("MODEL VERSION COST")
print("-----------------------------------")
print("Ada" + "\t\t" + "v1" + "\t$" + '%.8s' % str(ada_v1_embeddings_cost))
print("Babbage" + "\t\t" + "v1" + "\t$" + '%.8s' % str(babbage_v1_embeddings_cost))
print("Curie" + "\t\t" + "v1" + "\t$" + '%.8s' % str(curie_v1_embeddings_cost))
print("Davinci" + "\t\t" + "v1" + "\t$" + '%.8s' % str(davinci_v1_embeddings_cost))
print("Ada" + "\t\t" + "v2" + "\t$" + '%.8s' %str(ada_v2_embeddings_cost))
Davinci ์ญ์ GPT-3 API๋ก ์๋นํ ๊ณ ๊ฐ์์ ์ ์ ์๋ค. ๋ฐ๋ฉด Ada v2๋ 500๋ฐฐ๋ ์ ๋ ดํ๋ค. ๋จ์ embedding์ ํ์ฉํด classification์ ์ํํ๋ ๊ฒ์ด๋ผ๋ฉด Ada-V2๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ด ๊ฒฝ์ ์ ์ด๋ค. LangChain์ ๋๋ถ๋ถ์ ์๋น์ค๊ฐ Ada-v2๋ฅผ ์ฌ์ฉํ๋ค. ๋ถ๋ฅ๋ฅผ ํตํ Task๋ฅผ ์ ์ํ๊ณ ๋ง์ง๋ง์ completion Danvinci๋ chatGPT๋ฅผ ์ฌ์ฉ
Model dimension
Ada(1024)
Babbage(2048)
Curie(4096)
Davinci(12288)
def generate_embeddings(text, model="text-embedding-ada-002"):
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
df['ada_v2_embedding'] = df.CONTENT.apply(lambda x: generate_embeddings(x, model='text-embedding-ada-002'))
len(df['ada_v2_embedding'][1])
# 1536
ada_v2์ Output dimension์ 1536์ด๋ค. 768 * 2 ์ผ๋ฐ Bert ๋ชจ๋ธ์ 2๋ฐฐ์ embedding size๋ฅผ ์ฌ์ฉํ๋ค.
5) Input Text์ ์ ์ฌ๋๋ฅผ ๋ถ์ ํ ๊ฐ์ฅ ๋์ ์ ์ฌ๋ rank 3์ ์ถ์ถํด ๋ณด์ฌ์ค๋ค.
# search embedded docs based on cosine similarity
df_similarities = df_embeddings.copy()
def get_embedding(text, model="text-embedding-ada-002"):
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
def search_docs(df, user_query, top_n=3, to_print=True):
embedding = get_embedding(
user_query,
model="text-embedding-ada-002"
)
df_similarities["similarities"] = df_embeddings.ada_v2_embedding.apply(lambda x: cosine_similarity(x, embedding))
res = (
df_similarities.sort_values("similarities", ascending=False)
.head(top_n)
)
if to_print:
display(res)
return res
question = input("๋ฌด์์ ๋์๋๋ฆด๊น์?\n\n")
res = search_docs(df, question, top_n=3)
https://github.com/seohyunjun/openAI_API_token/blob/main/openaiAPI_embedding.ipynb
'๐ ๏ธ Tools > ๐ค ChatGPT' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[AutoGen] ๊ฐ๋ฐ์ ์คํํ๊ฒ feat.GPT-4 (0) | 2023.09.30 |
---|---|
GPT-4 ์ํคํ ์ฒ, ์ธํ๋ผ, ํ๋ จ ๋ฐ์ดํฐ์ , ๋น์ฉ, ๋น์ , MoE (0) | 2023.07.18 |
[LLM] ChatGPT Prompt Engineering for Developers - Chatbot (0) | 2023.05.07 |
[LLM] ChatGPT Prompt Engineering for Developers - Expanding (0) | 2023.05.06 |
[LLM] ChatGPT Prompt Engineering for Developers - Transforming (0) | 2023.05.05 |