728x90

LLM + Chain Tool 

 

Welcome to LangChain โ€” ๐Ÿฆœ๐Ÿ”— LangChain 0.0.175

 

python.langchain.com

 

 ์ตœ๊ทผ LangChain์„ ํ†ตํ•œ GPT ์„œ๋“œํŒŒํ‹ฐ ํˆด์ด ์Ÿ์•„์ง€๊ณ  ์žˆ๋‹ค. LLM ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ์ž๋™์œผ๋กœ Tool์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์ฃผ๊ณ  ์žˆ๋‹ค.

 

๊ทธ ๊ฒฐ๊ณผ ์‹ค์ œ๋กœ ์›ํ•˜๋Š” ์‚ฌ์ดํŠธ ๋งŒ๋“ค๊ธฐ - Auto-GPT

https://youtu.be/gWy-pJ2ofEM?si=f3pADRKEIZMsdhB2

 

 

Q&A ์ง€์‹ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰ Bot ๋งŒ๋“ค๊ธฐ 

https://youtu.be/cFCGUjc33aU?si=s7m0nw4MjKzaoQII

 

 

๋‹ค์–‘ํ•œ ์„œ๋น„์Šค๊ฐ€ LLM ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค. 

 

๊ทธ๋Ÿฌ๋‚˜ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฆฌ์†Œ์Šค๋กœ๋Š” Meta์—์„œ ๋ฐœํ‘œํ•œ LLAMA ๋ชจ๋ธ์ด๋‚˜ Open์†Œ์Šค๋กœ  ๊ณต๊ฐœํ•œ  databricks์˜ dolly๋ฅผ ํ™œ์šฉํ•ด์„œ ์„œ๋น„์Šค๋ฅผ ๊ตฌ์ถ•ํ•ด์•ผํ•˜๋Š”๋ฐ ํ•™์Šต ๋น„์šฉ๋งŒ ์ˆ˜๋ฐฑ๋งŒ์›์— ๋ชจ๋ธ์„ ๋งŒ๋“ค์—ˆ๋‹ค๊ณ ํ•ด๋„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๊ณ ์ŠคํŽ™ GPU๋ฅผ ์š”๊ตฌํ•œ๋‹ค. ๊ฒฐ๊ตญ์€ OpenAI์˜ API๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ˆ˜ ๋ฐ–์— ์—†๋‹ค.

 

OpenAI API ํ† ํฐ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ

 ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ChatGPT์˜ ์„œ๋น„์Šค๋Š” Chat ํ˜•ํƒœ์˜ ์งˆ๋ฌธ query๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ ๋ฌธ์žฅ ํ˜•ํƒœ์˜ ์งˆ์˜๋ฅผ ๋ณด๋‚ด์•ผํ•œ๋‹ค. ๊ฐ™์€ Task๋ฅผ ์ œ์ฐจ ์ •์˜๋ฅผํ•˜๊ณ  ์งˆ๋ฌธ์„ ํ•ด์•ผํ•˜๋ฏ€๋กœ ์ด ๊ฒฝ์šฐ ํ† ํฐ์ด ์ƒ๋‹นํžˆ ๋งŽ์ด ์†Œ์š”๋œ๋‹ค.(ChatGPT, $0.002 / 1K tokens) Token์€ ๊ณง ๋ˆ์ด๋ฉฐ 1000๊ฐœ์˜ ํ† ํฐ์— 2์›์ด๋ผ๊ณ  ๋ฌด์‹œํ•  ์ˆ˜ ์—†๋‹ค. ๋˜ํ•œ ํ•œ๊ตญ์–ด ์ „์šฉ LLM ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ผ Tokenize์‹œ ๋” ๋งŽ์€ ๋ˆ„์ˆ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

 

https://devocean.sk.com/blog/techBoardDetail.do?ID=164758&boardType=techBlog 

 

ChatGPT๊ฐ€ ํ•œ๊ตญ์–ด๋„ ์ž˜ํ•˜๋Š”๋ฐ ํ•œ๊ตญ์–ด ์–ธ์–ด๋ชจ๋ธ์„ ๋”ฐ๋กœ ๋งŒ๋“ค ํ•„์š”๊ฐ€ ์žˆ์„๊นŒ

 

devocean.sk.com

 

  ์ข…ํ•ฉํ•˜๋ฉด ํ† ํฐ์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‹จ์–ด๋ณด๋‹ค OpenAI์˜ GPT๊ฐ€ ์›ํ•˜๋Š” ํ˜•ํƒœ์˜ embedding vector๋ฅผ ํ™œ์šฉํ•ด์•ผ ๊ฐ™์€ ์งˆ๋ฌธ์ด๋”๋ผ๋„ ๋” ์งง์€ ๋ฌธ์žฅ์œผ๋กœ ์ผ์„ ์ฒ˜๋ฆฌ ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

10๊ฐ€์ง€์˜ QnA ๋ฌธ์„œ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ํ†ตํ•ด QnA bot ๋งŒ๋“ค๊ธฐ

 

1 ) OpenAI API ํ‚ค ๋“ฑ๋ก & Document ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import os
import time
import pandas as pd
import openai
import re
import requests
import sys
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity
import tiktoken

openai.api_key = os.getenv("OPENAI_API_KEY") 
openai.organization = os.getenv("OPENAI_ORGANIZATION") 

start_time=time.time()
path ='./QnA/'

########### This helps takes care of removing metadata
search_string = "" 
metadata_counter = 0
############
d = []
text=""

for root, directories, files in os.walk(path , topdown=False):
    for file in files:
        if file.lower().endswith(".txt"):
            name =(os.path.join(root,file))
            f = open(name, "r",encoding="utf-8")
            for line in f:
                text +=line
            f.close()
            d.append({'FILE NAME': file ,'CONTENT': text})
            pd.DataFrame(d)
            metadata_counter = 0
            text=""
end_time = time.time()
duration = end_time - start_time

print ("Script Execution: ", duration)

 

2) ๋ฌธ์žฅ ์ „์ฒ˜๋ฆฌ

# input sentence ์ „์ฒ˜๋ฆฌ
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.replace("#","")
    s = s.strip()
    if s =="":
        s = "<blank>"
    return s
df_normalized=df.copy()
df_normalized['CONTENT'] = df["CONTENT"].apply(lambda x : normalize_text(x))

 

3) Tokenize 

tokenizer = tiktoken.get_encoding("cl100k_base")
df_tok=df_normalized.copy()
df_tok['n_tokens'] = df_normalized["CONTENT"].apply(lambda x: len(tokenizer.encode(x)))
df_tok

 

4) Token ๋น„์šฉ ๊ณ„์‚ฐ 

# Based on https://openai.com/api/pricing/ on 01/29/2023
# If you were using this for approximating pricing with Azure OpenAI adjust the values below with: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/

#MODEL	USAGE
#Ada     v1	$0.0040 / 1K tokens
#Babbage v1	$0.0050 / 1K tokens
#Curie   v1	$0.0200 / 1K tokens
#Davinci v1	$0.2000 / 1K tokens

#MODEL	USAGE
#Ada     v2	$0.0004 / 1K tokens
#This Ada model, text-embedding-ada-002, is a better and lower cost replacement for our older embedding models.โ€‚

n_tokens_sum = df['n_tokens'].sum()

ada_v1_embeddings_cost = (n_tokens_sum/1000) *.0040
babbage_v1_embeddings_cost = (n_tokens_sum/1000) *.0050
curie_v1_embeddings_cost = (n_tokens_sum/1000) *.02
davinci_v1_embeddings_cost = (n_tokens_sum/1000) *.2

ada_v2_embeddings_cost = (n_tokens_sum/1000) *.0004

print("Number of tokens: " + str(n_tokens_sum) + "\n")

print("MODEL        VERSION    COST")
print("-----------------------------------")
print("Ada" + "\t\t" + "v1" + "\t$" + '%.8s' % str(ada_v1_embeddings_cost))
print("Babbage" + "\t\t" + "v1" + "\t$" + '%.8s' % str(babbage_v1_embeddings_cost))
print("Curie" + "\t\t" + "v1" + "\t$" + '%.8s' % str(curie_v1_embeddings_cost))
print("Davinci" + "\t\t" + "v1" + "\t$" + '%.8s' % str(davinci_v1_embeddings_cost))
print("Ada" + "\t\t" + "v2" + "\t$" + '%.8s' %str(ada_v2_embeddings_cost))

 

 Davinci ์—ญ์‹œ GPT-3 API๋กœ ์ƒ๋‹นํžˆ ๊ณ ๊ฐ€์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋ฐ˜๋ฉด Ada v2๋Š” 500๋ฐฐ๋‚˜ ์ €๋ ดํ•˜๋‹ค. ๋‹จ์ˆœ embedding์„ ํ™œ์šฉํ•ด classification์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด๋ผ๋ฉด Ada-V2๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ฒฝ์ œ์ ์ด๋‹ค. LangChain์˜ ๋Œ€๋ถ€๋ถ„์˜ ์„œ๋น„์Šค๊ฐ€ Ada-v2๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๋ถ„๋ฅ˜๋ฅผ ํ†ตํ•œ Task๋ฅผ ์ •์˜ํ•˜๊ณ  ๋งˆ์ง€๋ง‰์— completion Danvinci๋‚˜ chatGPT๋ฅผ ์‚ฌ์šฉ

 

 

Model dimension

 

Ada(1024)

Babbage(2048)

Curie(4096)

Davinci(12288)

 

def generate_embeddings(text, model="text-embedding-ada-002"):
    return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
 
df['ada_v2_embedding'] = df.CONTENT.apply(lambda x: generate_embeddings(x, model='text-embedding-ada-002'))

len(df['ada_v2_embedding'][1])
# 1536

 

ada_v2์˜ Output dimension์€ 1536์ด๋‹ค. 768 * 2 ์ผ๋ฐ˜ Bert ๋ชจ๋ธ์— 2๋ฐฐ์˜ embedding size๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. 

5) Input Text์™€ ์œ ์‚ฌ๋„๋ฅผ ๋ถ„์„ ํ›„ ๊ฐ€์žฅ ๋†’์€ ์œ ์‚ฌ๋„ rank 3์„ ์ถ”์ถœํ•ด ๋ณด์—ฌ์ค€๋‹ค.

 

# search embedded docs based on cosine similarity

df_similarities = df_embeddings.copy()

def get_embedding(text, model="text-embedding-ada-002"):
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        model="text-embedding-ada-002"
    )

    df_similarities["similarities"] = df_embeddings.ada_v2_embedding.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df_similarities.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res

question = input("๋ฌด์—‡์„ ๋„์™€๋“œ๋ฆด๊นŒ์š”?\n\n")

res = search_docs(df, question, top_n=3)

 

 

 

https://github.com/seohyunjun/openAI_API_token/blob/main/openaiAPI_embedding.ipynb

 

GitHub - seohyunjun/openAI_API_token: openAI API token information

openAI API token information. Contribute to seohyunjun/openAI_API_token development by creating an account on GitHub.

github.com

 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค