728x90

https://betterprogramming.pub/openais-embedding-model-with-vector-database-b69014f04433

 

OpenAIโ€™s Embedding Model With Vector Database

The updated Embedding model offers State-of-the-Art performance with 4x longer context window. Thew new model is 90% cheaper. The smallerโ€ฆ

betterprogramming.pub

 

Introduction

OpenAI๋Š” 2022๋…„ 12์›” ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ-ada-002๋กœ ์—…๋ฐ์ดํŠธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ƒˆ ๋ชจ๋ธ์€ ๋‹ค์Œ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • 90%-99.8% ์ €๋ ดํ•œ ๋น„์šฉ
  • 1/8 ์ž„๋ฒ ๋”ฉ ์ฐจ์› ํฌ๊ธฐ๋กœ vector database cost ์ ˆ๊ฐ
  • ์‚ฌ์šฉ ํŽธ์˜์„ฑ์„ ์œ„ํ•œ ์—”๋“œํฌ์ธํŠธ ํ†ตํ•ฉ
  • ํ…์ŠคํŠธ ๊ฒ€์ƒ‰, ์ฝ”๋“œ ๊ฒ€์ƒ‰ ๋ฐ ๋ฌธ์žฅ ์œ ์‚ฌ์„ฑ์„ ์œ„ํ•œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ
  • ์ปจํ…์ŠคํŠธ ์ฐฝ์ด 2048์—์„œ 8192๋กœ ์ฆ๊ฐ€

 

 ์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ง ์ž‘์—…์„ ํ†ตํ•ด ์ž„๋ฒ ๋”ฉ ์—”๋“œํฌ์ธํŠธ๋ฅผ ์•ˆ๋‚ดํ•œ๋‹ค. ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์ด๋Ÿฌํ•œ ์ž„๋ฒ ๋”ฉ์„ ์ €์žฅํ•˜๊ณ  ๊ฒ€์ƒ‰. ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๋ฐ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์™€ ๊ด€๋ จ๋œ ์งˆ๋ฌธ์„ ๋‹ค๋ฃฌ๋‹ค. ์ด์ „ ๋ฒ„์ „์˜ ์ž„๋ฒ ๋”ฉ ์—”๋“œํฌ์ธํŠธ์—์„œ ๋น„์šฉ ์ธก๋ฉด์ด ๋ฌธ์ œ๊ฐ€ ๋˜์—ˆ๋˜ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€?, ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‹ค์ œ๋กœ NLP ์ž‘์—…์— ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์„์ง€?, ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ž€ ๋ฌด์—‡์ธ๊ฐ€? OpenAI ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์„œ๋น„์Šค์— ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ธ๊ฐ€? ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ฟผ๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ธ๊ฐ€?

์ด ํŠœํ† ๋ฆฌ์–ผ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด OpenAI API ์•ก์„ธ์Šค๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ํ† ํฐ์€ 300๊ฐœ์˜ ๋ฆฌ๋ทฐ์— ๋ช‡ ์„ผํŠธ์˜ ๋น„์šฉ์ด ๋“ ๋‹ค. 

$0.0004 text-embedding-ada-002

 

!pip install plotly
!pip install -U scikit-learn
import os
import time # optional 
import pandas as pd
import numpy as np
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
import matplotlib
import matplotlib.pyplot as plt

 

OpenAI API ํ‚ค๋ฅผ ์ถ”๊ฐ€. ์ด ์˜ˆ์ œ์—์„œ๋Š” ์œˆ๋„์šฐ์— ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋กœ ์ €์žฅ๋œ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉ.

openai.api_key = os.getenv("OPENAI_API_KEY")

 

๋ณธ ์˜ˆ์ œ๋Š” Text Embedding๊ณผ t-SNE๋งŒ ์‚ฌ์šฉ

 

embedding ํ•จ์ˆ˜ ์ •์˜

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

 

Context ๋ฐ์ดํ„ฐ๋ฅผ embedding vector๋กœ ๋ณ€ํ™˜ (embedding dim = 1536)

df['ada_embedding'] = df['Context'].apply(lambda x : get_embedding(x, model="text-embedding-ada-002"))

 

list๋กœ ๋ณ€ํ™˜ 

matrix = np.array(df.ada_embedding.to_list())

 

t-SNE๋ž€? 

 

t-SNE(TSNE)๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์„ ํ˜ธ๋„๋ฅผ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์›๋ž˜ ๊ณต๊ฐ„์˜ ์„ ํ˜ธ๋„๋Š” ๊ฐ€์šฐ์‹œ์•ˆ ํ•ฉ๋™ ํ™•๋ฅ ๋กœ ํ‘œํ˜„๋˜๊ณ  ์ž„๋ฒ ๋””๋“œ ๊ณต๊ฐ„์˜ ์„ ํ˜ธ๋„๋Š” t ๋ถ„ํฌ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ t-SNE๋Š” ๋กœ์ปฌ ๊ตฌ์กฐ์— ํŠนํžˆ ๋ฏผ๊ฐํ•˜๋ฉฐ ๊ธฐ์กด ๊ธฐ๋ฒ•์— ๋น„ํ•ด ๋ช‡ ๊ฐ€์ง€ ๋‹ค๋ฅธ ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค:

- ๋‹จ์ผ ๋งต์—์„œ ๋‹ค์–‘ํ•œ ์ถ•์ฒ™์˜ ๊ตฌ์กฐ ๋“œ๋Ÿฌ๋‚ด๊ธฐ
- ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋งค๋‹ˆํด๋“œ ๋˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ ๋“œ๋Ÿฌ๋‚ด๊ธฐ
- ์ค‘์•™์— ํฌ์ธํŠธ๊ฐ€ ๋ฐ€์ง‘๋˜๋Š” ๊ฒฝํ–ฅ ๊ฐ์†Œ

 

t-sne ๋ถ„ํฌ ์ฐจ์› ์ถ•์†Œ

t-sne์ด ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ๊ณ ๋ คํ•œ decomposition ๋ฐฉ๋ฒ•์œผ๋กœ text embedding cluster์— ์ฃผ๋กœ ์‚ฌ์šฉ

 

 

tsne = TSNE(n_components=2, perplexity=15, random_state=11, init='random', learning_rate=200) 
vis_dims = tsne.fit_transform(matrix)
vis_dims.shape
x = [x for x,y in vis_dims]
y = [y for x,y in vis_dims]

 

input  embedding vector columns

df['embedding_x'] = x
df['embedding_y'] = y

 

 

plot์˜ point ์ƒ‰์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ random color ํ•จ์ˆ˜ ์ •์˜  

import random

def random_color():
    """
    Generate a random color in hexadecimal format.
    """
    r = random.randint(0, 255)
    g = random.randint(0, 255)
    b = random.randint(0, 255)
    return "#{:02x}{:02x}{:02x}".format(r, g, b)

 

scatter plot ์ƒ์„ฑ

fig, ax = plt.subplots(figsize=(15, 10))
for color,category in zip([random_color() for i in range(len(df['category'].unique()))],df['INQUIRY_TYPE'].unique()):
    p = df[df['category']==category]
    ax.scatter(p['embedding_x'], p['embedding_y'], color=color, alpha=0.3,label=f'{category}')
    avg_x = p['embedding_x'].mean()
    avg_y = p['embedding_y'].mean()
    ax.scatter(avg_x, avg_y, marker="x", color=color, s=100)


ax.set_title("Clusters of Context visualized 2d with K-means(with text-embedding-ada-002)", fontsize=14)
plt.legend()
plt.show()

sample private. dataset

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค