728x90

Llama-Index with Pinecone

์ด ๋…ธํŠธ๋ถ์—์„œ๋Š” semantic-search๋ฅผ ์œ„ํ•ด Pinecone๊ณผ llama-index(์ด์ „์˜ GPT-index) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด ๋…ธํŠธ๋ถ์€ llama-index์˜ ์˜ˆ์‹œ์ด๋ฉฐ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ๋Š” Pinecone ์˜ˆ์ œ ์ €์žฅ์†Œ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

1) install packages

!pip install -qU llama-index datasets pinecone-client openai transformers

 

2) SQuAD dataset Load

  • Wikipedia(context-title)
from datasets import load_dataset

data = load_dataset('squad', split='train')
data = data.to_pandas()[['id', 'context', 'title']]
data.drop_duplicates(subset='context', keep='first', inplace=True)
data.head()
len(data)
  • DataFram์„ llama_index๋กœ ์ธ๋ฑ์‹ฑ
  • ๊ฐ ๋ฌธ์„œ์— ํ…์ŠคํŠธ ๊ตฌ์ ˆ, ๊ณ ์œ  ID, ๋ฌธ์„œ ์ œ๋ชฉ ์ถ”๊ฐ€
from llama_index import Document

docs = []

for i, row in data.iterrows():
    docs.append(Document(
        text=row['context'],
        doc_id=row['id'],
        extra_info={'title': row['title']}
    ))
docs[0]
print(f"document counts: {len(docs)}")

 

3) OpenAI API key ์„ค์ •, SimpleNodeParser ์ดˆ๊ธฐํ™”

  • m1 mac์—์„œ๋Š” SimpleNodeParser๋ฅผ default๋กœ ์‚ฌ์šฉ
import os

os.environ['OPENAI_API_KEY'] = 'OPENAI_API_KEY'  # platform.openai.com
from llama_index.node_parser import SimpleNodeParser


# parser = SimpleNodeParser()
parser = SimpleNodeParser.from_defaults()

nodes = parser.get_nodes_from_documents(docs)
nodes[0]
len(nodes)

 

4) Indexing in Pinecone

  • Pinecone: ML Application์„ ์œ„ํ•ด ๋””์ž์ธ๋œ vector database manage service
  • ์ฃผ๋กœ llm(large language model)์—์„œ ์ƒ์„ฑ๋œ embedding vector๋“ค์„ ์ €์žฅํ•˜๊ณ  semantic sililarity-based ๊ฒ€์ƒ‰์„ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค.
  1. ์ฝ˜์†”์—์„œ free๋กœ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๊ด€๋ จ API ํ‚ค์™€ ํ™˜๊ฒฝ(https://app.pinecone.io/)์œผ๋กœ Pinecone์„ ์ดˆ๊ธฐํ™”ํ•œ ๋‹ค์Œ ์ƒˆ ์ธ๋ฑ์Šค๋ฅผ ์ƒ์„ฑ
  2. ์ธ๋ฑ์Šค์˜ dim์€ 1536์ด๋ฉฐ, ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(cosine similarity)๋ฅผ ์‚ฌ์šฉ, ์ฃผ๋กœ embedding vector๋Š” text-embedding-ada-002๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ ์ €๋ ดํ•˜๊ณ  ๋น ๋ฅธ embedding model์ด๋‹ค.
  • PINECONE_API_KEY: Pinecone API key
  • index_name : pinecone์˜ ์ €์žฅ์†Œ ์ด๋ฆ„
  • PINECONE_ENVIRONMENT: pinecone์˜ ์„œ๋ฒ„ ํ™˜๊ฒฝ ์ด๋ฆ„
import pinecone

# find API key in console at app.pinecone.io
os.environ['PINECONE_API_KEY'] = 'PINECONE_API_KEY'
# environment is found next to API key in the console
os.environ['PINECONE_ENVIRONMENT'] = 'PINECONE_ENVIRONMENT'

# initialize connection to pinecone
pinecone.init(
    api_key=os.environ['PINECONE_API_KEY'],
    environment=os.environ['PINECONE_ENVIRONMENT']
)

# create the index if it does not exist already
index_name = 'llama-index-intro'
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=1536,
        metric='cosine'
    )

# connect to the index
pinecone_index = pinecone.Index(index_name)

 

  • Pinecone ์ธ๋ฑ์Šค๋กœ PineconeVectorStore๋ฅผ ์ดˆ๊ธฐํ™”
  • PineconeVectorStore๋Š” Pinecone์˜ vector database์—์„œ document embedding์„ ์œ„ํ•œ ์ €์žฅ ๋ฐ ๊ฒ€์ƒ‰ ์ธํ„ฐํŽ˜์ด์Šค ์—ญํ• ์„ ์ œ๊ณต
from llama_index.vector_stores import PineconeVectorStore

# we can select a namespace (acts as a partition in an index)
namespace = 'test' # default namespace

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
  • PineconeVectorStore๋ฅผ storage๋กœ, OpenAIEmbedding์„ embedding, GPTVectorStoreIndex๋ฅผ Document ๊ฐ์ฒด ๋ชฉ๋ก์œผ๋กœ ์ดˆ๊ธฐํ™”
  • StorageContext๋Š” storage ์„ค์ •์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋ฉฐ, ServiceContext๋Š” embedding ๋ชจ๋ธ์„ ์„ค์ •
  • GPTVectorStoreIndex๋Š” ์ œ๊ณต๋œ storage์™€ service_context๋ฅผ ํ™œ์šฉํ•ด ์ธ๋ฑ์‹ฑ ๋ฐ ์ฟผ๋ฆฌ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ฒ˜๋ฆฌ

 

from llama_index import GPTVectorStoreIndex, StorageContext, ServiceContext
from llama_index.embeddings.openai import OpenAIEmbedding

# setup our storage (vector db)
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)
# setup the index/query process, ie the embedding model (and completion if used)
embed_model = OpenAIEmbedding(model='text-embedding-ada-002', embed_batch_size=100)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

index = GPTVectorStoreIndex.from_documents(
    docs, storage_context=storage_context,
    service_context=service_context
)

 

  • index query engine์„ ํ†ตํ•ด QA๋ฅผ
query_engine = index.as_query_engine()
res = query_engine.query("์ด์ˆœ์‹  ์žฅ๊ตฐ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.")
print(res)

# ์ด์ˆœ์‹  ์žฅ๊ตฐ์€ ํ•œ๊ตญ์˜ ์œ ๋ช…ํ•œ ์žฅ๊ตฐ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. 
# ๊ทธ๋Š” ์กฐ์„  ์‹œ๋Œ€์— ํ™œ์•ฝํ•œ ์žฅ๊ตฐ์œผ๋กœ, ์กฐ์„  ์™•์กฐ ์‹œ๋Œ€์— ๋งŽ์€ ์ „์Ÿ์—์„œ ์„ฑ๊ณผ๋ฅผ ๋‚ด์—ˆ์Šต๋‹ˆ๋‹ค. 
# ์ด์ˆœ์‹  ์žฅ๊ตฐ์€ ์กฐ์„ ์˜ ํ•ด์ƒ ์•ˆ๋ณด๋ฅผ ๊ฐ•ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์กฐ์„  ์ตœ์ดˆ์˜ ๊ฐ•ํ™”๋„๋ฅผ ๊ฑด์„คํ•˜๊ณ , ์กฐ์„ ์˜ ํ•ด์ƒ ์ „๋žต์„ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. 
# ๊ทธ๋Š” ์ผ๋ณธ์˜ ์นจ๋žต์„ ๋ง‰๊ธฐ ์œ„ํ•ด ๋งŽ์€ ์ „ํˆฌ์—์„œ ์Šน๋ฆฌ๋ฅผ ๊ฑฐ๋‘์—ˆ์œผ๋ฉฐ, ํŠนํžˆ ๋ช…๋Ÿ‰ํ•ด์ „์—์„œ์˜ ์Šน๋ฆฌ๋กœ ์œ ๋ช…ํ•ฉ๋‹ˆ๋‹ค. 
# ์ด์ˆœ์‹  ์žฅ๊ตฐ์€ ์กฐ์„ ์˜ ๊ตฐ์‚ฌ ์ „๋žต๊ณผ ๋›ฐ์–ด๋‚œ ์ง€ํœ˜๋ ฅ์œผ๋กœ ๋งŽ์€ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ์กด๊ฒฝ๋ฐ›๋Š” ์ธ๋ฌผ์ž…๋‹ˆ๋‹ค.

 

  • vector db์— ์—†๋Š” ์งˆ๋ฌธ
query_engine = index.as_query_engine()
res = query_engine.query("์ด์ˆœ์‹  ์žฅ๊ตฐ์ด ์‚ฌ์šฉํ•œ ์Œ์‹์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”")
print(res)

# ์ด์ˆœ์‹  ์žฅ๊ตฐ์ด ์‚ฌ์šฉํ•œ ์Œ์‹์— ๋Œ€ํ•ด๋Š” ์ œ๊ณต๋œ ๋ฌธ๋งฅ ์ •๋ณด๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

 

  • ์˜๋ฏธ ์—†๋Š” index๊ฐ€ vectordb์— ์ถ”๊ฐ€ ๋  ๊ฒฝ์šฐ index_name์œผ๋กœ ์‚ญ์ œ
pinecone.delete_index(index_name)

 


 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค