[Pinecone] llama-index with Pinecone

728x90

Llama-Index with Pinecone

이 노트북에서는 semantic-search를 위해 Pinecone과 llama-index(이전의 GPT-index) 라이브러리를 사용하는 방법을 보여준다. 이 노트북은 llama-index의 예시이며 향후 릴리스에서는 Pinecone 예제 저장소에서 찾을 수 있습니다.

1) install packages

!pip install -qU llama-index datasets pinecone-client openai transformers

2) SQuAD dataset Load

Wikipedia(context-title)

from datasets import load_dataset

data = load_dataset('squad', split='train')
data = data.to_pandas()[['id', 'context', 'title']]
data.drop_duplicates(subset='context', keep='first', inplace=True)
data.head()

len(data)

DataFram을 llama_index로 인덱싱
각 문서에 텍스트 구절, 고유 ID, 문서 제목 추가

from llama_index import Document

docs = []

for i, row in data.iterrows():
    docs.append(Document(
        text=row['context'],
        doc_id=row['id'],
        extra_info={'title': row['title']}
    ))
docs[0]

print(f"document counts: {len(docs)}")

3) OpenAI API key 설정, SimpleNodeParser 초기화

m1 mac에서는 SimpleNodeParser를 default로 사용

import os

os.environ['OPENAI_API_KEY'] = 'OPENAI_API_KEY'  # platform.openai.com

from llama_index.node_parser import SimpleNodeParser


# parser = SimpleNodeParser()
parser = SimpleNodeParser.from_defaults()

nodes = parser.get_nodes_from_documents(docs)
nodes[0]

len(nodes)

4) Indexing in Pinecone

Pinecone: ML Application을 위해 디자인된 vector database manage service
주로 llm(large language model)에서 생성된 embedding vector들을 저장하고 semantic sililarity-based 검색을 위해 사용한다.

콘솔에서 free로 얻을 수 있는 관련 API 키와 환경(https://app.pinecone.io/)으로 Pinecone을 초기화한 다음 새 인덱스를 생성
인덱스의 dim은 1536이며, 코사인 유사도(cosine similarity)를 사용, 주로 embedding vector는 text-embedding-ada-002를 사용하며 저렴하고 빠른 embedding model이다.

PINECONE_API_KEY: Pinecone API key
index_name : pinecone의 저장소 이름
PINECONE_ENVIRONMENT: pinecone의 서버 환경 이름

import pinecone

# find API key in console at app.pinecone.io
os.environ['PINECONE_API_KEY'] = 'PINECONE_API_KEY'
# environment is found next to API key in the console
os.environ['PINECONE_ENVIRONMENT'] = 'PINECONE_ENVIRONMENT'

# initialize connection to pinecone
pinecone.init(
    api_key=os.environ['PINECONE_API_KEY'],
    environment=os.environ['PINECONE_ENVIRONMENT']
)

# create the index if it does not exist already
index_name = 'llama-index-intro'
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=1536,
        metric='cosine'
    )

# connect to the index
pinecone_index = pinecone.Index(index_name)

Pinecone 인덱스로 PineconeVectorStore를 초기화
PineconeVectorStore는 Pinecone의 vector database에서 document embedding을 위한 저장 및 검색 인터페이스 역할을 제공

from llama_index.vector_stores import PineconeVectorStore

# we can select a namespace (acts as a partition in an index)
namespace = 'test' # default namespace

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

PineconeVectorStore를 storage로, OpenAIEmbedding을 embedding, GPTVectorStoreIndex를 Document 객체 목록으로 초기화
StorageContext는 storage 설정을 구성하는 데 사용되며, ServiceContext는 embedding 모델을 설정
GPTVectorStoreIndex는 제공된 storage와 service_context를 활용해 인덱싱 및 쿼리 프로세스를 처리

from llama_index import GPTVectorStoreIndex, StorageContext, ServiceContext
from llama_index.embeddings.openai import OpenAIEmbedding

# setup our storage (vector db)
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)
# setup the index/query process, ie the embedding model (and completion if used)
embed_model = OpenAIEmbedding(model='text-embedding-ada-002', embed_batch_size=100)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

index = GPTVectorStoreIndex.from_documents(
    docs, storage_context=storage_context,
    service_context=service_context
)

index query engine을 통해 QA를

query_engine = index.as_query_engine()
res = query_engine.query("이순신 장군에 대해 설명해 주세요.")

print(res)

# 이순신 장군은 한국의 유명한 장군으로 알려져 있습니다. 
# 그는 조선 시대에 활약한 장군으로, 조선 왕조 시대에 많은 전쟁에서 성과를 내었습니다. 
# 이순신 장군은 조선의 해상 안보를 강화하기 위해 조선 최초의 강화도를 건설하고, 조선의 해상 전략을 개발했습니다. 
# 그는 일본의 침략을 막기 위해 많은 전투에서 승리를 거두었으며, 특히 명량해전에서의 승리로 유명합니다. 
# 이순신 장군은 조선의 군사 전략과 뛰어난 지휘력으로 많은 사람들에게 존경받는 인물입니다.

vector db에 없는 질문

query_engine = index.as_query_engine()
res = query_engine.query("이순신 장군이 사용한 음식에 대해 설명해주세요")

print(res)

# 이순신 장군이 사용한 음식에 대해는 제공된 문맥 정보가 없습니다.

의미 없는 index가 vectordb에 추가 될 경우 index_name으로 삭제

pinecone.delete_index(index_name)

저작자표시

'🗣️ Natural Language Processing' 카테고리의 다른 글

[BERT] 왜 BERT는 15%의 비율로 모델링 했을까? (0)	2024.03.24
[Gemini] ValueError: The `response.parts` quick accessor only works for a single candidate, but none were returned. Check the `response.prompt_feedback` to see if the prompt was blocked. (0)	2024.02.12
The Path to Achieve Ultra-Low Inference Latency With LLaMA 65B on PyTorch/XLA (0)	2023.07.06
Textbooks Are All You Need (0)	2023.07.02
LLM Context 확장 불가능은 아니다. (token size 늘리기 정리) (0)	2023.06.28

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[Pinecone] llama-index with Pinecone

Llama-Index with Pinecone

1) install packages

2) SQuAD dataset Load

3) OpenAI API key 설정, SimpleNodeParser 초기화

4) Indexing in Pinecone

'🗣️ Natural Language Processing' 카테고리의 다른 글

Llama-Index with Pinecone

1) install packages

2) SQuAD dataset Load

3) OpenAI API key 설정, SimpleNodeParser 초기화

4) Indexing in Pinecone

'🗣️ Natural Language Processing' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

'🗣️ Natural Language Processing' 카테고리의 다른 글

'🗣️ Natural Language Processing' 카테고리의 다른 글

개인정보

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역