728x90

 word2vec์˜ ๊ฐœ๋…์„ ๋ฌธ์žฅ์ด๋‚˜ ๋ฌธ์„œ ์ „์ฒด๋กœ ํ™•์žฅ์‹œ์ผœ ํ™œ์šฉํ•œ๋‹ค. ๊ธฐ์กด ๋‹จ์–ด๋“ค์— ๊ทผ๊ฑฐํ•ด์„œ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•จ์œผ๋กœ์จ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ ํ•™์Šตํ•œ๋‹ค๋Š” ์ฐฉ์•ˆ์„ ๋ฌธ์žฅ์ด๋‚˜ ๋ฌธ๋‹จ, ๋ฌธ์„œ ๋ฒกํ„ฐ์˜ ํ•™์Šต์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋‹ค. 

 doc2vec์€ ์ ์ง„์  ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ํ›ˆ๋ จ์„ ๋งˆ์นœ ๋ชจํ˜•์— ์ƒˆ๋กœ์šด ๋ฌธ์„œ๋“ค์„ ์ž…๋ ฅํ•ด์„œ ์ƒˆ๋กœ์šด ๋ฌธ์„œ ๋ฒกํ„ฐ๋“ค์„ ์ƒ์„ฑํ•œ๋‹ค. ์ถ”๋ก  ๋‹จ๊ณ„์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋™๊ฒฐ๋œ ๋‹จ์–ด๋ฒกํ„ฐ ํ–‰๋ ฌ๊ณผ ํ•ด๋‹น ๊ฐ€์ค‘์น˜๋“ค๋กœ ์ƒˆ ๋ฌธ์„œ ๋ฒกํ„ฐ๋“ค์„ ๊ณ„์‚ฐํ•ด์„œ ๋ฌธ์„œ ํ–‰๋ ฌ์— ์ถ”๊ฐ€ํ•œ๋‹ค.

 ๋ฌธ์„œ ๋ฒกํ„ฐ ํ›ˆ๋ จ

genism ํŒจํ‚ค์ง€์—์žˆ๋Š” doc2vec์„ ์œ„ํ•œ ํ•จ์ˆ˜๋“ค์„ ์ด์šฉํ•ด ๋ฌธ์„œ ๋ฒกํ„ฐ ์ƒ์„ฑ

์‚ฌ์šฉํ•  cpu ์ฝ”์–ด ์ˆ˜ 

import multiprocessing
num_cores = multiprocessing.cpu_count()

genism์˜ doc2vec๊ณผ ๋ง๋ญ‰์น˜ ๋ฌธ์„œ ๋ฒกํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

from gensim.models.doc2vec import TaggedDocument,Doc2Vec
from gensim.utils import simple_preprocess

๋ฌธ์ž์—ด์„ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ• ๋‹น

corpus = ['This is the first document ...',
          'another document ...']

genism์˜ TaggedDocument๋Š” ๋ฌธ์„œ์— ๊ทธ ๋ฌธ์„œ์˜ ๋ถ„๋ฅ˜๋ช…์ด๋‚˜ ํ‚ค์›Œ๋“œ ๊ฐ™์€ ๋ฉ”ํƒ€ ์ •๋ณด๋ฅผ ๋‹ด์€ ๋ฌธ์ž์—ด ๋˜๋Š” ์ •์ˆ˜๊ฐ’์„ ๋ถ€์—ฌ ํ•  ์ˆ˜ ์žˆ๋‹ค.

training_corpus = []
for i, text in enumerate(corpus):
    tagged_doc = TaggedDocument(simple_preprocess(text),[i])
    training_corpus.append(tagged_doc)

๋ฌธ๋งฅ ๊ตฌ๊ฐ„ ํฌ๊ธฐ๊ฐ€ 10์ด๊ณ  ๋‹จ์–ด ๋ฒกํ„ฐ์™€ ๋ฌธ์„œ๋ฒกํ„ฐ์˜ ์ฐจ์› ์ˆ˜๊ฐ€ 100์ธ  doc2vec ์ƒ์„ฑ(word2vec(300)๊ณผ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.)
min_count๋Š” ๋ชจํ˜•์— ํฌํ•จํ•  ๋ฌธ์„œ์˜ ์ตœ์†Œ ๋นˆ๋„

model = Doc2Vec(size=100,min_count=2,workers=num_cores,iter=10)

build_vocab์„ ํ†ตํ•ด ์–ดํœ˜ compile ํ›„ train ์ง„ํ–‰

model.build_vocab(training_corpus)
model.train(training_corpus,total_examples=model.corpus_count,epochs=model.iter)

 

๋ชจํ˜• ๋ฒกํ„ฐ์— ๋Œ€ํ•œ ํ›ˆ๋ จ ํ›„ ์ƒˆ๋กœ์šด ๋ฌธ์„œ์— ๋Œ€ํ•œ ๋ฌธ์„œ ๋ฒกํ„ฐ๋ฅผ ์ถ”๋ก  infer_vector() ๋ฉ”์†Œ๋“œ ํ˜ธ์ถœ

model.infer_vector(simple_preprocess('This is a completely unseen document'),steps=10)
 
๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค