728x90

 ๋‹ค์–‘ํ•œ ๊ธฐ์—…์—์„œ ๋ฏธ๋ฆฌ ํ›ˆ๋ จ๋œ ๋‹จ์–ด ๋ฒกํ„ฐ ๋ชจํ˜•์„ ๊ณต๊ฐœํ–ˆ๋‹ค. ๊ทธ์ค‘ genism์ด ๊ฐ€์žฅ ์ธ๊ธฐ๊ฐ€ ์žˆ๋‹ค. 

 NLPIA ํŒจํ‚ค์ง€๋ฅผ ํ†ตํ•ด ์‹ค์Šต

from nlpia.data.loaders import get_data
word_vectors = get_data('word2vec')

gensim.models.utils_any2vec:loaded (3000000, 300) matrix load

 ์–ดํœ˜๊ฐ€ ์ž‘์œผ๋ฉด ๋‹จ์–ด ๋ฒกํ„ฐ ๋ชจํ˜•์˜ ์œ„๋ ฅ๋„ ๊ทธ๋งŒํผ ์ค„์–ด๋“ ๋‹ค. ํ•ด๋‹น ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๋“ค์„ ํฌํ•จํ•œ ๋ฌธ์„œ์— ๋Œ€ํ•ด์„œ๋Š” NLP ํŒŒ์ดํ”„๋ผ์ธ์ด ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋‚ด์ง€ ๋ชปํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹จ์–ด ๋ฒกํ„ฐ ๋ชจํ˜•์˜ ํฌ๊ธฐ๋Š” ๊ฐœ๋ฐœ ๋„์ค‘์—๋งŒ ์ œํ•œํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์งํ•˜๋‹ค. 

 genism.KeyedVectors.most_similar() ๋ฉ”์„œ๋“œ๋Š” ์ฃผ์–ด์ง„ ๋‹จ์–ด ๋ฒกํ„ฐ์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฐพ์•„์ค€๋‹ค. ํ‚ค์›Œ๋“œ ์ธ์ˆ˜์— positive์— ๋‹จ์–ด๋“ค์˜ ๋ชฉ๋ก์„ ์ง€์ •ํ•˜๋ฉด ์ด ๋ฉ”์„œ๋“œ๋Š” ํ•ด๋‹น ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ ๋”ํ•œ ๊ฒƒ์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ ์ฐพ๋Š”๋‹ค. ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ๋นผ๋ ค๋ฉด negative์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค. topn ์ธ์ˆ˜๋Š” ๋ฉ”์„œ๋“œ๊ฐ€ ๋Œ๋ ค์ค„ ๊ฒฐ๊ณผ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•œ๋‹ค.

 ์ „ํ†ต์ ์ธ ์œ ์˜์–ด ์‚ฌ์ „๋“ค๊ณผ ๋‹ฌ๋ฆฌ word2vec์€ ์œ ์˜์–ด ๊ด€๊ณ„๋ฅผ ๋‘ ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๋œปํ•˜๋Š” ํ•˜๋‚˜์˜ ์—ฐ์†๊ฐ’์œผ๋กœ ์ •์˜ํ•œ๋‹ค. ์ด๋Š” word2vec ์ž์ฒด๊ฐ€ ํ•˜๋‚˜์˜ ์—ฐ์† ๋ฒกํ„ฐ ๊ณต๊ฐ„ ๋ชจํ˜•์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

word_vectors.most_similar(positive=['cooking','potatoes'],topn=5)
# [('cook', 0.6973530054092407),
#  ('oven_roasting', 0.6754531860351562),
#  ('Slow_cooker', 0.6742031574249268),
#  ('sweet_potatoes', 0.6600279808044434),
#  ('stir_fry_vegetables', 0.6548758745193481)]

 genism ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋ฌด๊ด€ํ•œ ๋‹จ์–ด๋“ค์„ ์ฐพ์•„์ฃผ๋Š” doesnt_match() ๋ฉ”์„œ๋“œ๋„ ์žˆ๋‹ค. 4๊ฐœ์˜ ๋‹จ์–ด๋“ค๊ณผ ๋น„๊ตํ•ด ๊ฐ€์žฅ ์ ์ˆ˜๊ฐ€ ๋‚ฎ์€ computer๊ฐ€ ๋ฌด๊ด€ํ•œ ๋‹จ์–ด๋“ค๋กœ ๋‚˜์˜จ๋‹ค.

word_vectors.doesnt_match("potatoes milk cake computer".split())
# 'computer'

 king + woman - man = ? 

word_vectors.most_similar(positive=['king','woman'],negative=['man'],topn=2)
# [('queen', 0.7118192911148071), ('monarch', 0.6189674139022827)]
word_vectors.similarity('princess','queen')
# 0.7070532
word_vectors['phone']
word_vectors['phone'].shape
# (300,)

 ๊ฐœ๋ณ„ ๋‹จ์–ด๋Š” 1X300 ์ฐจ์›์˜ numpy ํ–‰ ๋ฒกํ„ฐ์ด๋‹ค. ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ด ์ˆ˜๋งŽ์€ ์„ฑ๋ถ„๋“ค์˜ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต์ง€๋งŒ, ์ด ์ˆ˜์น˜๋“ค์˜ ์ผ์ฐจ ๊ฒฐํ•ฉ์„ ํ†ตํ•ด ๋™์˜์–ด๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. 

  ๋‚˜๋งŒ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ ๋ชจํ˜• ๋งŒ๋“ค๊ธฐ

genism ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด์„œ ์˜์—ญ ํŠนํ™” word2vec ๋ชจํ˜•์„ ํ›ˆ๋ จํ•˜๊ธฐ ์ „์— ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.

  ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„

 ํ•„์š”ํ•œ ๊ฒƒ์€ ๋ง๋ญ‰์น˜์˜ ๋ฌธ์„œ๋“ค์„ ๋ฌธ์žฅ๋“ค๋กœ ๋ถ„ํ•ดํ•˜๊ณ  ๋ฌธ์žฅ๋“ค์„ ํ† ํฐ๋“ค๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ฒƒ์ด๋‹ค. genism์˜ word2vec ๋ชจํ˜•์€ ๊ฐ ๋ฌธ์žฅ์ด ํ† ํฐ๋“ค์˜ ๋ชฉ๋ก์ธ ๋ฌธ์žฅ ๋ชฉ๋ก์ด ํ•„์š”ํ•˜๋‹ค. 

from gensim.models.word2vec import Word2Vec
num_features = 300 # ๋‹จ์–ด ๋ฒกํ„ฐ ์ฐจ์› ์ˆ˜
min_word_count = 3 # word2vec๋ชจํ˜•์— ํฌํ•จ ์‹œํ‚ฌ ๋‹จ์–ด์˜ ์ตœ์†Œ๋นˆ๋„ ๋ง๋ญ‰์น˜๊ฐ€ ์ž‘๋‹ค๋ฉด ์ตœ์†Œ๋นˆ๋„๋ฅผ ๋” ์ค„์ด๊ณ  ํฌ๋‹ค๋ฉด ๋” ํ‚ค์šด๋‹ค
num_workers = 2 # ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•ญ cpu ์ฝ”์–ด ์ˆ˜, ์ฃผ์–ด์ง„ ์‹œ์Šคํ…œ์˜ ์ฝ”์–ด์— ๋งž๊ฒŒ multiprocessing ๋ชจ๋“ˆ์„ ์ ์ œํ•œ๋‹ค.
window_size = 6 # ๋ฌธ๋งฅ ๊ตฌ๊ฐ„์˜ ํฌ๊ธฐ
subsampling = 1e-3 # ๊ณ ๋นˆ๋„ ์šฉ์–ด๋ฅผ ์œ„ํ•œ ๋ถ€ํ‘œ์ง‘ ๋ฌธํ„ฑ๊ฐ’

model = Word2Vec(
    token_list,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=window_size,
    sample=subsampling
)

 init_sims๋ฅผ ํ†ตํ•ด ๋ชจํ˜•์„ ๋™๊ฒฐ(=grad_zero) ํ›„ ์€๋‹‰์ธต์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ €์žฅํ•˜๊ณ  ๋‹จ์–ด ๊ณต๋™ ์ถœํ˜„ ํ™•๋ฅ ๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ์ถœ๋ ฅ ๊ฐ€์ค‘์น˜๋Š” ํ๊ธฐํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ํ๊ธฐํ•œ ๋ชจํ˜•์€ ๋‹ค์Œ์— ๋‹ค์‹œ train ํ•  ์ˆ˜ ์—†๋‹ค.

model.init_sims(replace=True)
model_name = "my_domain_specific_word2vec_model"
model.save(model_name)
๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค