728x90

  LDiA

 ๋Œ€๋ถ€๋ถ„ ์ฃผ์ œ ๋ชจํ˜•ํ™”๋‚˜ ์˜๋ฏธ ๊ฒ€์ƒ‰, ๋‚ด์šฉ ๊ธฐ๋ฐ˜ ์ถ”์ฒœ ์—”์ง„์—์„œ ๊ฐ€์žฅ ๋จผ์ € ์„ ํƒํ•ด์•ผ ํ•  ๊ธฐ๋ฒ•์€ LSA์ด๋‹ค. ๋‚ด์šฉ ๊ธฐ๋ฐ˜ ์˜ํ™”์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์˜ํ•˜๋ฉด LSA๊ฐ€ LDiA ๋ณด๋‹ค ์•ฝ ๋‘๋ฐฐ๋กœ ์ •ํ™•ํ•˜๋‹ค. LSA์— ๊น”๋ฆฐ ์ˆ˜ํ•™์€ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ด๋‹ค.

 NLP์˜ ๋งฅ๋ฝ์—์„œ LDiA๋Š” LSA์ฒ˜๋Ÿผ ํ•˜๋‚˜์˜ ์ฃผ์ œ ๋ชจํ˜•์„ ์‚ฐ์ถœํ•œ๋‹ค. LDiA๋Š” ์ด๋ฒˆ ์žฅ ๋„์ž…๋ถ€์—์„œ ํ–ˆ๋˜ ์‚ฌ๊ณ  ์‹คํ—˜๊ณผ ๋น„์Šทํ•œ ๋ฐฉ์‹์œผ๋กœ ์˜๋ฏธ ๋ฒกํ„ฐ ๊ณต๊ฐ„(์ฃผ์ œ ๋ฒกํ„ฐ๋“ค์˜ ๊ณต๊ฐ„)์„ ์‚ฐ์ถœํ•œ๋‹ค.  

๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ


 LDiA๊ฐ€ LSA์™€ ๋‹ค๋ฅธ ์ ์€ ๋‹จ์–ด ๋นˆ๋„๋“ค์ด ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. LSA์˜ ๋ชจํ˜•๋ณด๋‹ค LDiA์˜ ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ๊ฐ€ ๋‹จ์–ด ๋นˆ๋„๋“ค์˜ ๋ถ„ํฌ๋ฅผ ์ž˜ ํ‘œํ˜„ํ•œ๋‹ค.

 

 

 LDiA๋Š” ์˜๋ฏธ ๋ฒกํ„ฐ ๊ณต๊ฐ„์„ ์‚ฐ์ถœํ•œ๋‹ค. ์‚ฌ๊ณ  ์‹คํ—˜์—์„œ ํŠน์ • ๋‹จ์–ด๋“ค์ด ๊ฐ™์€ ๋ฌธ์„œ์— ํ•จ๊ป˜ ๋“ฑ์žฅํ•˜๋Š” ํšŸ์ˆ˜์— ๊ธฐ์ดˆํ•ด์„œ ๋‹จ์–ด๋“ค์„ ์ฃผ์ œ๋“ค์— ์ง์ ‘ ๋ฐฐ์ •ํ–ˆ๋‹ค. ํ•œ ๋ฌธ์„œ์— ๋Œ€ํ•œ ๊ฐ ๋‹จ์–ด์˜ ์ฃผ์ œ ์ ์ˆ˜๋“ค์„ ์ด์šฉํ•ด ๋ฌธ์„œ์— ๋ฐฐ์ •ํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— LSA๋ณด๋‹ค ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๋‹ค.

 LDiA๋Š” ๊ฐ ๋ฌธ์„œ๋ฅผ ์ž„์˜์˜ ๊ฐœ์ˆ˜์˜ ์ฃผ์ œ๋“ค์˜ ํ˜ผํ•ฉ์œผ๋กœ ๊ฐ„์ฃผํ•œ๋‹ค. ์ฃผ์ œ ๊ฐœ์ˆ˜๋Š” LDiA ๋ชจํ˜•์„ ํ›ˆ๋ จํ•˜๊ธฐ ์ „์— ๊ฐœ๋ฐœ์ž๊ฐ€ ๋ฏธ๋ฆฌ ์ •ํ•œ๋‹ค. LDiA๋Š” ๋˜ํ•œ ๊ฐ ์ฃผ์ œ๋ฅผ ๋‹จ์–ด ์ถœํ˜„ ํšŸ์ˆ˜๋“ค์˜ ๋ถ„ํฌ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. LDiA๋Š” ๋˜ํ•œ ๊ฐ ์ฃผ์ œ๋ฅผ ๋‹จ์–ด ์ถœํ˜„ ํšŸ์ˆ˜๋“ค์˜ ๋ถ„ํฌ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. (์‚ฌ์ „(prior) ํ™•๋ฅ ๋ถ„ํฌ)

    LDiA์˜ ๊ธฐ์ดˆ

 ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ์— ๊ธฐ์ดˆํ•œ ๋ถ„์„ ๋ฐฉ๋ฒ•์€ ์œ ์ „์ž ์„œ์—ด์—์„œ ์ง‘๋‹จ ๊ตฌ์กฐ(population structure)๋ฅผ ์ถ”๋ก ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆํ–ˆ๋‹ค. 

LDiA์— ๊ธฐ์ดˆํ•œ ๋ฌธ์„œ ์ƒ์„ฑ๊ธฐ๋Š” ๋‹ค์Œ ๋‘๊ฐ€์ง€๋ฅผ ๋‚˜์ˆ˜๋กœ ๊ฒฐ์ •ํ•œ๋‹ค.

  • ๋ฌธ์„œ๋ฅผ ์œ„ํ•ด ์ƒ์„ฑํ•  ๋‹จ์–ด๋“ค์˜ ์ˆ˜(ํฌ์•„์†ก ๋ถ„ํฌ)
  • ๋ฌธ์„œ๋ฅผ ์œ„ํ•ด ํ˜ผํ•ฉํ•  ์ฃผ์ œ๋“ค์˜ ์ˆ˜(๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ)

 ๋ฌธ์„œ์˜ ๋‹จ์–ด ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ์“ฐ์ด๋Š” ํฌ์•„์†ก ๋ถ„ํฌ๋Š” ํ‰๊ท  ๋ฌธ์„œ ๊ธธ์ด๋ผ๋Š” ๋งค๊ฐœ ๋ณ€์ˆ˜ ํ•˜๋‚˜๋กœ ์ •์˜๋œ๋‹ค. ์ฃผ์ œ ๊ฐœ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ์“ฐ์ด๋Š” ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ๋Š” ๊ทธ๋ณด๋‹ค ๋‘๊ฐœ ๋งŽ์€ ์„ธ๊ฐœ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜๋กœ ์ •์˜๋œ๋‹ค. ๋‘ ์ˆ˜์น˜๋ฅผ ๊ฒฐ์ •ํ•œ ํ›„์—๋Š” ๋ฌธ์„œ์— ์‚ฌ์šฉํ•  ๋ชจ๋“  ์ฃผ์ œ์˜ ์šฉ์–ด-์ฃผ์ œ ํ–‰๋ ฌ๋งŒ ์žˆ์œผ๋ฉด ๊ฐ„๋‹จํ•œ ์ ˆ์ฐจ๋กœ ์ƒˆ๋กœ์šด ๋ฌธ์„œ๋“ค์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.
 
 ํ†ต๊ณ„ํ•™์ ์œผ๋กœ ๋ถ„์„ํ•˜๋ฉด ๋‘ ๋‚œ์ˆ˜ ๋ฐœ์ƒ ํ™•๋ฅ  ๋ถ„ํฌ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜๋“ค์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ์„ ๊นจ๋‹ฌ์•˜๋‹ค. 1๋ฒˆ ์ˆ˜์น˜, ์ฆ‰ ๋ฌธ์„œ์˜ ๋‹จ์–ด ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•ด๋‹น ํฌ์•„์†ก ๋ถ„ํฌ๋ฅผ ์ •ํ•ด์•ผํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•œ ํ‰๊ท ์€ ๋ง๋ญ‰์น˜์˜ ๋ชจ๋“  ๋ฌธ์„œ์— ๋Œ€ํ•œ ๋‹จ์–ด ๋ชจ์Œ๋“ค์˜ ํ‰๊ท  ๋‹จ์–ด ์ˆ˜(ํ‰๊ท  n-gram ์ˆ˜)๋กœ ์„ค์ •ํ•˜๋ฉด ๋œ๋‹ค. 

ํฌ์•„์†ก ๋ถ„ํฌ ํ•จ์ˆ˜ $f(x) = \frac{e^{-\lambda}\lambda^x}{x!}$ 

ํ‰๊ท ($u$) = ${\lambda}$ = mean_document_len

total_corpus_len = 0
for document_text in sms.text:
    total_corpus_len += len(casual_tokenize(document_text))
mean_document_len = total_corpus_len / len(sms)
round(mean_document_len,2) #  21.35

 ์ด ํ†ต๊ณ„๋Ÿ‰์€ ๋ฐ˜๋“œ์‹œ BOW๋“ค์—์„œ ์ง์ ‘๊ณ„์‚ฐํ•ด์•ผํ•œ๋‹ค. ๋ถˆ์šฉ์–ด ํ•„ํ„ฐ๋ง์ด๋‚˜ ๊ธฐํƒ€ ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•œ ๋ฌธ์„œ๋“ค์„ ํ† ํฐํ™”ํ•˜๊ณ  ๋ฒกํ„ฐํ™”ํ•œ ๋‹จ์–ด๋“ค์˜ ์ˆ˜๋ฅผ ์„ธ์–ด์•ผํ•œ๋‹ค.(๋‹จ์–ด ์‚ฌ์ „์˜ ๊ธธ์ด)  

 2๋ฒˆ ์ˆ˜์น˜, ์ฆ‰ ์ฃผ์ œ์˜ ์ˆ˜๋Š” ์‹ค์ œ๋กœ ๋‹จ์–ด๋“ค์„ ์ฃผ์ œ๋“ค์— ๋ฐฐ์ •ํ•ด ๋ณด๊ธฐ ์ „๊นŒ์ง€๋Š” ์•Œ์ˆ˜ ์—†๋‹ค. ์ด๋Š” K-NN์ด๋‚˜ K-means
clustering ๊ตฐ์ง‘ํ™”์™€ ๊ฐ™์€ ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ฒ˜๋Ÿผ ๋จผ์ € k๋ฅผ ๊ฒฐ์ •ํ•ด์•ผ ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ๋‚˜์•„๊ฐˆ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ์ƒํ™ฉ์ด๋‹ค.
์ฃผ์ œ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ž„์˜๋กœ ์ •ํ•˜๊ณ  ๊ฐœ์„ ํ•ด ๋‚˜๊ฐ€๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ผ๋‹จ ์ฃผ์ œ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•ด ์ฃผ๋ฉด LDiA๋Š” ๊ฐ ์ฃผ์ œ์— ๋Œ€ํ•ด ๋ชฉ์ ํ•จ์ˆ˜๊ฐ€ ์ตœ์ ๊ฐ’์ด ๋˜๋Š” ๋‹จ์–ด๋“ค์˜ ํ˜ผํ•ฉ์„ ์ฐพ์•„๋‚ธ๋‹ค.
 
 LDiA๋ฅผ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ k๋ฅผ ์กฐ์œจํ•˜๋ฉด ์ตœ์ ์˜ k์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฐ ์ตœ์ ํ™” ๊ณผ์ •์„ ์ž๋™ํ™”ํ•  ์ˆ˜์žˆ๋‹ค. LDiA ์–ธ์–ด ๋ชจํ˜•์˜ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค. LDiA์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋ง๋ญ‰์น˜์— ์žˆ๋Š” ์–ด๋–ค ๋ถ€๋ฅ˜ ๋˜๋Š” ํšŒ๊ท€๋ฌธ์ œ์— ์ ์šฉํ•ด์„œ ๊ทธ ๊ฒฐ๊ณผ์™€ ์ •๋‹ต์˜ ์˜ค์ฐจ๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Š” ๋น„์šฉํ•จ์ˆ˜(cost function)๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์— ํ•ด๋‹นํ•œ๋‹ค. ์ •๋‹ต์œผ๋กœ ๋ถ„๋ฅ˜๋ช…(๊ฐ์ •, ํ‚ค์›Œ๋“œ, ์ฃผ์ œ)์ด ๋ถ™์€ ๋ฌธ์„œ๋“ค๋กœ LDiA ๋ชจํ˜•์„ ์‹คํ–‰ํ•ด์„œ ์˜ค์ฐจ๋ฅผ ์ธก์ •ํ•˜๋ฉด ๋œ๋‹ค.

   ๋ฌธ์ž ๋ฉ”์‹œ์ง€ ๋ง๋ญ‰์น˜์— ๋Œ€ํ•œ LDiA ์ฃผ์ œ ๋ชจํ˜•

 LDiA๊ฐ€ ์‚ฐ์ถœํ•œ ์ฃผ์ œ๋Š” ์‚ฌ๋žŒ์ด ์ดํ•ดํ•˜๊ธฐ ์ข€ ๋” ์‰ฝ๋‚Ÿ. LSA๊ฐ€ ๋–จ์–ด์ € ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ๋” ๋–จ์–ด๋œจ๋ฆฐ๋‹ค๋ฉด LDiA๋Š” ๊ฐ€๊น๊ฒŒ ์ƒ๊ฐํ•˜๋Š” ๋‹จ์–ด๋“ค์„ ๋” ๊ฐ€๊น๊ฒŒ ๋งŒ๋“ ๋‹ค. 
 LDiA๋Š” ๊ณต๊ฐ„์„ ๋น„์„ ํ˜•์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋น„ํ‹€๊ณ  ์ผ๊ทธ๋Ÿฌ ๋œจ๋ฆฐ๋‹ค. ์›๋ž˜์˜ ๊ณต๊ฐ„์ด 3์ฐจ์›์ด๊ณ  ์ด๋ฅผ 2์ฐจ์›์œผ๋กœ ํˆฌ์˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์•„๋‹Œํ•œ ์‹œ๊ฐํ™”ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

# ๋ฉ”์„ธ์ง€ ์ŠคํŒธ ๋ฌธ์ œ์— ๋Œ€์ž… 
์‚ฌ์šฉํ•  ์ฃผ์ œ์˜ ์ˆ˜๋Š” 16. ์ฃผ์ œ์˜ ์ˆ˜๋ฅผ ๋‚ฎ๊ฒŒ ์œ ์ง€ํ•˜๋ฉด ๊ณผ๋Œ€์ ํ•ฉ(overfitting)์„ ์ค„์ด๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import casual_tokenize
import numpy as np
np.random.seed(42) #LDiA๋Š” ๋‚œ์ˆ˜๋ฅผ ์ด์šฉํ•œ๋‹ค.
# ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ + ํ† ํฐํ™”๋œ BOW ๋‹จ์–ด๋“ค์„ ์‚ฌ์šฉ
counter = CountVectorizer(tokenizer=casual_tokenize)
bow_docs =  pd.DataFrame(counter.fit_transform(raw_documents=sms.text).toarray(),index=index)
bow_docs.head()
#        0     1     2     3     4     5     ...  9226  9227  9228  9229  9230  9231
# sms0      0     0     0     0     0     0  ...     0     0     0     0     0     0
# sms1      0     0     0     0     0     0  ...     0     0     0     0     0     0
# sms2!     0     0     0     0     0     0  ...     0     0     0     0     0     0
# sms3      0     0     0     0     0     0  ...     0     0     0     0     0     0
# sms4      0     0     0     0     0     0  ...     0     0     0     0     0     0

column_nums, terms = zip(*sorted(zip(counter.vocabulary_.values(),counter.vocabulary_.keys())))
bow_docs.columns = terms

 

 ์ฒซ ๋ฌธ์ž ๋ฉ”์‹œ์ง€ sms0์œ ๋‹จ์–ด ๋ชจ์Œ 

sms.loc['sms0'].text
# 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet..
bow_docs.loc['sms0'][bow_docs.loc['sms0'] > 0].head()
# ,            1
# ..           1
#           2
# amore        1
# available    1
# Name: sms0, dtype: int64

 

 LDiA ์ ์šฉ ํ›„ ์ฃผ์ œ ๋ฒกํ„ฐ๋“ค์„ ์‚ฐ์ถœ

from sklearn.decomposition import LatentDirichletAllocation as LDiA
ldia = LDiA(n_components=16, learning_method='batch') # ์ฃผ์ œ ์ˆ˜ 16

ldia = ldia.fit(bow_docs) # bow_docs.shape (4837, 9232)
ldia.components_.shape # (16, 9232)

 

 9232๊ฐœ์˜ ๋‹จ์–ด๋ฅผ 16๊ฐœ์˜ ์ฃผ์ œ๋กœ ์••์ถ•ํ–ˆ๋‹ค. ์„ฑ๋ถ„ ํ™•์ธ(component)
๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋œ ๋ฌธ์ž๋Š” !์ด๊ณ  LDiA์—์„œ topic4์— ๊ฐ€์žฅ ๋งŽ์€ ์ ์ˆ˜๋ฅผ ํ• ๋‹นํ–ˆ๋‹ค. 

ldia = ldia.fit(bow_docs) # bow_docs.shape (4837, 9232)
ldia.components_.shape # (16, 9232)

columns = ["topic{}".format(i) for i in range(1,17)]

components = pd.DataFrame(ldia.components_.T,index=terms,columns=columns)
components.round(2).head(3)
#    topic1  topic2  topic3  topic4  ...  topic13  topic14  topic15  topic16
# !  184.03   15.00   72.22  394.95  ...    64.40   297.29    41.16    11.70
# "    0.68    4.22    2.41    0.06  ...     0.07    62.72    12.27     0.06
# #    0.06    0.06    0.06    0.06  ...     1.07     4.05     0.06     0.06
# [3 rows x 16 columns]

 

topic4๋Š” ๋‹ค๋ฅธ ๊ฐ์ • ํ‘œํ˜„๋ณด๋‹ค !์— ๋งŽ์ด ์ ์ˆ˜๋ฅผ ์ค€๊ฒƒ์œผ๋กœ๋ณด์•„ ๊ฐ•ํ•œ ๊ฐ•์กฐ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค.

components.topic4.sort_values(ascending=False)[:20]
# !         394.952246
# .         218.049724
# to        119.533134
# u         118.857546
# call      111.948541
# ยฃ         107.358914
# ,          96.954384
# *          90.314783
# your       90.215961
# is         75.750037
# the        73.335877
# a          62.456249
# on         61.814983
# claim      57.013114
# from       56.541578
# prize      54.284250
# mobile     50.273584
# urgent     49.659121
# &          47.490745
# now        47.419239
# Name: topic4, dtype: float64

์œ ๋กœ์™€ ! call์ด ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•„ ๊ด‘๊ณ ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.

 LSA์™€ ๋‹ค๋ฅด๊ฒŒ ์ง๊ด€์ ์œผ๋กœ ํŒ๋ณ„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 ๋ฌธ์ž ๋ฉ”์‹œ์ง€๋ฅผ ์ŠคํŒธ ๋˜๋Š” ๋น„์ŠคํŒธ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•ด LDiA ์ฃผ์ œ ๋ฒกํ„ฐ๋“ค์„ ๊ณ„์‚ฐํ•œ ๋‹ค์Œ LDA(์„ ํ˜• ํŒ๋ณ„ ๋ถ„์„)์— ์ ์šฉํ•œ๋‹ค.
0์ธ ๊ฒƒ์ด ๋งŽ์€ ๊ฒƒ์€ ์ž˜ ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ์ด๋‹ค. 0์€ ์ฃผ์ œ์™€ ์ƒ๊ด€์—†๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. LDiA ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ธฐ์ดˆํ•ด ์‚ฌ์—…์ƒ์˜ ๊ฒฐ์ •์„ ๋‚ด๋ฆด ๋•Œ ์ด๋Š” ์ค‘์š”ํ•œ ์žฅ์ ์ด๋‹ค.

ldia16_topic_vectors = ldia.transform(bow_docs)
ldia16_topic_vectors = pd.DataFrame(ldia16_topic_vectors,index = index, columns = columns)
ldia16_topic_vectors.round(2).head()

#        topic1  topic2  topic3  topic4  ...  topic13  topic14  topic15  topic16
# sms0     0.62    0.00    0.00    0.00  ...     0.00     0.00     0.00      NaN
# sms1     0.01    0.01    0.01    0.01  ...     0.01     0.01     0.01      NaN
# sms2!    0.00    0.00    0.00    0.00  ...     0.00     0.00     0.00      NaN
# sms3     0.00    0.00    0.00    0.09  ...     0.00     0.00     0.00      NaN
# sms4     0.00    0.33    0.00    0.00  ...     0.00     0.00     0.00      NaN
# [5 rows x 16 columns]



 LDiA + LDA ์ŠคํŒธ ๋ถ„๋ฅ˜๊ธฐ

from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
X_train,X_test,y_train,y_test = train_test_split(ldia16_topic_vectors,sms.spam,test_size=0.5)
X_train['topic16'] = np.zeros(len(X_train['topic16'])).T # NaN ๊ฐ’ -> 0 
lda = LDA(n_components=1)
lda = lda.fit(X_train,y_train)
sms['lda16_spam'] = lda.predict(ldia16_topic_vectors)
X_test['topic16'] = np.zeros(len(X_test['topic16'])).T # NaN ๊ฐ’ -> 0 
y_test['topic16'] = np.zeros(len(y_test['topic16'])).T # NaN ๊ฐ’ -> 0 
round(float(lda.score(X_test,y_test)),2)  # 0.93 ์ •ํ™•๋„

 


์ฐจ์›์ด 16->32์ผ ๋•Œ LDiA ๋น„๊ต

ldia32 = LDiA(n_components=32,learning_method='batch')
ldia32 = ldia32.fit(bow_docs)
ldia32.components_.shape # (32, 9232)

ldia32_topic_vectors = ldia32.transform(bow_docs)
columns32 = ['topic{}'.format(i) for i in range(ldia32.n_components)]
ldia32_topic_vectors = pd.DataFrame(ldia32_topic_vectors, index=index,columns=columns32)
ldia32_topic_vectors.round(2).head()
#        topic0  topic1  topic2  topic3  topic4  topic5  ...  topic26  \
# sms0      0.0     0.0     0.0    0.24     0.0    0.00  ...     0.00
# sms1      0.0     0.0     0.0    0.00     0.0    0.00  ...     0.12
# sms2!     0.0     0.0     0.0    0.00     0.0    0.00  ...     0.00
# sms3      0.0     0.0     0.0    0.93     0.0    0.00  ...     0.00
# sms4      0.0     0.0     0.0    0.00     0.0    0.24  ...     0.00
#        topic27  topic28  topic29  topic30  topic31
# sms0       0.0      0.0     0.00     0.00      0.0
# sms1       0.0      0.0     0.00     0.00      0.0
# sms2!      0.0      0.0     0.98     0.00      0.0
# sms3       0.0      0.0     0.00     0.00      0.0
# sms4       0.0      0.0     0.00     0.14      0.0
# [5 rows x 32 columns]

0์ด ๋งŽ์€ ๊ฒƒ์œผ๋กœ ๋ณด์•„ ๊น”๋”ํ•˜๊ฒŒ ๋ถ„๋ฆฌ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์ •ํ™•๋„ ์ธก์ •

 

from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
X_train,X_test,y_train,y_test = train_test_split(ldia32_topic_vectors,sms.spam,test_size=0.5)
lda = LDA(n_components=1)
lda = lda.fit(X_train,y_train)
sms['lda_32_spam'] = lda.predict(ldia32_topic_vectors)
X_train.shape #  (2418, 32)
round(float(lda.score(X_train,y_train)),3) #0.927

 ์ฃผ์ œ์˜ ์ˆ˜๋ฅผ ๋Š˜๋ ค ์ข€ ๋” ๋ช…ํ™•ํ•œ ๋ถ„๋ฆฌ๋ฅผ ํ–ˆ๋‹ค. ์ •ํ™•๋„๊ฐ€ ์•„์ง๊นŒ์ง€ PCA + LDA๋ฅผ ๋„˜์ง€๋Š” ๋ชปํ–ˆ๋‹ค. 

 

 

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค