๋ถ๋ฅ๋ช
์ด ๋ถ์ ๋ฌธ์ ๋ฉ์ธ์ง๋ค๋ก ์ ํ ํ๋ณ ๋ถ์ ๋ชจํ์ ํ๋ จ
LDA๋ LSA์ ๋น์ท ํ ๊ณ ์ฐจ์ ๊ณต๊ฐ์์ ์ฐจ์๋ค(BOW, TF-IDF)์ ์ต๊ณ ์ ์ผ์ฐจ ๊ฒฐํฉ์ ์ฐพ์๋ด๋ ค๋ฉด ๋ถ๋ฅ๋ช
์ด๋ ๊ธฐํ ์ ์๋ค์ด ๋ฏธ๋ฆฌ ๋ถ์ฌ๋ ํ๋ จ๋ ์๋ฃ๊ฐ ํ์ํ๋ค.
LSA - ์ ๋ฒกํฐ ๊ณต๊ฐ์์ ๋ชจ๋ ๋ฒกํฐ๊ฐ ์๋ก ์ต๋ํ ๋จ์ด์ง๊ฒ ๋ถ์ฌ
LDA - ๋ถ๋ฅ๋ค ์ฌ์ด์ ๊ฑฐ๋ฆฌ ์ฆ ํ ๋ถ๋ฅ์ ์ํ๋ ๋ฒกํฐ๋ค์ ๋ฌด๊ฒ ์ค์ฌ๊ณผ ๋ค๋ฅธ ๋ถ๋ฅ์ ์ํ๋ ๋ฒกํฐ๋ค์ ๋ฌด๊ฒ์ค์ฌ ์ฌ์ด์ ๊ฑฐ๋ฆฌ๋ฅผ ์ต๋ํ
LDA๋ฅผ ์ํํ๋ ค๋ฉด LDA ์๊ณ ๋ฆฌ์ฆ์ ๋ถ๋ฅ๋ช
์ด ๋ถ์ ๊ฒฌ๋ณธ๋ค์ ์ ๊ณตํด์ ์ฐ๋ฆฌ๊ฐ ๋ชจํํํ๊ณ ์ํ๋ ์ฃผ์ ๋ฅผ ์๋ ค์ค์ผํ๋ค. ( ์คํธ 1 / ๋น์คํธ 0 )
Data Load
# data load
import pandas as pd
from nlpia.data.loaders import get_data
sms = get_data("sms-spam")
sms.head()
index = ['sms{}{}'.format(i,'!'*j)
for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
sms -> TF-IDF Vectorize
# tf-idf docs
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
tfidf_docs = pd.DataFrame(tfidf_docs)
LDA predict
LDA ์ค์ฐจ 0% , predict๋ฅผ tfidf_docs๋ฅผ ์ง์ด ๋ฃ์์ผ๋ฏ๋ก ์ค์ฐจ๊ฐ 0์ธ๊ฒ์ ๋ํด ๋๋จํ ๊ฒฐ๊ณผ๊ฐ ์๋๋ค.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
lda = lda.fit(tfidf_docs,sms.spam)
sms['lda_spaminess'] = lda.predict(tfidf_docs)
((sms.spam - sms.lda_spaminess) ** 2 ).sum() ** .5 # 0.0 ์ค์ฐจ 0 ์ ํํ ๋ถ๋ฅ
(sms.spam == sms.lda_spaminess).sum() # 4837
len(sms) # 4837
LDA cross val score
์ ํ๋ 0.77๋ก ์ข์ ๋ชจํ์ด ์๋๋ค. cross_val_score์ ์ด ์์น๊ฐ ๋ง๋์ง train : test = 2 : 1์ ๋น์จ๋ก ์ ํ๋ ์ธก์
from sklearn.model_selection import cross_val_score
lad = LDA(n_components=1)
scores = cross_val_score(lda,tfidf_docs,sms.spam,cv=5)
# 'Accuracy:0.77 (+/-0.02)'
์ ํ๋ ์ธก์
์ ํ๋์ ๋ํ ์ฐจ์ด๊ฐ ์๊ณ ๋ ๋ฎ์๋ค.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf_docs,sms.spam,test_size=0.33)
lda = LDA(n_components=1)
lda.fit(X_train,y_train)
lda.score(X_test,y_test).round(3) # 0.741
LSA+LDA
PCA ์งํ ํ LDA ๊ฒฐ๊ณผ 74 -> 95.7์ ์ฑ๋ฅ์ผ๋ก ๋์ ๊ฒฐ๊ณผ๊ฐ ๋์๋ค.
from sklearn.decomposition import PCA
pca = PCA(n_components=16)
pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)
columns = ['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors = pd.DataFrame(pca_topic_vectors,columns=columns,index=index)
X_train, X_test, y_train, y_test = train_test_split(pca_topic_vectors.values,sms.spam,test_size=0.3)
lda = LDA(n_components=1)
lda.fit(X_train,y_train)
lda.score(X_test,y_test).round(3) # 0.962
lda = LDA(n_components=1)
scores = cross_val_score(lad,pca_topic_vectors,sms.spam,cv=10)
#'Accuracy:0.957 (+/-0.022)'
'๐ฃ๏ธ Natural Language Processing' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
Word2Vec ํ์ฉ (0) | 2021.02.28 |
---|---|
์๋ฏธ ๊ธฐ๋ฐ ๊ฒ์(semantic search) (0) | 2021.02.23 |
LSA ๊ฑฐ๋ฆฌ์ ์ ์ฌ๋ (0) | 2021.02.21 |
[Transformer] Multi-Head Attention (1) (0) | 2021.02.20 |
์ ์ฌ ๋ํด๋ ํ ๋น (LDiA, Latent Dirichlet Allocation) (0) | 2021.02.17 |