[Kaggle] ๋„ค์ด๋ฒ„ ์˜ํ™” ๋ฆฌ๋ทฐ ๋ถ„๋ฅ˜(2)
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
# ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜ ์ƒ์„ฑ ํ›„ ์ ์šฉ def preprocessing(data,stopword): rm = re.compile('[:;\\'\\"\\[\\]\\(\\)\\.,@]') rm_data = data.astype(str).apply(lambda x: re.sub(rm, '', x)) word_token = [word_tokenize(x) for x in rm_data] remove_stopwords_tokens = [] for sentence in word_token: temp = [] for word in sentence: if word not in stopword: temp.append(word) remove_stopwords_tokens.append(temp) return remove_stopwo..
์ฑ— ๋ด‡ ๋งŒ๋“ค๊ธฐ(1)
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ํ™œ์šฉ๋นˆ๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์ฑ—๋ด‡์„ ๋งŒ๋“ค์–ด ๋ณธ๋‹ค. ๋‹จ์ˆœํ•˜๊ฒŒ ๊ทœ์น™ ๊ธฐ๋ฐ˜์œผ๋กœ ์ œ์ž‘, ๋จธ์‹ ๋Ÿฌ๋‹ ์œ ์‚ฌ๋„ ํ™œ์šฉ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์ด ์žˆ์ง€๋งŒ ๋”ฅ๋Ÿฌ๋‹์„ ํ†ตํ•ด ์‹ค์Šต์„ ํ•œ๋‹ค. ๋”ฅ๋Ÿฌ๋‹์—์„œ๋„ Sequence to sequence ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ์ฑ—๋ด‡์„ ์ œ์ž‘ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. Data : github.com/songys/Chatbot_data songys/Chatbot_data Chatbot_data_for_Korean. Contribute to songys/Chatbot_data development by creating an account on GitHub. github.com ( http://cafe116.daum.net/_c21_/home?grpid=1bld )์—์„œ ์ž์ฃผ ๋‚˜์˜ค๋Š” ์ด์•ผ๊ธฐ๋“ค์„ ์ฐธ๊ณ ํ•˜์—ฌ ์ œ์ž‘ ์ž๋ฃŒ๋ฅผ ์˜ค..
MaLSTM
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
############## MaLSTM ๋ชจ๋ธ ############## LSTM๊ณ„์—ด์„ ํ™œ์šฉํ•ด ๋ฌธ์žฅ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•œ๋‹ค. MaLSTM ๋ชจ๋ธ์€ 2016๋…„ MIT์—์„œ ์กฐ๋‚˜์Šค ๋ฎ๋Ÿฌ(Jonas Mueller)์˜ ๋…ผ๋ฌธ์—์„œ ์ฒ˜์Œ ์†Œ๊ฐœ ๋˜์—ˆ๋‹ค. ๋ฌธ์ž์˜ Sequence ํ˜•ํƒœ๋กœ ํ•™์Šต ์‹œํ‚ค๊ณ  ๊ธฐ์กด RNN๋ณด๋‹ค ์žฅ๊ธฐ์ ์ธ ํ•™์Šต์— ํšจ๊ณผ์ ์ธ ์„ฑ๋Š˜์„ ๋ณด์—ฌ์คฌ๋‹ค. MaLSTM์ด๋ž€ ๋งจํ•˜ํƒ„ ๊ฑฐ๋ฆฌ(Manhattan Distance) + LSTM์˜ ์ค„์ž„๋ง์ด๋‹ค. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๋Œ€์‹ ํ•ด ๋งจํ•˜ํƒ„ ๊ฑฐ๋ฆฌ(L1)์„ ์ด์šฉํ•œ๋‹ค. LSTM์˜ ๋งˆ์ง€๋ง‰ ์Šคํ…์ธ $LSTM_a$์˜ $h_5^{a}$ ๊ฐ’๊ณผ $LSTM_b$์˜ $h_4^{b}$ ๊ฐ’์ด ์€๋‹‰ ์ƒํƒœ ๋ฒกํ„ฐ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์ด ๊ฐ’์€ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋ฐ˜์˜๋œ ๊ฐ’์œผ๋กœ ์ „์ฒด ๋ฌธ์žฅ์„ ๋Œ€ํ‘œํ•˜๋Š” ๋ฒกํ„ฐ๊ฐ€ ๋œ๋‹ค. ..
[Kaggle] ๋„ค์ด๋ฒ„ ์˜ํ™” ๋ฆฌ๋ทฐ ๋ถ„๋ฅ˜(1)
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
www.kaggle.com/c/dfc615k/data DFC615K DFC615 Natural Language Processing Task 1 www.kaggle.com NSMC ๋„ค์ด๋ฒ„ ์˜ํ™” ๋ฆฌ๋ทฐ์— ๋‹ฌ๋ฆฐ ๋ณ„์ ์„ ๊ธ์ •/๋ถ€์ •์œผ๋กœ ๋ณ€ํ™˜ํ•œ binary-class ๋ฐ์ดํ„ฐ ์…‹ # kaggle-nsmc import os import zipfile def extractall(path,s_path,info=None,f_type=None): file_list = os.listdir(path) for file in file_list: try: if file.split('.')[1] in "zip": zipRef = zipfile.ZipFile(path + file, 'r') zipRef.extractall(s_path) #..
PCA, SVD ์ž ์žฌ ์˜๋ฏธ ๋ถ„์„
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
=== PCA === scikit-learn์˜ PCA๋ชจํ˜•์„ ๋ฌธ์ž ๋ฉ”์‹œ์ง€๋“ค์— ์ ์šฉ import pandas as pd from nlpia.data.loaders import get_data sms = get_data("sms-spam") sms.head() index = ['sms{}{}'.format(i,'!'*j) for (i,j) in zip(range(len(sms)), sms.spam)] sms.index = index # ๊ฐ ๋ฉ”์‹œ์ง€์˜ TF-IDF ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐ from sklearn.feature_extraction.text import TfidfVectorizer from nltk.tokenize.casual import casual_tokenize tfidf = TfidfVectorizer(t..
CNN ํ…์ŠคํŠธ ์œ ์‚ฌ๋„ ๋ถ„์„(Feat. Quora pairs)
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
kaggle www.kaggle.com/c/quora-question-pairs/submissions Quora ์งˆ๋ฌธ ๋‹ต๋ณ€ ์‚ฌ์ดํŠธ์—์„œ ๊ฐ™์€ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ํŒ๋ณ„ ๋ฌธ์ œ column # ['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate'] Quora Question Pairs Can you identify question pairs that have the same intent? www.kaggle.com CNN - ํ•ฉ์„ฑ ์‹ ๊ฒฝ๋ง In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied ..
KoNLPy ์ข…๋ฅ˜
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
# KoNLPy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ฐ์ฒด # ์ฃผ๋กœ Okt ๋ถ„์„๊ธฐ๋ฅผ ์‚ฌ์šฉํ•จ # Hannanum: ํ•œ๋‚˜๋ˆ”. KAIST Semantic Web Research Center ๊ฐœ๋ฐœ. # http://semanticweb.kaist.ac.kr/hannanum/ # Kkma: ๊ผฌ๊ผฌ๋งˆ. ์„œ์šธ๋Œ€ํ•™๊ต IDS(Intelligent Data Systems) ์—ฐ๊ตฌ์‹ค ๊ฐœ๋ฐœ. # http://kkma.snu.ac.kr/ # Komoran: ์ฝ”๋ชจ๋ž€. Shineware์—์„œ ๊ฐœ๋ฐœ. # https://github.com/shin285/KOMORAN # Mecab: ๋ฉ”์นด๋ธŒ. ์ผ๋ณธ์–ด์šฉ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ํ•œ๊ตญ์–ด๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ˆ˜์ •. # https://bitbucket.org/eunjeon/mecab-ko # Open Korean Text: ์˜คํ”ˆ ์†Œ..
SVD(singular value decomposition) VS SVM(support vector machine)
ยท
๐Ÿ—ฃ๏ธ Natural Language Processing
dspace.mit.edu/bitstream/handle/1721.1/77902/18-337j-spring-2005/contents/lecture-notes/chapter_12.pdf SVM : categorical data classifier SVD : Component decomposition
๋‹คํ–ˆ๋‹ค
'๐Ÿ—ฃ๏ธ Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๊ธ€ ๋ชฉ๋ก (6 Page)