ai.stanford.edu/~amaas/data/sentiment/
Sentiment Analysis
Publications Using the Dataset Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (A
ai.stanford.edu
스탠퍼드 대학교 인공지능 연구팀의 원본 자료로 classification 진행
import glob
import os
from random import shuffle
Data load
def preprocess_data(filepath):
positivie_path = os.path.join(filepath,'pos')
negative_path = os.path.join(filepath,'neg')
pos_label = 1
neg_label = 0
dataset = []
for filename in glob.glob(os.path.join(positivie_path,"*.txt")):
with open(filename,'r') as f:
dataset.append((pos_label, f.read()))
for filename in glob.glob(os.path.join(negative_path,"*.txt")):
with open(filename,'r') as f:
dataset.append((neg_label, f.read()))
shuffle(dataset)
return dataset
dataset = preprocess_data('개인환경/aclimdb/train')
wordvectorize
from nltk.tokenize import TreebankWordTokenizer
from gensim.models.keyedvectors import KeyedVectors
from nlpia.loaders import get_data
nltk의 TreebankwordTokenizer를 통해 토큰화 사용 dataset의 문장을 토큰화 진행
def tokenize_and_vectorize(dataset):
tokenizer = TreebankWordTokenizer()
vectorized_data = []
for sample in dataset:
tokens = tokenizer.tokenize(sample[1])
sample_vecs = []
for token in tokens:
try:
sample_vecs.append(word_vectors[token])
except KeyError:
pass
vectorized_data.append(sample_vecs)
return vectorized_data
vectorized_data = tokenize_and_vectorize(dataset)
collect_expected 예상 sentiment label 값 추출
def collect_expected(dataset):
expected = []
for sample in dataset:
expected.append(sample[0])
return expected
expected = collect_expected(dataset)
train test 비율 8 : 2
split_point = int(len(vectorized_data)*.8)
x_train = vectorized_data[:split_point]
y_train = expected[:split_point]
x_test = vectorized_data[split_point:]
y_test = vectorized_data[split_point:]
maxlen 문장 당 단어의 길이를 400으로 제한 embedding_dims 합성곱 신경망에 입력할 토큰 벡터의 길이
filters 훈련에 사용할 필터 개수 kernel_size 각 1차원 필터의 너비 hidden_dims 신경망 끝에 있는 순방향 뉴런 수
maxlen = 400
batch_size = 32
embedding_dims = 300
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 2
pad_trunc maxlen(400)만큼 길이 조정 keras의 pad_sequence 와 같은 역할을 한다.
def pad_trunc(data,maxlen):
new_data = []
zero_vector = []
for _ in range(len(data[0][0])):
zero_vector.append(0.0)
for sample in data:
if len(sample) > maxlen:
temp = sample[:maxlen]
elif len(sample) < maxlen:
temp = sample
additional_elems = maxlen - len(sample)
for _ in range(additional_elems):
temp.append(zero_vector)
else:
temp = sample
new_data.append(temp)
return new_data
x_train = pad_trunc(x_train,maxlen)
x_test = pad_trunc(x_test,maxlen)
x_train = np.reshape(x_train,(len(x_train),maxlen,embedding_dims))
y_train = np.array(y_train)
x_test = np.reshape(x_test,(len(x_test),maxlen,embedding_dims))
y_test = np.array(y_test)
모델 빌드
- Sequential 신경망 구축 틀 생성
- Conv1D는 1차원으로 kernel_size * filter 간격 1만큼 진행 후 padding = 'valid' 입력과 동일
- relu 연산
model = Sequential()
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
input_shape=(maxlen,embedding_dims)))
Max pool 최댓값을 기준으로 특징 추출
model.add(GlobalMaxPool1D())
'Natural Language Processing' 카테고리의 다른 글
[BERT Dict] NSP(Next Senctenct Prediction) Task (0) | 2022.05.07 |
---|---|
버키팅(bucketing)을 이용한 학습 복잡도 해결 (0) | 2021.03.28 |
[doc2vec] 문서 유사도 추정 (0) | 2021.03.09 |
[Word2vec] 단어 관계 시각화 (0) | 2021.03.08 |
Word2vec Vs GloVe (0) | 2021.03.08 |