본 논문은 텍스트 분류 성능을 향상시키기 위해 모델로 주요내용은 다음과 같다.
unknown token(학습에 사용되지 않은 새로운 단어)에 대해서 대처할 수 있음
Model
1) Input for first convolution
예제 데이터를 다운 받아으면, 평점이 있는 영화 리뷰들이 확인할 수 있다.
import tensorflow as tf
import numpy as np
import json
f = open('./yelp_academic_dataset_review.json')
files = list(f)
files[0]
''
으로 string으로 인식을 하게 된다. '{"votes": {"funny": 0, "useful": 5, "cool": 2}, "user_id": "rLtl8ZkDX5vH5nAx9C3q5Q", "review_id": "fWKvX83p0-ka4JS3dc6E5A", "stars": 5, "date": "2011-01-26", "text": "My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\\n\\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\\n\\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best \\"toast\\" I\'ve ever had.\\n\\nAnyway, I can\'t wait to go back!", "type": "review", "business_id": "9yKzy9PApeiPPOUJEtnvkg"}\n'
하지만 우리는 원래 형태인 dictionary 형태를 받고 싶으니, json파일 형식으로 다시 읽어준다.
# character format to dict format
examples = [json.loads(i) for i in files]
examples[0]
{'votes': {'funny': 0, 'useful': 5, 'cool': 2},
'user_id': 'rLtl8ZkDX5vH5nAx9C3q5Q',
'review_id': 'fWKvX83p0-ka4JS3dc6E5A',
'stars': 5,
'date': '2011-01-26',
'text': 'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!',
'type': 'review',
'business_id': '9yKzy9PApeiPPOUJEtnvkg'}
review 하나만 살펴보자.
# give an example
review = examples[0]
# five rating
stars = review['stars']
# text corresponding to the example
example = list(review['text'].lower())
print(example)
['m', 'y', ' ', 'w', 'i', 'f', 'e', ' ', 't', 'o', 'o', 'k', ' ', 'm', 'e', ' ', 'h', 'e', 'r', 'e', ' ', 'o', 'n', ' ', 'm', 'y', ' ', 'b', 'i', 'r', 't', 'h', 'd', 'a', 'y', ' ', 'f', 'o', 'r', ' ', 'b', 'r', 'e', 'a', 'k', 'f', 'a', 's', 't', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', '.', ' ', ' ', 't', 'h', 'e', ' ', 'w', 'e', 'a', 't', 'h', 'e', 'r', ' ', 'w', 'a', 's', ' ', 'p', 'e', 'r', 'f', 'e', 'c', 't', ' ', 'w', 'h', 'i', 'c', 'h', ' ', 'm', 'a', 'd', 'e', ' ', 's', 'i', 't', 't', 'i', 'n', 'g', ' ', 'o', 'u', 't', 's', 'i', 'd', 'e', ' ', 'o', 'v', 'e', 'r', 'l', 'o', 'o', 'k', 'i', 'n', 'g', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'r', 'o', 'u', 'n', 'd', 's', ' ', 'a', 'n', ' ', 'a', 'b', 's', 'o', 'l', 'u', 't', 'e', ' ', 'p', 'l', 'e', 'a', 's', 'u', 'r', 'e', '.', ' ', ' ', 'o', 'u', 'r', ' ', 'w', 'a', 'i', 't', 'r', 'e', 's', 's', ' ', 'w', 'a', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', ' ', 'a', 'n', 'd', ' ', 'o', 'u', 'r', ' ', 'f', 'o', 'o', 'd', ' ', 'a', 'r', 'r', 'i', 'v', 'e', 'd', ' ', 'q', 'u', 'i', 'c', 'k', 'l', 'y', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 's', 'e', 'm', 'i', '-', 'b', 'u', 's', 'y', ' ', 's', 'a', 't', 'u', 'r', 'd', 'a', 'y', ' ', 'm', 'o', 'r', 'n', 'i', 'n', 'g', '.', ' ', ' ', 'i', 't', ' ', 'l', 'o', 'o', 'k', 'e', 'd', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'h', 'e', ' ', 'p', 'l', 'a', 'c', 'e', ' ', 'f', 'i', 'l', 'l', 's', ' ', 'u', 'p', ' ', 'p', 'r', 'e', 't', 't', 'y', ' ', 'q', 'u', 'i', 'c', 'k', 'l', 'y', ' ', 's', 'o', ' ', 't', 'h', 'e', ' ', 'e', 'a', 'r', 'l', 'i', 'e', 'r', ' ', 'y', 'o', 'u', ' ', 'g', 'e', 't', ' ', 'h', 'e', 'r', 'e', ' ', 't', 'h', 'e', ' ', 'b', 'e', 't', 't', 'e', 'r', '.', '\n', '\n', 'd', 'o', ' ', 'y', 'o', 'u', 'r', 's', 'e', 'l', 'f', ' ', 'a', ' ', 'f', 'a', 'v', 'o', 'r', ' ', 'a', 'n', 'd', ' ', 'g', 'e', 't', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'b', 'l', 'o', 'o', 'd', 'y', ' ', 'm', 'a', 'r', 'y', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'p', 'h', 'e', 'n', 'o', 'm', 'e', 'n', 'a', 'l', ' ', 'a', 'n', 'd', ' ', 's', 'i', 'm', 'p', 'l', 'y', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', 'i', "'", 'v', 'e', ' ', 'e', 'v', 'e', 'r', ' ', 'h', 'a', 'd', '.', ' ', ' ', 'i', "'", 'm', ' ', 'p', 'r', 'e', 't', 't', 'y', ' ', 's', 'u', 'r', 'e', ' ', 't', 'h', 'e', 'y', ' ', 'o', 'n', 'l', 'y', ' ', 'u', 's', 'e', ' ', 'i', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', 'f', 'r', 'o', 'm', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'a', 'r', 'd', 'e', 'n', ' ', 'a', 'n', 'd', ' ', 'b', 'l', 'e', 'n', 'd', ' ', 't', 'h', 'e', 'm', ' ', 'f', 'r', 'e', 's', 'h', ' ', 'w', 'h', 'e', 'n', ' ', 'y', 'o', 'u', ' ', 'o', 'r', 'd', 'e', 'r', ' ', 'i', 't', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', '.', '\n', '\n', 'w', 'h', 'i', 'l', 'e', ' ', 'e', 'v', 'e', 'r', 'y', 't', 'h', 'i', 'n', 'g', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'n', 'u', ' ', 'l', 'o', 'o', 'k', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', ',', ' ', 'i', ' ', 'h', 'a', 'd', ' ', 't', 'h', 'e', ' ', 'w', 'h', 'i', 't', 'e', ' ', 't', 'r', 'u', 'f', 'f', 'l', 'e', ' ', 's', 'c', 'r', 'a', 'm', 'b', 'l', 'e', 'd', ' ', 'e', 'g', 'g', 's', ' ', 'v', 'e', 'g', 'e', 't', 'a', 'b', 'l', 'e', ' ', 's', 'k', 'i', 'l', 'l', 'e', 't', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 't', 'a', 's', 't', 'y', ' ', 'a', 'n', 'd', ' ', 'd', 'e', 'l', 'i', 'c', 'i', 'o', 'u', 's', '.', ' ', ' ', 'i', 't', ' ', 'c', 'a', 'm', 'e', ' ', 'w', 'i', 't', 'h', ' ', '2', ' ', 'p', 'i', 'e', 'c', 'e', 's', ' ', 'o', 'f', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'r', 'i', 'd', 'd', 'l', 'e', 'd', ' ', 'b', 'r', 'e', 'a', 'd', ' ', 'w', 'i', 't', 'h', ' ', 'w', 'a', 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'a', 'b', 's', 'o', 'l', 'u', 't', 'e', 'l', 'y', ' ', 'm', 'a', 'd', 'e', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'a', 'l', ' ', 'c', 'o', 'm', 'p', 'l', 'e', 't', 'e', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', '"', 't', 'o', 'a', 's', 't', '"', ' ', 'i', "'", 'v', 'e', ' ', 'e', 'v', 'e', 'r', ' ', 'h', 'a', 'd', '.', '\n', '\n', 'a', 'n', 'y', 'w', 'a', 'y', ',', ' ', 'i', ' ', 'c', 'a', 'n', "'", 't', ' ', 'w', 'a', 'i', 't', ' ', 't', 'o', ' ', 'g', 'o', ' ', 'b', 'a', 'c', 'k', '!']
# cut by a specific threshold
def extract_fore(char_seq):
if len(char_seq) > 1014:
char_seq = char_seq[0:1014]
return char_seq
text_fore_extracted = extract_fore(example)
print(text_fore_extracted)
['m', 'y', ' ', 'w', 'i', 'f', 'e', ' ', 't', 'o', 'o', 'k', ' ', 'm', 'e', ' ', 'h', 'e', 'r', 'e', ' ', 'o', 'n', ' ', 'm', 'y', ' ', 'b', 'i', 'r', 't', 'h', 'd', 'a', 'y', ' ', 'f', 'o', 'r', ' ', 'b', 'r', 'e', 'a', 'k', 'f', 'a', 's', 't', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', '.', ' ', ' ', 't', 'h', 'e', ' ', 'w', 'e', 'a', 't', 'h', 'e', 'r', ' ', 'w', 'a', 's', ' ', 'p', 'e', 'r', 'f', 'e', 'c', 't', ' ', 'w', 'h', 'i', 'c', 'h', ' ', 'm', 'a', 'd', 'e', ' ', 's', 'i', 't', 't', 'i', 'n', 'g', ' ', 'o', 'u', 't', 's', 'i', 'd', 'e', ' ', 'o', 'v', 'e', 'r', 'l', 'o', 'o', 'k', 'i', 'n', 'g', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'r', 'o', 'u', 'n', 'd', 's', ' ', 'a', 'n', ' ', 'a', 'b', 's', 'o', 'l', 'u', 't', 'e', ' ', 'p', 'l', 'e', 'a', 's', 'u', 'r', 'e', '.', ' ', ' ', 'o', 'u', 'r', ' ', 'w', 'a', 'i', 't', 'r', 'e', 's', 's', ' ', 'w', 'a', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', ' ', 'a', 'n', 'd', ' ', 'o', 'u', 'r', ' ', 'f', 'o', 'o', 'd', ' ', 'a', 'r', 'r', 'i', 'v', 'e', 'd', ' ', 'q', 'u', 'i', 'c', 'k', 'l', 'y', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 's', 'e', 'm', 'i', '-', 'b', 'u', 's', 'y', ' ', 's', 'a', 't', 'u', 'r', 'd', 'a', 'y', ' ', 'm', 'o', 'r', 'n', 'i', 'n', 'g', '.', ' ', ' ', 'i', 't', ' ', 'l', 'o', 'o', 'k', 'e', 'd', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'h', 'e', ' ', 'p', 'l', 'a', 'c', 'e', ' ', 'f', 'i', 'l', 'l', 's', ' ', 'u', 'p', ' ', 'p', 'r', 'e', 't', 't', 'y', ' ', 'q', 'u', 'i', 'c', 'k', 'l', 'y', ' ', 's', 'o', ' ', 't', 'h', 'e', ' ', 'e', 'a', 'r', 'l', 'i', 'e', 'r', ' ', 'y', 'o', 'u', ' ', 'g', 'e', 't', ' ', 'h', 'e', 'r', 'e', ' ', 't', 'h', 'e', ' ', 'b', 'e', 't', 't', 'e', 'r', '.', '\n', '\n', 'd', 'o', ' ', 'y', 'o', 'u', 'r', 's', 'e', 'l', 'f', ' ', 'a', ' ', 'f', 'a', 'v', 'o', 'r', ' ', 'a', 'n', 'd', ' ', 'g', 'e', 't', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'b', 'l', 'o', 'o', 'd', 'y', ' ', 'm', 'a', 'r', 'y', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'p', 'h', 'e', 'n', 'o', 'm', 'e', 'n', 'a', 'l', ' ', 'a', 'n', 'd', ' ', 's', 'i', 'm', 'p', 'l', 'y', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', 'i', "'", 'v', 'e', ' ', 'e', 'v', 'e', 'r', ' ', 'h', 'a', 'd', '.', ' ', ' ', 'i', "'", 'm', ' ', 'p', 'r', 'e', 't', 't', 'y', ' ', 's', 'u', 'r', 'e', ' ', 't', 'h', 'e', 'y', ' ', 'o', 'n', 'l', 'y', ' ', 'u', 's', 'e', ' ', 'i', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', 'f', 'r', 'o', 'm', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'a', 'r', 'd', 'e', 'n', ' ', 'a', 'n', 'd', ' ', 'b', 'l', 'e', 'n', 'd', ' ', 't', 'h', 'e', 'm', ' ', 'f', 'r', 'e', 's', 'h', ' ', 'w', 'h', 'e', 'n', ' ', 'y', 'o', 'u', ' ', 'o', 'r', 'd', 'e', 'r', ' ', 'i', 't', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', '.', '\n', '\n', 'w', 'h', 'i', 'l', 'e', ' ', 'e', 'v', 'e', 'r', 'y', 't', 'h', 'i', 'n', 'g', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'n', 'u', ' ', 'l', 'o', 'o', 'k', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', ',', ' ', 'i', ' ', 'h', 'a', 'd', ' ', 't', 'h', 'e', ' ', 'w', 'h', 'i', 't', 'e', ' ', 't', 'r', 'u', 'f', 'f', 'l', 'e', ' ', 's', 'c', 'r', 'a', 'm', 'b', 'l', 'e', 'd', ' ', 'e', 'g', 'g', 's', ' ', 'v', 'e', 'g', 'e', 't', 'a', 'b', 'l', 'e', ' ', 's', 'k', 'i', 'l', 'l', 'e', 't', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 't', 'a', 's', 't', 'y', ' ', 'a', 'n', 'd', ' ', 'd', 'e', 'l', 'i', 'c', 'i', 'o', 'u', 's', '.', ' ', ' ', 'i', 't', ' ', 'c', 'a', 'm', 'e', ' ', 'w', 'i', 't', 'h', ' ', '2', ' ', 'p', 'i', 'e', 'c', 'e', 's', ' ', 'o', 'f', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'r', 'i', 'd', 'd', 'l', 'e', 'd', ' ', 'b', 'r', 'e', 'a', 'd', ' ', 'w', 'i', 't', 'h', ' ', 'w', 'a', 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'a', 'b', 's', 'o', 'l', 'u', 't', 'e', 'l', 'y', ' ', 'm', 'a', 'd', 'e', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'a', 'l', ' ', 'c', 'o', 'm', 'p', 'l', 'e', 't', 'e', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', '"', 't', 'o', 'a', 's', 't', '"', ' ', 'i', "'", 'v', 'e', ' ', 'e', 'v', 'e', 'r', ' ', 'h', 'a', 'd', '.', '\n', '\n', 'a', 'n', 'y', 'w', 'a', 'y', ',', ' ', 'i', ' ', 'c', 'a', 'n', "'", 't', ' ', 'w', 'a', 'i', 't', ' ', 't', 'o', ' ', 'g', 'o', ' ', 'b', 'a', 'c', 'k', '!']
" "
padding을 하여, 모든 입력사이즈의 길이를 동일하게 해준다.
# add " " padding character
def pad_sentence(char_seq, padding_char=" "):
char_seq_length = 1014
num_padding = char_seq_length - len(char_seq)
new_char_seq = char_seq + [padding_char] * num_padding
return new_char_seq
padded = pad_sentence(text_fore_extracted)
print(padded)
['m', 'y', ' ', 'w', 'i', 'f', 'e', ' ', 't', 'o', 'o', 'k', ' ', 'm', 'e', ' ', 'h', 'e', 'r', 'e', ' ', 'o', 'n', ' ', 'm', 'y', ' ', 'b', 'i', 'r', 't', 'h', 'd', 'a', 'y', ' ', 'f', 'o', 'r', ' ', 'b', 'r', 'e', 'a', 'k', 'f', 'a', 's', 't', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', '.', ' ', ' ', 't', 'h', 'e', ' ', 'w', 'e', 'a', 't', 'h', 'e', 'r', ' ', 'w', 'a', 's', ' ', 'p', 'e', 'r', 'f', 'e', 'c', 't', ' ', 'w', 'h', 'i', 'c', 'h', ' ', 'm', 'a', 'd', 'e', ' ', 's', 'i', 't', 't', 'i', 'n', 'g', ' ', 'o', 'u', 't', 's', 'i', 'd', 'e', ' ', 'o', 'v', 'e', 'r', 'l', 'o', 'o', 'k', 'i', 'n', 'g', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'r', 'o', 'u', 'n', 'd', 's', ' ', 'a', 'n', ' ', 'a', 'b', 's', 'o', 'l', 'u', 't', 'e', ' ', 'p', 'l', 'e', 'a', 's', 'u', 'r', 'e', '.', ' ', ' ', 'o', 'u', 'r', ' ', 'w', 'a', 'i', 't', 'r', 'e', 's', 's', ' ', 'w', 'a', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', ' ', 'a', 'n', 'd', ' ', 'o', 'u', 'r', ' ', 'f', 'o', 'o', 'd', ' ', 'a', 'r', 'r', 'i', 'v', 'e', 'd', ' ', 'q', 'u', 'i', 'c', 'k', 'l', 'y', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 's', 'e', 'm', 'i', '-', 'b', 'u', 's', 'y', ' ', 's', 'a', 't', 'u', 'r', 'd', 'a', 'y', ' ', 'm', 'o', 'r', 'n', 'i', 'n', 'g', '.', ' ', ' ', 'i', 't', ' ', 'l', 'o', 'o', 'k', 'e', 'd', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'h', 'e', ' ', 'p', 'l', 'a', 'c', 'e', ' ', 'f', 'i', 'l', 'l', 's', ' ', 'u', 'p', ' ', 'p', 'r', 'e', 't', 't', 'y', ' ', 'q', 'u', 'i', 'c', 'k', 'l', 'y', ' ', 's', 'o', ' ', 't', 'h', 'e', ' ', 'e', 'a', 'r', 'l', 'i', 'e', 'r', ' ', 'y', 'o', 'u', ' ', 'g', 'e', 't', ' ', 'h', 'e', 'r', 'e', ' ', 't', 'h', 'e', ' ', 'b', 'e', 't', 't', 'e', 'r', '.', '\n', '\n', 'd', 'o', ' ', 'y', 'o', 'u', 'r', 's', 'e', 'l', 'f', ' ', 'a', ' ', 'f', 'a', 'v', 'o', 'r', ' ', 'a', 'n', 'd', ' ', 'g', 'e', 't', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'b', 'l', 'o', 'o', 'd', 'y', ' ', 'm', 'a', 'r', 'y', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'p', 'h', 'e', 'n', 'o', 'm', 'e', 'n', 'a', 'l', ' ', 'a', 'n', 'd', ' ', 's', 'i', 'm', 'p', 'l', 'y', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', 'i', "'", 'v', 'e', ' ', 'e', 'v', 'e', 'r', ' ', 'h', 'a', 'd', '.', ' ', ' ', 'i', "'", 'm', ' ', 'p', 'r', 'e', 't', 't', 'y', ' ', 's', 'u', 'r', 'e', ' ', 't', 'h', 'e', 'y', ' ', 'o', 'n', 'l', 'y', ' ', 'u', 's', 'e', ' ', 'i', 'n', 'g', 'r', 'e', 'd', 'i', 'e', 'n', 't', 's', ' ', 'f', 'r', 'o', 'm', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'a', 'r', 'd', 'e', 'n', ' ', 'a', 'n', 'd', ' ', 'b', 'l', 'e', 'n', 'd', ' ', 't', 'h', 'e', 'm', ' ', 'f', 'r', 'e', 's', 'h', ' ', 'w', 'h', 'e', 'n', ' ', 'y', 'o', 'u', ' ', 'o', 'r', 'd', 'e', 'r', ' ', 'i', 't', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', '.', '\n', '\n', 'w', 'h', 'i', 'l', 'e', ' ', 'e', 'v', 'e', 'r', 'y', 't', 'h', 'i', 'n', 'g', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'n', 'u', ' ', 'l', 'o', 'o', 'k', 's', ' ', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 't', ',', ' ', 'i', ' ', 'h', 'a', 'd', ' ', 't', 'h', 'e', ' ', 'w', 'h', 'i', 't', 'e', ' ', 't', 'r', 'u', 'f', 'f', 'l', 'e', ' ', 's', 'c', 'r', 'a', 'm', 'b', 'l', 'e', 'd', ' ', 'e', 'g', 'g', 's', ' ', 'v', 'e', 'g', 'e', 't', 'a', 'b', 'l', 'e', ' ', 's', 'k', 'i', 'l', 'l', 'e', 't', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 't', 'a', 's', 't', 'y', ' ', 'a', 'n', 'd', ' ', 'd', 'e', 'l', 'i', 'c', 'i', 'o', 'u', 's', '.', ' ', ' ', 'i', 't', ' ', 'c', 'a', 'm', 'e', ' ', 'w', 'i', 't', 'h', ' ', '2', ' ', 'p', 'i', 'e', 'c', 'e', 's', ' ', 'o', 'f', ' ', 't', 'h', 'e', 'i', 'r', ' ', 'g', 'r', 'i', 'd', 'd', 'l', 'e', 'd', ' ', 'b', 'r', 'e', 'a', 'd', ' ', 'w', 'i', 't', 'h', ' ', 'w', 'a', 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'a', 'b', 's', 'o', 'l', 'u', 't', 'e', 'l', 'y', ' ', 'm', 'a', 'd', 'e', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'a', 'l', ' ', 'c', 'o', 'm', 'p', 'l', 'e', 't', 'e', '.', ' ', ' ', 'i', 't', ' ', 'w', 'a', 's', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', '"', 't', 'o', 'a', 's', 't', '"', ' ', 'i', "'", 'v', 'e', ' ', 'e', 'v', 'e', 'r', ' ', 'h', 'a', 'd', '.', '\n', '\n', 'a', 'n', 'y', 'w', 'a', 'y', ',', ' ', 'i', ' ', 'c', 'a', 'n', "'", 't', ' ', 'w', 'a', 'i', 't', ' ', 't', 'o', ' ', 'g', 'o', ' ', 'b', 'a', 'c', 'k', '!', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
모든 문자를 계속가지고 있을필요는 없다. 알파벳 dictionary를 정의하고, 그 dictionary에 해당하는 위치의 index를 가져오면 된다.
def indexing(char_seq, alphabet):
x = np.array([alphabet.find(char) for char in char_seq], dtype=np.int8)
return x
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}\n"
text_int8_repr = indexing(padded, alphabet)
text_int8_repr
array([12, 24, -1, ..., -1, -1, -1], dtype=int8)
높은 점수(4,5)를 받은 review는 두번째 열이 1인 label을 달아주고, 낮은 점수(1,2)를 받은 review는 첫번째 열이 1인 label을 달아준다.
# preprocessing
i=0
reviews = []
labels = []
for e in examples:
stars = e["stars"]
text = e["text"]
if stars != 3:
text_end_extracted = extract_fore(list(text.lower()))
padded = pad_sentence(text_end_extracted)
text_int8_repr = indexing(padded, alphabet)
if stars == 1 or stars == 2:
labels.append([1, 0])
reviews.append(text_int8_repr)
elif stars == 4 or stars == 5:
labels.append([0, 1])
reviews.append(text_int8_repr)
i += 1
if i % 10000 == 0:
print("Non-neutral instances processed: " + str(i))
Non-neutral instances processed: 10000
Non-neutral instances processed: 20000
Non-neutral instances processed: 30000
Non-neutral instances processed: 40000
Non-neutral instances processed: 50000
Non-neutral instances processed: 60000
Non-neutral instances processed: 70000
Non-neutral instances processed: 80000
Non-neutral instances processed: 90000
Non-neutral instances processed: 100000
Non-neutral instances processed: 110000
Non-neutral instances processed: 120000
Non-neutral instances processed: 130000
Non-neutral instances processed: 140000
Non-neutral instances processed: 150000
Non-neutral instances processed: 160000
Non-neutral instances processed: 170000
Non-neutral instances processed: 180000
Non-neutral instances processed: 190000
데이터는 194,544개의 review가 존재하며, 그 문장의 길이는 1,014로 고정되어 있는 형태이다.
x = np.array(reviews, dtype=np.int8)
y = np.array(labels, dtype=np.int8)
print("x_char_seq_ind=" + str(x.shape))
print("y shape=" + str(y.shape))
x_char_seq_ind=(194544, 1014)
y shape=(194544, 2)
print('X:', x[0:10],'\n\n','Y:', y[0:10])
X: [[12 24 -1 ... -1 -1 -1]
[ 8 -1 7 ... 4 -1 2]
[11 14 21 ... -1 -1 -1]
...
[11 20 2 ... -1 -1 -1]
[ 3 4 5 ... -1 -1 -1]
[13 14 1 ... -1 -1 -1]]
Y: [[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]
[0 1]]
모델이 전반적인 데이터를 볼 수 있게, shuffle을 시켜주고, 모델을 평가하기 위해 데이터를 나눠준다.
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]
# spliting for DevSet
n_dev_samples = int(len(x_shuffled)*0.3)
x_train, x_dev = x_shuffled[:-n_dev_samples], x_shuffled[-n_dev_samples:]
y_train, y_dev = y_shuffled[:-n_dev_samples], y_shuffled[-n_dev_samples:]
print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))
Train/Dev split: 136181/58363
Batch size
=20이라고 가정해보자.
# 임의로 배치사이즈 20으로 예를 들면 다음과 같다
x_batch = x_train[0:20]
y_batch = y_train[0:20]
print(x_batch.shape)
print(y_batch.shape)
(20, 1014)
(20, 2)
(20,70,1014,1)
는 20개의 배치, 70개로 이루어진 dictionary, 최대 단어시퀀스 길이가 1014, depth가 1인 tensor를 생성하려고 한다.먼저 원하는 tensor shape을 0으로 형성하고, 특정 examples에 특정 character의 대한 위치에 1을 대입해, one-hot vector로 만들어준다.
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}\n"
x_batch_one_hot = np.zeros(shape=[len(x_batch), len(alphabet), x_batch.shape[1], 1])
print(x_batch_one_hot.shape)
(20, 70, 1014, 1)
list(enumerate(x_batch))[0:2]
[(0, array([14, 20, 17, ..., 7, 4, -1], dtype=int8)),
(1, array([13, 14, 19, ..., -1, -1, -1], dtype=int8))]
example_i, char_seq_indices =list(enumerate(x_batch))[0]
print('example_i:',example_i)
print('char_seq_indices:',char_seq_indices)
example_i: 0
char_seq_indices: [14 20 17 ... 7 4 -1]
# pick a specific character
char_pos_in_seq, char_seq_char_ind = list(enumerate(char_seq_indices))[0]
print('char_pos_in_seq:',char_pos_in_seq)
print('char_seq_char_ind:',char_seq_char_ind )
char_pos_in_seq: 0
char_seq_char_ind: 14
char_seq_char_ind
14
# x_batch_one_hot.shape: (batch_size in 20, dictctionary in 70, max_length in 1014, depth)
x_batch_one_hot[example_i][char_seq_char_ind][char_pos_in_seq]=1
print(x_batch_one_hot[example_i][char_seq_char_ind])
[[1.]
[0.]
[0.]
...
[0.]
[0.]
[0.]]
2) CNN architecture
input
$\rightarrow$ Conv 1
$\rightarrow$ pooling 1
$\rightarrow$ Conv 2
$\rightarrow$ pooling 2
$\rightarrow$ Conv 3
$\rightarrow$ Conv 4
$\rightarrow$ Conv 5
$\rightarrow$ Conv 6
$\rightarrow$ pooling 6
$\rightarrow$ dropout 1
$\rightarrow$ FcL 1
$\rightarrow$ dropout 2
$\rightarrow$ FcL 2
$\rightarrow$ FcL 3
$\rightarrow$ output
구성 된다.num_classes=2
filter_sizes=(7, 7, 3, 3, 3, 3)
num_filters_per_size=256
l2_reg_lambda=0.0
sequence_max_length=1014
num_quantized_chars=70 # alpahet size
모델에 입력을 변화를 주기 위해 사용되는 feed dict
정의를 아래와 같이 하였다.
# num_quantized_chars: what you want for embedding size
input_x = tf.placeholder(tf.float32, [None, num_quantized_chars, sequence_max_length, 1],
name="input_x")
input_x
<tf.Tensor 'input_x:0' shape=(?, 70, 1014, 1) dtype=float32>
# positive or negative
input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
input_y
<tf.Tensor 'input_y:0' shape=(?, 2) dtype=float32>
dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
dropout_keep_prob
<tf.Tensor 'dropout_keep_prob:0' shape=<unknown> dtype=float32>
height
는 dictionary의 길이를 의미하므로 70으로 설정width
는 window 사이즈만큼 연산이 되므로, filter size
로 설정 # tf.nn.conv2d(filter = [filter_height, filter_width, in_channels, out_channels])
filter_shape = [num_quantized_chars, filter_sizes[0], 1, num_filters_per_size]
filter_shape
[70, 7, 1, 256]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters_per_size]), name="b")
print(W)
print(b)
<tf.Variable 'W:0' shape=(70, 7, 1, 256) dtype=float32_ref>
<tf.Variable 'b:0' shape=(256,) dtype=float32_ref>
conv = tf.nn.conv2d(input_x, W, strides=[1, 1, 1, 1], padding="VALID", name="conv1")
conv
<tf.Tensor 'conv1:0' shape=(?, 1, 1008, 256) dtype=float32>
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
h
<tf.Tensor 'relu:0' shape=(?, 1, 1008, 256) dtype=float32>
Pooling
tf.nn.max_pool
batch_size
, height
, width
= subsequence, depth
)
batch_size
, height
, width
= subsequence, depth
)
inputs: 1 2 3 4 5 6 7 8 9 10 11 (12 13)
|________________| dropped
|________________|
pad| |pad
inputs: 0 |1 2 3 4 5 6 7 8 9 10 11 12 13|0 0
|________________|
|_________________|
|______________|
위와 했던 방식대로 output size를 계산하면 (1008-3)/3 +1 =336
pooled = tf.nn.max_pool(
h,
ksize=[1, 1, 3, 1],
strides=[1, 1, 3, 1],
padding='VALID',
name="pool1")
pooled
<tf.Tensor 'pool1:0' shape=(?, 1, 336, 256) dtype=float32>
output size in pool= (330 -3)/3 +1 = 110
# conv-maxpool-2
filter_shape = [1, filter_sizes[1], num_filters_per_size, num_filters_per_size]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters_per_size]), name="b")
conv = tf.nn.conv2d(pooled, W, strides=[1, 1, 1, 1], padding="VALID", name="conv2")
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
pooled = tf.nn.max_pool(
h,
ksize=[1, 1, 3, 1],
strides=[1, 1, 3, 1],
padding='VALID',
name="pool2")
pooled
<tf.Tensor 'pool2:0' shape=(?, 1, 110, 256) dtype=float32>
output size in conv= (110 -3)/1 +1 = 108
# conv-3
filter_shape = [1, filter_sizes[2], num_filters_per_size, num_filters_per_size]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters_per_size]), name="b")
conv = tf.nn.conv2d(pooled, W, strides=[1, 1, 1, 1], padding="VALID", name="conv3")
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
h
<tf.Tensor 'relu_2:0' shape=(?, 1, 108, 256) dtype=float32>
output size in conv= (108 -3)/1 +1 = 106
# conv-4
filter_shape = [1, filter_sizes[3], num_filters_per_size, num_filters_per_size]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters_per_size]), name="b")
conv = tf.nn.conv2d(h, W, strides=[1, 1, 1, 1], padding="VALID", name="conv4")
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
h
<tf.Tensor 'relu_3:0' shape=(?, 1, 106, 256) dtype=float32>
output size in conv= (106 -3)/1 +1 = 104
# conv-5
filter_shape = [1, filter_sizes[4], num_filters_per_size, num_filters_per_size]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters_per_size]), name="b")
conv = tf.nn.conv2d(h, W, strides=[1, 1, 1, 1], padding="VALID", name="conv5")
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
h
<tf.Tensor 'relu_4:0' shape=(?, 1, 104, 256) dtype=float32>
output size in pool= (102 -3)/3 +1 = 34
# con-maxpool-6
filter_shape = [1, filter_sizes[5], num_filters_per_size, num_filters_per_size]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters_per_size]), name="b")
conv = tf.nn.conv2d(h, W, strides=[1, 1, 1, 1], padding="VALID", name="conv6")
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
pooled = tf.nn.max_pool(
h,
ksize=[1, 1, 3, 1],
strides=[1, 1, 3, 1],
padding='VALID',
name="pool6")
pooled
<tf.Tensor 'pool6:0' shape=(?, 1, 34, 256) dtype=float32>
num_features_total = pooled.get_shape().as_list()[2] * num_filters_per_size # (final depth=256)
h_pool_flat = tf.reshape(pooled, [-1, num_features_total])
h_pool_flat
<tf.Tensor 'Reshape:0' shape=(?, 8704) dtype=float32>
절반의 확률로 input value를 0으로 마스킹
drop1 = tf.nn.dropout(h_pool_flat, keep_prob=0.5)
drop1
<tf.Tensor 'dropout/mul:0' shape=(?, 8704) dtype=float32>
hidden layer(1024)를 추가한 Fully conneted layer 1
# fc_1
W = tf.Variable(tf.truncated_normal([num_features_total, 1024], stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[1024]), name="b")
fc_1_output = tf.nn.relu(tf.nn.xw_plus_b(drop1, W, b), name="fc-1-out")
fc_1_output
<tf.Tensor 'fc-1-out:0' shape=(?, 1024) dtype=float32>
hidden layer(1024)를 추가한 Fully conneted layer 2
drop2 = tf.nn.dropout(fc_1_output, keep_prob=0.5)
drop2
<tf.Tensor 'dropout_1/mul:0' shape=(?, 1024) dtype=float32>
# fc_2
W = tf.Variable(tf.truncated_normal([1024, 1024], stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[1024]), name="b")
fc_2_output = tf.nn.relu(tf.nn.xw_plus_b(drop2, W, b), name="fc-2-out")
fc_2_output
<tf.Tensor 'fc-2-out:0' shape=(?, 1024) dtype=float32>
output size는 평점이 높은 클래스, 낮은 클래스에 대해 알고 싶은 것이므로 num_classes
= 2로 설정
# fc_3
W = tf.Variable(tf.truncated_normal([1024, num_classes], stddev=0.05), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
scores = tf.nn.xw_plus_b(fc_2_output, W, b, name="output")
predictions = tf.argmax(scores, 1, name="predictions")
predictions
<tf.Tensor 'predictions:0' shape=(?,) dtype=int64>
W
, b
선형결합으로 이루어진 logit을 softmax를 취해 0~1로 표현한뒤 target과 prediction의 loss를 산출실제값과 예측값을 비교한 correct_predictions
은 데이터 갯수(batch_size
)만큼 존재하게 됨
# CalculateMean cross-entropy loss
losses = tf.nn.softmax_cross_entropy_with_logits_v2(logits=scores, labels=input_y)
losses
<tf.Tensor 'softmax_cross_entropy_with_logits/Reshape_2:0' shape=(?,) dtype=float32>
# Accuracy
correct_predictions = tf.equal(predictions, tf.argmax(input_y, 1))
print(correct_predictions)
Tensor("Equal:0", shape=(?,), dtype=bool)