Convolutional Neural Networks for Sentence Classification

October 13, 2018 NLP CNN

본 논문은 텍스트 분류 성능을 향상시키기 위해 CNN(Convolutional Neural Network)를 사용한 다양한 feature들을 생성해내고, fine-tunning을 통해 개선한 내용을 담고 있다. 아래의 전체코드는 여기에 참조되어 있다.

Model

1) $n \times k$ representation of sentence

$V$(단어의 unique set의 길이)차원을 가지는 단어벡터(1-hot vector)를 word2vec모델로 $k$차원으로 표현한다.
- 아래의 코드의 input_x는 각 문장이 row, 그 문장에서 발생되는 단어들의 시퀀스가 column으로 표현된다.
- 위에서 말한 Matrix(문장 $\times$ 단어)의 값은 Vocabulary에 대응되는 단어 idx로 표현된다.
문장에서 $i$번째에 위치한 $k$-dimensional를 가지는 word vector: $x_i \in \mathbb{R}^{k}$를 정의한다.
문장마다 길이가 다르므로, 특정문장길이 $n$을 고정하여, 긴문장은 짧게, 짧은 문장은 0으로 padding을 적용해 준다.
- 아래의 그림에서 각 row의 수(9)가 $n$이 된다.
- 아래의 그림에서 각 column의 수(6)가 $k$가 된다. 따라서 각 word vector가 $k$-dimensional를 가지게 된다.
구현관점에서 생각해 보면 $V$가 vocab size가 되며 embedding size는 위에서 언급한 $k$가 된다.

본 논문에 있는 위의 그림을 코드상의 tensor를 구성해본다면,

batch size = 2
fixed_length(max words) = 8
embed_dim = 6
input_x $ = (\mathbb{x}_1,\mathbb{x}_2,…\mathbb{x}_8 ), \quad \mathbb{x}_i \in i^{th} \text{word index in the vocabulary}$

kernel_size = 2

with tf.device('/cpu:0'), tf.name_scope("embedding"):
  W = tf.Variable(
  tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
  name="W")
  embedded_chars = tf.nn.embedding_lookup(W, input_x)
  embedded_chars_expanded = tf.expand_dims(embedded_chars, -1)

위의 맨 왼쪽의 있는 그림을 나타낸 것이 embedded_chars $\rightarrow$ shape(batch_size, fixed_length, W2V_dim)이다.
```
print(embedded_chars)
<tf.Tensor 'embedding_lookup_5:0' shape=(2, 9, 6) dtype=float32>
```
tf.nn.conv2d 입력채널 1을 표현하기 위해 하나의 차원을 늘린 embedded_chars_expanded를 사용하게 된다.
```
print(embedded_chars_expanded)
<tf.Tensor 'ExpandDims_5:0' shape=(2, 9, 6, 1) dtype=float32>
```

2) Convolutional layer with mutiple filter widths

이제 이 입력 matrix를 convolution을 하게 되는데 한축(1-D)으로 filter($\mathbf{w} \in \mathbb{R}^{h \times k}$)가 sliding(sum product of $\mathbf{w}$ with bias $b$ and $\mathbf{x}$)하게 된다.
- 위 그림에서 맨 앞에 있는 filter(red box which width size =2)는 $\mathbf{w}^{2 \times k}$의 weight matrix를 가지고 $\mathbf{x}$와 sum product가 이루어 진다.
- 위 그림에서 맨 마지막에 있는 filter(yellow box which width size =3)는 $\mathbf{w}^{3 \times k}$의 weight matrix를 가지고 $\mathbf{x}$와 sum product가 이루어 진다.
- 위 그림에서는 총 4개의 filter들이 사용되었다.
convolve된 아웃풋을 비선형 변환을 위해 relu function으로 활성화 시켜주고, 그것을 feature map이라고 부른다.
하나의 filter에 대한 과정들을 수식으로 표현하면 다음과 같다.

\[c_i = f(\mathbf{w} \cdot \mathbf{x}_{i:i+h-1}+b)\] \[\mathbf{c } = [c_1, c_2, c_3,...c_{n-h+1}]\]

filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(
    self.embedded_chars_expanded,
    W,
    strides=[1, 1, 1, 1],
    padding="VALID",
    name="conv")
# Apply nonlinearity
c = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")

print(c)

# 위 그림에는 7개로 보이지만 잘못 그린 것 같다. output_size = (9-2)/1+1=8
<tf.Tensor 'conv-maxpool-2_4/relu:0' shape=(2, 8, 1, 4) dtype=float32>

위의 filter($\mathbf{w}$) shape를 이해하자면, [width,height,channel_in, channel_out]로 생각할 수 있다. (위 그림에서는 아래로 슬라이딩하는데 tf.nn.conv2d는 왼쪽에서 오른쪽으로 가므로 이것을 고려해준 설정값을 아래와 같다.)
- filter_size(filter_height)는 h(word subsequence)를 의미(위 그림에서 첫번째 필터사이즈는 2)
- embedding_size(filter_width)는 k(w2v dim)를 의미(위 그림에서는 6으로 설정됨)
- channel_in은 이미지에서 RGB같은 3개의 depth가 존재하지만, 텍스트는 그런 개념이 없기 때문에 1로 입력채널을 설정해준다.
- channel_out(num_filters)은 각 필터 사이즈종류마다 사용된 고정된 필터들의 수이며, 이 값을 4로 정할 경우 필터의 총 개수는 각 필터 종류의 4배로 표현 된다.
- 다시 말해서 filter_size=[2,3]을 설정하면, size 2에 대한 필터가 4개, size 3에 대한 필터가 4개로 생각하면 된다.
stride는 1D-conv는 한쪽 방향으로 몇 칸씩 움직이는지에 대한 설정값이다.
- [batch=1, width=1, height=1, depth=1]의 의미는 한 문장에 대해서(batch=1), 가로방향에 대해서는 width=1 일지라도 embedding_size만큼의 filter와 input의 overlay이 되기 때문에 움직이지 않는다. 반면에 단어 sequence의 세로방향을 따라 stride를 하게되며(height=1), 한 sentence는 3차원이 아니므로 depth는 1이다(depth=1). e.g.이미지에서는 RGB $\rightarrow$ 3 depth를 가짐

padding은 두가지 방식이 존재하는데 위에선 “VALID” 방식을 사용하였다. (일반적으로는 “SAME”방식이 정보손실이 적으므로 더 적합해 보임)

“VALID” = without padding:

inputs:     1  2  3  4  5  6  7  8  9  10 11 (12 13)
            |________________|                dropped
                           |________________|

“SAME” = with zero padding:

             pad|                                      |pad
 inputs:      0 |1  2  3  4  5  6  7  8  9  10 11 12 13|0  0
             |________________|
                           |_________________|
                                             |______________|

3) Max-over-time pooling

\[\tilde{c} = max\{c\}\]

위 식과 같이 각 filter에서 생성된 feature map에 있는 element들중(아래의 코드에서 그 범위는 sequence_length - filter_size + 1)에 가장 큰값을 그 filter의 대표값 하나를 뽑는다.

아래의 코드에서 strides는 움직일 필요가 없으므로 [batch = 1, width = 1, height=1, depth = 1]로 설정하였다.

pooled = tf.nn.max_pool(
  c,
  ksize=[1, sequence_length - filter_size + 1, 1, 1],
  strides=[1, 1, 1, 1],
  padding='VALID',
  name="pool")

모든 필터마다 pooling된 하나의 scalar가 나오게 되며, 최종 아웃풋은 총 필터의 갯수만큼의 차원을 가질 것이다.
- 각 filter_size마다 num_filters만큼 가지고 있으므로, 총 필터의 갯수는 아래의 코드처럼 num_filters * len(filter_sizes)으로 표현된다.(위 그림으로는 정확하게 그리지 않아 식별하기 어려울것 같다.)
```
num_filters_total = num_filters * len(filter_sizes)
c_pool = tf.concat(pooled_outputs, 3)
c_pool_flat = tf.reshape(c_pool, [-1, num_filters_total])
```
c_pool은 Max-over-time pooling이 이루어진 것으로 위에서 3번째 그림을 의미한다.
```
print(c_pool)
<tf.Tensor 'concat_7:0' shape=(2, 1, 1, 8) dtype=float32>
```
c_pool_flat은 tensor의 shape을 2차원으로 바꿔주기 위한 reshape을 적용하였다.
```
print(c_pool_flat)
<tf.Tensor 'Reshape_1:0' shape=(2, 8) dtype=float32>
```
결과를 해석하자면, 2개의 문장들의 feature를 길이가 8인 벡터로 표현한것이다.
이 벡터는 후에 fully connected layer(dense-layer)의 입력으로 사용된다.

4) Fully connected layer with dropout and softmax output

Conv feature의 값을 입력데이터로 사용해 인공신경망을 학습시키게 되고, 과적합 방지를 위해 dropout을 적용한다.
이진분류문제는 class 2개이므로 마지막 노드는 각 class에 대한 2개의 확률을 가진다.

Convolutional Neural Networks for Sentence Classification

October 13, 2018 NLP CNN

Related Posts

July 25, 2021

Fairseq 코드리뷰 Wav2vec 2.0 (Finetune)

July 24, 2021

Fairseq 코드리뷰 Wav2vec 2.0 (Pretrain)

July 5, 2021

Docker container를 vscode로 원격제어하기