SentencePiece Tokenizer

September 1, 2019 Text Mining

Google: SentencePiece Tutorial

$n$-gram 기반으로 token들의 빈도가 가장 빈번하게하는 단위로 tokenize
- vocab size 큰 경우, 단어단위로 tokenize되는 경향
- vocab size 작은 경우, 음절단위로 tokenize되는 경향

패키지 설치를 아래와 같이 합니다.

pip install sentencepiece

Requirement already satisfied: sentencepiece in /home/donghwa/anaconda3/envs/lab/lib/python3.5/site-packages (0.1.82)

그리고 tokenizer를 학습시킬 데이터가 아래와 같이 있습니다.

head -n 10  "./botchan.txt"

Project Gutenberg's Botchan (Master Darling), by Kin-nosuke Natsume
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: Botchan (Master Darling)
Author: Kin-nosuke Natsume
Translator: Yasotaro Morri
Posting Date: October 14, 2012 [EBook #8868]
Release Date: September, 2005

1. 모델 hyperparameter 설정

모델의 hyperparameter를 string으로 정의해 줍니다.
- character_coverage: 얼마나 자소단위 셋을 줄여 단어단위 셋으로 coverage 시킬 것인지에 대한 모델 하이퍼파라미터
- 실험적으로
  - 중국어, 일본어 같이 자소단위로 이루어진 언어(rich character set)에서는 0.9995
  - 다른언어(small character set)에 대해서 1로 설정

templates= '--input={} \
--pad_id={} \
--bos_id={} \
--eos_id={} \
--unk_id={} \
--model_prefix={} \
--vocab_size={} \
--character_coverage={} \
--model_type={}'


train_input_file = "./botchan.txt"
pad_id=0  #<pad> token을 0으로 설정
vocab_size = 2000 # vocab 사이즈
prefix = 'botchan_spm' # 저장될 tokenizer 모델에 붙는 이름
bos_id=1 #<start> token을 1으로 설정
eos_id=2 #<end> token을 2으로 설정
unk_id=3 #<unknown> token을 3으로 설정
character_coverage = 1.0 # to reduce character set 
model_type ='unigram' # Choose from unigram (default), bpe, char, or word


cmd = templates.format(train_input_file,
                pad_id,
                bos_id,
                eos_id,
                unk_id,
                prefix,
                vocab_size,
                character_coverage,
                model_type)

cmd                

'--input=./botchan.txt --pad_id=0 --bos_id=2000 --eos_id=botchan_spm --unk_id=1 --model_prefix=2 --vocab_size=3 --character_coverage=1.0 --model_type=unigram'

2. Tokenizer 학습

학습이 잘 완료되면 True라는 인자를 반환합니다.
이부분이 완료되면, prefix를 가진 아래의 두개의 파일이 생성됩니다.
- model:실제로 사용되는 tokenizer 모델
- vocab: 참조하는 단어집합
  
  ├── botchan_spm.model
  ├── botchan_spm.vocab

import sentencepiece as spm

spm.SentencePieceTrainer.Train(cmd)

True

3. 문장에 Tokenizer 적용

이제 실제 문장에 적용하기 위해서 sp모듈 새로 정의해, 이전에 학습한 botchan_spm.model을 불러옵니다.

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load('botchan_spm.model') # prefix이름으로 저장된 모델

True

botchan_spm.vocab에 대한 정보가 필요하다면 아래와 같이 불러올수 있습니다.

with open('./botchan_spm.vocab', encoding='utf-8') as f:
    Vo = [doc.strip().split("\t") for doc in f]

# w[0]: token name    
# w[1]: token score
word2idx = {w[0]: i for i, w in enumerate(Vo)}

{'▁idea': 958,
'▁in': 15,
'▁sport': 1617,
'▁child': 1975,
'▁country': 425,
'▁......': 601, ...

Tokenize하기 전에 데이터 자체에 <s>, </s>를 넣어줘도 되지만, 패키지 자체에서 알아서 해주는 옵션(sp.SetEncodeExtraOptions)이 있습니다.

# 문장 양 끝에 <s> , </s> 추가
sp.SetEncodeExtraOptions('bos:eos')

그 다음에 tokenize를 수행하게 됩니다..
- EncodeAsPieces: string으로 tokenize
- EncodeAsIds: ids으로 tokenize

tokens = sp.EncodeAsPieces('This eBook is for the use of anyone anywhere at no cost')
tokens

['<s>',
'▁This',
'▁eBook',
'▁is',
'▁for',
'▁the',
'▁use',
'▁of',
'▁anyone',
'▁any',
'where',
'▁at',
'▁no',
'▁cost',
'</s>']

tokensIDs = sp.EncodeAsIds('This eBook is for the use of anyone anywhere at no cost')
tokensIDs

[1, 209, 810, 31, 33, 5, 520, 11, 1458, 117, 1505, 40, 74, 981, 2]

4. Token들을 문자열로 원복

위에서 쪼개진 token들을 그냥 공백기준으로 붙이면 _와 띄어쓰기 모호함에 따라 어색함이 생길 수 있습니다.
sp내의 Decode함수를 사용하면 쉽게 원복할 수 있습니다.

# string token들을 원복
sp.DecodePieces(tokens)

'This eBook is for the use of anyone anywhere at no cost'

# idx token들을 원복
sp.DecodeIds(tokensIDs)

'This eBook is for the use of anyone anywhere at no cost'

(추가) 샘플링 기능

SentencePiece Tokenizer는 n-gram의 가장 빈도가 높은 Top 1의 기준만 사용하게 됩니다.
그래서 이 패키지 안에는 확률적으로 다양한 tokenized된 example들을 샘플링 할 수 있는데 사용방법은 아래와 같습니다.

N best tokens

for i in sp.NBestEncodeAsPieces('This eBook is for the use of anyone anywhere at no cost', 5):
    print(i)

['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁a', 't', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁us', 'e', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁co', 'st', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁', 'at', '▁no', '▁cost', '</s>']

Pieces sampling
- deterministic-based: 1
- probability-based: -1
  - alpha =1.0: more likely sampling
  - alpha =0.0: uniformly sampling

for x in range(5):
    print(sp.SampleEncodeAsPieces('This eBook is for the use of anyone anywhere at no cost', 
    1,
    0.0)
    )

['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']

for x in range(5):
    print(sp.SampleEncodeAsPieces('This eBook is for the use of anyone anywhere at no cost', 
    -1, 
    1.0)
    )

['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁a', 't', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁a', 't', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']

for x in range(5):
    print(sp.SampleEncodeAsPieces('This eBook is for the use of anyone anywhere at no cost',
     -1,
      0.5)
      )

['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁any', 'on', 'e', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁a', 'n', 'y', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁', 'T', 'h', 'i', 's', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁an', 'y', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁', 'This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '</s>']
['<s>', '▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁', 'at', '▁no', '▁cost', '</s>']

for x in range(5):
    print(sp.SampleEncodeAsPieces('This eBook is for the use of anyone anywhere at no cost',
     -1,
     0.0)
     )

['<s>', '▁T', 'h', 'i', 's', '▁e', 'B', 'o', 'o', 'k', '▁', 'is', '▁fo', 'r', '▁', 't', 'h', 'e', '▁', 'us', 'e', '▁', 'o', 'f', '▁', 'an', 'y', 'on', 'e', '▁any', 'w', 'h', 'e', 'r', 'e', '▁a', 't', '▁', 'n', 'o', '▁c', 'o', 's', 't', '</s>']
['<s>', '▁', 'This', '▁e', 'B', 'o', 'o', 'k', '▁is', '▁fo', 'r', '▁t', 'h', 'e', '▁', 'u', 'se', '▁of', '▁any', 'o', 'ne', '▁an', 'y', 'where', '▁', 'at', '▁no', '▁co', 's', 't', '</s>']
['<s>', '▁T', 'h', 'is', '▁', 'e', 'B', 'ook', '▁', 'is', '▁f', 'or', '▁', 'the', '▁us', 'e', '▁of', '▁', 'an', 'y', 'on', 'e', '▁a', 'n', 'y', 'w', 'he', 're', '▁', 'at', '▁', 'n', 'o', '▁co', 's', 't', '</s>']
['<s>', '▁This', '▁', 'e', 'B', 'o', 'o', 'k', '▁', 'is', '▁', 'f', 'or', '▁t', 'he', '▁', 'us', 'e', '▁', 'o', 'f', '▁any', 'on', 'e', '▁', 'a', 'ny', 'w', 'h', 'e', 'r', 'e', '▁a', 't', '▁', 'n', 'o', '▁co', 's', 't', '</s>']
['<s>', '▁', 'This', '▁e', 'B', 'ook', '▁', 'i', 's', '▁', 'f', 'o', 'r', '▁', 't', 'he', '▁use', '▁', 'o', 'f', '▁', 'a', 'n', 'y', 'o', 'ne', '▁a', 'n', 'y', 'w', 'h', 'er', 'e', '▁', 'a', 't', '▁', 'no', '▁', 'c', 'o', 'st', '</s>']

SentencePiece Tokenizer

September 1, 2019 Text Mining

Related Posts

July 25, 2021

Fairseq 코드리뷰 Wav2vec 2.0 (Finetune)

July 24, 2021

Fairseq 코드리뷰 Wav2vec 2.0 (Pretrain)

July 5, 2021

Docker container를 vscode로 원격제어하기