ํŠœํ† ๋ฆฌ์–ผ: ์ŠคํŽ˜์ธ์–ด์šฉ RoBERTa ์–ธ์–ด ๋ชจ๋ธ ํ›ˆ๋ จ ๋ฐฉ๋ฒ•

๋ชฉ์ฐจ

SpanBERTa: RoBERTa ์ŠคํŽ˜์ธ์–ด ์–ธ์–ด ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚จ ๋ฐฉ๋ฒ•

    

Skim AI์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ์—ฐ๊ตฌ ์ธํ„ด์ธ ํฌ๋ฆฌ์Šค ํŠธ๋ž€์ด ์ฒ˜์Œ ๊ฒŒ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.



spanberta_pretraining_bert_from_scratch





Google Colab์—์„œ ์‹คํ–‰

์†Œ๊ฐœ

ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์ž๊ฐ€ ํ•™์Šต ๋ฐฉ๋ฒ•์€ ๋Œ€๋ถ€๋ถ„์˜ NLP ์ž‘์—…์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ›ˆ๋ จ์—๋Š” ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค๊ธฐ ๋•Œ๋ฌธ์— ํ˜„์žฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์€ ์˜์–ด ์ „์šฉ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ŠคํŽ˜์ธ์–ด ํ”„๋กœ์ ํŠธ์—์„œ NLP ์ž‘์—…์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์ €ํฌ ํŒ€์€ ์Šคํ‚ค๋ฐ AI ํ›ˆ๋ จํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. RoBERTa ์ŠคํŽ˜์ธ์–ด์šฉ ์–ธ์–ด ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ƒˆ๋กœ ๋งŒ๋“ค์–ด์„œ SpanBERTa๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

SpanBERTa๋Š” RoBERTa-base์™€ ํฌ๊ธฐ๊ฐ€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. RoBERTa์˜ ํ›ˆ๋ จ ์Šคํ‚ค๋งˆ๋ฅผ ๋”ฐ๋ผ 18GB์˜ OSCAR์˜ ์ŠคํŽ˜์ธ์–ด ๋ง๋ญ‰์น˜๋ฅผ 4๊ฐœ์˜ ํ…Œ์Šฌ๋ผ P100 GPU๋ฅผ ์‚ฌ์šฉํ•ด 8์ผ ๋งŒ์— ์™„์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” ๋‹ค์Œ์„ ์‚ฌ์šฉํ•˜์—ฌ BERT์™€ ์œ ์‚ฌํ•œ ์–ธ์–ด ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ์—”๋“œํˆฌ์—”๋“œ ํ”„๋กœ์„ธ์Šค๋ฅผ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ ๋ฐ ํ† ํฐํ™” ๋„๊ตฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ธ€์˜ ์ฝ”๋“œ๋ฅผ ์ง์ ‘ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” Google Colab ๋…ธํŠธ๋ถ๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ธํŠธ๋ถ์„ ์ ์ ˆํžˆ ์ˆ˜์ •ํ•˜์—ฌ ๋‹ค๋ฅธ ์–ธ์–ด์— ๋Œ€ํ•ด BERT์™€ ์œ ์‚ฌํ•œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ฑฐ๋‚˜ ์‚ฌ์šฉ์ž ์ง€์ • ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณ„์† ์ง„ํ–‰ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ์‚ฌ๋žŒ์ด ์ตœ์ฒจ๋‹จ NLP ๋ชจ๋ธ์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€ Hugging Face ํŒ€์—๊ฒŒ ํฐ ๊ฐ์‚ฌ๋ฅผ ํ‘œํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

์„ค์ •

1. ์„ค์น˜ ์ข…์†์„ฑ

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

%pture
!.pip uninstall -y tensorflow
!.pip ์„ค์น˜ ํŠธ๋žœ์Šคํฌ๋จธ==2.8.0

2. ๋ฐ์ดํ„ฐ

SpanBERTa๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์ „ ๊ต์œกํ–ˆ์Šต๋‹ˆ๋‹ค. OSCAR์˜ ์ŠคํŽ˜์ธ์–ด ๋ง๋ญ‰์น˜๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ „์ฒด ํฌ๊ธฐ๋Š” 150GB์ด๋ฉฐ, ์ด ์ค‘ 18GB์˜ ์ผ๋ถ€๋ฅผ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด ์˜ˆ์ œ์—์„œ๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ์˜ ์ŠคํŽ˜์ธ์–ด ์˜ํ™” ์ž๋ง‰ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. OpenSubtitles. ์ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ํฌ๊ธฐ๋Š” 5.4GB์ด๋ฉฐ ์•ฝ 300MB์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์œผ๋กœ ํ•™์Šตํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

๊ฐ€์ ธ์˜ค๊ธฐ
# ์˜ํ™” ์ž๋ง‰ ๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์••์ถ• ํ’€๊ธฐ
os.path.exists('data/dataset.txt')๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด:
  !.wget "https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz" -O dataset.txt.gz
  !.gzip -d dataset.txt.gz
  !.mkdir data
  !.mv dataset.txt ๋ฐ์ดํ„ฐ
-2020-04-06 15:53:04-- https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz
object.pouta.csc.fi ํ™•์ธ ์ค‘ (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
(object.pouta.csc.fi)|86.50.254.18|:443... ์—ฐ๊ฒฐ ์ค‘ (object.pouta.csc.fi)์— ์—ฐ๊ฒฐ ์ค‘์ž…๋‹ˆ๋‹ค.
HTTP ์š”์ฒญ ์ „์†ก, ์‘๋‹ต ๋Œ€๊ธฐ ์ค‘... 200 OK
๊ธธ์ด: 1859673728 (1.7G) [application/gzip]
์ €์žฅ ์œ„์น˜: 'dataset.txt.gz'

dataset.txt.gz 100%[===================>] 1.73G 17.0MB/s in 1m 46s

2020-04-06 15:54:51 (16.8MB/s) - 'dataset.txt.gz' ์ €์žฅ [1859673728/1859673728]

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ์ด ํšŒ์„  ์ˆ˜ ๋ฐ ์ผ๋ถ€ ์ž„์˜ ํšŒ์„  ์ˆ˜
!wc -l ๋ฐ์ดํ„ฐ/๋ฐ์ดํ„ฐ์…‹.txt
!shuf -n 5 ๋ฐ์ดํ„ฐ/๋ฐ์ดํ„ฐ์…‹.txt
179287150 data/dataset.txt
์ œ ์ƒ๊ฐ์—๋Š” ์ œ ๋‚จํŽธ์„ ํ†ตํ•ด ์‹ธ์šธ ์ˆ˜์žˆ๋Š” ๋” ๋งŽ์€ ํŽ ๋กœํŠธ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.
์—˜ํ”„์™€ ์˜คํฌ์˜ ๋ชจ๋“  ์–ธ์–ด์— ๋‹ด๊ธด ๋ชจ๋“  ๋งค๋ ฅ์„ ๋А๊ปด๋ณด์„ธ์š”.
์ด์ „์—๋Š” ๋ธ”๋ฃจ ๋ธ”๋Ÿฌ๋“œ์—์„œ:
๊ทธ๋ฆฌ๊ณ  ๋‹ค๋‹ˆ์—˜ ์Šคํƒœํฌ๋“œ์™€๋Š” ๊ฑฐ๋ž˜ํ•˜์ง€ ์•Š๊ฒ ๋‹ค๊ณ  ์•ฝ์†ํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.
์›ƒ๊ฒผ์–ด์š”.

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ํ›ˆ๋ จ์šฉ ์ฒซ 1,000,000์ค„์˜ ํ•˜์œ„ ์ง‘ํ•ฉ ๊ฐ€์ ธ์˜ค๊ธฐ
TRAIN_SIZE = 1000000 #@param {type:"integer"}
!(head -n $TRAIN_SIZE data/dataset.txt) > data/train.txt

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ์œ„ํ•ด ๋‹ค์Œ 10,000์ค„์˜ ํ•˜์œ„ ์ง‘ํ•ฉ ๊ฐ€์ ธ์˜ค๊ธฐ
VAL_SIZE = 10000 #@param {type:"integer"}
!(sed -n {TRAIN_SIZE + 1},{TRAIN_SIZE + VAL_SIZE}p data/dataset.txt) > data/dev.txt

3. ํ† ํฐํ™” ํŠธ๋ ˆ์ด๋‹

์›๋ž˜์˜ BERT ๊ตฌํ˜„์€ 32K ๊ฐœ์˜ ํ•˜์œ„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ๊ตฌ์„ฑ๋œ ์–ดํœ˜๋ฅผ ๊ฐ€์ง„ WordPiece ํ† ํฐํ™”๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ฐฉ๋ฒ•์€ ํฌ๊ท€ ๋‹จ์–ด๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ "์•Œ ์ˆ˜ ์—†๋Š”" ํ† ํฐ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ตฌํ˜„์—์„œ๋Š” 50,265๊ฐœ์˜ ํ•˜์œ„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ๊ตฌ์„ฑ๋œ ์–ดํœ˜๋ฅผ ๊ฐ€์ง„ ๋ฐ”์ดํŠธ ์ˆ˜์ค€์˜ BPE ํ† ํฐํ™”๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(RoBERTa-base์™€ ๋™์ผ). ๋ฐ”์ดํŠธ ์ˆ˜์ค€์˜ BPE๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด "์•Œ ์ˆ˜ ์—†๋Š”" ํ† ํฐ ์—†์ด ๋ชจ๋“  ์ž…๋ ฅ์„ ์ธ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ๋Š” ์ ๋‹นํ•œ ํฌ๊ธฐ์˜ ํ•˜์œ„ ๋‹จ์–ด ์–ดํœ˜๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์™œ๋ƒํ•˜๋ฉด ๋ฐ”์ดํŠธ ๋ ˆ๋ฒจBPETokenizer 2๊ฐœ์˜ ํŒŒ์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ["vocab.json", "merges.txt"] ๋™์•ˆ ๋ฒ„ํŠธ์›Œ๋“œํ”ผ์Šคํ† ํฐ๋ผ์ด์ € ํŒŒ์ผ ํ•˜๋‚˜๋งŒ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. vocab.txt๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ๋ฒ„ํŠธ์›Œ๋“œํ”ผ์Šคํ† ํฐ๋ผ์ด์ € ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ BPE ํ† ํฐํ™”๊ธฐ์˜ ์ถœ๋ ฅ์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

%%ํƒ€์ž„
ํ† ํฐํ™” ๋„๊ตฌ์—์„œ ByteLevelBPETokenizer๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
๊ฒฝ๋กœ = "data/train.txt"
# ํ† ํฐํ™”๊ธฐ๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
ํ† ํฐํ™”๊ธฐ = ByteLevelBPETokenizer()
# ํŠธ๋ ˆ์ด๋‹ ์‚ฌ์šฉ์ž ์ง€์ •
tokenizer.train(files=path,
                vocab_size=50265,
                min_frequency=2,
                special_tokens=["", "", "", "", ""]))
# ๋””์Šคํฌ์— ํŒŒ์ผ ์ €์žฅ
!.mkdir -p "models/roberta"
tokenizer.save("models/roberta")
CPU ์‹œ๊ฐ„: ์‚ฌ์šฉ์ž 1๋ถ„ 37์ดˆ, ์‹œ์Šคํ…œ: 1.02์ดˆ, ์ด 1๋ถ„ 38์ดˆ
๋ฒฝ ์‹œ๊ฐ„: 1๋ถ„ 38์ดˆ

๋งค์šฐ ๋น ๋ฆ…๋‹ˆ๋‹ค! 1,000๋งŒ ํšŒ์„ ์„ ํ›ˆ๋ จํ•˜๋Š” ๋ฐ 2๋ถ„๋ฐ–์— ๊ฑธ๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ฒ˜์Œ๋ถ€ํ„ฐ ํŠธ๋ ˆ์ด๋‹ ์–ธ์–ด ๋ชจ๋ธ

1. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

RoBERTa๋Š” BERT์™€ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์™„์ „ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ์œ ์ผํ•œ ์ฐจ์ด์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • RoBERTa๋Š” ๋” ํฐ ํ•˜์œ„ ๋‹จ์–ด ์–ดํœ˜(50,000๊ฐœ ๋Œ€ 32,000๊ฐœ)๋ฅผ ๊ฐ€์ง„ ๋ฐ”์ดํŠธ ์ˆ˜์ค€์˜ BPE ํ† ํฐํ™”๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • RoBERTa๋Š” ๋™์  ๋‹จ์–ด ๋งˆ์Šคํ‚น์„ ๊ตฌํ˜„ํ•˜๊ณ  ๋‹ค์Œ ๋ฌธ์žฅ ์˜ˆ์ธก ์ž‘์—…์„ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค.
  • RoBERTa์˜ ํŠธ๋ ˆ์ด๋‹ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ.

๋‹ค๋ฅธ ์•„ํ‚คํ…์ฒ˜ ๊ตฌ์„ฑ์€ ๋ฌธ์„œ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(RoBERTa, BERT).

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

json ๊ฐ€์ ธ์˜ค๊ธฐ
config = {
    "์•„ํ‚คํ…์ฒ˜": [
        "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "์ ค๋ฃจ",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "LAYER_NORM_EPS": 1e-05,
    "max_position_embedings": 514,
    "model_type": "๋กœ๋ฒ ๋ฅดํƒ€",
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "type_vocab_size": 1,
    "vocab_size": 50265
}
open("models/roberta/config.json", 'w') ๋ฅผ fp๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
    json.dump(config, fp)
ํ† ํฐํ™”๊ธฐ_์„ค์ • = {"max_len": 512}
with open("models/roberta/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

2. ๊ต์œก ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐBERT ๊ธฐ๋ฐ˜RoBERTa-base
์‹œํ€€์Šค ๊ธธ์ด128, 512512
๋ฐฐ์น˜ ํฌ๊ธฐ2568K
์ตœ๋Œ€ ํ•™์Šต๋ฅ 1e-46e-4
์ตœ๋Œ€ ๊ฑธ์Œ ์ˆ˜1M500K
์›Œ๋ฐ์—… ๋‹จ๊ณ„10K24K
๋ฌด๊ฒŒ ๊ฐ์†Œ0.010.01
์•„๋‹ด 1TP4ํ…์‹ค๋ก $1e-61e-6
Adam $beta_1$0.90.9
Adam $beta_2$0.9990.98
๊ทธ๋ผ๋ฐ์ด์…˜ ํด๋ฆฌํ•‘0.00.0

RoBERTa๋ฅผ ํ›ˆ๋ จํ•  ๋•Œ ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” 8000๊ฐœ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ RoBERTa-base๋Š” 500๋งŒ ๊ฑธ์Œ์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์ง€๋งŒ ํ›ˆ๋ จ ๊ณ„์‚ฐ ๋น„์šฉ์€ BERT-base์˜ 16๋ฐฐ์— ๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ์—์„œ RoBERTa ์ข…์ด๋ฅผ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ๋ฐฐ์น˜๋กœ ํ•™์Šตํ•˜๋ฉด ์ตœ์ข… ์ž‘์—… ์ •ํ™•๋„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋งˆ์Šคํฌ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋ชฉํ‘œ์— ๋Œ€ํ•œ ๋‚œํ•ด์„ฑ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์„ ์กฐ์ •ํ•˜์—ฌ ๋” ํฐ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ผ๋ฐ์ด์…˜_๋ˆ„์ _๋‹จ๊ณ„.

๊ณ„์‚ฐ์ƒ์˜ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด BERT-base์˜ ํ›ˆ๋ จ ์Šคํ‚ค๋งˆ๋ฅผ ๋”ฐ๋ผ 8์ผ ๋™์•ˆ 4๊ฐœ์˜ Tesla P100 GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 20๋งŒ ๋‹จ๊ณ„์— ๊ฑธ์ณ SpanBERTa ๋ชจ๋ธ์„ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค.

3. ๊ต์œก ์‹œ์ž‘

๋‹ค์Œ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. run_language_modeling.py์—์„œ ์ œ๊ณตํ•˜๋Š” ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ, ๋ง๋ญ‰์น˜ ํ† ํฐํ™” ๋ฐ ๋ชจ๋ธ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์Šคํฌ ์–ธ์–ด ๋ชจ๋ธ๋ง ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด ์Šคํฌ๋ฆฝํŠธ๋Š” ํ•˜๋‚˜์˜ ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜์—์„œ ํ•™์Šตํ•˜๋„๋ก ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํฌ๊ณ  ์ด๋ฅผ ๋ถ„ํ• ํ•˜์—ฌ ์ˆœ์ฐจ์ ์œผ๋กœ ํ›ˆ๋ จํ•˜๋ ค๋ฉด ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ˆ˜์ •ํ•˜๊ฑฐ๋‚˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋งŽ์€ ๋ชฌ์Šคํ„ฐ ๋จธ์‹ ์„ ๊ตฌํ•  ์ค€๋น„๊ฐ€ ๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ์—…๋ฐ์ดํŠธ 2020๋…„ 4์›” 22์ผ: ํ—ˆ๊น…ํŽ˜์ด์Šค๊ฐ€ run_language_modeling.py ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์—…๋ฐ์ดํŠธํ–ˆ์Šต๋‹ˆ๋‹ค.
# ์—…๋ฐ์ดํŠธ ์ „์— ์ด ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜์„ธ์š”.
!.wget -c https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/run_language_modeling.py
-2020-04-24 02:28:21-- https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/run_language_modeling.py
raw.githubusercontent.com(raw.githubusercontent.com) ํ™•์ธ ์ค‘... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
์—ฐ๊ฒฐ ์ค‘ (raw.githubusercontent.com)|151.101.0.133|:443... ์—ฐ๊ฒฐ๋จ.
HTTP ์š”์ฒญ ์ „์†ก, ์‘๋‹ต ๋Œ€๊ธฐ ์ค‘... 200 OK
๊ธธ์ด: 34328 (34K) [ํ…์ŠคํŠธ/์ผ๋ฐ˜]
์ €์žฅํ•˜๋Š” ์ค‘ 'run_language_modeling.py'

run_language_modeli 100%[===================>] 33.52K --.-KB/s in 0.003s

2020-04-24 02:28:21 (10.1 MB/s) - 'run_language_modeling.py' ์ €์žฅ [34328/34328]

์ค‘์š” ์ธ์ˆ˜

  • --LINE_BY_LINE ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ณ ์œ ํ•œ ํ…์ŠคํŠธ ์ค„์„ ๊ณ ์œ ํ•œ ์‹œํ€€์Šค๋กœ ์ฒ˜๋ฆฌํ• ์ง€ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ฐ ์ค„์ด ๊ธธ๊ณ  ํ† ํฐ์ด ์ตœ๋Œ€ 512๊ฐœ ์ด์ƒ์ธ ๊ฒฝ์šฐ ์ด ์„ค์ •์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์ค„์ด ์งง์€ ๊ฒฝ์šฐ, ๊ธฐ๋ณธ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ๋Š” ๋ชจ๋“  ์ค„์„ ์—ฐ๊ฒฐํ•˜๊ณ  ํ† ํฐํ™”ํ•˜์—ฌ ํ† ํฐํ™”๋œ ์ถœ๋ ฅ์„ 512๊ฐœ์˜ ํ† ํฐ ๋ธ”๋ก์œผ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ž‘์€ ๋ฉ์–ด๋ฆฌ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๊ฐœ๋ณ„์ ์œผ๋กœ ์‚ฌ์ „ ์ฒ˜๋ฆฌํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. 3GB์˜ ํ…์ŠคํŠธ๋Š” ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์•ฝ 50๋ถ„์ด ์†Œ์š”๋ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ ํด๋ž˜์Šค.
  • --should_continue output_dir์˜ ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ์—์„œ ๊ณ„์†ํ• ์ง€ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค.
  • --๋ชจ๋ธ_์ด๋ฆ„_๋˜๋Š”_๊ฒฝ๋กœ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๋ ค๋ฉด ์—†์Œ์œผ๋กœ ๋‘ก๋‹ˆ๋‹ค.
  • --mlm ์–ธ์–ด ๋ชจ๋ธ๋ง ๋Œ€์‹  ๋งˆ์Šคํ‚น๋œ ์–ธ์–ด ๋ชจ๋ธ๋ง ์†์‹ค๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • --config_name, --tokenizer_name ๋ชจ๋ธ_์ด๋ฆ„_๋˜๋Š”_๊ฒฝ๋กœ์™€ ๊ฐ™์ง€ ์•Š์€ ๊ฒฝ์šฐ ์‚ฌ์ „ ํ•™์Šต๋œ ๊ตฌ์„ฑ ๋ฐ ํ† ํฐํ™”๊ธฐ ์ด๋ฆ„ ๋˜๋Š” ๊ฒฝ๋กœ(์„ ํƒ ์‚ฌํ•ญ). ๋‘˜ ๋‹ค None์ธ ๊ฒฝ์šฐ ์ƒˆ ๊ตฌ์„ฑ์„ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • --per_gpu_train_batch_size ํŠธ๋ ˆ์ด๋‹์„ ์œ„ํ•œ GPU/CPU๋‹น ๋ฐฐ์น˜ ํฌ๊ธฐ. GPU์— ๋งž์ถœ ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ํฐ ์ˆซ์ž๋ฅผ ์„ ํƒํ•˜์„ธ์š”. ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด ์˜ค๋ฅ˜๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.
  • --gradient_accumulation_steps ๋’ค๋กœ/์—…๋ฐ์ดํŠธ ํŒจ์Šค๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ๋ˆ„์ ํ•  ์—…๋ฐ์ดํŠธ ๋‹จ๊ณ„ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ per_gpu_train_batch_size = 16 ๋ฐ gradient_accumulation_steps = 4๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ์ด ์—ด์ฐจ ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” 64๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • --overwrite_output_dir ์ถœ๋ ฅ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ์ฝ˜ํ…์ธ ๋ฅผ ๋ฎ์–ด์”๋‹ˆ๋‹ค.
  • --no_cuda, --fp16, --fp16_opt_level GPU/CPU ํŠธ๋ ˆ์ด๋‹์„ ์œ„ํ•œ ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • ๋‹ค๋ฅธ ์ธ์ˆ˜๋Š” ๋ชจ๋ธ ๊ฒฝ๋กœ์™€ ํ•™์Šต ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค.

์Šคํฌ๋ฆฝํŠธ์—์„œ๋Š” ๋ชจ๋ธ ๊ฒฝ๋กœ์— ๋ชจ๋ธ ์œ ํ˜•(์˜ˆ: "roberta", "bert", "gpt2" ๋“ฑ)์„ ํฌํ•จํ•  ๊ฒƒ์„ ์ ๊ทน ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. ์ž๋™ ๋ชจ๋ธ ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ œ๊ณต๋œ ๊ฒฝ๋กœ์—์„œ ํŒจํ„ด ๋งค์นญ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์„ ์ถ”์ธกํ•ฉ๋‹ˆ๋‹ค.

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ๋ชจ๋ธ ๊ฒฝ๋กœ
MODEL_TYPE = "roberta" #@param ["roberta", "bert"]
MODEL_DIR = "models/roberta" #@param {type: "๋ฌธ์ž์—ด"}
OUTPUT_DIR = "models/roberta/output" #@param {type: "๋ฌธ์ž์—ด"}
TRAIN_PATH = "data/train.txt" #@param {type: "๋ฌธ์ž์—ด"}
EVAL_PATH = "data/dev.txt" #@param {type: "๋ฌธ์ž์—ด"}

์ด ์˜ˆ์ œ์—์„œ๋Š” Colab์—์„œ ์ œ๊ณตํ•˜๋Š” Tesla P4 GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 25๋‹จ๊ณ„๋งŒ ํ›ˆ๋ จํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

!.nvidia-smi
์›” Apr 6 15:59:35 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 ๋“œ๋ผ์ด๋ฒ„ ๋ฒ„์ „: 418.67 CUDA ๋ฒ„์ „: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU ์ด๋ฆ„ ์ง€์†์„ฑ-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| ํŒฌ ์˜จ๋„ ํผํ”„ Pwr:์‚ฌ์šฉ๋Ÿ‰/์บก| ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ | GPU-Util Compute M. |
|===============================+======================+======================|
0 Tesla P4 Off | 00000000:00:04.0 Off | 0 | |
| N/A 31C P8 7W / 75W | 0MiB / 7611MiB | 0% ๊ธฐ๋ณธ๊ฐ’ |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| ํ”„๋กœ์„ธ์Šค:                                                       GPU ๋ฉ”๋ชจ๋ฆฌ |
| GPU PID ์œ ํ˜• ํ”„๋กœ์„ธ์Šค ์ด๋ฆ„ ์‚ฌ์šฉ |
|=============================================================================|
์‹คํ–‰ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค ์—†์Œ | |
+-----------------------------------------------------------------------------+

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ๋ช…๋ น์ค„
cmd = """python run_language_modeling.py
    --output_dir {output_dir}
    --model_type {model_type}
    --mlm
    --config_name {config_name}
    --tokenizer_name {ํ† ํฐ๋ผ์ด์ €_์ด๋ฆ„}
    {line_by_line}
    {SHOULD_CONTINUE}
    {๋ชจ๋ธ_์ด๋ฆ„_๋˜๋Š”_๊ฒฝ๋กœ}
    --train_data_file {train_path}
    --eval_data_ํŒŒ์ผ {eval_path}
    --do_train
    {do_eval}
    {evaluate_during_training}
    --overwrite_output_dir
    --block_size 512
    --max_step 25
    --warmup_steps 10
    --learning_rate 5e-5
    --per_gpu_train_batch_size 4
    --gradient_accumulation_steps 4
    --weight_decay 0.01
    --์•„๋‹ด_์—ก์‹ค๋ก  1e-6
    --max_grad_norm 100.0
    --save_total_limit 10
    --save_steps 10
    --logging_steps 2
    --seed 42
"""

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จ์„ ์œ„ํ•œ ์ธ์ˆ˜. ํ‰๊ฐ€_์ค‘_ํ›ˆ๋ จ์„ ๋•๋‹ˆ๋‹ค,
# line_by_line, should_continue, model_name_or_path๋ฅผ ๋•๋‹ˆ๋‹ค.
train_params = {
    "output_dir": OUTPUT_DIR,
    "model_type": MODEL_TYPE,
    "config_name": MODEL_DIR,
    "ํ† ํฐํ™”๊ธฐ_์ด๋ฆ„": MODEL_DIR,
    "train_path": TRAIN_PATH,
    "EVAL_PATH": EVAL_PATH,
    "do_eval": "--do_eval",
    "evaluate_during_training": "",
    "LINE_BY_LINE": "",
    "SHOULD_CONTINUE": "",
    "๋ชจ๋ธ_์ด๋ฆ„_๋˜๋Š”_๊ฒฝ๋กœ": "",
}

๊ฐ€์ƒ ๋จธ์‹ ์—์„œ ๊ต์œกํ•˜๋Š” ๊ฒฝ์šฐ ํ…์„œ๋ณด๋“œ๋ฅผ ์„ค์น˜ํ•˜์—ฌ ๊ต์œก ๊ณผ์ •์„ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ ํ…์„œ๋ณด๋“œ ์ŠคํŒฌ๋ฒ„ํƒ€ ๊ต์œก์šฉ.

pip install tensorboard==2.1.0
tensorboard dev upload --logdir runs

20๋งŒ ๊ฑธ์Œ ํ›„ ์†์‹ค์€ 1.8, ๋‹นํ˜น๊ฐ์€ 5.2์— ๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด์ œ ๊ต์œก์„ ์‹œ์ž‘ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค!

In [ ]:

!{cmd.format(**train_params)}
    04/06/2020 15:59:55 - INFO - __main__ - ๋ฐ์ดํ„ฐ์—์„œ ๋ฐ์ดํ„ฐ ์„ธํŠธ ํŒŒ์ผ์—์„œ ๊ธฐ๋Šฅ ๋งŒ๋“ค๊ธฐ
    04/06/2020 16:04:43 - INFO - __main__ - ์บ์‹œ๋œ ํŒŒ์ผ data/roberta_cached_lm_510_train.txt์— ๊ธฐ๋Šฅ ์ €์žฅ ์ค‘
    04/06/2020 16:04:46 - INFO - __main__ - ***** ํŠธ๋ ˆ์ด๋‹ ์‹คํ–‰ *****
    04/06/2020 16:04:46 - INFO - __main__ - ์˜ˆ์ œ ์ˆ˜ = 165994
    04/06/2020 16:04:46 - INFO - __main__ - Num Epochs = 1
    04/06/2020 16:04:46 - INFO - __main__ - GPU๋‹น ์ˆœ๊ฐ„ ๋ฐฐ์น˜ ํฌ๊ธฐ = 4
    04/06/2020 16:04:46 - INFO - __main__ - ์ด ํŠธ๋ ˆ์ธ ๋ฐฐ์น˜ ํฌ๊ธฐ(๋ณ‘๋ ฌ, ๋ถ„์‚ฐ ๋ฐ ๋ˆ„์  ํฌํ•จ) = 16
    04/06/2020 16:04:46 - INFO - __main__ - ๊ทธ๋ผ๋ฐ์ด์…˜ ๋ˆ„์  ๋‹จ๊ณ„ = 4
    04/06/2020 16:04:46 - INFO - __main__ - ์ด ์ตœ์ ํ™” ๋‹จ๊ณ„ = 25
    ์—ํฌํฌ: 0% 0/1 [00:00<?, ?it/s]
    ๋ฐ˜๋ณต:   0% 0/41499 [00:00<?, ?it/s]
    ๋ฐ˜๋ณต   0% 1/41499 [00:01<13:18:02, 1.15s/it]
    ๋ฐ˜๋ณต:   0% 2/41499 [00:01<11:26:47, 1.01it/s]
    ๋ฐ˜๋ณต:   0% 3/41499 [00:02<10:10:30, 1.13it/s]
    ๋ฐ˜๋ณต   0% 4/41499 [00:03<9:38:10, 1.20it/s]
    ๋ฐ˜๋ณต   0% 5/41499 [00:03<8:52:44, 1.30it/s]
    ๋ฐ˜๋ณต   0% 6/41499 [00:04<8:22:47, 1.38it/s]
    ๋ฐ˜๋ณต:   0% 7/41499 [00:04<8:00:55, 1.44it/s]
    ๋ฐ˜๋ณต:   0% 8/41499 [00:05<8:03:40, 1.43it/s]
    ๋ฐ˜๋ณต:   0% 9/41499 [00:06<7:46:57, 1.48it/s]
    ๋ฐ˜๋ณต:   0% 10/41499 [00:06<7:35:35, 1.52it/s]
    ๋ฐ˜๋ณต:   0% 11/41499 [00:07<7:28:29, 1.54it/s]
    ๋ฐ˜๋ณต:   0% 12/41499 [00:08<7:41:41, 1.50it/s]
    ๋ฐ˜๋ณต   0% 13/41499 [00:08<7:34:28, 1.52it/s]
    ๋ฐ˜๋ณต:   0% 14/41499 [00:09<7:28:46, 1.54it/s]
    ๋ฐ˜๋ณต:   0% 15/41499 [00:10<7:23:29, 1.56it/s]
    ๋ฐ˜๋ณต   0% 16/41499 [00:10<7:38:06, 1.51it/s]
    ๋ฐ˜๋ณต:   0% 17/41499 [00:11<7:29:13, 1.54it/s]
    ๋ฐ˜๋ณต:   0% 18/41499 [00:12<7:24:04, 1.56it/s]
    ๋ฐ˜๋ณต:   0% 19/41499 [00:12<7:21:59, 1.56it/s]
    ๋ฐ˜๋ณต:   0% 20/41499 [00:13<7:38:06, 1.51it/s]
    04/06/2020 16:06:23 - INFO - __main__ - ***** ์‹คํ–‰ ํ‰๊ฐ€ *****
    04/06/2020 16:06:23 - INFO - __main__ - ์˜ˆ์ œ ์ˆ˜ = 156
    04/06/2020 16:06:23 - INFO - __main__ - ๋ฐฐ์น˜ ํฌ๊ธฐ = 4
    ํ‰๊ฐ€ ์ค‘์ž…๋‹ˆ๋‹ค: 100% 39/39 [00:08<00:00, 4.41it/s]
    04/06/2020 16:06:32 - INFO - __main__ - ***** ํ‰๊ฐ€ ๊ฒฐ๊ณผ *****
    04/06/2020 16:06:32 - INFO - __main__ - perplexity = tensor(6077.6812)

4. ๋งˆ์Šคํฌ๋œ ๋‹จ์–ด ์˜ˆ์ธก

์–ธ์–ด ๋ชจ๋ธ์„ ํ•™์Šตํ•œ ํ›„์—๋Š” ๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๊ณ  ์ปค๋ฎค๋‹ˆํ‹ฐ์™€ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. SpanBERTa ๋ชจ๋ธ์„ Hugging Face ์„œ๋ฒ„์— ์—…๋กœ๋“œํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์ „์— ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ๋ฌธ๋งฅ์—์„œ ๊ฐ€๋ ค์ง„ ๋‹จ์–ด๋ฅผ ์ฑ„์šฐ๋Š” ๋ฐฉ๋ฒ•์„ ์–ด๋–ป๊ฒŒ ํ•™์Šตํ–ˆ๋Š”์ง€ ์‚ดํŽด๋ด…์‹œ๋‹ค.

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

%%์บก์ณ
%%์‹œ๊ฐ„
ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ ํŒŒ์ดํ”„๋ผ์ธ ๊ฐ€์ ธ์˜ค๊ธฐ
fill_mask = ํŒŒ์ดํ”„๋ผ์ธ(
    "fill-mask",
    model="chriskhanhtran/spanberta",
    tokenizer="chriskhanhtran/spanberta"
)

์ฝ”๋กœ๋‚˜19์— ๊ด€ํ•œ Wikipedia์˜ ๊ธฐ์‚ฌ์—์„œ ํ•œ ๋ฌธ์žฅ์„ ๊ณจ๋ผ๋ด…๋‹ˆ๋‹ค.

์›๋ž˜ ๋ฌธ์žฅ์€ "๋ฌผ๊ณผ ์ “๊ฐ€๋ฝ์œผ๋กœ ์†์„ ์ž์ฃผ ์”ป์œผ์„ธ์š”," ์˜๋ฏธ "๋น„๋ˆ„์™€ ๋ฌผ๋กœ ์†์„ ์ž์ฃผ ์”ป์œผ์„ธ์š”.

๋งˆ์Šคํฌ๋œ ๋‹จ์–ด๋Š” "์ž๋ด‰"(๋น„๋ˆ„) ์ƒ์œ„ 5๊ฐ€์ง€ ์˜ˆ์ธก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋น„๋ˆ„, ์†Œ๊ธˆ, ์ŠคํŒ€, ๋ ˆ๋ชฌ ๋ฐ ์‹์ดˆ. ๋ชจ๋ธ์ด ๋ฐ•ํ…Œ๋ฆฌ์•„๋ฅผ ์ฃฝ์ด๊ฑฐ๋‚˜ ์‚ฐ์„ ํ•จ์œ  ํ•  ์ˆ˜์žˆ๋Š” ๋ฌผ๊ฑด์œผ๋กœ ์†์„ ์”ป์–ด์•ผํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์–ด๋–ป๊ฒŒ ๋“  ๋ฐฐ์šด๋‹ค๋Š” ๊ฒƒ์ด ํฅ๋ฏธ ๋กญ์Šต๋‹ˆ๋‹ค.

0]์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค:

fill_mask("๋ฌผ๊ณผ ๋กœ ์†์„ ์ž์ฃผ ๋‹ฆ์•„์ฃผ์„ธ์š”.")

Out[0]:

[{'score': 0.6469631195068359,
  'sequence': ' ๋ฌผ๊ณผ ์ “๊ฐ€๋ฝ์œผ๋กœ ์†์„ ์ž์ฃผ ์”ป์œผ์‹ญ์‹œ์˜ค.',
  'token': 18493},
 {'score': 0.06074320897459984,
  'sequence': ' ๋ฌผ๊ณผ ์†Œ๊ธˆ์œผ๋กœ ์†์„ ์ž์ฃผ ์”ป์œผ์‹ญ์‹œ์˜ค.',
  'ํ† ํฐ': 619},
 {'score': 0.029787985607981682,
  'sequence': ' ๋ฌผ๊ณผ ์ฆ๊ธฐ๋กœ ์†์„ ์ž์ฃผ ์”ป์œผ์‹ญ์‹œ์˜ค.',
  'token': 11079},
 {'score': 0.026410052552819252,
  'sequence': ' ๋ฌผ๊ณผ ๋ ˆ๋ชฌ์œผ๋กœ ์†์„ ์ž์ฃผ ์”ป์œผ์‹ญ์‹œ์˜ค.',
  'token': 12788},
 {'score': 0.017029203474521637,
  'sequence': ' ๋ฌผ๊ณผ ์†Œ๊ธˆ์œผ๋กœ ์†์„ ์ž์ฃผ ์”ป์œผ์‹ญ์‹œ์˜ค.',
  'token': 18424}]

๊ฒฐ๋ก 

์ŠคํŽ˜์ธ์–ด์— ๋Œ€ํ•œ BERT ์–ธ์–ด ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ณ , ์ฃผ์–ด์ง„ ๋ฌธ๋งฅ์—์„œ ๋งˆ์Šคํฌ๋œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ ๋ชจ๋ธ์ด ์–ธ์–ด์˜ ์†์„ฑ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์„œ์— ๋”ฐ๋ผ ์‚ฌ์šฉ์ž ์ง€์ • ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋ฏธ๋ฆฌ ํ•™์Šต๋œ BERT์™€ ์œ ์‚ฌํ•œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ ์‹œํ€€์Šค ๋ถ„๋ฅ˜, NER, POS ํƒœ๊น…, NLI๋ฅผ ํฌํ•จํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜๊ณ  ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ผ๋ถ€ ๋น„-BERT ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ํฌ์ŠคํŒ…๋„ ๊ธฐ๋Œ€ํ•ด์ฃผ์„ธ์š”!


๊ท€์‚ฌ์˜ AI ์†”๋ฃจ์…˜์— ๋Œ€ํ•ด ๋…ผ์˜ํ•ด ๋ณด์„ธ์š”

    ๊ด€๋ จ ๊ฒŒ์‹œ๋ฌผ

    ๋น„์ฆˆ๋‹ˆ์Šค๋ฅผ ๊ฐ•ํ™”ํ•  ์ค€๋น„ ์™„๋ฃŒ

    ko_KRํ•œ๊ตญ์–ด