input#1์ ๊ธฐ์ค์ผ๋ก #2, #3์์ ๊ด๊ณ๋ฅผ score๋ก ๋ง๋ค๊ณ output #1์ ๋ง๋ ๋ค. ๊ทธ๋ฆฌ๊ณ #2์ #1, #3์์ score๋ฅผ ๊ตฌํ๊ณ ๋ค์ #์ผ๋ก ๋์ด๊ฐ๋ฉด์ score๋ฅผ ๊ตฌํ๋ค. ์ด ์ ์ score๋ฅผ ๋ชจ์ attention map์ ๋ง๋ ๋ค.
1. Illustrations
The illustrations are divided into the following steps:
- Prepare inputs
- Initialise weights
- Derive key, query and value
- Calculate attention scores for Input 1
- Calculate softmax
- Multiply scores with values
- Sum weighted values to get Output 1
- Repeat steps 4–7 for Input 2 & Input 3
Step 1: Prepare inputs
Input 1: [1, 0, 1, 0]
Input 2: [0, 2, 0, 2]
Input 3: [1, 1, 1, 1]
๋ฐ์ดํฐ ์ ๋ ฅ
Step 2: Initialise weights
๊ฐ ์ธํ ๊ฐ์ ๊ฐ์ค์น๋ฅผ ๋ถ์ฌํ ๋ค ๊ฐ์ ๊ตฌํ๋ค.
์ ๊ฒฝ๋ง(neural network)์ ์ฐ์ํ๋ฅ ๋ถํฌ์ธ Gaussian, Xavier, He Kaming ๋ถํฌ๋ฅผ ์ด์ฉํ๋ค. ์ด ์ด๊ธฐํ๋ training ์ ์ ํ๋ฒ ์ฌ์ฉํ๋ค.
key: [[0, 0, 1], query: [[1, 0, 1], value: [[0, 2, 0],
[1, 1, 0], [1, 0, 0], [0, 3, 0],
[0, 1, 0], [0, 0, 1], [1, 0, 3],
[1, 1, 0]] [0, 1, 1]] [1, 1, 0]]
Step 3: Derive key, query and value
๋ด์ ์ฐ์ฐ์
Input #1 : [1, 0, 1, 0]
[0, 0, 1]
[1, 0, 1, 0] X [1, 1, 0] = [0, 1, 1]
[0, 1, 0]
[1, 1, 0]
input #2 : [0, 2, 0, 2]
[0, 0, 1]
[0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
[0, 1, 0]
[1, 1, 0]
# key :
[0, 0, 1]
[1, 0, 1, 0] [1, 1, 0] [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1] [1, 1, 0] [2, 3, 1]
Value :
[0, 2, 0]
[1, 0, 1, 0] [0, 3, 0] [1, 2, 3]
[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
[1, 1, 1, 1] [1, 1, 0] [2, 6, 3]
query :
[1, 0, 1]
[1, 0, 1, 0] [1, 0, 0] [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1] [0, 1, 1] [2, 1, 3]
Step 4: Calculate attention scores for Input 1
attention scores ๊ตฌํ๊ธฐ, query(red)์ ๋ชจ๋ keys(orange)์ ๋ด์ ํ๋ค.
query $\odot$ keys = attention scores
#1 attention scores :
#input 1 query & #1 key = 2
#input 1 query & #2 key = 4
#input 1 query & #3 key = 4
[0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
[1, 0, 1]
Step 5: Calculate softmax
attention scores ๊ฐ์ softmax๋ก 0~1๋ก ํ๋ฅ ๊ฐ์ผ๋ก ๋ณํํ๋ค.
softmax([2, 4, 4]) = [0.0, 0.5, 0.5]
Step 6: Multiply scores with values
softmax([2, 4, 4]) = [0.0, 0.5, 0.5] softmax๋ก ๋ณํํ attention scores๊ฐ์ ๊ฐ์ค์น(weighted values)๋ก ์ฌ์ฉํด ๊ฐ value์ ๊ณฑํ๋ค.
1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]
Step 7: Sum weighted values to get Output 1
๊ฐ์ค์น ๊ฐ (weighted values (yellow))๊ณผ value ๊ฐ์ ๊ณฑํ ๊ฐ์ ๋ชจ๋ ํฉ์น๋ค.
[0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5] <- input #1์ output
Step 8: Repeat for Input 2 & Input 3
input #2์ input #3์๋ ๋์ผ ์ฐ์ฐ
## query์ key์ ๊ฐ์ dim์ด์ด์ผํ๋ค. ๋ด์ ์ ํด์ผํ๊ธฐ ๋๋ฌธ์, ๊ทธ๋ฌ๋ value์ dim์ output์ ๋ชจ์์ ๋ง์ถ๋ฉด ๋๋ค.
[์ถ์ฒ] towardsdatascience.com/illustrated-self-attention-2d627e33b20a
NLP์ ํ์ฉ
'๐พ Deep Learning' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Transformer] Positional Encoding (3) (0) | 2021.02.20 |
---|---|
[Transformer] Position-wise Feed-Forward Networks (2) (0) | 2021.02.20 |
VAE(Variational Autoencoder) (3) MNIST (0) | 2021.02.18 |
Tensorflow Initializer ์ด๊ธฐํ ์ข ๋ฅ (0) | 2021.02.18 |
VAE(Variational Autoencoder) (2) (0) | 2021.02.18 |