动手学深度学习NLP

https://zh-v2.d2l.ai/chapter_convolutional-modern/googlenet.html

课程安排 - 动手学深度学习课程 (d2l.ai)

Base

时序模型

当前数据与之前数据相关

音乐、语言、文本

与前面所有有关:

image-20230515113440168

对过去的数据建模,然后预测自己:自回归模型

A:马可夫模型:当前数据只与最近数据相关;用函数前4个值作为特征,预测下一个值,2层MLP

nn.Sequential(nn.Linear(4, 10),nn.ReLU(),nn.Linear(10, 1))

紫线为单步预测,绿线为长步预测

image-20230515115303920

B:潜变量:引入潜变量,来概括历史信息 RNN 两个模型,(在实际训练中,还是切成了一段段step,实际可以理解为暗含隐马可夫step)

​ ot利用ht输出(ht由 xt-1 和 ht-1 求出,保存历史信息),来推测xt

image-20210818124506982 image-20210818124810932

QA:

  1. RNN甚至可以用来排序?因为可用记住
  2. 数据到底和多长的前面的数据相关呢?transformer自动探索多少个
  3. 传感器、电池故障预测。单步多步不是重点,关键在于负样本数量
  4. 序列也是一维数据,可用用CNN做分类吗? 可以用1维卷积,效果不错的

Vocab

tokenize:将文章按字母划分,如果是词中文需要分词

Vocab:文本词汇表,可以按单词分也可以按字母分,将单词映射为index。 按频率排序,方便观察、常用数据内存在一起

语言模型

估计联合概率p(x1 x2 xT),序列出现的概率

  • 做预序列模型 BERT GPT-3
  • 文本生成
  • 判断序列更常见 语音识别哪个更正常 打字

使用计数建模:判断文本出现的概率

image-20210817212714212

n元语法:一个单词出现的概率与它前面的n-1个单词有关。n-1阶段马可夫模型

二元词汇:两个词合起来算一个token

1
2
3
4
5
6
# 扫一边,长度还是为n,但重复的会减少,种类会增加
bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]
trigram_tokens = [triple for triple in zip(
corpus[:-2], corpus[1:-1], corpus[2:])]
[(('the', 'time', 'traveller'), 59),
(('the', 'time', 'machine'), 30),]
image-20210817221810317

数据加载

将corpus 转为 batchsize,单个长度为num_step

1.batch间随机;随机起始点,每个单词每次只用一次

image-20230515160816014

1
2
3
4
5
6
7
8
b = 2, step = 5
X: tensor([[11, 12, 13, 14, 15],
[ 6, 7, 8, 9, 10]])
Y: tensor([[12, 13, 14, 15, 16],
[ 7, 8, 9, 10, 11]])
X: tensor([[ 1, 2, 3, 4, 5],
[21, 22, 23, 24, 25]])
Y: tensor([[ 2, 3, 4, 5, 6],

2.batch间连续

1
2
3
4
5
6
7
X:  tensor([[ 4,  5,  6,  7,  8],
[19, 20, 21, 22, 23]])
Y: tensor([[ 5, 6, 7, 8, 9],
[20, 21, 22, 23, 24]])
X: tensor([[ 9, 10, 11, 12, 13],
[24, 25, 26, 27, 28]])
Y: tensor([[10, 11, 12, 13, 14],

load_data_time_machine: 封装数据并返回vocab

x = [b,t] y=[b,t] 特征抽取

x ->onehot-> [t, b, infeature] ->layer-> [t, b, hidden] ->linear-> [t*b, outfeature]

​ 定义:**[in, hidden]** state:( layers * direction, batch, hidden,)

RNN

任务定义:给定一串字母,生成下一个或者n个

模型的好坏(困惑度):每一个词都可以看成分类,将每一个词的交叉熵求和求平均。最后做个指数

image-20210818125137879

T个时间上的梯度连乘,需要梯度剪裁。但无法处理梯度消失

image-20230515165612896

任务

image-20210818125528564

视频Tracking:不需要用rnn,直接判断bbox帧间周围的情况

手动实现

h是一个hiddens长的特征记录信息,每一个序列x都会更新下一个h,同时该h能够给出一个o输出,代表着预测的输出

关注h,h是对历史的建模,从h到o只是一个线性回归

image-20210818124506982 image-20210818124810932

参数定义:五个参数,需要梯度。并且需要定义初始化h的函数 ( (b, hiddens), )

forward函数:

序列输入,所以t一定是在最外面。b的作用仅仅是泛化,b之间互不影响。h也是存储了b个

  1. 输入[b,t]state 转置onehot转为[t, b, onehot]
  2. 按t遍历输入到网络中,每次输出[b onehot], 并更新t次state
  3. (和y计算损失函数,预测的下一个字母)
  4. 最后堆叠输出 [t*b, onehot]new_state
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 传入[t, b, vocab_size]  以及参数、初始状态  返回   [t*b, vocab_size]
def rnn(inputs, state, params):
# `inputs`的形状:(`时间步数量`,`批量大小`,`词表大小`)
W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
# `X`的形状:(`批量大小`,`词表大小`)
for X in inputs:
# 一次序列更新一次H,生成一个Y
H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
Y = torch.mm(H, W_hq) + b_q
outputs.append(Y)
return torch.cat(outputs, dim=0), (H,)

对于一个 [t, b, vocab_size], 根据state,每次传入[b, vocab_size]给出[b]个结果预测


# 网络执行的方式,X传进来后需要onehot
def __call__(self, X, state):
X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
return self.forward_fn(X, state, self.params)

剪裁 :梯度二范数 torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params)) < θ

预测: 用预先给的词初始化h,并不断forward给输出并更新state

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 根据h和上一个x来预测下一个值x,并更新h
def predict_ch8(prefix, num_preds, net, vocab, device): #@save
"""在`prefix`后面生成新字符。"""
state = net.begin_state(batch_size=1, device=device)
outputs = [vocab[prefix[0]]]
get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape((1, 1))
# 根据input,更新state
for y in prefix[1:]: # 预热期
_, state = net(get_input(), state)
outputs.append(vocab[y])

for _ in range(num_preds): # 预测`num_preds`步
y, state = net(get_input(), state)
outputs.append(int(y.argmax(dim=1).reshape(1)))
return ''.join([vocab.idx_to_token[i] for i in outputs])

训练:一个epoch中,注意batch间如果打乱了的化,state要重新初始化。否者沿用之前的,并且需要detach_()

损失函数:直接CrossEntropy 注意更新前先剪裁梯度。y是[b, t] 传入前先转置一下

简洁实现

核心:通过保存state信息,对t个features编码,转为t个num_hiddens

RNN的定义是没有b的,只需要features num_hiddens,但state有b且多了个1维度

RNN实际上就是对输入的t个时间序列,进行建模处理,并返回hidden维度信息

1
2
3
4
5
6
7
8
9
10
11
# 输入维度,隐藏层维度
rnn_layer = nn.RNN(features, num_hiddens)
#传入数据也是onehot后的, 输出没有输出层,加一层linear, state需要自己传入,

# X:[t, b, features]
# state: [1, b, num_hiddens]
Y, state = rnn_layer (X , state )
# Y [t, b, num_hiddens] t次最后一个layer的H的cat
# state_new [1, b, num_hiddens] 用于传入下一次 1为num_layer

# 所以网络的输出是t个时间段全部的(num_layer[-1].state值cat,最后一个state) Y[-1] == state_new

输入数据中的t代表着输入数据的时序长度,很像t次MLP分类,但是前面的数据会影响state从而影响后面的分类

为什么没有out层?输出不一定要和输入的维度一样,比如我可以只去做一个情感分类,或者只想提取特征。如果想分类,直接输入到全连接

1
2
3
Y, state = self.rnn(X, state)
# 全连接层首先将`Y`的形状改为(num_steps * b, `隐藏单元数`)。
output = self.linear(Y.reshape((-1, Y.shape[-1])))

QA:

  1. 处理视频时序序列,t就是想要关联的帧长度,而onehot则改成了由神经网络抽取出来单帧图片的特征。所以[t,features]输入到rnn后,rnn返回给你[t,features’ ] ,根据这个提取出信息

  2. 如果用单词作为预测目标onehot将会非常长。不利于预测

  3. RNN不能处理长序列:num_hiddens决定着你记录之前的状态。但太大会过拟合,太小会无法记录下之前的消息

  4. 高频词可以对概率开根号,或者随机去除

GRU

添加两个门,更好的保留以前的信息:如一群老鼠突然出现一直猫,注意点要转移到猫上。(0~1取值,按位乘)

遗忘门R:计算ht时,ht-1h忘记多少 更新门Z:ht和现在ht-1所占的比例

image-20210819132116519

实现

获取参数 :11个 W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q

forward函数:按照公式写

1
2
3
4
5
6
7
8
9
10
11
12
def gru(inputs, state, params):
W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
Z = torch.sigmoid((X @ W_xz) + (H @ W_hz) + b_z)
R = torch.sigmoid((X @ W_xr) + (H @ W_hr) + b_r)
H_tilda = torch.tanh((X @ W_xh) + ((R * H) @ W_hh) + b_h)
H = Z * H + (1 - Z) * H_tilda
Y = H @ W_hq + b_q
outputs.append(Y)
return torch.cat(outputs, dim=0), (H,)

和前面一样封装到类中,需要传入infeature hidden get_param init_state forward

1
2
3
model = d2l.RNNModelScratch(len(vocab), num_hiddens, device, get_params,
init_gru_state, gru)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

简洁:

封装到RNNModel中

1
2
3
4
gru_layer = nn.GRU(num_inputs, num_hiddens)
model = d2l.RNNModel(gru_layer, len(vocab))
# state依然是(1, b, num_hiddens)
# nn.GRU输入为[t,b,in] 输出为[t,b,hi],[1,b,hi]

对比GRU,虽然计算复杂了,但运算速度反而更快了 242822.8 -> 26820.1 tokens/sec

QA:

  1. GRU LSTM参数更多,但稳定性比RNN更好
  2. 尽量不要使用RNN

LSTM

两个state :C、H

image-20210819142947789image-20210819143001209

F (忘记门) 和 I (输入门)决定以前C和现在C~所占比例,O(输出门)决定C求出来后如何向H转换

1
2
3
4
5
6
lstm_layer = nn.LSTM(num_inputs, num_hiddens)
# state: ([1, b, num_hiddens],[1, b, num_hiddens]) H C

Y, state_new = lstm_layer(X, state)
# Y [t, b, num_hiddens]) 最后一个H的集合
# state_new ([1, b, num_hiddens],[1, b, num_hiddens]) 用于传入下一次 1为num_layer

实际内存难以计算,cudnn会用内存换速度,直接跑来看占用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class RNNModel(nn.Module):
"""The RNN model.

Defined in :numref:`sec_rnn-concise`"""
def __init__(self, rnn_layer, vocab_size, **kwargs):
super(RNNModel, self).__init__(**kwargs)
self.rnn = rnn_layer
self.vocab_size = vocab_size
self.num_hiddens = self.rnn.hidden_size
# If the RNN is bidirectional (to be introduced later),
# `num_directions` should be 2, else it should be 1.
if not self.rnn.bidirectional:
self.num_directions = 1
self.linear = nn.Linear(self.num_hiddens, self.vocab_size)
else:
self.num_directions = 2
self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)

def forward(self, inputs, state):
X = F.one_hot(inputs.T.long(), self.vocab_size)
X = X.to(torch.float32)
Y, state = self.rnn(X, state)
# The fully connected layer will first change the shape of `Y` to
# (`num_steps` * `batch_size`, `num_hiddens`). Its output shape is
# (`num_steps` * `batch_size`, `vocab_size`).
output = self.linear(Y.reshape((-1, Y.shape[-1])))
return output, state

def begin_state(self, device, batch_size=1):
if not isinstance(self.rnn, nn.LSTM):
# `nn.GRU` takes a tensor as hidden state
return torch.zeros((self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens),
device=device)
else:
# `nn.LSTM` takes a tuple of hidden states
return (torch.zeros((
self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens), device=device),
torch.zeros((
self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens), device=device))


def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
"""Train a net within one epoch (defined in Chapter 8).

Defined in :numref:`sec_rnn_scratch`"""
state, timer = None, d2l.Timer()
metric = d2l.Accumulator(2) # Sum of training loss, no. of tokens
for X, Y in train_iter:
if state is None or use_random_iter:
# Initialize `state` when either it is the first iteration or
# using random sampling
state = net.begin_state(batch_size=X.shape[0], device=device)
else:
if isinstance(net, nn.Module) and not isinstance(state, tuple):
# `state` is a tensor for `nn.GRU`
state.detach_()
else:
# `state` is a tuple of tensors for `nn.LSTM` and
# for our custom scratch implementation
for s in state:
s.detach_()
y = Y.T.reshape(-1)
X, y = X.to(device), y.to(device)
y_hat, state = net(X, state)
l = loss(y_hat, y.long()).mean()
if isinstance(updater, torch.optim.Optimizer):
updater.zero_grad()
l.backward()
grad_clipping(net, 1)
updater.step()
else:
l.backward()
grad_clipping(net, 1)
# Since the `mean` function has been invoked
updater(batch_size=1)
metric.add(l * d2l.size(y), d2l.size(y))
return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()
1
2
3
4
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers)
model = d2l.RNNModel(lstm_layer, len(vocab))

d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

深度

多个隐藏层获得非线性性

image-20210819150224101image-20230516144012521

在同一个时刻,保存多个Ht ,由左下角推理而来

image-20210819150931603

state: ([1, b, num_hiddens],[1, b, num_hiddens]) ->([2, b, num_hiddens],[2, b, num_hiddens]) 两层够了

1
lstm_layer = nn.LSTM(num_inputs, num_hiddens,num_layer)  num_layer决定H个数

双向

image-20230516145522237

两个H,一个依赖以前的,一个依赖以后的。相互独立,cat在一起决定输出

抽取特征,分类,填空、翻译。但不能预测未来,因为反方向不存在

image-20210819160940494image-20210819161400150image-20210819161423155

1
lstm_layer = nn.LSTM(num_inputs, num_hiddens,num_layer, bidirectional=True)

对句子做特征提取:做翻译、改写,不能做预测,因为完全没有反方向的信息

机器翻译数据集

  • 读入数据,预处理去除大写、特殊字符。

  • 单词化后,英语,法语分别绘制vocab,加入一些特殊字符

  • 结尾补上vocab[‘‘], 批量计算,每一个句子长度要想同,所以限制最大长度,不足补vocab[‘‘] 转idx

    1
    2
    3
    4
    5
    6
    7
    8
    def build_array_nmt(lines, vocab, num_steps):
    """将机器翻译的文本序列转换成小批量"""
    lines = [vocab[l] for l in lines]
    lines = [l + [vocab['<eos>']] for l in lines]
    array = torch.tensor([truncate_pad(
    l, num_steps, vocab['<pad>']) for l in lines])
    valid_len = (array != vocab['<pad>']).type(torch.int32).sum(1)
    return array, valid_len
  • 封装成batch,每次返回 X, X_valid_len, Y, Y_valid_len. len为实际句子长度

1
2
3
4
5
6
7
8
9
10
11
12
13
def load_data_nmt(batch_size, num_steps, num_examples=600):
"""返回翻译数据集的迭代器和词表"""
text = preprocess_nmt(read_data_nmt())
source, target = tokenize_nmt(text, num_examples)
src_vocab = d2l.Vocab(source, min_freq=2,
reserved_tokens=['<pad>', '<bos>', '<eos>'])
tgt_vocab = d2l.Vocab(target, min_freq=2,
reserved_tokens=['<pad>', '<bos>', '<eos>'])
src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
data_iter = d2l.load_array(data_arrays, batch_size)
return data_iter, src_vocab, tgt_vocab
1
2
3
4
5
6
7
8
9
10
11
12
for X, X_valid_len, Y, Y_valid_len in train_iter:
X: tensor([[93, 12, 4, 3, 1, 1, 1, 1],
[13, 34, 5, 3, 1, 1, 1, 1]])
valid lengths for X: tensor([4, 4])

Y: tensor([[ 0, 103, 104, 105, 5, 3, 1, 1],
[121, 5, 3, 1, 1, 1, 1, 1]], dtype=torch.int32)
valid lengths for Y: tensor([6, 3])

b=2,每个句子最大长度num_steps=8 输入为一个句子,输出也为一个句子
不同于文本生成:序列中每一个输入都有一个输出
机器翻译为一整个序列输入:对应一整个序列输出

Encoder-Decoder

image-20210819171331079

encoder最后的隐藏状态作为decoder的输入,decoder还可以有额外输入。decoder时,由于不知道后面的信息,所以需要一个一个输入

1
2
3
4
5
6
7
8
9
10
11
12
#@save
class EncoderDecoder(nn.Module):
"""编码器-解码器架构的基类"""
def __init__(self, encoder, decoder, **kwargs):
super(EncoderDecoder, self).__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder

def forward(self, enc_X, dec_X, *args):
enc_outputs = self.encoder(enc_X, *args)
dec_state = self.decoder.init_state(enc_outputs, *args)
return self.decoder(dec_X, dec_state)

Seq2Seq

句子生成句子,使用编码器解码器架构

编码器用于提取句子(生成context),解码器输入为

  • 预测阶段:前一个单词(1,b,h) 和 context的concate
  • 训练阶段:整个单词序列(t,b,h) 和 context的concate,由于context不变,GRU内部其实也是相对于进行了t次

image-20230516154007581

encoder可用双向,对输入编码后返回编码器最后的状态,作为decoder输入

训练时decoder需要右移位一下,每次用的正确的输入(强制教学),推理用的上一次输出

衡量结果

如何衡量生成序列的好坏

image-20230516154646648
  1. 编码器获得state后,把最后一次的state [num_layer, b, num_hiddens] 作为解码器的输入state
  2. 解码器负责将state[-1]重复t次,作为历史状态,并和输入Y[b,t] concat [t, b, emb+hid]传入GRU网络
  3. GRU将t个state作为dense的输入,输出t * b * vocab permuteb * t * vocab

model

encoder
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
输入 b, t
输出 [t, b, hiddens] [numlayer, b, hiddens]
class Seq2SeqEncoder(d2l.Encoder):
"""用于序列到序列学习的循环神经网络编码器"""
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0, **kwargs):
super(Seq2SeqEncoder, self).__init__(**kwargs)
# 嵌入层
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = nn.GRU(embed_size, num_hiddens, num_layers,
dropout=dropout)

def forward(self, X, *args):
# 输出'X'的形状:(batch_size,num_steps,embed_size)
X = self.embedding(X)
# 在循环神经网络模型中,第一个轴对应于时间步
X = X.permute(1, 0, 2)
# 如果未提及状态,则默认为0
output, state = self.rnn(X)
# output的形状:(num_steps,batch_size,num_hiddens)
# state的形状:(num_layers,batch_size,num_hiddens)
return output, state
decoder
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
输入 X[b, t]   embedding ->  X[t, b, embed_size]  + cat state[t, b, hiddens] = [t, b, em+hiddens]
输出[t, b, hiddens] 经过dense [b, t, vocab_size]
一次性得到t次预测做loss(模型内部还是一个一个词输入并且一个个输出,但由于强制学习强制使用已知的正确的词作为前一个词,就可以一次性全部输入到模型中)
predict时每次输出一个,并作为下一个的输入
class Seq2SeqDecoder(d2l.Decoder):
"""用于序列到序列学习的循环神经网络解码器"""
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0, **kwargs):
super(Seq2SeqDecoder, self).__init__(**kwargs)
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = nn.GRU(embed_size + num_hiddens, num_hiddens, num_layers,
dropout=dropout)
self.dense = nn.Linear(num_hiddens, vocab_size)

def init_state(self, enc_outputs, *args):
return enc_outputs[1]

def forward(self, X, state):
# 输出'X'的形状:(num_steps,batch_size,embed_size)
X = self.embedding(X).permute(1, 0, 2)
# 广播context,使其具有与X相同的num_steps
context = state[-1].repeat(X.shape[0], 1, 1)
X_and_context = torch.cat((X, context), 2)
output, state = self.rnn(X_and_context, state)
output = self.dense(output).permute(1, 0, 2)
# output的形状:(batch_size,num_steps,vocab_size)
# state的形状:(num_layers,batch_size,num_hiddens)
return output, state

loss

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 需要根据实际的长度,超出valid_len部分weights为0从而损失为0, 忽略pad
X = torch.tensor([[1, 1, 1], [1, 1, 1]]) b, t
sequence_mask(X, torch.tensor([1, 2]))
输出 [[1, 0, 0],[1, 1, 0]]


class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
"""带遮蔽的softmax交叉熵损失函数"""
# pred的形状:(batch_size,num_steps,vocab_size)
# label的形状:(batch_size,num_steps)
# valid_len的形状:(batch_size,)
def forward(self, pred, label, valid_len):
weights = torch.ones_like(label)
weights = sequence_mask(weights, valid_len)
self.reduction='none'
unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
pred.permute(0, 2, 1), label)
weighted_loss = (unweighted_loss * weights).mean(dim=1)
return weighted_loss

train

1
2
3
4
5
6
7
8
9
10
for batch in data_iter:
optimizer.zero_grad()
X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
device=device).reshape(-1, 1)
dec_input = torch.cat([bos, Y[:, :-1]], 1) # 强制教学 你好啊 -> <bos>你好
Y_hat, _ = net(X, dec_input, X_valid_len) # 这里X_valid_len没有用上
l = loss(Y_hat, Y, Y_valid_len)
l.sum().backward() # 反向传播
d2l.grad_clipping(net, 1)

predict

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#@save
def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
device, save_attention_weights=False):
"""序列到序列模型的预测"""
# 在预测时将net设置为评估模式
net.eval()
src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
src_vocab['<eos>']]
enc_valid_len = torch.tensor([len(src_tokens)], device=device)
src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
# 添加批量轴
enc_X = torch.unsqueeze(
torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
enc_outputs = net.encoder(enc_X, enc_valid_len)
dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
# 添加批量轴
dec_X = torch.unsqueeze(torch.tensor(
[tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)
output_seq, attention_weight_seq = [], []
# 输入dec_X为(1, 1) tb都等于
for _ in range(num_steps):
Y, dec_state = net.decoder(dec_X, dec_state)
# 我们使用具有预测最高可能性的词元,作为解码器在下一时间步的输入
dec_X = Y.argmax(dim=2)
pred = dec_X.squeeze(dim=0).type(torch.int32).item()
# 保存注意力权重(稍后讨论)
if save_attention_weights:
attention_weight_seq.append(net.decoder.attention_weights)
# 一旦序列结束词元被预测,输出序列的生成就完成了
if pred == tgt_vocab['<eos>']:
break
output_seq.append(pred)
return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq

QA:

  1. word2vec没讲,跳过了
  2. transformer可以代替seq2seq

束搜索

预测时,每一步都是取最优的(贪心),但贪心不一定是全局最优,例如下面第二步取C

image-20230516202920183image-20230516202927638

穷举:指数级 太大了

束搜索:每次在所有kn个选项中,保留k个最大的。只保留一个就是贪心

image-20230516203605851

注意力

attention

query:输入 key,value :已有的一些数据

核心:根据querykeyi的关系,决定出valuei的权重,加权得到一个最终value

核回归

非参数

query为输入x,根据数据((xi,yi))给出预测y。xi-yikeys-values

  1. 最简单的是对y的数据求平均,这样每个数据给出的f(x)都一样
  2. 根据一个权重,加权求和。权重为根据K(距离)函数求出来的。如果k是高斯核,就等价于用高斯距离softmax加权

image-20230517120533488

image-20230517121458578

1
2
3
4
5
6
7
8
9
# X_repeat的形状:(n_test,n_train),
# 每一行都包含着相同的测试输入(例如:同样的查询)
X_repeat = x_test.repeat_interleave(n_train).reshape((-1, n_train))
# x_train包含着键。attention_weights的形状:(n_test,n_train),
# 每一行都包含着要在给定的每个查询的值(y_train)之间分配的注意力权重
attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2, dim=1)
# y_hat的每个元素都是值的加权平均值,其中的权重是注意力权重
y_hat = torch.matmul(attention_weights, y_train)
plot_kernel_reg(y_hat)

image-20230517114942706image-20230517115633471

一行代表,对哪个inpute的权重更大。权重给的比较平滑,所以pred也比较平滑

参数化

引入可学习的w=nn.Parameter(torch.rand((1,), requires_grad=True)) 控制高斯核的窗口大小,w越大窗口越小
$$
\begin{aligned}f(x) &= \sum_{i=1}^n \alpha(x, x_i) y_i \&= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}((x - x_i)w)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}((x - x_j)w)^2\right)} y_i \&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}((x - x_i)w)^2\right) y_i.\end{aligned}
$$
image-20230517120017210image-20230517120026332

窗口更窄了,只给离得近的分配权重,所以pred更加弯曲

注意力分数

拓展到高维情况,q k v都是向量

image-20230517121659643

Scaled Dot
  • k和q长度一样:kq做内积后除去根号dk。transformer。两次矩阵乘法,无学习的参数,去除根号d防止梯度问题

image-20230517122729061

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class DotProductAttention(nn.Module):
"""Scaled dot product attention.

Defined in :numref:`subsec_additive-attention`"""
def __init__(self, dropout, **kwargs):
super(DotProductAttention, self).__init__(**kwargs)
self.dropout = nn.Dropout(dropout)

# Shape of `queries`: (`batch_size`, no. of queries, `d`)
# Shape of `keys`: (`batch_size`, no. of key-value pairs, `d`)
# Shape of `values`: (`batch_size`, no. of key-value pairs, value
# dimension)
# Shape of `valid_lens`: (`batch_size`,) or (`batch_size`, no. of queries)
def forward(self, queries, keys, values, valid_lens=None):
d = queries.shape[-1]
# Set `transpose_b=True` to swap the last two dimensions of `keys`
scores = torch.bmm(queries, keys.transpose(1,2)) / math.sqrt(d)
self.attention_weights = masked_softmax(scores, valid_lens)
return torch.bmm(self.dropout(self.attention_weights), values)
additive
  • k和q长度一样:k和q concat输入到隐藏层为h输出为1的MLP,再乘上vT输出为分数值。有参
image-20230517122801534

对于每一个query,我都需要得到一个len(“键-值”对)的向量,多个query就是一个weight矩阵 [len(query), len(“键-值”对)],weight*values得到加权输出[querys, d(v)]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#@save
[querys, d(q)] -》 [querys, d(v)]
class AdditiveAttention(nn.Module):
"""加性注意力"""
def __init__(self, key_size, query_size, num_hiddens, dropout, **kwargs):
super(AdditiveAttention, self).__init__(**kwargs)
self.W_k = nn.Linear(key_size, num_hiddens, bias=False)
self.W_q = nn.Linear(query_size, num_hiddens, bias=False)
self.w_v = nn.Linear(num_hiddens, 1, bias=False)
self.dropout = nn.Dropout(dropout)

def forward(self, queries, keys, values, valid_lens):
# valid_lens对于每个query 考虑多少个kv
queries, keys = self.W_q(queries), self.W_k(keys)
# 在维度扩展后,
# queries的形状:(batch_size,查询的个数,1,num_hidden)
# key的形状:(batch_size,1,“键-值”对的个数,num_hiddens)
# 使用广播方式进行求和
features = queries.unsqueeze(2) + keys.unsqueeze(1)
features = torch.tanh(features)
# self.w_v仅有一个输出,因此从形状中移除最后那个维度。
# scores的形状:(batch_size,查询的个数,“键-值”对的个数)
scores = self.w_v(features).squeeze(-1)
self.attention_weights = masked_softmax(scores, valid_lens) # valid_lens强行把scores得分1e-6
# values的形状:(batch_size,“键-值”对的个数,值的维度)
return torch.bmm(self.dropout(self.attention_weights), values)

dropout增加模型的泛化能力

应用:key value query到底是什么

Bahdanau seq2seq

翻译时额外添加原句子的对应信息,而不是只用最后一个state。具体用以前的哪个state由attention决定

image-20230517144406568

image-20230524211826420
  • key-value:编码器每一次的RNN 的输出states
  • query:解码器上一次输出

对比之前的改进:之前context直接用最后一个state,现在对所有state拿出来做一个weight

image-20230517150923131

  • query:当前state[-1] [batch_size,1,num_hiddens]
  • key-value: encoder的output[batch_size,num_steps,num_hiddens]

以前的context:t次都一样,都是最后的state

1
2
3
4
# encoder的state重复t次,代表着我t次上下文关注点都一样,每一次都是state(b, 1, num_hiddens)
context = state[-1].repeat(X.shape[0], 1, 1)
X_and_context = torch.cat((X, context), 2)
output, state = self.rnn(X_and_context, state) 直接一次性输入到网络

现在的context:每次都不一样,为output的加权。C=attention(pre-state, (h1,h2...ht)) 不一样所以需要遍历

  • query为上次state[-1],代表着当前状态 [batch_size, query=1, num_hiddens] 当前状态的维度为num_hiddens
  • key-value = enc_outputs是encoder的output转置下 [batch_size, num_steps, num_hiddens] 代表着有t次状态,每个状态的维度为num_hiddens,attention对t次状态加权后得到[batch_size, query=1, num_hiddens], 权重矩阵为[b, query=1, num_steps]
1
2
3
4
5
6
7
8
9
10
for x in X:  # 每一次query(当前状态都不同,代表着翻译到了哪个单词,不同状态不同注意点)
# query的形状为(batch_size, 1, num_hiddens) 1代表着一次询问
query = torch.unsqueeze(hidden_state[-1], dim=1)
# context的形状为(batch_size,1,num_hiddens) enc_valid_lens忽略输入的pad
context = self.attention(query, enc_outputs, enc_outputs, enc_valid_lens)

# 在特征维度上连结
x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1)
# 将x变形为(t=1, batch_size, embed_size+num_hiddens)
out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state)

image-20230525165633633

纵坐标为生成的token['<bos>', 'je', 'suis', 'chez', 'moi', '.', '<eos>']
横坐标为输入的4个词 ["i'm", 'home', '.', '<eos>']

self-attention!

同时拉取汇聚全部时间的信息

image-20230517161838017

key value query都是x

image-20230517164245269

code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 原来: [querys, d(q)]  ->  [querys, d(v)]
o = attention(queries, keys, values) self.attention(query, enc_outputs, enc_outputs)

# 现在: [querys=t, d(x)] -> [t, d(x)]
X = self.attention(X, X, X)

# 如果是在decoer,则不能看到当前t时刻以后的信息,(b, t) 每一行都是[1, 2, ...n]代表长度
X = self.attention(X, X, X,
torch.arange(1, num_steps + 1, device=X.device).repeat(batch_size, 1))

# 如果所在预测阶段,Key_value需要不断生成,X为单个字符(1, 1, h) key_value(1, now_t, h) 前t个字符
enc_outputs, enc_valid_lens = state[0], state[1]
# 预测阶段,输出序列是通过词元一个接着一个解码的,
# 因此state[2][self.i]包含着直到当前时间步第i个块解码的输出表示
key_values = torch.cat((state[2][self.i], X), axis=1)
state[2][self.i] = key_values

# 自注意力
X2 = self.attention1(X, key_values, key_values, dec_valid_lens=None)
Y = self.addnorm1(X, X2)

pos-encoding

失去了位置信息,添加上位置P矩阵。n个词i,每个d维j。也可以是可学习(BERT)
$$
\begin{aligned} p_{i, 2j} &= \sin\left(\frac{i}{10000^{2j/d}}\right),\p_{i, 2j+1} &= \cos\left(\frac{i}{10000^{2j/d}}\right).\end{aligned}
$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#@save 
[b, t, d] -> [b, t, d] p:[1, max_len, num_hiddens]
class PositionalEncoding(nn.Module):
"""位置编码"""
def __init__(self, num_hiddens, dropout, max_len=1000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(dropout)
# 创建一个足够长的P
self.P = torch.zeros((1, max_len, num_hiddens))
X = torch.arange(max_len, dtype=torch.float32).reshape(
-1, 1) / torch.pow(10000, torch.arange(
0, num_hiddens, 2, dtype=torch.float32) / num_hiddens)
self.P[:, :, 0::2] = torch.sin(X)
self.P[:, :, 1::2] = torch.cos(X)

def forward(self, X):
X = X + self.P[:, :X.shape[1], :].to(X.device)
return self.dropout(X)
image-20230517164528517

QA:

  1. self-attention理解为一个layer,有输入输出
  2. BERT 其实是纯self-attention + context-attention

transformer

  • encoder-decoder架构
  • 纯注意力,n个transformer块
  • block input-output形状一样

image-20230523122148500

编码器的状态信息会同时传给每一个解码器block

image-20230524212752754

muti-head-attention

多个dot attention,也就是多个h。dot attention通过一个多头相当于添加了可学习参数

image-20230523112624348

1
2
3
4
5
6
7
num_hiddens, num_heads = 100, 5

X = torch.ones((b, num_queries, num_hiddens))
Y = torch.ones((b, num_kvpairs, num_hiddens))
attention = MultiHeadAttention(num_hiddens, num_hiddens, num_hiddens,
num_hiddens, num_heads, 0.5)
attention(X, Y, Y, valid_lens).shape = [b, num_queries, num_hiddens]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#@save 
t:查询或者“键-值”对的个数
transpose_qkv:(b,t,num_hiddens) -> (b*num_heads, t, num_hiddens/num_heads)
相对于num_hiddens保存了多个头的信息,输入时拆分出来,为了直接大矩阵乘法去除for循环

class MultiHeadAttention(nn.Module):
"""多头注意力"""
def __init__(self, key_size, query_size, value_size, num_hiddens,
num_heads, dropout, bias=False, **kwargs):
super(MultiHeadAttention, self).__init__(**kwargs)
self.num_heads = num_heads
self.attention = d2l.DotProductAttention(dropout)
self.W_q = nn.Linear(query_size, num_hiddens, bias=bias)
self.W_k = nn.Linear(key_size, num_hiddens, bias=bias)
self.W_v = nn.Linear(value_size, num_hiddens, bias=bias)
self.W_o = nn.Linear(num_hiddens, num_hiddens, bias=bias)

def forward(self, queries, keys, values, valid_lens):
# queries,keys,values的形状:
# (batch_size,查询或者“键-值”对的个数,num_hiddens)
# valid_lens 的形状:
# (batch_size,)或(batch_size,查询的个数)
# 经过变换后,输出的queries,keys,values 的形状:
# (batch_size*num_heads,查询或者“键-值”对的个数,
# num_hiddens/num_heads)
queries = transpose_qkv(self.W_q(queries), self.num_heads)
keys = transpose_qkv(self.W_k(keys), self.num_heads)
values = transpose_qkv(self.W_v(values), self.num_heads)

if valid_lens is not None:
# 在轴0,将第一项(标量或者矢量)复制num_heads次,
# 然后如此复制第二项,然后诸如此类。
valid_lens = torch.repeat_interleave(
valid_lens, repeats=self.num_heads, dim=0)

# output的形状:(batch_size*num_heads,查询的个数,
# num_hiddens/num_heads)
output = self.attention(queries, keys, values, valid_lens)

# output_concat的形状:(batch_size,查询的个数,num_hiddens)
output_concat = transpose_output(output, self.num_heads)
return self.W_o(output_concat)

FFN

基于位置的前馈网络两层 MLP,[b, t, in] -> [b, t, out]

self-attention是在不同的t之间汇聚信息,而mlp对单个t中的in信息做处理

image-20230524205443098
1
2
3
4
5
6
7
8
9
10
11
12
#@save
class PositionWiseFFN(nn.Module):
"""基于位置的前馈网络"""
def __init__(self, ffn_num_input=512, ffn_num_hiddens=2048, ffn_num_outputs=512,
**kwargs):
super(PositionWiseFFN, self).__init__(**kwargs)
self.dense1 = nn.Linear(ffn_num_input, ffn_num_hiddens)
self.relu = nn.ReLU()
self.dense2 = nn.Linear(ffn_num_hiddens, ffn_num_outputs)

def forward(self, X):
return self.dense2(self.relu(self.dense1(X)))

mask-Mutiattention

predict时,不能使用未来的信息,通过设置有效的attention长度

1
2
3
4
5
6
7
8
9
10
11
12
if self.training:
batch_size, num_steps, _ = X.shape
# dec_valid_lens的开头:(batch_size,num_steps),
# 其中每一行是[1,2,...,num_steps] 因为X相对于有num_steps个query,后面的query长度更长
dec_valid_lens = torch.arange(
1, num_steps + 1, device=X.device).repeat(batch_size, 1)
else:
dec_valid_lens = None

# mask-自注意力
X2 = self.attention1(X, X, X, dec_valid_lens) # 预测阶段key value需要不断拼接得到
Y = self.addnorm1(X, X2)

Context Attention

decoder第二层为context attention,query当前状态,keyvalue为encoder的输出

image-20230524204852405
1
2
3
enc_outputs, enc_valid_lens = state[0], state[1]
Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
Z = self.addnorm2(Y, Y2)

AddNorm

残差标准化

1
2
3
4
5
6
7
8
9
10
#@save
class AddNorm(nn.Module):
"""残差连接后进行层规范化"""
def __init__(self, normalized_shape, dropout, **kwargs):
super(AddNorm, self).__init__(**kwargs)
self.dropout = nn.Dropout(dropout)
self.ln = nn.LayerNorm(normalized_shape)

def forward(self, X, Y):
return self.ln(self.dropout(Y) + X)

Encoderblock

MultiHeadAttention + addnorm1 + ffn + addnorm2 输入输出维度不变

1
2
3
4
5
X = torch.ones((2, 100, 24))
valid_lens = torch.tensor([3, 2])
encoder_blk = EncoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5)
encoder_blk.eval()
encoder_blk(X, valid_lens).shape = [2, 100, 24]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#@save
class EncoderBlock(nn.Module):
"""Transformer编码器块"""
def __init__(self, key_size, query_size, value_size, num_hiddens,
norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
dropout, use_bias=False, **kwargs):
super(EncoderBlock, self).__init__(**kwargs)
self.attention = d2l.MultiHeadAttention(
key_size, query_size, value_size, num_hiddens, num_heads, dropout,
use_bias)
self.addnorm1 = AddNorm(norm_shape, dropout)
self.ffn = PositionWiseFFN(
ffn_num_input, ffn_num_hiddens, num_hiddens)
self.addnorm2 = AddNorm(norm_shape, dropout)

def forward(self, X, valid_lens):
Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))
return self.addnorm2(Y, self.ffn(Y))

Encoder

多个block堆叠。embedding + pos_encoding + 多个EncoderBlock

1
2
3
4
5
encoder = TransformerEncoder(
200, 24, 24, 24, 24, [100, 24], 24, 48, 8, 2, 0.5)
encoder.eval()
encoder(torch.ones((2, 100), dtype=torch.long), valid_lens).shape
torch.Size([2, 100, 24])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#@save
class TransformerEncoder(d2l.Encoder):
"""Transformer编码器"""
def __init__(self, vocab_size, key_size, query_size, value_size,
num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
num_heads, num_layers, dropout, use_bias=False, **kwargs):
super(TransformerEncoder, self).__init__(**kwargs)
self.num_hiddens = num_hiddens
self.embedding = nn.Embedding(vocab_size, num_hiddens)
self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
self.blks = nn.Sequential()
for i in range(num_layers):
self.blks.add_module("block"+str(i),
EncoderBlock(key_size, query_size, value_size, num_hiddens,
norm_shape, ffn_num_input, ffn_num_hiddens,
num_heads, dropout, use_bias))

def forward(self, X, valid_lens, *args):
# 因为位置编码值在-1和1之间,
# 因此嵌入值乘以嵌入维度的平方根进行缩放,
# 然后再与位置编码相加。
X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
self.attention_weights = [None] * len(self.blks)
for i, blk in enumerate(self.blks):
X = blk(X, valid_lens)
self.attention_weights[
i] = blk.attention.attention.attention_weights
return X

Decoderblock

需要自己的输入和encoder的输出。输入输出维度不变!

self-MultiHeadAttention + addnorm1 + MultiHeadAttention(编码器解码器注意力) + addnorm2 + fnn + addnorm3

  • 第一次mask-self-attention就是attention1(X, X, X, dec_valid_lens),dec_valid_lens保证不看后面
  • 第二次需要用到encoder输出attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class DecoderBlock(nn.Module):
"""解码器中第i个块"""
def __init__(self, key_size, query_size, value_size, num_hiddens,
norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
dropout, i, **kwargs):
super(DecoderBlock, self).__init__(**kwargs)
self.i = i
self.attention1 = d2l.MultiHeadAttention(
key_size, query_size, value_size, num_hiddens, num_heads, dropout)
self.addnorm1 = AddNorm(norm_shape, dropout)
self.attention2 = d2l.MultiHeadAttention(
key_size, query_size, value_size, num_hiddens, num_heads, dropout)
self.addnorm2 = AddNorm(norm_shape, dropout)
self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
num_hiddens)
self.addnorm3 = AddNorm(norm_shape, dropout)

def forward(self, X, state):
enc_outputs, enc_valid_lens = state[0], state[1]
# 训练阶段,输出序列的所有词元都在同一时间处理,
# 因此state[2][self.i]初始化为None。
# 预测阶段,输出序列是通过词元一个接着一个解码的,
# 因此state[2][self.i]包含着直到当前时间步第i个块解码的输出表示
if state[2][self.i] is None:
key_values = X
else:
key_values = torch.cat((state[2][self.i], X), axis=1)
state[2][self.i] = key_values
if self.training:
batch_size, num_steps, _ = X.shape
# dec_valid_lens的开头:(batch_size,num_steps),
# 其中每一行是[1,2,...,num_steps] 因为X相对于有num_steps个query,后面的query长度更长
dec_valid_lens = torch.arange(
1, num_steps + 1, device=X.device).repeat(batch_size, 1)
else:
dec_valid_lens = None

# 自注意力
X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
Y = self.addnorm1(X, X2)
# 编码器-解码器注意力。
# enc_outputs的开头:(batch_size,num_steps,num_hiddens)
Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
Z = self.addnorm2(Y, Y2)
return self.addnorm3(Z, self.ffn(Z)), state
1
2
3
4
5
decoder_blk = DecoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5, 0)
decoder_blk.eval()
X = torch.ones((2, 100, 24))
state = [encoder_blk(X, valid_lens), valid_lens, [None]]
decoder_blk(X, state)[0].shape # [2, 100, 24]

Decoder

embedding + pos_encoding + 多个decoderBlock + dense;decoderBlock需要的state训练时不变,都是encoder给的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class TransformerDecoder(d2l.AttentionDecoder):
def __init__(self, vocab_size, key_size, query_size, value_size,
num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
num_heads, num_layers, dropout, **kwargs):
super(TransformerDecoder, self).__init__(**kwargs)
self.num_hiddens = num_hiddens
self.num_layers = num_layers
self.embedding = nn.Embedding(vocab_size, num_hiddens)
self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
self.blks = nn.Sequential()
for i in range(num_layers):
self.blks.add_module("block"+str(i),
DecoderBlock(key_size, query_size, value_size, num_hiddens,
norm_shape, ffn_num_input, ffn_num_hiddens,
num_heads, dropout, i))
self.dense = nn.Linear(num_hiddens, vocab_size)

def init_state(self, enc_outputs, enc_valid_lens, *args):
return [enc_outputs, enc_valid_lens, [None] * self.num_layers]

def forward(self, X, state):
X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
self._attention_weights = [[None] * len(self.blks) for _ in range (2)]
for i, blk in enumerate(self.blks):
X, state = blk(X, state)
# 解码器自注意力权重
self._attention_weights[0][
i] = blk.attention1.attention.attention_weights
# “编码器-解码器”自注意力权重
self._attention_weights[1][
i] = blk.attention2.attention.attention_weights
return self.dense(X), state

@property
def attention_weights(self):
return self._attention_weights

train

num_hiddens 512 1024, num_heads 8 16

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
num_hiddens, num_layers, dropout, batch_size, num_steps = 32, 2, 0.1, 64, 10
lr, num_epochs, device = 0.005, 200, d2l.try_gpu()
ffn_num_input, ffn_num_hiddens, num_heads = 32, 64, 4
key_size, query_size, value_size = 32, 32, 32
norm_shape = [32]

train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)

encoder = TransformerEncoder(
len(src_vocab), key_size, query_size, value_size, num_hiddens,
norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
num_layers, dropout)
decoder = TransformerDecoder(
len(tgt_vocab), key_size, query_size, value_size, num_hiddens,
norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
num_layers, dropout)
net = d2l.EncoderDecoder(encoder, decoder)
d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)

预测

预测t+1:前t个转为key value,第t个为query

QA:

  1. concat特征比加权平均好
  2. transfomer硬件要求还好,BERT很大
  3. 很多模型只有encoder,如bert
  4. 可以处理图片,抠出一个个patch

BERT

使用预训练模型提取句子特征,如word2vec(忽略时序)。预训练模型可以不更新,只修改output layer

image-20230523145131209

只有transformer的encoder:block=12 24 hiddensize=768 1024 head=12 16 parameters=110 340M 10亿个词

输入

  • Segment:没有解码器,所以输入输出都输入到encoder,用分开并且添加额外编码
  • Position:可学习
  • Token:普通编码

image-20230523150435634

训练任务

带掩码

transformer是双向的,如何做单向预测?

带掩码的语言模型:15%将一些词作为,完形填空

微调任务时,压根没有没有mask,让模型能在有答案情况下填空:对于mask 80%不变、10%保持、10%替换别的

下一句子预测

句子是不是相邻

image-20230523151222497

BERT代码

1.对token添加

1
2
3
4
5
6
7
8
9
10
#@save
def get_tokens_and_segments(tokens_a, tokens_b=None):
"""获取输入序列的词元及其片段索引"""
tokens = ['<cls>'] + tokens_a + ['<sep>']
# 0和1分别标记片段A和B
segments = [0] * (len(tokens_a) + 2)
if tokens_b is not None:
tokens += tokens_b + ['<sep>']
segments += [1] * (len(tokens_b) + 1)
return tokens, segments
BERTEncoder

输入tokens,segments [b, t],返回[b, t, hidden]。Encoder中包含pos_embedding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#@save
class BERTEncoder(nn.Module):
"""BERT编码器"""
def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
ffn_num_hiddens, num_heads, num_layers, dropout,
max_len=1000, key_size=768, query_size=768, value_size=768,
**kwargs):
super(BERTEncoder, self).__init__(**kwargs)
self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
self.segment_embedding = nn.Embedding(2, num_hiddens)
self.blks = nn.Sequential()
for i in range(num_layers):
self.blks.add_module(f"{i}", d2l.EncoderBlock(
key_size, query_size, value_size, num_hiddens, norm_shape,
ffn_num_input, ffn_num_hiddens, num_heads, dropout, True))
# 在BERT中,位置嵌入是可学习的,因此我们创建一个足够长的位置嵌入参数
self.pos_embedding = nn.Parameter(torch.randn(1, max_len,
num_hiddens))

def forward(self, tokens, segments, valid_lens):
# 在以下代码段中,X的形状保持不变:(批量大小,最大序列长度,num_hiddens)
X = self.token_embedding(tokens) + self.segment_embedding(segments)
X = X + self.pos_embedding.data[:, :X.shape[1], :]
for blk in self.blks:
X = blk(X, valid_lens)
return X
MaskLM

对编码器的输出特征encoded_X,在指定位置上pred_positions,提取出该位置特征masked_X去分类

encoded_X:[b, t, hidden] pred_positions:[b, num_pred] masked_X:[b, num_pred, hidden]

out: [b, num_pred, vocab_size]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#@save
class MaskLM(nn.Module):
"""BERT的掩蔽语言模型任务"""
def __init__(self, vocab_size, num_hiddens, num_inputs=768, **kwargs):
super(MaskLM, self).__init__(**kwargs)
self.mlp = nn.Sequential(nn.Linear(num_inputs, num_hiddens),
nn.ReLU(),
nn.LayerNorm(num_hiddens),
nn.Linear(num_hiddens, vocab_size))

def forward(self, X, pred_positions):
num_pred_positions = pred_positions.shape[1]
pred_positions = pred_positions.reshape(-1)
batch_size = X.shape[0]
batch_idx = torch.arange(0, batch_size)
# 假设batch_size=2,num_pred_positions=3
# 那么batch_idx是np.array([0,0,0,1,1,1])
batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions)
masked_X = X[batch_idx, pred_positions]
masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
mlm_Y_hat = self.mlp(masked_X)
return mlm_Y_hat
NextSentencePred

encoded_X[:, 0, :]的特征进行分类

1
2
3
4
5
6
7
8
9
10
#@save
class NextSentencePred(nn.Module):
"""BERT的下一句预测任务"""
def __init__(self, num_inputs, **kwargs):
super(NextSentencePred, self).__init__(**kwargs)
self.output = nn.Linear(num_inputs, 2)

def forward(self, X):
# X的形状:(batchsize,num_hiddens)
return self.output(X)
BERTModel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#@save
class BERTModel(nn.Module):
"""BERT模型"""
def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
ffn_num_hiddens, num_heads, num_layers, dropout,
max_len=1000, key_size=768, query_size=768, value_size=768,
hid_in_features=768, mlm_in_features=768,
nsp_in_features=768):
super(BERTModel, self).__init__()
self.encoder = BERTEncoder(vocab_size, num_hiddens, norm_shape,
ffn_num_input, ffn_num_hiddens, num_heads, num_layers,
dropout, max_len=max_len, key_size=key_size,
query_size=query_size, value_size=value_size)
self.hidden = nn.Sequential(nn.Linear(hid_in_features, num_hiddens),
nn.Tanh())
self.mlm = MaskLM(vocab_size, num_hiddens, mlm_in_features)
self.nsp = NextSentencePred(nsp_in_features)

def forward(self, tokens, segments, valid_lens=None,
pred_positions=None):
encoded_X = self.encoder(tokens, segments, valid_lens)
if pred_positions is not None:
mlm_Y_hat = self.mlm(encoded_X, pred_positions)
else:
mlm_Y_hat = None
# 用于下一句预测的多层感知机分类器的隐藏层,0是“<cls>”标记的索引
nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
return encoded_X, mlm_Y_hat, nsp_Y_hat

数据集

  • 需要先获得 tokens, segments, is_next ; segments为0、1用于区别句子
  • 对tokens进行mask替换,返回tokens,positions,positions上原词汇mlm_Y
  • pad 和对应 valid_lens。all_mlm_weights0或1用于过滤掉mask中,属于pad的词
1
2
3
4
5
for (tokens_X, segments_X, valid_lens_x, pred_positions_X, mlm_weights_X, mlm_Y, nsp_y) in train_iter

torch.Size([512, 64]) torch.Size([512, 64]) torch.Size([512]) torch.Size([512, 10]) torch.Size([512, 10]) torch.Size([512, 10]) torch.Size([512])

mlm_weights_X 01用于过滤掉mask中,属于pad的词

训练

训练是不需要encoder_X,为了提升模型抽取encoder的能力

1
2
3
4
5
6
7
net = d2l.BERTModel(len(vocab), num_hiddens=128, norm_shape=[128],
ffn_num_input=128, ffn_num_hiddens=256, num_heads=2,
num_layers=2, dropout=0.2, key_size=128, query_size=128,
value_size=128, hid_in_features=128, mlm_in_features=128,
nsp_in_features=128)
devices = d2l.try_all_gpus()
loss = nn.CrossEntropyLoss()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#@save
def _get_batch_loss_bert(net, loss, vocab_size, tokens_X,
segments_X, valid_lens_x,
pred_positions_X, mlm_weights_X,
mlm_Y, nsp_y):
# 前向传播
_, mlm_Y_hat, nsp_Y_hat = net(tokens_X, segments_X,
valid_lens_x.reshape(-1),
pred_positions_X)
# 计算遮蔽语言模型损失 不计算pad的
mlm_l = loss(mlm_Y_hat.reshape(-1, vocab_size), mlm_Y.reshape(-1)) *\
mlm_weights_X.reshape(-1, 1)
mlm_l = mlm_l.sum() / (mlm_weights_X.sum() + 1e-8)
# 计算下一句子预测任务的损失
nsp_l = loss(nsp_Y_hat, nsp_y)
l = mlm_l + nsp_l
return mlm_l, nsp_l, l

BERT表示文本

利用BERT获得句子的encoded_X,去进行分类、预测等

1
2
3
4
5
6
7
def get_bert_encoding(net, tokens_a, tokens_b=None):
tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
token_ids = torch.tensor(vocab[tokens], device=devices[0]).unsqueeze(0)
segments = torch.tensor(segments, device=devices[0]).unsqueeze(0)
valid_len = torch.tensor(len(tokens), device=devices[0]).unsqueeze(0)
encoded_X, _, _ = net(token_ids, segments, valid_len)
return encoded_X
1
2
3
4
5
6
7
8
tokens_a = ['a', 'crane', 'is', 'flying']
encoded_text = get_bert_encoding(net, tokens_a) [1, 6, 128]
# 词元:'<cls>','a','crane','is','flying','<sep>'
encoded_text_cls = encoded_text[:, 0, :] [1, 128]

tokens_a, tokens_b = ['a', 'crane', 'driver', 'came'], ['he', 'just', 'left']
encoded_pair = get_bert_encoding(net, tokens_a, tokens_b)
# 词元:'<cls>','a','crane','driver','came','<sep>','he','just', 'left','<sep>'

QA:

  1. 模型太大? model分在不同GPU上

微调BERT

利用bert对每个词都抽取了特征,我们不需要考虑如何抽取句子特征、词特征了。只需要添加输出层。需要相同Vocab

image-20230523164053828
  • 句子分类:直接用的特征,别的也可以但最好cls

  • 识别词元是不是特殊词:人名、地名、机构。 对每一个词的特征做二分类

  • 问题回答:给出一段话和一个问题。对于一段话中每一个词,预测是不是问题的开始和结束。三分类

    image-20230523164551697

code

只需要用上encoder,hidden是bert中输出到NSP前的处理,这里也用上;相对于替换了NSP

1
2
3
4
5
6
7
8
9
10
11
class BERTClassifier(nn.Module):
def __init__(self, bert):
super(BERTClassifier, self).__init__()
self.encoder = bert.encoder
self.hidden = bert.hidden
self.output = nn.Linear(256, 3)

def forward(self, inputs):
tokens_X, segments_X, valid_lens_x = inputs
encoded_X = self.encoder(tokens_X, segments_X, valid_lens_x)
return self.output(self.hidden(encoded_X[:, 0, :]))

QA:

  1. YOLO基础效果不好,但加了大量trick细节
  2. 通过蒸馏十分之一大小,但精度不会下降很多