机器翻译:从 RNN 到 Transformer 的演进
1. 技术分析
1.1 机器翻译技术演进
机器翻译经历了从规则方法到深度学习的演进:
机器翻译技术路线 规则翻译: 基于语法规则 统计翻译: 基于语料统计 神经机器翻译: RNN/Transformer
1.2 神经机器翻译架构
| 架构 | 特点 | 代表模型 |
|---|
| RNN seq2seq | 循环结构 | Seq2Seq |
| Attention | 注意力机制 | Bahdanau Attention |
| Transformer | 全注意力 | Transformer |
| MLP-Mixer | 无注意力 | MLP-Mixer |
1.3 翻译质量评估指标
BLEU 评分计算 BLEU = BP * exp(Σ w_n * log(p_n)) 其中 p_n 是 n-gram 精度,BP 是简短惩罚
2. 核心功能实现
2.1 RNN Seq2Seq 实现
import torch import torch.nn as nn import torch.nn.functional as F class EncoderRNN(nn.Module): def __init__(self, input_size, hidden_size, num_layers=2): super().__init__() self.hidden_size = hidden_size self.embedding = nn.Embedding(input_size, hidden_size) self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers=num_layers, bidirectional=True) def forward(self, x): embedded = self.embedding(x) outputs, (hidden, cell) = self.lstm(embedded) hidden = torch.cat((hidden[-2], hidden[-1]), dim=1) cell = torch.cat((cell[-2], cell[-1]), dim=1) return outputs, (hidden.unsqueeze(0), cell.unsqueeze(0)) class DecoderRNN(nn.Module): def __init__(self, hidden_size, output_size, num_layers=2): super().__init__() self.hidden_size = hidden_size self.embedding = nn.Embedding(output_size, hidden_size) self.lstm = nn.LSTM(hidden_size * 2, hidden_size, num_layers=num_layers) self.fc = nn.Linear(hidden_size, output_size) def forward(self, x, hidden, encoder_outputs=None): embedded = self.embedding(x).unsqueeze(0) output, hidden = self.lstm(embedded, hidden) prediction = self.fc(output.squeeze(0)) return prediction, hidden class Seq2Seq(nn.Module): def __init__(self, encoder, decoder): super().__init__() self.encoder = encoder self.decoder = decoder def forward(self, src, tgt): encoder_outputs, hidden = self.encoder(src) outputs = [] x = tgt[0] for _ in range(len(tgt)): output, hidden = self.decoder(x, hidden, encoder_outputs) outputs.append(output) x = output.argmax(dim=1) return torch.stack(outputs)
2.2 Attention Seq2Seq 实现
class AttentionDecoder(nn.Module): def __init__(self, hidden_size, output_size, num_layers=2): super().__init__() self.hidden_size = hidden_size self.embedding = nn.Embedding(output_size, hidden_size) self.lstm = nn.LSTM(hidden_size * 3, hidden_size, num_layers=num_layers) self.attn = nn.Linear(hidden_size * 2, hidden_size) self.fc = nn.Linear(hidden_size * 2, output_size) def forward(self, x, hidden, encoder_outputs): embedded = self.embedding(x).unsqueeze(0) attn_weights = F.softmax( torch.bmm(encoder_outputs.transpose(0, 1), self.attn(hidden[0].transpose(0, 1)).transpose(1, 2)), dim=1 ) context = torch.bmm(attn_weights, encoder_outputs.transpose(0, 1)).transpose(0, 1) lstm_input = torch.cat((embedded, context), dim=2) output, hidden = self.lstm(lstm_input, hidden) prediction = self.fc(torch.cat((output.squeeze(0), context.squeeze(0)), dim=1)) return prediction, hidden, attn_weights
2.3 Transformer 机器翻译
class TransformerMT(nn.Module): def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, num_heads=8, d_ff=2048, num_layers=6): super().__init__() self.src_embedding = nn.Embedding(src_vocab_size, d_model) self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model) self.positional_encoding = PositionalEncoding(d_model) self.transformer = nn.Transformer( d_model=d_model, nhead=num_heads, num_encoder_layers=num_layers, num_decoder_layers=num_layers, dim_feedforward=d_ff ) self.fc = nn.Linear(d_model, tgt_vocab_size) def forward(self, src, tgt): src = self.src_embedding(src) * torch.sqrt(torch.tensor(self.src_embedding.embedding_dim, dtype=torch.float32)) src = self.positional_encoding(src) tgt = self.tgt_embedding(tgt) * torch.sqrt(torch.tensor(self.tgt_embedding.embedding_dim, dtype=torch.float32)) tgt = self.positional_encoding(tgt) output = self.transformer(src, tgt) output = self.fc(output) return output def translate(self, src, max_len=100): self.eval() src = self.src_embedding(src) * torch.sqrt(torch.tensor(self.src_embedding.embedding_dim, dtype=torch.float32)) src = self.positional_encoding(src) tgt = torch.tensor([[self.tgt_start_token]], device=src.device) for _ in range(max_len): tgt_emb = self.tgt_embedding(tgt) * torch.sqrt(torch.tensor(self.tgt_embedding.embedding_dim, dtype=torch.float32)) tgt_emb = self.positional_encoding(tgt_emb) output = self.transformer(src, tgt_emb) prediction = self.fc(output[:, -1]) next_token = prediction.argmax(dim=1).unsqueeze(0) tgt = torch.cat([tgt, next_token], dim=1) if next_token.item() == self.tgt_end_token: break return tgt
3. 性能对比
3.1 机器翻译模型对比
| 模型 | BLEU | 训练时间 | 推理速度 | 适用场景 |
|---|
| RNN Seq2Seq | 45 | 快 | 快 | 小规模 |
| Attention Seq2Seq | 55 | 中 | 中 | 中等规模 |
| Transformer | 65 | 慢 | 中 | 大规模 |
| T5 | 70 | 很慢 | 慢 | 大规模 |
3.2 不同语言对的表现
| 语言对 | Transformer | T5 | mBART |
|---|
| 英中 | 55 | 62 | 60 |
| 中英 | 58 | 65 | 63 |
| 英德 | 62 | 68 | 66 |
3.3 模型大小影响
| 参数规模 | BLEU | 训练时间 | 内存 |
|---|
| 100M | 55 | 1天 | 8GB |
| 500M | 62 | 5天 | 16GB |
| 1B | 68 | 10天 | 32GB |
4. 最佳实践
4.1 机器翻译模型选择
def select_translation_model(direction, data_size): if data_size < 10000: return Seq2Seq(EncoderRNN(10000, 256), DecoderRNN(256, 10000)) else: return TransformerMT(10000, 10000) class TranslationModelFactory: @staticmethod def create(config): if config['type'] == 'seq2seq': return Seq2Seq(**config['params']) elif config['type'] == 'transformer': return TransformerMT(**config['params']) elif config['type'] == 't5': from transformers import T5ForConditionalGeneration return T5ForConditionalGeneration.from_pretrained(config['model_name'])
4.2 机器翻译训练流程
class TranslationTrainer: def __init__(self, model, optimizer, scheduler, loss_fn): self.model = model self.optimizer = optimizer self.scheduler = scheduler self.loss_fn = loss_fn def train_step(self, src, tgt): self.optimizer.zero_grad() output = self.model(src, tgt[:, :-1]) loss = self.loss_fn(output.reshape(-1, output.size(-1)), tgt[:, 1:].reshape(-1)) loss.backward() self.optimizer.step() self.scheduler.step() return loss.item() def evaluate(self, dataloader): self.model.eval() total_loss = 0 with torch.no_grad(): for src, tgt in dataloader: output = self.model(src, tgt[:, :-1]) loss = self.loss_fn(output.reshape(-1, output.size(-1)), tgt[:, 1:].reshape(-1)) total_loss += loss.item() return total_loss / len(dataloader)
5. 总结
机器翻译已进入 Transformer 时代:
- Transformer:目前最优的机器翻译架构
- 预训练模型:T5、mBART 等效果优秀
- 多语言支持:mBART 支持 50+ 语言
- 质量评估:BLEU 是主要评估指标
对比数据如下:
- Transformer 比 RNN Seq2Seq 提升约 20% BLEU
- T5 在大规模数据上表现最佳
- 双语语料越多效果越好
- 推荐使用预训练模型进行微调