告别样本失衡！用PyTorch手把手实现RetinaNet的Focal Loss（附完整代码）-洪萨配资

从数学本质到代码实现：彻底掌握RetinaNet的Focal Loss

当你在训练目标检测模型时，是否遇到过这样的困境：模型总是被大量简单负样本主导，导致对困难样本和正样本的学习效果不佳？这正是RetinaNet提出Focal Loss要解决的核心问题。不同于传统的交叉熵损失，Focal Loss通过巧妙的数学设计，让模型训练过程更加聚焦于那些真正需要学习的样本上。

1. 样本失衡问题的本质剖析

在目标检测任务中，样本失衡问题远比分类任务更为严重。想象一下，在一张普通图片中，可能有几十个物体需要检测（正样本），但同时会产生成千上万个背景区域（负样本）。这种极端不平衡会导致几个严重后果：

梯度被简单样本主导：大量容易分类的背景样本虽然单个损失很小，但累积起来会主导梯度方向
模型收敛困难：有用的信号被淹没在噪声中，模型难以学习到真正有判别性的特征
检测性能下降：特别是对小物体和密集物体的检测效果会明显恶化

传统解决方案如硬负样本挖掘（Hard Negative Mining）虽然有效，但存在两个主要缺陷：

增加了额外的计算开销和实现复杂度
破坏了端到端训练的统一性

Focal Loss的创新之处在于，它从损失函数层面优雅地解决了这个问题，不需要额外的采样策略，保持了端到端训练的优势。

2. Focal Loss的数学原理深度解析

2.1 从交叉熵到Focal Loss的演进路径

标准交叉熵损失(CE)可以表示为：

def cross_entropy(p, y): pt = p if y == 1 else 1 - p return -torch.log(pt)

Balanced Cross Entropy引入了α平衡因子：

def balanced_ce(p, y, alpha=0.25): pt = p if y == 1 else 1 - p alpha_t = alpha if y == 1 else 1 - alpha return -alpha_t * torch.log(pt)

Focal Loss在此基础上增加了调制因子(1-pt)^γ：

def focal_loss(p, y, alpha=0.25, gamma=2): pt = p if y == 1 else 1 - p alpha_t = alpha if y == 1 else 1 - alpha return -alpha_t * (1-pt)**gamma * torch.log(pt)

2.2 关键参数的作用机制

参数	作用	典型值	影响方向
α (alpha)	平衡正负样本权重	0.25	增大α会增加正样本重要性
γ (gamma)	调节难易样本权重	2.0	增大γ会聚焦于更难样本

这两个参数在实际应用中需要联合调整：

当γ增大时，简单样本的权重会被进一步压制，此时可能需要适当增大α来补偿正样本的损失
实验表明γ=2, α=0.25在大多数目标检测任务中表现良好

2.3 损失曲线的对比分析

通过绘制不同损失函数的曲线，可以直观理解Focal Loss的优势：

import matplotlib.pyplot as plt import numpy as np p = np.linspace(0.01, 0.99, 100) ce = -np.log(p) focal_loss_gamma1 = - (1-p)**1 * np.log(p) focal_loss_gamma2 = - (1-p)**2 * np.log(p) plt.plot(p, ce, label='Cross Entropy') plt.plot(p, focal_loss_gamma1, label='Focal Loss (γ=1)') plt.plot(p, focal_loss_gamma2, label='Focal Loss (γ=2)') plt.xlabel('Probability of ground truth class') plt.ylabel('Loss value') plt.legend() plt.show()

从曲线可以看出：

当p→1（易分类样本）时，Focal Loss的值急剧下降
γ越大，对易分类样本的抑制越强
难样本(p较小)的损失相对权重增加

3. PyTorch实现Focal Loss的工程细节

3.1 基础实现版本

class FocalLoss(nn.Module): def __init__(self, alpha=0.25, gamma=2, reduction='mean'): super(FocalLoss, self).__init__() self.alpha = alpha self.gamma = gamma self.reduction = reduction def forward(self, inputs, targets): BCE_loss = F.binary_cross_entropy_with_logits( inputs, targets, reduction='none') pt = torch.exp(-BCE_loss) alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets) FL_loss = alpha_t * (1 - pt) ** self.gamma * BCE_loss if self.reduction == 'mean': return FL_loss.mean() elif self.reduction == 'sum': return FL_loss.sum() return FL_loss

关键实现要点：

使用binary_cross_entropy_with_logits确保数值稳定性
通过torch.exp(-BCE_loss)计算pt
动态计算alpha_t，对正负样本应用不同权重
支持不同的reduction方式（mean/sum/none）

3.2 多分类扩展版本

对于多分类任务，需要对每个类别独立计算Focal Loss：

class MultiClassFocalLoss(nn.Module): def __init__(self, num_classes, alpha=None, gamma=2, reduction='mean'): super(MultiClassFocalLoss, self).__init__() self.num_classes = num_classes self.gamma = gamma self.reduction = reduction if alpha is None: self.alpha = torch.ones(num_classes) else: self.alpha = torch.tensor(alpha) def forward(self, inputs, targets): log_softmax = F.log_softmax(inputs, dim=1) ce_loss = -log_softmax * targets pt = torch.exp(-ce_loss) alpha_t = self.alpha.to(inputs.device)[torch.argmax(targets, dim=1)] alpha_t = alpha_t.unsqueeze(1) FL_loss = alpha_t * (1 - pt) ** self.gamma * ce_loss if self.reduction == 'mean': return FL_loss.mean() elif self.reduction == 'sum': return FL_loss.sum() return FL_loss

3.3 训练过程中的实用技巧

学习率调整策略：
- 初始学习率可以比普通CE损失稍大（约1.5-2倍）
- 配合余弦退火或带热重启的学习率调度效果更好
Batch Size选择：
- Focal Loss对batch size更敏感
- 建议使用较大的batch size（≥32）以获得稳定的梯度估计
参数初始化：
- 最后一层的bias初始化为-log((1-π)/π)，其中π=0.01
- 这有助于训练初期的稳定性

4. RetinaNet中的Focal Loss实战应用

4.1 与RetinaNet架构的集成

在RetinaNet中，Focal Loss主要应用于分类分支。典型实现结构如下：

class RetinaNetClassifier(nn.Module): def __init__(self, in_channels, num_anchors, num_classes): super().__init__() self.conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1) self.conv2 = nn.Conv2d(in_channels, in_channels, 3, padding=1) self.conv3 = nn.Conv2d(in_channels, in_channels, 3, padding=1) self.conv4 = nn.Conv2d(in_channels, in_channels, 3, padding=1) self.output = nn.Conv2d(in_channels, num_anchors * num_classes, 3, padding=1) # 初始化输出层的bias prior_prob = 0.01 bias_value = -math.log((1 - prior_prob) / prior_prob) self.output.bias.data.fill_(bias_value) def forward(self, x): x = F.relu(self.conv1(x)) x = F.relu(self.conv2(x)) x = F.relu(self.conv3(x)) x = F.relu(self.conv4(x)) return self.output(x)

4.2 训练流程的关键调整

Anchor匹配策略：
- 正样本：IoU > 0.5
- 负样本：IoU < 0.4
- 忽略样本：0.4 ≤ IoU ≤ 0.5
损失计算细节：
- 分类损失：Focal Loss（所有样本）
- 回归损失：Smooth L1 Loss（仅正样本）

def compute_loss(classification, regression, anchors, annotations): # 1. Anchor匹配 matched_idxs, targets = match_anchors(anchors, annotations) # 2. 准备分类目标 cls_targets = prepare_cls_targets(matched_idxs, targets) # 3. 计算Focal Loss classification = classification.view(-1, num_classes) cls_targets = cls_targets.view(-1, num_classes) cls_loss = focal_loss(classification, cls_targets) # 4. 计算回归损失 pos_indices = (matched_idxs > 0).nonzero().squeeze(1) if pos_indices.numel() > 0: regression = regression.view(-1, 4) reg_targets = prepare_reg_targets(matched_idxs, targets) reg_loss = smooth_l1_loss(regression[pos_indices], reg_targets[pos_indices]) else: reg_loss = torch.tensor(0).float().to(device) return cls_loss, reg_loss