Residual Networks 残差网络,何凯明,孙剑 @MSRA。
- 152层残差网络,是 VGG net的8倍,但是复杂度更低,效果更好。
- ImageNet 测试集错误率为 3.57%
- COCO object detection dataset 28% 相对提升
- ILSVRC & COCO 2015 competitions 第一名,on the tasks of ImageNet detection, ImageNet localization,
COCO detection, and COCO segmentation
- 深度卷积网络(CNN)是图像分类问题的重大突破,
ImageNet 的最佳结果都是很深的模型,从13层到30层。深度模型对其他的图像任务也有帮助。
学习深度模型最大的问题在于 vanishing gradient,梯度消减!导致模型无法收敛。
梯度消减的问题被很大程度上通过 normalized initialization 和 intermediate normalization layers 解决了
$(\mathcal{F}(x):=\mathcal{H}(x) - x)$,并提供一条短路(或直连)通道,使得输出为$(\mathcal{F}(x)+x)$。
整个网络可以采用 end-to-end 使用SGD+BP算法,可以采用现有的求解器就能实现。
We present successfully trained models on this dataset (CIFAR-10) with
over 100 layers, and explore models with over 1000 layers.
Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC
2015 classification competition.
The extremely deep representations also have excellent generalization performance
on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization,
COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions.
Residual Representations.
Shortcut Connection
highway networks 在短路链接采用了门函数,该门函数有参数需要通过数据学习。
Deep Residual Learning
- 假设:(还是一个open question)多层非线性可以逼近复杂函数。
- 当输入输出是相同的维度,可以假设它逼近残差$(\mathcal{H}(x) - x)$。虽然逼近原始函数和逼近残差,这两个函数都很复杂,
这表明单位映射是一个很好的先验条件。 - 残差网络基本模块是:
y = \mathcal{F}(x, {W_i}) + x \\
\mathcal{F} = W_2 \sigma(W_1 x) \\
\sigma = ReLU
y = \mathcal{F}(x, {W_i}) + W_s x
过拟合的原因,因为没有用到MaxOut[1]和Dropout[2]强正则化的做法。 -
此时他已经来到Facebook AI团队了!
- AlexNet, 8 layers (ILSVRC 2012)
- VGG, 19 layers (ILSVRC 2014)
- GoogleNet, 22 layers (ILSVRC 2014)
- ResNet, 152 layers (ILSVRC 2015)
> 200 citations in 6 months after posted on arXiv (Dec. 2015)
- 5 layers: easy
10 layers: initialization, Batch Normalization
30 layers: skip connections
100 layers: identity skip connections
- LeCun et al 1998 “Efficient Backprop”
- Glorot& Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Batch Normalize
- 输入标准化
- 标准化每一层 for each mini-batch
- 极大地加速训练
- 减少初值敏感
- 增强正则化
\hat{x} = \frac{x - \mu}{\sigma} \\
y = \gamma \hat{x} + \beta
- $(\mu, \sigma)$ 分别是 mini-batch 的均值和标准差,是由数据计算出来的
- $(\gamma, \beta)$ 是缩放因子和位移量,需要模型学出来。
- 注意,训练集的均值方差是从数据中计算,但是测试集是采用训练集计算的结果(平均)。
Deep Residual Network 10-100层
- 简单叠加会变差!
x_L = x_l + \sum_{i=l}^{L-1} \mathcal{F}_i(x_i) \\
\frac{\partial E}{\partial x_l} = \frac{\partial E}{\partial x_L}(1 + \frac{\partial E}{\partial x_l} \sum_{i=l}^{L-1} \mathcal{F}_i(x_i))
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
- Representation
- skipping 1 layer vs. multiple layers?
- Flat vs. Bottleneck?
- Inception-ResNet[Szegedy et al 2016]
- ResNetin ResNet[Targ et al 2016]
- Width vs. Depth [Zagoruyko & Komodakis 2016]
- Generalization
- DropOut, MaxOut, DropConnect, …
- Drop Layer (Stochastic Depth) [Huang et al 2016]
- Optimization
- Without residual/shortcut?