残差网络

关于

Residual Networks 残差网络,何凯明,孙剑 @MSRA。

残差网络2015年的论文导读

摘要

导言

CNN的重要论文:

  1. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
    W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten
    zip code recognition. Neural computation, 1989
  2. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification
    with deep convolutional neural networks. In NIPS, 2012.
  3. P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun.
    Overfeat: Integrated recognition, localization and detection
    using convolutional networks. In ICLR, 2014
  4. M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional
    neural networks. In ECCV, 2014.

近期研究表明,堆叠的深度是至关重要的因素。

  1. K. Simonyan and A. Zisserman. Very deep convolutional networks
    for large-scale image recognition. In ICLR, 2015
  2. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
    V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions.
    In CVPR, 2015

ImageNet 的最佳结果都是很深的模型,从13层到30层。深度模型对其他的图像任务也有帮助。

  1. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:
    Surpassing human-level performance on imagenet classification. In
    ICCV, 2015.
  2. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
    network training by reducing internal covariate shift. In ICML, 2015
  3. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
    Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet
    large scale visual recognition challenge. arXiv:1409.0575, 2014.

学习深度模型最大的问题在于 vanishing gradient,梯度消减!导致模型无法收敛。
利用这些技巧,几十层深度的模型也可以通过BP算法+SGD进行训练。

  1. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies
    with gradient descent is difficult. IEEE Transactions on Neural
    Networks, 5(2):157–166, 1994.
  2. X. Glorot and Y. Bengio. Understanding the difficulty of training
    deep feedforward neural networks. In AISTATS, 2010.

梯度消减的问题被很大程度上通过 normalized initialization 和 intermediate normalization layers 解决了

  1. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. ¨
    In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998.
  2. A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to
    the nonlinear dynamics of learning in deep linear neural networks.
    arXiv:1312.6120, 2013.
  3. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:
    Surpassing human-level performance on imagenet classification. In
    ICCV, 2015.
  4. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
    network training by reducing internal covariate shift. In ICML, 2015.

随着层数的加深,模型的性能逐渐饱和,然后迅速恶化。这个问题并不是由于过拟合,更深的模型导致更差的性能!
理论上来讲,深层模型应该可以做到比浅层模型更好的性能,可以设想多余的层是恒等变换,那么深层模型结果和浅层一样。
但是实际上的结果并非如此。

残差网络并不直接拟合目标,而是拟合残差。假设潜在的目标映射为$(\mathcal{H}(x))$,我们让非线性层学习残差
$(\mathcal{F}(x):=\mathcal{H}(x) - x)$,并提供一条短路(或直连)通道,使得输出为$(\mathcal{F}(x)+x)$。
我们假设优化残差比原始映射要简单!(假设!!!!)
在极端情况下,可以让非线性层置0,使得直接输出输入值。(我的思考:存在正则项的时候,这个确实更优,那是不是就证明残差网络不会比浅层网络更差了呢?!)
短路连接在这里可以跳过一层或者多层。单位短路通道(即短路通道直接输出输入的值)不增加计算复杂度也不增加额外的参数。
整个网络可以采用 end-to-end 使用SGD+BP算法,可以采用现有的求解器就能实现。

  1. C. M. Bishop. Neural networks for pattern recognition. Oxford
    university press, 1995.
  2. B. D. Ripley. Pattern recognition and neural networks. Cambridge
    university press, 1996.
  3. W. Venables and B. Ripley. Modern applied statistics with s-plus.
    1999

  4. ImageNet 论文: O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
    Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet
    large scale visual recognition challenge. arXiv:1409.0575, 2014.

  5. CIFAR-10 论文:A. Krizhevsky. Learning multiple layers of features from tiny images.
    Tech Report, 2009.

We present successfully trained models on this dataset (CIFAR-10) with
over 100 layers, and explore models with over 1000 layers.
Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC
2015 classification competition.
The extremely deep representations also have excellent generalization performance
on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization,
COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions.

相关的研究工作

Residual Representations.

Shortcut Connection

已被研究多日了(哈哈哈哈),早起的多层感知器研究在输入输出间单独加了一个线性层。
在另外两篇论文中,一些中间层直接连接到一个辅助的分类器,通过这种方式减少梯度消减。

  1. C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply supervised
    nets. arXiv:1409.5185, 2014.
  2. R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks.
    arXiv:1505.00387, 2015

等等其他,略。

highway networks 在短路链接采用了门函数,该门函数有参数需要通过数据学习。
当门关掉(为0值)时,网络就是传统的神经网络,而不是残差网络。

Deep Residual Learning

$$
y = \mathcal{F}(x, {W_i}) + x \\
\mathcal{F} = W_2 \sigma(W_1 x) \\
\sigma = ReLU
$$

如果输入输出维度不同,可以通过投影的方法解决。$(W_s)$仅仅用来解决维度匹配的问题,如果维度相同,单位映射就好了。

$$
y = \mathcal{F}(x, {W_i}) + W_s x
$$

何凯明PPT@ICML2016

此时他已经来到Facebook AI团队了!

深度的演化

> 200 citations in 6 months after posted on arXiv (Dec. 2015)

深度频谱

初始化技巧

总结,好的初始化很重要,当层数较深(20-30)时,可能收敛更快,初始化不好可能不收敛。

  1. LeCun et al 1998 “Efficient Backprop”
  2. Glorot& Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”

Batch Normalize

$$
\hat{x} = \frac{x - \mu}{\sigma} \\
y = \gamma \hat{x} + \beta
$$

Deep Residual Network 10-100层

单位映射的重要性

单位映射下:

$$
x_L = x_l + \sum_{i=l}^{L-1} \mathcal{F}_i(x_i) \\
\frac{\partial E}{\partial x_l} = \frac{\partial E}{\partial x_L}(1 + \frac{\partial E}{\partial x_l} \sum_{i=l}^{L-1} \mathcal{F}_i(x_i))
$$

在单位映射下,梯度可以以恒定比例传递过来,
如果不是,一旦深度变深了,要么衰减,要么爆炸!

加总之后,还是单位映射好,(我觉得还是梯度传递的问题,需要单位范数的映射才能不使得梯度消失和爆炸!)pre-active

  1. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

未来的方向

参考

  1. Deep Residual Learning for Image Recognition
  2. Deep Residual Networks