CS231N - Convolutional Neural Networks for Visual Recognition

关于

李菲菲在Stanford开的课程，见http://cs231n.stanford.edu/

BP 算法与计算图

一个图节点实现forward计算激活函数和backward计算梯度，图的变对应于变量。

高效的BP算法

LeeCun 1998 的论文中给出了BP算法的一些trick：

采用随机梯度下降，更快：考虑对样本的10次复制，随机梯度相当于训练了10次，而批量梯度下降只有1次！
结果通常更好：可以跳出鞍点，也可以更大概率跳出局部最优。可以随时间进一步训练！online learning。
对于SGD，理论上最优学习率随时间线性下降！
每一次epoch重新打散样本！
归一化输入，均值接近0通常收敛更快！同样，输出也尽可能是0均值。去相关，归一化方差，会加快收敛。
基于上一个原则，tanh比sigmoid好。
目标，匹配输出激活函数
初始化权值，权值要使得激活函数工作在线性区，（这样才好学，否则梯度为0）目标是让输入输出的方差相同（都为1）。
当输入方差为1的时候，输出方差为

$$
\sigma_{y_i} = (\sum_j w_{ij}^2)^{1/2}
$$

为了保证输出方差也为1，那么

$$
\sigma_w = m^{-1/2}
$$

对于部分连接的网络（如CNN，DTNN），m应该是连接的节点个数。

学习率，自动调整衰减。动量机制，减少震荡。

$$
\Delta w(t+1) = \eta \frac{\partial E_{t+1}}{\partial w} + \mu \Delta w(t)
$$

自适应学习率

自动微分

四种计算梯度的方法：

手动推导梯度的公式，然后编码实现：易错，费时
数值微分（有限差分）：简单实现，但是效率低，而且不精确
符号微分（Mathematica, Maple)：表达式通常会比较复杂，存在 expression swell 的问题，而且对表达式形式有要求（close form）
自动微分：可以达到机器精度，和理想的渐进性能只差一个常数因子（性能牛逼！）

参考

Automatic differentiation in machine learning: a survey, 2015
Efficient BP, LeeCun 1998
https://cs231n.github.io/optimization-2/
Stochastic Gradient Tricks

神经网络历史

感知器

Frank Rosenblatt 1957

$$
y = f(w x + b) \\
f(z) = 1 when z>0 else 0
$$

更新权值

$$
w_i(t+1) = w_i(t) + \alpha (d_j - y_j(t))x_{j, i}
$$

相当于下述损失函数 + SGD优化（注意这里label是+1，-1和上面有区别，这里只是便于表达）

$$
\max(0, - d_j * y_j)
$$

三层神经网络

Rumelhart et al. 1986，BP算法

RBM深度网络

Hinton 2006

第一个强结果

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition George Dahl,
Dong Yu, Li Deng, Alex Acero, 2010 MSR

Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012

激活函数

sigmoid: 将结果映射到[0,1]之间，有概率解释。问题在于饱和将梯度变为0了，非0均值。而非0均值，导致梯度被限制在各分量全为正或者全为负的区域。
导致收敛变慢。exp函数计算复杂度较大。
tanh: 解决了0均值的问题
ReLU: 在正值区不饱和，计算效率较高，但是还是不是0均值，负向梯度为0
Leaky ReLU = max(0.1x, x),解决负向梯度饱和问题，
Maxout = max(w1 x + b1, w2 x + b2), 基本解决上述问题，但是参数变多了 double
ELU

$$
x > 0: x \\
x < 0: \alpha (\exp(x) - 1)
$$

中心化：减去图像均值（AlexNet），减去每个通道的均值（VGGNet）

初始化：
“Xavier initialization” [Glorot et al., 2010]： $(n_{in} Var(w) = 1)$

对于 ReLU，因为一半恒为0，因此有一个0.5因子。$(\frac{1}{2} n_{in} Var(w) = 1)$ He et al., 2015

论文：

Understanding the difficulty of training deep feedforward neural networks, Glorot and Bengio, 2010
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

Batch Normalize[Ioffe and Szegedy, 2015]：归一化到标准正态分布，然后让它自己学一个纺射变换。通常插入在全连接层和激活函数之间。

CNN

1998，LeNet-5, LeCun
LeNet-5: Gradient-based learning applied to document recognition，1998，LeCun, Bottou, Bengio, Haffner
AlexNet: ImageNet Classification with Deep Convolutional Neural Networks，Hinton 2012,
ZFNet:
VGGNet: Very Deep Convolutional Networks for Large-Scale Image Recognition
GoogLeNet: Going Deeper with Convolutions
ResNet

目标检测

任务：分类 + 定位

Location as Regression: 输入图片，输出4个坐标！（非常简单）L2损失函数，作为回归问题处理。
分类 + 定位作为多任务，共用同一个CNN做特征提取层。

输入图像 => CNN => FC => softmax/Regression

回归层连接的位置：在CNN层后面（Overfeat， VGG）；在全连接层(FC)后面：DeepPose，R-CNN

检测多个目标：共用CNN层做特征提取层。

姿势估计：Toshev and Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks”, CVPR 2014

sliding window: Overfeat:Integrated Recognition, Localization and Detection using Convolutional Networks, ICLR 2014

在高精度图片的不同位置训练模型进行分类和回归。
将FC也变成卷积层，减少运算量：Overfeat
结合所有的位置的结果，得到最终的结果（MAX Pool）
实际使用中：采用多个不同位置不同尺寸的窗

通过一个滑动窗，目标检测可以作为一个分类问题！需要大量的计算匹配！

HOG(Histogram of Oriented Gradient)

Dalal and Triggs, “Histograms of Oriented Gradients for Human Detection”, CVPR 2005

在不同分辨率计算方向梯度直方图
略，不懂

Deformable Parts Model : DPM

Felzenszwalb et al, “Object Detection with Discriminatively
Trained Part Based Models”, PAMI 2010

DPM is CNN?

Girschick et al, “Deformable Part Models are Convolutional Neural Networks”, CVPR 2015

Detection as Classification:

问题：需要测试很多位置和尺寸，计算量大！

方案：仅仅测试一个很小的子集！

How to：

Region Proposals: Selective Search

自底向上，分割图像，然后在不同层级合并相似区域，得到不同层级的分割结果。

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013

其他方法：EdgeBox？

检测 Review： Hosang et al, “What makes for effective detection proposals?”, PAMI 2015

R-CNN！ Girschick et al, “Rich feature hierarchies for
accurate object detection and semantic
segmentation”, CVPR 2014

Step 1: Train (or download) a classification model for ImageNet (AlexNet)
Step 2: Fine-tune model for detection
- 把1000个分类变成20个目标+背景
- 扔掉最后一层FC层，重新初始化
- 用正负样本区域训练
Step 3: 抽取特征
- 对所有图片，找到感兴趣区域
- 对每一个区域，剪切或者压缩到CNN输入尺寸，run forward through CNN，保存pool5特征到硬盘
Step 4: 对每一个类，用上述抽取的特征，训练一个2分类SVM
Step 5: (bbox regression) 对每一个类，训练一个线性回归模型，从上述特征得到box的偏移量！

目标检测数据集：PASCAL VOC (2010)， ImageNet Detection (ILSVRC 2014)， MS-COCO (2014)

评估指标：“mean average precision” (mAP)

RCNN问题：

Slow at test-time: 对每一个区域都要计算CNN抽取的特征
SVM和回归都不会对CNN的特征进行更新，不存在调优
复杂的多阶段流程

Fast-RCNN：Girschick, “Fast R-CNN”, ICCV 2015

计算慢的问题：对整个图像计算CNN后的特征，共享计算量
end-to-end 地训练一次！

ROI(region of interest) 抽取：

对整个图像卷积+Pooling，得到高精度的特征
将投影区域划分为 h*w 个格子

训练加速8.8倍，测试加速146倍！

问题：测试加速不包过 ROI 提取！？

Faster RCNN：在最后一层卷积层加入一层Region Proposal
Network (RPN)

Ren et al, “Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks”, NIPS 2015

进一步将test时间加速10倍！

表达可视化

t-SNE visualization：two images are placed nearby if their CNN codes are close.
Laurens van der Maaten , Geoffrey Hinton, 2008.

Deconv方法：选择某个CNN层，将该层的梯度全部置0，除了其中一个！然后BP到输入，得到Deconv图像！ BP to image.

Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013

Optimization to Image 方法：寻找最大化某些类别的score！

$$
\arg \max_{I} S_c(I) - \lambda ||I||_2^2
$$

将输入层置0，即输入全零图像。
将输出层的梯度为单位向量，某个类别为1其他为0，然后 BP to image!

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014.

Understanding Neural Networks Through Deep Visualization, Yosinski et al. , 2015

问题：给定一个CNN编码，能否重构出原始图像？

Understanding Deep Image Representations by Inverting Them， Mahendran and Vedaldi, 2014

DeepDream， https://github.com/google/deepdream

Understanding Neural Networks Through Deep Visualization, Jason Yosinski, 2015

RNN

字母维度的语言模型：

Image Captioning：将CNN抽取的特征，作为RNN隐层额外的输入！RNN的初始输入用一个固定的值，后续时刻用前一时刻的输出作为输入！

Image Sentence Datasets：Microsoft COCO [Tsung-Yi Lin et al. 2014]

RNN 在产生单词的时候，关注图像的部分：Show Attend and Tell, Xu et al., 2015

CNN practice

Data Augmentation 数据增强：

对图像进行变换：

水平翻转
随机裁剪和缩放

Training: sample random crops / scales
ResNet:

Pick random L in range [256, 480]
Resize training image, short side = L
Sample random 224 x 224 patch

Testing: average a fixed set of crops
ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

color jitter：色彩抖动？
更多：Random mix/combinations of :
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)

加噪声！
训练：添加随机噪声；测试：排除噪声！

Transfer Learning 迁移学习

在ImageNet上训练CNN
在小数据集上，固定前面所有层，只改变最后一层参数！相当于用CNN做特征提取，没有调优！
在中等数据集上，固定前面大多数层，只改变后面少许层参数，进行调优！

CNN Features off-the-shelf: an Astounding Baseline for Recognition，[Razavian et al, 2014]

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, 2014.

CNN 细节

多个小滤波器堆叠比一个大滤波器好！因为可以用较少的参数，得到相同的非线性！（用最后一层的神经元所能看到的输入像素个数来度量？）
并且计算量更小！

1x1 大小的滤波器，用来降维！？GoogleNet！

用两个1xN和Nx1的滤波器，代替一个NxN的滤波器？！！减少参数

Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”

卷积的实现

将多个卷积计算变成一个矩阵乘法运算！ im2col，需要大额外的内存

设图像特征map为 HMC 维，D个卷积滤波器维度为KKC 维。
将图像reshape成 (K^2C)*N维矩阵，而将滤波器变为(K^2C)*D维矩阵，然后计算矩阵乘法，最后将D*N维结果再reshape为给定的大小。

FFT实现：对于小的滤波器没有提升！
Vasilache et al, Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Strassen’s 矩阵乘法算法！加速。
Lavin and Gray, “Fast Algorithms for Convolutional Neural Networks”, 2015

GPU：NVIDIA is much more common for deep learning

CEO of NVIDIA: Jen-Hsun Huang
(Stanford EE Masters 1992)

CPU:
Few, fast cores (1 - 16),
Good at sequential processing.

GPU:
Many, slower cores (thousands),
Originally for graphics,
Good at parallel computation

GUDA vs OpenCL.

Udacity: Intro to Parallel Programming

GPU 非常适合矩阵乘法！

多GPU训练：

模型并行：FC全连接层
数据并行：CNN层

Alex Krizhevsky, “One weird trick for parallelizing convolutional neural networks”

Google：分布式 CPU 训练！数据并行 and 模型并行！

Large Scale Distributed Deep Networks, Jeff Dean et al., 2013

Google：异步 and 同步

Abadi et al, “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems”

GPU - CPU 通信瓶颈：

CPU：数据预取 + data augment
GPU：forward/backward

CPU - disk 瓶颈：磁盘 => SSD

GPU memory 瓶颈：
Titan X: 12 GB <- currently the max。
GTX 980 Ti: 6 GB

AlexNet: ~3GB needed with batch size 256

浮点精度

大多数编程环境：64bit 双精度
CNN：32bit 单精度
16bit 半精度将成为新的标准！cuDNN 已经支持！

最低精度能到多少？
16bit 定点 with 随机 round！

Gupta et al, “Deep Learning with Limited Numerical Precision”, ICML 2015

10bit 激活函数，12bit参数更新！
Courbariaux et al, “BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1”, arXiv 2016

未来：binary network？

软件包 Caffe / Torch / Theano / TensorFlow

Caffe

U. C. Berkeley
C++
Has Python and MATLAB bindings
Good for training or finetuning feedforward models

主要类：

Blob: 存储数据
Layers：将底层Blob变成顶层Blob
Net：很多Layers
Solver：使用梯度更新权值

Protocol Buffers: "Typed Json" from Google

训练和调优：不需要写代码！

Caffe: Model Zoo，预训练好的模型！

提供Python接口

not good for RNN

Torch

NYU + IDIAP
C and Lua
Used a lot a Facebook, DeepMind

Tensors： ndarray

not good for RNN

Theano

Yoshua Bengio’s group at University of Montreal
High-level wrappers: Keras, Lasagne

计算图！

问题：每次更新权值需要将权值和梯度移到 CPU 计算！
可以通过 shared_variable 得到解决！

TensorFlow

From Google
Tensorboard for 可视化

目前还比较慢！

Video

feature based 方法（运动识别）：

Dense trajectories and motion boundary descriptors for action recognition
Wang et al., 2013
1. 检测不同尺度的图像的特征点
2. 跟踪特征点 optical flow
3. 在局部坐标中抽取 HOG/HOF/MBH 特征
4. 相关文献：
  - [G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” 2003]
  - [T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” 2011]
  - [J. Shi and C. Tomasi, “Good features to track,” CVPR 1994]
  - [Ivan Laptev 2005]
Action Recognition with Improved Trajectories
Wang and Schmid, 2013
Spatio-Temporal Conv：
- [3D Convolutional Neural Networks for Human Action Recognition, Ji et al., 2010]
- Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011
- [Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014]
- 3D VGGNet : [Learning Spatiotemporal Features with 3D Convolutional Networks, Tran et al. 2015]

无监督学习

Autoencoders

Encoder and Decoder：
1. 线性 + 非线性激活函数(sigmoid)
2. Deep 全连接
3. ReLU CNN
loss function: L2
使用Encoder初始化神经网络
逐层训练：Greedy training：RBM 2006 Hinton。现在不再流行了，因为 ReLU, 合理的初始化，batchnorm, Adam etc easily train
from scratch
生成样本！

Variational autoencoder

Kingma and Welling, “Auto-Encoding
Variational Bayes”, ICLR 2014

intuition：$(z)$以概率 $(p_{\theta*}(x|z))$产生图片样本$(x)$，z 可以使类别，属性等！
problem：在不知道z的情况下，估计参数$(\theta)$
prior：$(p(z))$是标准高斯分布
condition：$(p(x|z))$是对角高斯分布！用神经网络预测均值和方差

Generative adversarial nets

Goodfellow et al, “Generative
Adversarial Nets”, NIPS 2014