CS231N - Convolutional Neural Networks for Visual Recognition

关于

李菲菲在Stanford开的课程,见http://cs231n.stanford.edu/

BP 算法与计算图

一个图节点实现forward计算激活函数和backward计算梯度,图的变对应于变量。

高效的BP算法

LeeCun 1998 的论文中给出了BP算法的一些trick:

$$
\sigma_{y_i} = (\sum_j w_{ij}^2)^{1/2}
$$

为了保证输出方差也为1,那么

$$
\sigma_w = m^{-1/2}
$$

对于部分连接的网络(如CNN,DTNN),m应该是连接的节点个数。

$$
\Delta w(t+1) = \eta \frac{\partial E_{t+1}}{\partial w} + \mu \Delta w(t)
$$

自动微分

四种计算梯度的方法:

  1. 手动推导梯度的公式,然后编码实现:易错,费时
  2. 数值微分(有限差分):简单实现,但是效率低,而且不精确
  3. 符号微分(Mathematica, Maple):表达式通常会比较复杂,存在 expression swell 的问题,而且对表达式形式有要求(close form)
  4. 自动微分:可以达到机器精度,和理想的渐进性能只差一个常数因子(性能牛逼!)

参考

  1. Automatic differentiation in machine learning: a survey, 2015
  2. Efficient BP, LeeCun 1998
  3. https://cs231n.github.io/optimization-2/
  4. Stochastic Gradient Tricks

神经网络历史

感知器

Frank Rosenblatt 1957

$$
y = f(w x + b) \\
f(z) = 1 when z>0 else 0
$$

更新权值

$$
w_i(t+1) = w_i(t) + \alpha (d_j - y_j(t))x_{j, i}
$$

相当于下述损失函数 + SGD优化(注意这里label是+1,-1和上面有区别,这里只是便于表达)

$$
\max(0, - d_j * y_j)
$$

三层神经网络

Rumelhart et al. 1986,BP算法

RBM深度网络

Hinton 2006

第一个强结果

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition George Dahl,
Dong Yu, Li Deng, Alex Acero, 2010 MSR

Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012

激活函数

$$
x > 0: x \\
x < 0: \alpha (\exp(x) - 1)
$$

中心化:减去图像均值(AlexNet),减去每个通道的均值(VGGNet)

初始化:
“Xavier initialization” [Glorot et al., 2010]: $(n_{in} Var(w) = 1)$

对于 ReLU,因为一半恒为0,因此有一个0.5因子。$(\frac{1}{2} n_{in} Var(w) = 1)$ He et al., 2015

论文:

  1. Understanding the difficulty of training deep feedforward neural networks, Glorot and Bengio, 2010
  2. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

Batch Normalize[Ioffe and Szegedy, 2015]:归一化到标准正态分布,然后让它自己学一个纺射变换。通常插入在全连接层和激活函数之间。

CNN

  1. 1998,LeNet-5, LeCun
  2. LeNet-5: Gradient-based learning applied to document recognition,1998,LeCun, Bottou, Bengio, Haffner
  3. AlexNet: ImageNet Classification with Deep Convolutional Neural Networks,Hinton 2012,
  4. ZFNet:
  5. VGGNet: Very Deep Convolutional Networks for Large-Scale Image Recognition
  6. GoogLeNet: Going Deeper with Convolutions
  7. ResNet

目标检测

任务:分类 + 定位

Location as Regression: 输入图片,输出4个坐标!(非常简单)L2损失函数,作为回归问题处理。
分类 + 定位 作为多任务,共用同一个CNN做特征提取层。

输入图像 => CNN => FC => softmax/Regression

回归层连接的位置 :在CNN层后面(Overfeat, VGG); 在全连接层(FC)后面:DeepPose,R-CNN

检测多个目标:共用CNN层做特征提取层。

姿势估计:Toshev and Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks”, CVPR 2014

sliding window: Overfeat:Integrated Recognition, Localization and Detection using Convolutional Networks, ICLR 2014

通过一个滑动窗,目标检测可以作为一个分类问题!需要大量的计算匹配!

HOG(Histogram of Oriented Gradient)

Dalal and Triggs, “Histograms of Oriented Gradients for Human Detection”, CVPR 2005

Deformable Parts Model : DPM

Felzenszwalb et al, “Object Detection with Discriminatively
Trained Part Based Models”, PAMI 2010

DPM is CNN?

Girschick et al, “Deformable Part Models are Convolutional Neural Networks”, CVPR 2015

Detection as Classification:

问题:需要测试很多位置和尺寸,计算量大!

方案:仅仅测试一个很小的子集!

How to:

Region Proposals: Selective Search

自底向上,分割图像,然后在不同层级合并相似区域,得到不同层级的分割结果。

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013

其他方法:EdgeBox?

检测 Review: Hosang et al, “What makes for effective detection proposals?”, PAMI 2015

R-CNN! Girschick et al, “Rich feature hierarchies for
accurate object detection and semantic
segmentation”, CVPR 2014

目标检测数据集:PASCAL VOC (2010), ImageNet Detection (ILSVRC 2014), MS-COCO (2014)

评估指标:“mean average precision” (mAP)

RCNN问题:

Fast-RCNN:Girschick, “Fast R-CNN”, ICCV 2015

ROI(region of interest) 抽取:

  1. 对整个图像卷积+Pooling,得到高精度的特征
  2. 将投影区域划分为 h*w 个格子

训练加速8.8倍,测试加速146倍!

问题:测试加速不包过 ROI 提取!?

Faster RCNN:在最后一层卷积层加入一层Region Proposal
Network (RPN)

Ren et al, “Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks”, NIPS 2015

进一步将test时间加速10倍!

表达可视化

t-SNE visualization:two images are placed nearby if their CNN codes are close.
Laurens van der Maaten , Geoffrey Hinton, 2008.

Deconv方法:选择某个CNN层,将该层的梯度全部置0,除了其中一个!然后BP到输入,得到Deconv图像! BP to image.

Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013

Optimization to Image 方法:寻找最大化某些类别的score!

$$
\arg \max_{I} S_c(I) - \lambda ||I||_2^2
$$

  1. 将输入层置0,即输入全零图像。
  2. 将输出层的梯度为单位向量,某个类别为1其他为0,然后 BP to image!

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014.

Understanding Neural Networks Through Deep Visualization, Yosinski et al. , 2015

问题:给定一个CNN编码,能否重构出原始图像?

Understanding Deep Image Representations by Inverting Them, Mahendran and Vedaldi, 2014

DeepDream, https://github.com/google/deepdream

Understanding Neural Networks Through Deep Visualization, Jason Yosinski, 2015

RNN

字母维度的语言模型:

Image Captioning:将CNN抽取的特征,作为RNN隐层额外的输入!RNN的初始输入用一个固定的值,后续时刻用前一时刻的输出作为输入!

Image Sentence Datasets:Microsoft COCO [Tsung-Yi Lin et al. 2014]

RNN 在产生单词的时候,关注图像的部分:Show Attend and Tell, Xu et al., 2015

CNN practice

Data Augmentation 数据增强:

对图像进行变换:

  1. 水平翻转
  2. 随机裁剪和缩放

Training: sample random crops / scales
ResNet:

  1. Pick random L in range [256, 480]
  2. Resize training image, short side = L
  3. Sample random 224 x 224 patch

Testing: average a fixed set of crops
ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

  1. color jitter:色彩抖动?
  2. 更多:Random mix/combinations of :
    • translation
    • rotation
    • stretching
    • shearing,
    • lens distortions, … (go crazy)

加噪声!
训练:添加随机噪声; 测试:排除噪声!

Transfer Learning 迁移学习

  1. 在ImageNet上训练CNN
  2. 在小数据集上,固定前面所有层,只改变最后一层参数!相当于用CNN做特征提取,没有调优!
  3. 在中等数据集上,固定前面大多数层,只改变后面少许层参数,进行调优!

CNN Features off-the-shelf: an Astounding Baseline for Recognition,[Razavian et al, 2014]

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, 2014.

CNN 细节

多个小滤波器堆叠比一个大滤波器好!因为可以用较少的参数,得到相同的非线性!(用最后一层的神经元所能看到的输入像素个数来度量?)
并且计算量更小!

1x1 大小的滤波器,用来降维!?GoogleNet!

用两个1xN和Nx1的滤波器,代替一个NxN的滤波器?!!减少参数

Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”

卷积的实现

将多个卷积计算变成一个矩阵乘法运算! im2col,需要大额外的内存

  1. 设图像特征map为 HMC 维,D个卷积滤波器维度为KKC 维。
  2. 将图像reshape成 (K^2C)*N维矩阵,而将滤波器变为(K^2C)*D维矩阵,然后计算矩阵乘法,最后将D*N维结果再reshape为给定的大小。

FFT实现:对于小的滤波器没有提升!
Vasilache et al, Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Strassen’s 矩阵乘法算法!加速。
Lavin and Gray, “Fast Algorithms for Convolutional Neural Networks”, 2015

GPU:NVIDIA is much more common for deep learning

CEO of NVIDIA: Jen-Hsun Huang
(Stanford EE Masters 1992)

CPU:
Few, fast cores (1 - 16),
Good at sequential processing.

GPU:
Many, slower cores (thousands),
Originally for graphics,
Good at parallel computation

GUDA vs OpenCL.

Udacity: Intro to Parallel Programming

GPU 非常适合矩阵乘法!

多GPU训练:

  1. 模型并行:FC全连接层
  2. 数据并行:CNN层

Alex Krizhevsky, “One weird trick for parallelizing convolutional neural networks”

Google:分布式 CPU 训练!数据并行 and 模型并行!

Large Scale Distributed Deep Networks, Jeff Dean et al., 2013

Google:异步 and 同步

Abadi et al, “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems”

GPU - CPU 通信瓶颈:

CPU - disk 瓶颈:磁盘 => SSD

GPU memory 瓶颈:
Titan X: 12 GB <- currently the max。
GTX 980 Ti: 6 GB

AlexNet: ~3GB needed with batch size 256

浮点精度

最低精度能到多少?
16bit 定点 with 随机 round!

Gupta et al, “Deep Learning with Limited Numerical Precision”, ICML 2015

10bit 激活函数,12bit参数更新!
Courbariaux et al, “BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1”, arXiv 2016

未来:binary network?

软件包 Caffe / Torch / Theano / TensorFlow

Caffe

主要类:

Protocol Buffers: "Typed Json" from Google

训练和调优:不需要写代码!

Caffe: Model Zoo,预训练好的模型!

提供Python接口

not good for RNN

Torch

Tensors: ndarray

not good for RNN

Theano

计算图!

问题:每次更新权值需要将权值和梯度移到 CPU 计算!
可以通过 shared_variable 得到解决!

TensorFlow

目前还比较慢!

Video

feature based 方法(运动识别):

无监督学习

Autoencoders

Variational autoencoder

Kingma and Welling, “Auto-Encoding
Variational Bayes”, ICLR 2014

Generative adversarial nets

Goodfellow et al, “Generative
Adversarial Nets”, NIPS 2014