/ deeplearning.ai深度学习笔记整理

deeplearning.ai深度学习笔记(Course2 Week1):Practical aspects of deep learning

提示:为方便阅读,《deeplearning.ai深度学习笔记》系列博文已统一整理到 http://dl-notes.imshuai.com

这是Course 2的内容,涉及:

  1. 训练集、开发集、测试集的概念
  2. Bias/Variance问题
  3. 如何通过泛化(regularization)算法,解决High variance问题,以及常用的集中regularization算法。
  4. 最小化J中的一些加快手段
    • 标准化输入(Normalizing )
    • 梯度消失和梯度爆炸问题(Vanishing/Exploding Gradient)以及缓解方法
    • 验证梯度计算是否正确。

1- Setting up your Machine Learning Application

1.1- Train / Dev / Test sets

  1. Applied ML is a highly iterative process
    之所以说是highly interative,是因为要设置不少超参,不断试验。但超参也不是说靠猜。而是通过一些手段,使得这个过程更有效。

  2. 数据集划分:

  • Train set:训练集
  • dev set:即cross validation set,测试不同的算法
  • test set:测试集
  1. 关于数据集的一些趋势:
  • 传统来说(数据集较小),这三种测试数据的比例为:70%,20%和20%;但在目前大数据集的情况下,dev set和test set的比例并不需要太高,只要足够测试即可,很可能比例是98%,1%和1%,甚至更高。
  • 实际生产环境,可能Training set和Dev/test sets来源不同,前者来自开发者,后者来自用户,导致数据集分布不同
  1. 经验法则:尽量让Training set和Dev/test sets具有相同的分布
  2. 某些情况下没有test set也是没问题的。这个时候只有training set和dev set,或者不严谨的,这个时候dev set被称作为test set(但和tain/dev/test中的test作用并不一样,实际作用还是dev)

1.2- Bias / Variance

Bias:偏差,High bias: underfitting;说明算法比数据简单,不足以描述数据。
Variance:方差,High variance: overfitting;说明算法超过了数据的实际复杂性,甚至将一些随机因素过度解释为了数据规律。

直观的含义:
Screen-Shot-2018-06-17-at-17.39.30

如何衡量是bias还是variance:对比Train set error和dev set error以及base error的关系
下表是判断依据(括号内的百分数是举例值,假设base error接近0%的error)

Train set error Dev set error type
low (1%) high (11%) high variance
high (15%) high, but near train set error (16%) high bias
high (15%) high, but much higher than Train set error (30%) high bias & high variance
low (0.5%) low (1%) low bias & low variance

可以看出:

  1. Dev set error理论上通常是大于Train set error
  2. 好的算法,要求bias和variance都很低,而坏的算法则相反。而并没有一个明显的bias-variance trade-off

人类可以达到的error称作为base error,或optimal error。如果上述表格中,设定的base error=15%,那么第二行的例子反而是low bias & low variance的。

1.3- Basic Recipe for Machine Learning

解决Bias/Variance的一般步骤:
basic-recipe-for-ml2

  1. 针对bias和variance要选择对应的解决方法。
  2. 在早期的ML,强调bias variance trade-off;但在现代deep learning,可以通过加大Neural Network或增加更多数据,在分别解决High bias和High variance的时候,并不会影响彼此。 这也是deep learning在supervised learning如此成功的重要原因。

Training a bigger network almost never hurts. And the main cost of training a neural network that's too big is just computational time, so long as you're regularizing.

2- Regularization your neural network

  1. Regularzization可以有效解决overfitting问题。
  2. 常用的regularization方法:
    • L2 regularization
    • dropout reuglarization
  3. 我的理解:之所以会有过拟合问题,本质上是数据存在一定的随机性干扰(即在主要的规律外,还有一定的随机因素干扰了数据,而这些随机因素被算法当成规律学习了),而中和这种随机性的办法就是在算法中也增加一些“干扰”,这个“干扰”就是Regularization。

下图是是否做了regularization的对比举例,直观上,regularization 让decision boundary更平滑了:
Xnip2018-06-18_10-34-31

2.1- L2 Regularization

在cost function J 增加参数的\(L^2\)范数,即L2 regularization(也叫weight decay),对于向量来说\(L^2\)范数就是向量的模。

  1. 举例:
  • logistic regression做regularization
    在cost function增加:
    $$\frac{\lambda}{2m}||w||^2_2$$

  • Neural Network,增加Frobenius norm
    $$\frac{\lambda}{2m} \sum\limits_{l = 1}^{m}||W^{[l]}||^2_F$$
    其中
    $$||W^{[l]}||^2_F = \sum\limits_{i = 1}^{n^{[l]}} \sum\limits_{j = 1}^{n^{[l-1]}} (W^{[l]}_{ij})^2$$

其中,λ是regularization parameter

  1. Why L2 regularization reduces overfitting?

Regularization其实是让函数变得简化

为什么也叫weight decay?
L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

加入λ设置的很大,那么整个函数J其实对norm部分更为敏感(因为即norm部分的数值比较大),为了让J更小,就倾向于让W向零靠近;W靠近0的后果是,很多neural起的作用变小了,极限情况下甚至退化为logistic regression。

另外,过大的λ,导致W偏小,同时让activation function(比如tanh)处于偏于线性的部分,有简化的趋势:
Screen-Shot-2018-06-17-at-20.26.22

L2 regularization的不足:要通过不断的选用不同的λ进行测试,计算量加大了。

2.2- Dropout Regularization

  1. Dropout Regularization:在每轮迭代计算时,随机的将Network中一些neuron剔除,效果就好像用了一个更小的Network。(我有个疑问:为什么不直接较小的Network?)

下面是编程作业里面的实例视频

Drop-out on the second hidden layer:

At each iteration, you shut down (= set to zero) each neuron of a layer with probability \(1 - keep_prob\) or keep it with probability \(keep_prob \) (50% here). The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration.

Drop-out on the first and third hidden layers:

\(1^{st}\) layer: we shut down on average 40% of the neurons. \(3^{rd}\) layer: we shut down on average 20% of the neurons.

  1. implementing dropout("Inverted dropout")
    illustrate with 3rd layer, 以0.8的概率保留neuron(keep_prob=0.8)
    Screen-Shot-2018-06-17-at-20.47.54
  2. 在test time,不需要做regularization
  3. dropout intuition
    Can’t rely on any one feature, so have to spread out weights. And by spreading all the weights, this will tend to have an effect of shrinking the squared norm of the weights. And so, similar to what we saw with L2 regularization.
  4. 不同的layer,可以选用不同的keep-pro;比如neuron较多的层,可以将keep-pro设置的小一些,而本身就没有几个neuron的(比如最后一层),则将keep-pro倾向于设置为1。另外,input layter也尽量设置为keep-pro==1。
  5. dropout在computer vision用的非常多;但其他领域不轻易使用。
  6. dropout的缺点是:导致cost function J不稳定,在估计J是否收敛时,可能会有不稳定。一般会在检查收敛的时候,把dropout关掉,确认收敛后在打开。

2.3- Other regularization methods

  1. Data augmentation
    通过对已有数据的人工加工,形成更多的训练数据,变相实现了增加数据量。比如对已有图片的翻转、裁剪形成新的数据。
    Screen-Shot-2018-06-17-at-21.47.41

  2. Early stopping
    在观察training set error的同时,将dev set error也输出观察。通常dev set error会先下降再上升。因此在这个点提前停止会得到一个相对好的结果。
    通常,随着迭代的次数增加,W通常会越来越大;因此同L2 regularization类似(让W变小了),early stop也变相让W还没变大的时候提前停止。
    early stop的缺点是:这个过程是和最小化J这个任务相悖的。但与L2 regularization相比,也有它的优势,即不需要通过迭代选择最好的λ,一次迭代就能确定early stop的点。

3- setting up your optimization problem

如何加快优化算法的速度

3.1- Normalizing inputs

  1. normalizing = 中心化(均值为0)+归一化(方差为1)
    注意:train set和dev/test set应该用一样的normalizing方法。
  2. why normalize inputs?
    让每个参数的范围一致,这样在做gradient descent的时候,所有参数下降的速率相似;如果速度不一致的话,很可能有些参数已经下降完成了,而另一些参数还在下降的过程中,就像下图的左边的情况。(我的理解,如果不normalize,就要让每个参数有不一样的Learning rate,才能速度保持一致)
    Screen-Shot-2018-06-17-at-22.11.53

中心化的作用也是类似,因为在初始化参数的时候,用的是同一个随机分布做的初始化。

3.2- Vanishing / Exploding gradients

Vanishing / Exploding gradients:在层数很深的neural network,可能因为input数据>1或<1的区别,深层的activation function会指数级的变的很大或很小,进而activation function的梯度要么很大,要么很小,使得梯度下降算法性能下降。

这个问题,长期以来也是深度学习网络的一个壁垒。缓解这个问题(但没有彻底解决) 的方法是: better or more careful choice of the random initialization for your neural network.

3.3- Weight Initialization for Deep Networks

在之前的课程中提到(shallow neural network),初始化W的时候,会乘一个系数(比如0.01),为了是让W尽量小,尽量使得激活函数出现在有明显梯度的点,但具体多小呢,正是本节要解决的问题。
但对于DNN,鉴于Vanishing / Exploding gradients问题,初始化的参数希望让z尽量计算结果接近1。 由于input是normalize的,因此尽量让w的方差靠近1/n。
Xnip2018-06-17_22-50-26

针对tanh activation function,一般为(Xavier initialization):
$$W^{[l]} = np.random.randn(shape) *\sqrt{\frac{1}{n^{[l-1]}}}$$

针对ReLU activation function,一般为(He Initialization):
$$W^{[l]} = np.random.randn(shape) *\sqrt{\frac{2}{n^{[l-1]}}}$$

结论(结合编程作业理解):

  • Different initializations lead to different results
  • Random initialization is used to break symmetry and make sure different hidden units can learn different things
  • Don't intialize to values that are too large
  • He initialization works well for networks with ReLU activations.

3.4- Gradient checking

  1. 为了确保gradient descent中导数计算正确,可以用近似导数(Numerical approximation of gradients)进行验证。近似导数其实就是数学上导数的极限定义

$$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} $$

  1. Gradient descent check implementation
    对每个参数进行近似导数求导,然后比较和实际计算的梯度相比:
    $$\frac{||d_{approx} - d||_2} {||d_{approx}||_2 + ||d||_2}$$
    上面的式子的结果应该和\(\varepsilon\)大概一个数量级

  2. 注意:

    • 仅在debug时用Gradient checking
    • 不要漏掉 regularization 的部分
    • dropout regularization不适用,因为J是不稳定的。

4- Heros of Deep Learning: Yoshua Bengio interview

Xnip2018-06-17_23-59-19

  1. Yoshua Bengio 是 Ian Goodfellow 的导师,也是 Deep learning 这本书的的第二作者。
  2. When I started my graduate studies in 1985, I started reading neural net papers, and that's where I got all excited, and it became really a passion.
  3. We start with experiments, with intuitions, and theory sort of comes later. We now understand a lot better, for example, why Backdrop is working so well, why depth is so important.
  4. what were the biggest surprises of what turned out to be wrong, compared to what we knew 30 years ago?
    The ReLU was working a lot better than the sigmoids and tanh.
  5. 最骄傲的研究成果:
    • Long-term dependencies
    • curse of dimensionality
    • joint distributions with neural nets
    • word embeddings for joint distributions for words
    • the vanishing gradient in deep nets.
    • the denoising auto-encoders
    • the GANs
  6. Neural nets we used to think as machines that can map a vector to a vector. But really with attention mechanisms, you can now handle any kind of data structure. And this is really opening up a lot of interesting avenues.
  7. 虽然,现在业界在supervised learning上很成功,但unsupervised learning很重要,It's about actually building a mental construction that explains how the world works by observation.
  8. what in deep learning today excites you the most?
    I feel like the current state of the science of deep learning is far from where I'd like to see it. And I have the impression that our systems right now make the kind of mistakes that suggest they have a very superficial understanding of the world.
    they would become much easier to tackle if we had systems that had a better understanding of how the world works.
  9. Advice for people entering AI
    • First of all, there are different motivations and different things you can do. What you need to become a deep learning researcher may not be the same as if you want to be an engineer who's going to use deep learning to build products. T
    • You have to practice programming the things yourself. So don't just use one of the programming frameworks where you can do everything in a few lines of code, but you don't really know what just happened. Trying to derive the thing yourself from first principles, if you can. That really helps. But yeah, the usual things you have to do like reading, looking at other people's code, writing your own code, doing lots of experiment, making sure you understand everything you do. So especially for the science part of it, trying to ask why am I doing this, why are people doing this? Maybe the answer is somewhere in the book and you have to read more. (再次提到实践的重要性)
    • deep learning 这本书的看法: I feel like there is more people reading this book than people who can read it right now.
    • Proceedings of the ICLR (International Conference on Learning Representations) conference is probably the best concentrated place of good papers. (ICLR 国际会议很重要,需要关注)
    • Don't be afraid by the math. Just develop the intuitions, and then the math become really easier to understand once you get the hang of what's going on at the intuitive level. (Intuition很重要)
    • You don't need five years of PhD to become proficient at deep learning.
    • 需要的数学基础: probability, algebra, optimization and calculus.
deeplearning.ai深度学习笔记(Course2 Week1):Practical aspects of deep learning
Share this