/ deeplearning.ai深度学习笔记整理

# deeplearning.ai深度学习笔记（Course2 Week1）：Practical aspects of deep learning

1. 训练集、开发集、测试集的概念
2. Bias/Variance问题
3. 如何通过泛化（regularization）算法，解决High variance问题，以及常用的集中regularization算法。
4. 最小化J中的一些加快手段
• 标准化输入（Normalizing ）
• 验证梯度计算是否正确。

# 1- Setting up your Machine Learning Application

## 1.1- Train / Dev / Test sets

1. Applied ML is a highly iterative process
之所以说是highly interative,是因为要设置不少超参，不断试验。但超参也不是说靠猜。而是通过一些手段，使得这个过程更有效。

2. 数据集划分：

• Train set：训练集
• dev set：即cross validation set，测试不同的算法
• test set：测试集
1. 关于数据集的一些趋势：
• 传统来说（数据集较小），这三种测试数据的比例为：70%，20%和20%；但在目前大数据集的情况下，dev set和test set的比例并不需要太高，只要足够测试即可，很可能比例是98%，1%和1%，甚至更高。
• 实际生产环境，可能Training set和Dev/test sets来源不同，前者来自开发者，后者来自用户，导致数据集分布不同
1. 经验法则：尽量让Training set和Dev/test sets具有相同的分布
2. 某些情况下没有test set也是没问题的。这个时候只有training set和dev set，或者不严谨的，这个时候dev set被称作为test set（但和tain/dev/test中的test作用并不一样，实际作用还是dev）

## 1.2- Bias / Variance

Bias：偏差，High bias: underfitting；说明算法比数据简单，不足以描述数据。
Variance：方差，High variance: overfitting；说明算法超过了数据的实际复杂性，甚至将一些随机因素过度解释为了数据规律。

Train set error Dev set error type
low (1%) high (11%) high variance
high (15%) high, but near train set error (16%) high bias
high (15%) high, but much higher than Train set error (30%) high bias & high variance
low (0.5%) low (1%) low bias & low variance

1. Dev set error理论上通常是大于Train set error

## 1.3- Basic Recipe for Machine Learning

1. 针对bias和variance要选择对应的解决方法。
2. 在早期的ML，强调bias variance trade-off；但在现代deep learning，可以通过加大Neural Network或增加更多数据，在分别解决High bias和High variance的时候，并不会影响彼此。 这也是deep learning在supervised learning如此成功的重要原因。

Training a bigger network almost never hurts. And the main cost of training a neural network that's too big is just computational time, so long as you're regularizing.

# 2- Regularization your neural network

1. Regularzization可以有效解决overfitting问题。
2. 常用的regularization方法：
• L2 regularization
• dropout reuglarization
3. 我的理解：之所以会有过拟合问题，本质上是数据存在一定的随机性干扰（即在主要的规律外，还有一定的随机因素干扰了数据，而这些随机因素被算法当成规律学习了），而中和这种随机性的办法就是在算法中也增加一些“干扰”，这个“干扰”就是Regularization。

## 2.1- L2 Regularization

1. 举例：
• logistic regression做regularization
在cost function增加：
$$\frac{\lambda}{2m}||w||^2_2$$

• Neural Network，增加Frobenius norm
$$\frac{\lambda}{2m} \sum\limits_{l = 1}^{m}||W^{[l]}||^2_F$$
其中
$$||W^{[l]}||^2_F = \sum\limits_{i = 1}^{n^{[l]}} \sum\limits_{j = 1}^{n^{[l-1]}} (W^{[l]}_{ij})^2$$

1. Why L2 regularization reduces overfitting?

Regularization其实是让函数变得简化

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

L2 regularization的不足：要通过不断的选用不同的λ进行测试，计算量加大了。

## 2.2- Dropout Regularization

1. Dropout Regularization：在每轮迭代计算时，随机的将Network中一些neuron剔除，效果就好像用了一个更小的Network。（我有个疑问：为什么不直接较小的Network？）

Drop-out on the second hidden layer：

At each iteration, you shut down (= set to zero) each neuron of a layer with probability $$1 - keep_prob$$ or keep it with probability $$keep_prob$$ (50% here). The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration.

Drop-out on the first and third hidden layers:

$$1^{st}$$ layer: we shut down on average 40% of the neurons. $$3^{rd}$$ layer: we shut down on average 20% of the neurons.

1. implementing dropout("Inverted dropout")
illustrate with 3rd layer, 以0.8的概率保留neuron（keep_prob=0.8）
2. 在test time，不需要做regularization
3. dropout intuition
Can’t rely on any one feature, so have to spread out weights. And by spreading all the weights, this will tend to have an effect of shrinking the squared norm of the weights. And so, similar to what we saw with L2 regularization.
4. 不同的layer，可以选用不同的keep-pro；比如neuron较多的层，可以将keep-pro设置的小一些，而本身就没有几个neuron的（比如最后一层），则将keep-pro倾向于设置为1。另外，input layter也尽量设置为keep-pro==1。
5. dropout在computer vision用的非常多；但其他领域不轻易使用。
6. dropout的缺点是：导致cost function J不稳定，在估计J是否收敛时，可能会有不稳定。一般会在检查收敛的时候，把dropout关掉，确认收敛后在打开。

## 2.3- Other regularization methods

1. Data augmentation
通过对已有数据的人工加工，形成更多的训练数据，变相实现了增加数据量。比如对已有图片的翻转、裁剪形成新的数据。

2. Early stopping
在观察training set error的同时，将dev set error也输出观察。通常dev set error会先下降再上升。因此在这个点提前停止会得到一个相对好的结果。
通常，随着迭代的次数增加，W通常会越来越大；因此同L2 regularization类似（让W变小了），early stop也变相让W还没变大的时候提前停止。
early stop的缺点是：这个过程是和最小化J这个任务相悖的。但与L2 regularization相比，也有它的优势，即不需要通过迭代选择最好的λ，一次迭代就能确定early stop的点。

# 3- setting up your optimization problem

## 3.1- Normalizing inputs

1. normalizing = 中心化（均值为0）+归一化（方差为1）
注意：train set和dev/test set应该用一样的normalizing方法。
2. why normalize inputs?

## 3.2- Vanishing / Exploding gradients

Vanishing / Exploding gradients：在层数很深的neural network，可能因为input数据>1或<1的区别，深层的activation function会指数级的变的很大或很小，进而activation function的梯度要么很大，要么很小，使得梯度下降算法性能下降。

## 3.3- Weight Initialization for Deep Networks

$$W^{[l]} = np.random.randn(shape) *\sqrt{\frac{1}{n^{[l-1]}}}$$

$$W^{[l]} = np.random.randn(shape) *\sqrt{\frac{2}{n^{[l-1]}}}$$

• Different initializations lead to different results
• Random initialization is used to break symmetry and make sure different hidden units can learn different things
• Don't intialize to values that are too large
• He initialization works well for networks with ReLU activations.

$$\frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}$$

对每个参数进行近似导数求导，然后比较和实际计算的梯度相比：
$$\frac{||d_{approx} - d||_2} {||d_{approx}||_2 + ||d||_2}$$
上面的式子的结果应该和$$\varepsilon$$大概一个数量级

2. 注意：

• 不要漏掉 regularization 的部分
• dropout regularization不适用，因为J是不稳定的。

# 4- Heros of Deep Learning: Yoshua Bengio interview

1. Yoshua Bengio 是 Ian Goodfellow 的导师，也是 Deep learning 这本书的的第二作者。
2. When I started my graduate studies in 1985, I started reading neural net papers, and that's where I got all excited, and it became really a passion.
3. We start with experiments, with intuitions, and theory sort of comes later. We now understand a lot better, for example, why Backdrop is working so well, why depth is so important.
4. what were the biggest surprises of what turned out to be wrong, compared to what we knew 30 years ago?
The ReLU was working a lot better than the sigmoids and tanh.
5. 最骄傲的研究成果：
• Long-term dependencies
• curse of dimensionality
• joint distributions with neural nets
• word embeddings for joint distributions for words
• the vanishing gradient in deep nets.
• the denoising auto-encoders
• the GANs
6. Neural nets we used to think as machines that can map a vector to a vector. But really with attention mechanisms, you can now handle any kind of data structure. And this is really opening up a lot of interesting avenues.
7. 虽然，现在业界在supervised learning上很成功，但unsupervised learning很重要，It's about actually building a mental construction that explains how the world works by observation.
8. what in deep learning today excites you the most?
I feel like the current state of the science of deep learning is far from where I'd like to see it. And I have the impression that our systems right now make the kind of mistakes that suggest they have a very superficial understanding of the world.
they would become much easier to tackle if we had systems that had a better understanding of how the world works.
9. Advice for people entering AI
• First of all, there are different motivations and different things you can do. What you need to become a deep learning researcher may not be the same as if you want to be an engineer who's going to use deep learning to build products. T
• You have to practice programming the things yourself. So don't just use one of the programming frameworks where you can do everything in a few lines of code, but you don't really know what just happened. Trying to derive the thing yourself from first principles, if you can. That really helps. But yeah, the usual things you have to do like reading, looking at other people's code, writing your own code, doing lots of experiment, making sure you understand everything you do. So especially for the science part of it, trying to ask why am I doing this, why are people doing this? Maybe the answer is somewhere in the book and you have to read more. （再次提到实践的重要性）
• deep learning 这本书的看法： I feel like there is more people reading this book than people who can read it right now.
• Proceedings of the ICLR (International Conference on Learning Representations) conference is probably the best concentrated place of good papers. （ICLR 国际会议很重要，需要关注）
• Don't be afraid by the math. Just develop the intuitions, and then the math become really easier to understand once you get the hang of what's going on at the intuitive level. （Intuition很重要）
• You don't need five years of PhD to become proficient at deep learning.
• 需要的数学基础: probability, algebra, optimization and calculus.