AI实践者会用到的10个深度学习模型

Interest in machine learning has exploded over the past decade. You see machine learning in computer science programs, industry conferences, and the Wall Street Journal almost daily. For all the talk about machine learning, many conflate what it can do with what they wish it could do. Fundamentally, machine learning is using algorithms to extract information from raw data and represent it in some type of model. We use this model to infer things about other data we have not yet modeled.

Neural networks are one type of model for machine learning; they have been around for at least 50 years. The fundamental unit of a neural network is a node, which is loosely based on the biological neuron in the mammalian brain. The connections between neurons are also modeled on biological brains, as is the way these connections develop over time (with “training”).

In the mid-1980s and early 1990s, many important architectural advancements were made in neural networks. However, the amount of time and data needed to get good results slowed adoption, and thus interest cooled. In the early 2000s, computational power expanded exponentially and the industry saw a “Cambrian explosion” of computational techniques that were not possible prior to this. Deep learning emerged from that decade’s explosive computational growth as a serious contender in the field, winning many important machine learning competitions. The interest has not cooled as of 2017; today, we see deep learning mentioned in every corner of machine learning.

To get myself into the craze, I took Udacity’s “Deep Learning” course, which is a great introduction to the motivation of deep learning and the design of intelligent systems that learn from complex and/or large-scale datasets in TensorFlow. For the class projects, I used and developed neural networks for image recognition with convolutions, natural language processing with embeddings and character based text generation with Recurrent Neural Network / Long Short-Term Memory. All the code in Jupiter Notebook can be found on this GitHub repository.

Here is an outcome of one of the assignments, a t-SNE projection of word vectors, clustered by similarity.

1_5gUqFLOyQrB4kgjAicn0og.png

Most recently, I have started reading academic papers on the subject. From my research, here are several publications that have been hugely influential to the development of the field:

There is an abundant amount of great knowledge about deep learning I have learnt via research and learning. Here I want to share the 10 powerful deep learning methods AI engineers can apply to their machine learning problems. But first of all, let’s define what deep learning is. Deep learning has been a challenge to define for many because it has changed forms slowly over the past decade. To set deep learning in context visually, the figure below illustrates the conception of the relationship between AI, machine learning, and deep learning.

1_k5P2e-b_rhH2X4u9qMsrWg.jpeg

The field of AI is broad and has been around for a long time. Deep learning is a subset of the field of machine learning, which is a subfield of AI. The facets that differentiate deep learning networks in general from “canonical” feed-forward multilayer networks are as follows:

  • More neurons than previous networks
  • More complex ways of connecting layers
  • “Cambrian explosion” of computing power to train
  • Automatic feature extraction

When I say “more neurons”, I mean that the neuron count has risen over the years to express more complex models. Layers also have evolved from each layer being fully connected in multilayer networks to locally connected patches of neurons between layers in Convolutional Neural Networks and recurrent connections to the same neuron in Recurrent Neural Networks (in addition to the connections from the previous layer).

Deep learning then can be defined as neural networks with a large number of parameters and layers in one of four fundamental network architectures:

  • Unsupervised Pre-trained Networks
  • Convolutional Neural Networks
  • Recurrent Neural Networks
  • Recursive Neural Networks

In this post, I am mainly interested in the latter 3 architectures. A Convolutional Neural Network is basically a standard neural network that has been extended across space using shared weights. CNN is designed to recognize images by having convolutions inside, which see the edges of an object recognized on the image. A Recurrent Neural Network is basically a standard neural network that has been extended across time by having edges which feed into the next time step instead of into the next layer in the same time step. RNN is designed to recognize sequences, for example, a speech signal or a text. It has cycles inside that implies the presence of short memory in the net. A Recursive Neural Network is more like a hierarchical network where there is really no time aspect to the input sequence but the input has to be processed hierarchically in a tree fashion. The 10 methods below can be applied to all of these architectures.

1 — Back-Propagation

Back-prop is simply a method to compute the partial derivatives (or gradient) of a function, which has the form as a function composition (as in Neural Nets). When you solve an optimization problem using a gradient-based method (gradient descent is just one of them), you want to compute the function gradient at each iteration.

1_gRAmdOHLaf-AfgyBfsa0JQ.png

For a Neural Nets, the objective function has the form of a composition. How do you compute the gradient? There are 2 common ways to do it: (i) Analytic differentiation. You know the form of the function. You just compute the derivatives using the chain rule (basic calculus). (ii) Approximate differentiation using finite difference. This method is computationally expensive because the number of function evaluation is O(N), where N is the number of parameters. This is expensive, compared to analytic differentiation. Finite difference, however, is commonly used to validate a back-prop implementation when debugging.

2 — Stochastic Gradient Descent

An intuitive way to think of Gradient Descent is to imagine the path of a river originating from top of a mountain. The goal of gradient descent is exactly what the river strives to achieve — namely, reach the bottom most point (at the foothill) climbing down from the mountain.

Now, if the terrain of the mountain is shaped in such a way that the river doesn’t have to stop anywhere completely before arriving at its final destination (which is the lowest point at the foothill, then this is the ideal case we desire. In Machine Learning, this amounts to saying, we have found the global mimimum (or optimum) of the solution starting from the initial point (top of the hill). However, it could be that the nature of terrain forces several pits in the path of the river, which could force the river to get trapped and stagnate. In Machine Learning terms, such pits are termed as local minima solutions, which is not desirable. There are a bunch of ways to get out of this (which I am not discussing).1_fByskqwqf8UQaU-1sHE4jg.png

Gradient Descent therefore is prone to be stuck in local minimum, depending on the nature of the terrain (or function in ML terms). But, when you have a special kind of mountain terrain (which is shaped like a bowl, in ML terms this is called a Convex Function), the algorithm is always guaranteed to find the optimum. You can visualize this picturing a river again. These kind of special terrains (a.k.a convex functions) are always a blessing for optimization in ML. Also, depending on where at the top of the mountain you initial start from (ie. initial values of the function), you might end up following a different path. Similarly, depending on the speed at the river climbs down (ie. the learning rate or step size for the gradient descent algorithm), you might arrive at the final destination in a different manner. Both of these criteria can affect whether you fall into a pit (local minima) or are able to avoid it.

3 — Learning Rate Decay

1_jzIVDhOzzrHV9m8cEsTiSA.jpeg

Adapting the learning rate for your stochastic gradient descent optimization procedure can increase performance and reduce training time. Sometimes this is called learning rate annealing or adaptive learning rates. The simplest and perhaps most used adaptation of learning rate during training are techniques that reduce the learning rate over time. These have the benefit of making large changes at the beginning of the training procedure when larger learning rate values are used, and decreasing the learning rate such that a smaller rate and therefore smaller training updates are made to weights later in the training procedure. This has the effect of quickly learning good weights early and fine tuning them later.

Two popular and easy to use learning rate decay are as follows:

  • Decrease the learning rate gradually based on the epoch.
  • Decrease the learning rate using punctuated large drops at specific epochs.

4 — Dropout

拥有大量参数的深度神经网络是非常强大的机器学习系统。然而,在这样的网络中过拟合是一个严重的问题。并且,大型网络的运行速度很慢,使得在测试阶段通过结合多个不同的大型神经网络的预测来解决过拟合问题是困难的。Dropout正是针对这个问题的一种技术。

1_PghKZ1K2Lepg01EGfbtKoQ.jpeg

其关键的思想是在训练过程中随机地从神经网络中删除单元(以及相应的连接),进而防止过拟合。在训练过程中,Dropout将从指数级的不同稀疏网络中采样。在测试阶段,很容易通过使用有较小权重的单untwined网络,将这些稀疏网络的预测取平均进而逼近结果。这能有效地避免过拟合并且相比其它的正则化方法能得到更大的性能提升。监督学习任务在计算机视觉、语音识别、文本分类和计算生物学等领域中,Dropout已经被证明能提升神经网络的性能并在多个基准测试数据集中达到顶尖结果。

5 — 最大池化Max Pooling

最大池化是一种基于样本的离散化方法。目标是对输入表征(图像、隐藏层的输出矩阵等)进行下采样,降低维度并且允许假设包括在子区域中的特征被丢弃。

1_mAb72pBCgfSG707fgDoxgA.jpeg

通过提供表征的抽象形式,这种方法在某种程度上有助于解决过拟合。同样,它也通过减少学习参数的数量和提供基本的内部表征的转换不变性减少了计算量。最大池化是通过将一个最大过滤器应用于通常不重叠的初始表征子区域来完成的。

6 — 批量归一化Batch Normalization

当然,包括深度网络在内的神经网络需要仔细调整权重初始化和学习参数。批量归一化能使这个过程更简单。

权重问题:

  • 无论哪种权重初始化,比如:随机或按经验选择,这些权重值都和学习权重差别很大。考虑一个小批量,在最初时,在所需要的特征激活中可能会有很多异常值。

  • 深度神经网络本身就具有病态性,即初始层的微小变动就会导致下一层的巨大变化。

在反向传播过程中,这些现象会导致梯度的偏移,这意味着在学习权重以产生所需输出之前,梯度必须补偿异常值。而这将导致需要额外的时间才能收敛。

1_XKr2e_wMF_HLX6mz4doMLg.png

批量归一化使这些梯度从离散到正常值,并在小批量范围内(通过归一化)向共同目标流动。

学习率问题:通常来说,学习率保持较低,使得只有一小部分的梯度校正权重,原因是异常激活的梯度不应该影响已经学习到的激活。通过批量归一化,这些异常值激活会被降低,从而可以使用更大的学习率加速学习过程。

7 — Long Short-Term Memory长短期记忆:

长短期记忆网络(LSTM network)的神经元和其他递归神经网络(recurrent neural network)中常用神经元不同,有如下三种特征:

  1. 它对神经元的输入有决定权;

  2. 它对上一个时间步中计算内容的存储有决定权;

  3. 它对将输出传递到下一个时间步的时间有决定权。

LSTM的强大在于它能只基于当前的输入就决定以上所有的值。看看下方的图表:

1_U1Wu_6KVzlN8Wls6HNcDSw.png

当前时间标记处的输入信号x(t)决定了所有上述3个点。输入门(input gate)决定了第1点,遗忘门(forget gate)决定了第2点,输出门(output gate)决定了第3点。仅靠单独的输入就能完成所有这三项决定。这受到了我们的大脑如何工作的启发,并且可以基于输入来处理突然的上下文切换。

8 — Skip-gram:

词嵌入模型的目标是为每个词汇项学习一个高维密集表示,其中嵌入向量之间的相似性显示了相应词语之间的语义或句法相似性。Skip-gram是一种学习词嵌入算法的模型。

skip-gram 模型(和很多其它词嵌入模型)背后的主要思想是:如果两个词汇项有相似的上下文,则它们是相似的。

1_0zCP-Z033it0UYkvWbPLBQ.jpeg

换种说法,假设你有一个句子,比如“cats are mammals“,如果用”dogs“替换”cats“,该句子仍然是有意义的。因此在这个例子中,”dogs“”cats“有相似的上下文(即”are mammals“)。

基于上述假设,我们可以考虑一个上下文窗口(一个包含K个连续项的窗口)。那么你应该跳过其中一个词,试着学习一个除了跳过这个词以外,还可以预测跳过的词的神经网络。因此,如果两个词在大语料库中反复共享相似的上下文,那么这些词的嵌入向量将具有相似的向量。

9 — Continuous Bag Of Words连续词袋模型:

在自然语言处理中,我们希望学习将文档中的每一个单词表示为一个数值向量,使得出现在相似上下文中的单词有非常相似或相近的向量。在连续词袋模型中,我们的目标是能利用围绕特定单词的上下文并预测特定单词。

1_-itR-HwIYwz46gldCs4ZyQ.jpeg

我们通过在一个大的语料库中抽取大量的句子来做到这一点,每次看到一个单词时,我们同时抽取它的上下文。然后我们将上下文单词输入到一个神经网络,并预测在这个上下文中心的单词。

当我们有成千上万个这样的上下文单词和中心词时,我们就有了一个神经网络数据集的实例。我们训练神经网络并且最后经过编码的隐藏层输出表示特定单词的嵌入表达。当我们对大量的句子进行训练时也能发现,类似语境中的单词恰巧得到相似的向量。

10 — 迁移学习Transfer Learning:

让我们考虑图像到底是如何流经卷积神经网络的。假设你有一张图像,对其应用卷积,并得到像素的组合作为输出。假设这些输出是边缘,再次应用卷积,那么现在你的输出将是边或线的组合。然后现在再次应用卷积,此时你的输出是线的组合,以此类推……你可以把它看作是每一层寻找一个特定的模式。神经网络的最后一层往往会变得非常特异化。如果 你在ImageNet上工作,你的网络的最后一层大概就是在寻找儿童、狗或者飞机等整体图案。再往后倒退几层,你可能会看到网络在寻找眼睛、耳朵、嘴巴或者轮子等组成部件。

1_ovNMgv2yulRn6mnpzOSkUg.jpeg

深度卷积神经网络中的每一层都逐步建立起越来越高层次的特征表征,最后几层往往是专门针对输入模型的任何数据。另一方面,早期的图层更为通用,在一大类图片中有许多简单的图案。

迁移学习就是当你在一个数据集上训练CNN时,切掉最后一层,在不同的数据集上重新训练最后一层的模型。直观地说,你正在重新训练模型以识别不同的高级特征。因此,训练时间会减少很多,所以当你没有足够的数据或者训练需要太多的资源时,迁移学习是一个有用的工具。

这篇文章简单介绍了深度学习的一些方法,如果你想要了解更多更深层次的东西,建议你继续阅读以下资料:

深度学习是强烈注重技术的,对每一个新想法都没有太多具体的解释,大多数新想法都附带了实验结果来证明它们能够运作。深度学习就像玩乐高,掌握乐高和掌握其他艺术一样具有挑战性,但是入门乐高可是相对容易很多的。

 

原文:https://towardsdatascience.com/the-10-deep-learning-methods-ai-practitioners-need-to-apply-885259f402c1

Advertisements

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

You are commenting using your WordPress.com account. Log Out /  更改 )

Google+ photo

You are commenting using your Google+ account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )

Connecting to %s