对 caffe 中Xavier, msra 权值初始化方式的解释
来源:互联网 发布:知乎恐怖提问 编辑:程序博客网 时间:2024/06/08 15:13
If you work through the Caffe MNIST tutorial, you’ll come across this curious line
weight_filler { type: "xavier" }
and the accompanying explanation
For the weight filler, we will use the xavier algorithm that automatically determines the scale of initialization based on the number of input and output neurons.
Unfortunately, as of the time this post was written, Google hasn’t heard much about “the xavier algorithm”. To work out what it is, you need to poke around the Caffe source until you find the right docstring and then read the referenced paper, Xavier Glorot & Yoshua Bengio’s Understanding the difficulty of training deep feedforward neural networks.
Why’s Xavier initialization important?
In short, it helps signals reach deep into the network.
- If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
- If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.
To go any further than this, you’re going to need a small amount of statistics - specifically you need to know about random distributions and their variance.
Okay, hit me with it. What’s Xavier initialization?
In Caffe, it’s initializing the weights in your network by drawing them from a distribution with zero mean and a specific variance,
where
It’s worth mentioning that Glorot & Bengio’s paper originally recommended using
where
And where did those formulas come from?
Suppose we have an input
And from Wikipedia we can work out that
Now if our inputs and weights both have mean
Then if we make a further assumption that the
Or in words: the variance of the output is the variance of the input, but scaled by
Voila. There’s your Caffe-style Xavier initialization.
Glorot & Bengio’s formula needs a tiny bit more work. If you go through the same steps for the backpropagated signal, you find that you need
to keep the variance of the input gradient & the output gradient the same. These two constraints can only be satisfied simultaneously if
I’m not sure why the Caffe authors used the
- that preserving the forward-propagated signal is much more important than preserving the back-propagated one.
- that for implementation reasons, it’s a pain to find out how many neurons in the next layer consume the output of the current one.
That seems like an awful lot of assumptions.
It is. But it works. Xavier initialization was one of the big enablers of the move away from per-layer generative pre-training.
The assumption most worth talking about is the “linear neuron” bit. This is justified in Glorot & Bengio’s paper because immediately after initialization, the parts of the traditional nonlinearities -
instead. Which makes sense: a rectifying linear unit is zero for half of its input, so you need to double the size of weight variance to keep the signal’s variance constant.
- 对 caffe 中Xavier, msra 权值初始化方式的解释
- Caffe--xavier初始化方法
- caffe初始化-Xavier
- caffe-msra初始化
- Caffe中Layer参数的初始化方式
- MSRA初始化
- 深度学习的Xavier初始化方法
- 带有xavier初始化、dropout的多层神经网络
- caffe filter type:Xavier
- xavier - 网络初始化问题
- “Xavier”初始化方法
- Xavier初始化方法
- Xavier初始化方法
- Xavier初始化方法
- caffe中特殊的layer解释
- Caffe中convert_imageset的用法解释
- caffe中卷积层的权重初始化
- DeepLearning-Xavier在caffe中的实现
- LeetCode No.120 Triangle
- Uiautomator之python封装包安装
- 一步一步用 java 设计生成二维码
- c#用IMessageFilter拦截键盘消息
- android事件分发机制的再学习
- 对 caffe 中Xavier, msra 权值初始化方式的解释
- mongo db(用 mongo VUE windows) 仅查询某个字段且不为空的所有值
- php下POST json数据无法解析问题
- swift 学习笔记(17)-guard
- HDU 1173 采矿
- std::set/std::map 的几个为什么
- Linux之三剑客,awk、sed、grep的用法
- Linux入门:shell编程
- 1668: [Usaco2006 Oct]Cow Pie Treasures 馅饼里的财富