SGD的batch size多大是合适的?
来源:互联网 发布:大数据在职研究生 编辑:程序博客网 时间:2024/06/02 19:36
The "sample size" you're talking about is referred to as batch size,
To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]
- Batch gradient descent,
B=|x| - Online stochastic gradient descent:
B=1 - Mini-batch stochastic gradient descent:
B>1 butB<|x| .
Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.
SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let
Batch gradient descent updates the weights
Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.
Let's say that we have some data vector
For simplicity we can assume that D is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.
An iterative algorithm for SGD with
Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.
The major parameters for the vanilla (no momentum) SGD described above are:
- Learning Rate:
ϵ
I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.
If you want to have the learning rate fixed, just define epsilon as a constant function.
- Batch Size
Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.
Citations & Further Reading:
- Introduction to Gradient Based Learning
- Practical recommendations for gradient-based training of deep architectures
- Efficient Mini-batch Training for Stochastic Optimization
- SGD的batch size多大是合适的?
- 多大的缓存才合适?
- batch size的作用
- 梯度下降(BGD)、随机梯度下降(SGD)、Mini-batch Gradient Descent、带Mini-batch的SGD
- 梯度下降(BGD)、随机梯度下降(SGD)、Mini-batch Gradient Descent、带Mini-batch的SGD
- Scaling SGD Batch Size to 32K for ImageNet Training
- linux下的Swap分区设多大合适?
- 企业使用多大的光纤宽带合适呢-光纤接入
- batch-GD, SGD, Mini-batch-GD, Stochastic GD, Online-GD -- 大数据背景下的梯度训练算法
- batch-GD, SGD, Mini-batch-GD, Stochastic GD, Online-GD -- 大数据背景下的梯度训练算法
- batch-GD, SGD, Mini-batch-GD, Stochastic GD, Online-GD -- 大数据背景下的梯度训练算法
- tensorflow实现最基本的神经网络 + 对比GD、SGD、batch-GD的训练方法
- Raid组的Stripe Size到底设置为多少合适?
- SQLServer2000的数据库容量是多大
- Caffe:深度学习中 epoch,[batch size], iterations的区别
- Caffe:深度学习中 epoch,batch size, iterations的区别
- Caffe:深度学习中 epoch,[batch size], iterations的区别
- 深度学习中Batch size对训练效果的影响
- java NIO socket编程简介
- 二项堆 数据结构说解
- 连接文件
- 写给2017年的自己
- Tcp C/S架构实现聊天室(链表管理在线用户)(服务器)
- SGD的batch size多大是合适的?
- 过河问题
- Rails中使用ajax的两种方法及调试技巧
- Tcp C/S架构实现聊天室(链表管理在线用户)(客户端)
- Foundation框架与Core Foundation框架相互之间的桥接
- 模式识别hw2-------基于matconvnet,用CNN实现人脸图片性别识别
- CCF201403-1 相反数(100分)
- [GDKOI2016] Day1 魔卡少女 线段树
- 喷水装置(一)