Dense-Captioning Events in Videos
来源:互联网 发布:淘宝客推广方法 编辑:程序博客网 时间:2024/06/02 23:46
Dense-Captioning Events in Videos
info
project page http://cs.stanford.edu/people/ranjaykrishna/densevid/
文章做了以下几个工作:
a new model:
- identify all events in a single pass of the video
- describing the detected events with natural language
- a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes.
捕捉事件之间的依赖关系:a new captioning module that uses contextual information from past and future events to jointly describe all events(采用上下文信息)
- 提供数据集 ActivityNet Captions
Dense-captioning events model
Goal : design an architecture that
- jointly localizes temporal proposals of interest
- and then describes each with natural language.
Input: sequence of videoframes
Output: a set of sentences(且包含起止时间)
Event proposal module
framework:先把视频序列输入C3D得到特征,送入proposal module(就是DAPs),得到proposal(包含起始时间、分数、hidden representation
对DAPs的更改:We do not modify the training of DAPs and only change the model at inference time by outputting K proposals at every time step, each proposing an event with offsets.
While traditional DAPs uses non-maximum suppression to eliminate overlapping outputs, we keep them separately and treat them as individual events。
Captioning module with context
从时间上下文获取信息,对于一个事件来说,把所有其他时间都划分为两类:past和future,如果是cocurrent的时间,那么在当前事件结束前就结束划分为past,否则future。past和future的表示如下:
最终得到的特征表达
实现细节
loss:两个loss,one for proposal,another for captioning model。总的loss:
其中
训练和优化:
- train our full densecaptioning model by alternating between training the language model and the proposal module every 500 iterations.
- first train the captioning module by masking all neighboring events for 10 epochs before adding in the context features.
- initialize all weights using a Gaussian with standard deviation of 0:01.
- stochastic gradient descent with momentum 0:9 to train.
- learning rate : 0.01 for the language model and 0.001 for the proposal module.
- For efficiency, we do not finetune the C3D feature extraction.
- training batch-size is set to 1
We cap all sentences to be a maximum sentence length of 30 words
PyTorch 0.1.10.
One mini-batch runs in approximately 15:84 ms on a Titan X GPU and it takes 2 days for the model to converge.
- Dense-Captioning Events in Videos
- Dense-Captioning Events in Videos
- Weakly Supervised Dense Video Captioning
- 论文笔记 DenseCap: Fully Convolutional Localization Networks for Dense Captioning
- DenseCap:Fully Convolutional Localization Networks for Dense Captioning
- Encoding videos in H264
- 论文笔记之---DenseCap:Fully Convolutional Localization Networks for Dense Captioning
- 实时字幕生成原理挖掘——论文解读DenseCap: Fully Convolutional Localization Networks for Dense Captioning
- Events In C#
- Handling Events in Windows
- wait events in oracle
- Custom Events in JavaScript
- Events-Today In History
- Build Dense Trajectory Codes in Ubuntu
- The Difference in PIM Dense vs. DVMRP
- Dense prediction applications in Deep learning
- Multiple Events in Server-Sent Events
- Handling Events in JavaServer Faces
- 【ShawnZhang】带你看蓝桥杯—— 算法提高 质因数2
- 计算机网络学习笔记day3
- Linux 文件系统的目录结构
- 机房重构之报表制作
- 解决nginx服务下 thinkphp只能访问首页不报任何错误的问题404错误
- Dense-Captioning Events in Videos
- codevs 1231 最优布线问题(最小生成树模板)
- memcpy,memmove函数的实现
- 如何打印菱形 (for循环)
- linux rcp
- H5开发:手机访问www如何自动跳转到m
- SWOT分析法讲解
- java中连接数据库失败,提示:Acc?¨s refus?? pour l'utilisateur: 'root'@'@localhost' (mot de passe: OUI)
- 双目视觉---立体匹配介绍