Dense-Captioning Events in Videos

来源:互联网 发布:淘宝客推广方法 编辑:程序博客网 时间:2024/06/02 23:46

Dense-Captioning Events in Videos

info

project page http://cs.stanford.edu/people/ranjaykrishna/densevid/

文章做了以下几个工作:

  • a new model:

    • identify all events in a single pass of the video
    • describing the detected events with natural language
    • a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes.
  • 捕捉事件之间的依赖关系:a new captioning module that uses contextual information from past and future events to jointly describe all events(采用上下文信息)

  • 提供数据集 ActivityNet Captions

Dense-captioning events model

Goal : design an architecture that

  • jointly localizes temporal proposals of interest
  • and then describes each with natural language.

Input: sequence of videoframes
Output: a set of sentences(且包含起止时间)

Event proposal module

这里写图片描述

framework:先把视频序列输入C3D得到特征,送入proposal module(就是DAPs),得到proposal(包含起始时间、分数、hidden representation hi),分数超过阈值的proposal就可以送入language model,通过hidden representation进行video captioning,输出对于每个event的描述。

对DAPs的更改:We do not modify the training of DAPs and only change the model at inference time by outputting K proposals at every time step, each proposing an event with offsets.

While traditional DAPs uses non-maximum suppression to eliminate overlapping outputs, we keep them separately and treat them as individual events。

Captioning module with context

从时间上下文获取信息,对于一个事件来说,把所有其他时间都划分为两类:past和future,如果是cocurrent的时间,那么在当前事件结束前就结束划分为past,否则future。past和future的表示如下:


这里写图片描述

hj是其他时间的hidden representation
最终得到的特征表达(hpasti,hi,hfuturei)送入LSTM,最终得到视频的描述。

实现细节

loss:两个loss,one for proposal,another for captioning model。总的loss:

L=λ1Lcap+λ2Lprop

其中λ1=1.0λ2=0.1

训练和优化:

  • train our full densecaptioning model by alternating between training the language model and the proposal module every 500 iterations.
  • first train the captioning module by masking all neighboring events for 10 epochs before adding in the context features.
  • initialize all weights using a Gaussian with standard deviation of 0:01.
  • stochastic gradient descent with momentum 0:9 to train.
  • learning rate : 0.01 for the language model and 0.001 for the proposal module.
  • For efficiency, we do not finetune the C3D feature extraction.
  • training batch-size is set to 1
  • We cap all sentences to be a maximum sentence length of 30 words

    PyTorch 0.1.10.
    One mini-batch runs in approximately 15:84 ms on a Titan X GPU and it takes 2 days for the model to converge.

原创粉丝点击