End-to-end Learning of Action Detection from Frame Glimpses in Videos
Serena Yeung, Olga Russakovsky, Greg Mori, Li Fei-Fe
Summary
Papers I’ve read so far only do video classification- classify video based on activity, assume there is generally one activity in the video.
This paper proposes a method to identify not only whether an action occurs, but also where it does. Prior work involves using frame/segment level analysis at multiple time scales. This is end-to-end.
Architecture: Inspired by RAM model
- observation network
- recieves location (in-time) and video frame. Video frame is processed by CNN (VGG net features), location by another net-> combined and again passed through fully connected net.
- output of this net is directly passed as input to the recurrent net
- recurrent network
- recieves processed input from observation network, prev. hidden state
- outputs:
- candidate detection
- network that outputs tuple(start time, end time, confidence)
- indicator that says whether to emit candidate detection at that
timestep
- network that parameterizes Bernoulli dist. that is sampled from. During test time, MAP estimate is used
- location of next frame to observe
- network that parameterizes Gaussian distribution’s mean (with fixed variance) that is sampled from. During test time, MAP estimate is used.
- candidate detection
Training:
- Uses backprop
- Candidate prediction
- Loss function boosts confidence score for detection that is matched to ground truth label
- Loss function (L2 norm) penalizes if start/end times of detection window is close to that of ground truth
- Candidate prediction
- Uses REINFORCE
- Indicator prediction
- Location prediction
- The reward function:
- penalizes the agent for being conservative (not outputting predictions)
- rewards positive detections, penalizes negative
- defn. of positive- overlap with ground truth above a certain threshold(hyperparam)
Things I learned
- Training is done as 1-vs-all for each class. Seems like the way to go when it comes to activity detection/video classification(sometimes), does seem to give a performance bump.
- Why do you need REINFORCE if you could reparameterize any sampling steps?
- Even if sampling was differentiable, the indexing isn’t
- Selecting an image(indexed from 1-T) that is discrete is not differentiable
- A good way to think of why this is not differentiable is to think of the DRAW paper, there they make the attention differentiable by using a gaussian filter over the entire frame as opposed to an indexing operation. Hard attention vs soft attention
Questions
- Why doesn’t prediction indicator depend on candidate detection confidence
score?
- Their relation is incorporated into the loss function
- How many steps does it run for? Can this be made dynamic if it isn’t already?
- N is chosen beforehand=6
- Is a single location transformed into a 1024 dimensional vector? Does that make sense?
- How exactly is ground truth labeled? And if the RNN makes multiple predictions (say with indicator being +ve), and the candidate predictions are all active too, how .
- What makes the tuple(start,end,conf) bounded to the right values, say [0,1]
- probably clipped. Doesn’t say
- Doesn’t the matching function’s definition make sure that every candidate detection will be matched to some ground truth value, no matter if they are even close to each other?
- How are there multiple ground truth predictions- does it mean they aren’t contiguous?
- Intuitively this model doesn’t make sense, doesn’t function like a human
would. How is a glimpse enough to make predictions for a video it hasn’t even
seen yet.
- this model is learning to find and leverage some bias in the training data, that each class has some very specific intervals and patterns. Might generalize poorly to real/out of training videos
- assumption of start, end times that are being normalized
- fixed length videos
- no constraints that end has to be greater than start. Doesn’t look like they do encode it in the model
- Is this actually actor-critic? Since they compute approximate gradient and also use the advantage function to reduce bias. So 2 function approximators..
- How efficient is this model both in training and testing? They say they use 2% of total frames but in other ways is it efficient
- Does REINFORCE make sense for this problem(as opposed to some other Q
learning)? As in, in order to learn a good policy, it needs to actually
randomly stumble into a good one.
- What is the use case for REINFORCE, does it suit itself to a certain class of problems where the state/action space is relatively constrained? I can imagine it taking forever in more complicated scenarios?
- How to encourage efficient parameter exploration? It feels like most of the magic here is in the reward function
Leave a Comment