Action Recognition and Video Description using Visual Attention

Shikhar Sharma


We propose soft attention based models for the tasks of action recognition in videos and generating natural language descriptions of videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep both spatially and temporally. Our model learns to focus selectively on parts of the video frames and classifies videos after taking a few glimpses. It is also able to generate sentences describing the videos using spatio-temporal glimpses across them. The model essentially learns which parts in the frames are relevant for the task at hand and attaches higher importance to them. We evaluate the action recognition model on UCF-11 (YouTube Action), HMDB-51 and Hollywood2 datasets and analyze how the model focuses its attention depending on the scene and the action being performed. We evaluate the description generation model on YouTube2Text dataset and visualize the model’s attention as it generates words.


  author  = {Shikhar Sharma},
  title   = {Action Recognition and Video Description using Visual Attention},
  school  = {University of Toronto},
  address = {Toronto, Canada},
  year    = {2016},
  month   = {February},
  url     = {}