The ability to craft and understand stories is a crucial cognitive tool used by humans for communication. According to computational linguists, narrative theorists and cognitive scientists, the story understanding is a good proxy to measure the readers' intelligence. Readers can understand a story as a way of problem-solving in which, for example, they keep focusing on how main characters overcome certain obstacles throughout the story. Readers need to make inferences both in prospect and in retrospect about the causal relationships between different events in the story.
Especially, the video story data such as TV shows and movies can serve as an excellent testbed to evaluate the human-level AI algorithms from two points of view. First, video data have different modalities such as a sequence of images, audios (including dialogue, sound effects and background music) and text (subtitles or added comments). Second, video data show various cross-sections of everyday life. Therefore, understanding video story can be thought of a significant challenge to current AI technology, which involves analyzing and simulating human vision, language, thinking, and behavior.
Towards human-level video understanding, machine intelligence needs to extract meaningful information such as events from the sequential multimodal video data, consider the causal relationships between different events, and make inferences both in prospect and in retrospect about what events will occur and how these events could occur. Story in the video is highly-abstracted information which consists of a series of events across multiple scenes in a scenario.
In this workshop, we emphasize the necessity of findings and insights from the various research domain for video story understanding. We aims to invite experts in variety of related fields, including vision, language processing, computational narratology and neuro-symbolic computing to provide a perspective on the research that exists, and initiates discussion of future challenges in data-driven video understanding. Topics of interest include but not limited to:
We invite submissions of papers as extended abstract within 4 pages, excluding references or supplementary materials. All submissions must be in pdf format as a single file (incl. supplementary materials) using below templates and submitted through this CMT link. The review process is single-round and double-blind. All submissions have to be anonymized.
All accepted papers will be presented as posters during the workshop and listed on the website. Additionally, a small number of accepted papers will be selected to be presented as contributed talks.
Note that this workshop will not publish official proceedings. The accepted submission will not be counted as a publication. We encourage submissions of relevant work that has been previously published, or is to be presented at the main conference.
Paper Submission Deadline | September 10, 2019 (GMT+9) |
---|---|
Notification to Authors | October 7, 2019 |
Paper Camera-Ready Deadline | October 18, 2019 |
Workshop Date | November 2, 2019 |
Time | Presentation |
---|---|
08:30 - 08:45 | Opening Remarks: Video Turing Test, Byoung-Tak Zhang (Seoul National University) |
08:45 - 09:15 | Invited Talk 1: Video Understanding: Action, Activities and Beyond, Leonid Sigal (University of British Columbia) |
09:15 - 09:45 | Invited Paper 1: VideoMem: Constructing, Analyzing, Predicting Short-Term and Long-Term Video Memorability Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Martin Engilberge Invited Paper 2: Progressive Attention Memory Network for Movie Story Question Answering Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, Chang D. Yoo Invited Paper 3: HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic |
09:45 - 10:15 | Spotlight Talks (5 minutes each) (1) DIFRINT: Deep Iterative Frame Interpolation for Full-frame Video Stabilization (Jinsoo Choi, In So Kweon) (2) Adversarial Inference for Multi-Sentence Video Description (Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach) (3) Robust Person Re-identification via Graph Convolution Networks (Guisik Kim, Dongwook Shu, Junseok Kwon) (4) Enhancing Performance of Character Identification on Multiparty Dialogues of Drama via Multimodality (Donghwan Kim) (5) Dual Attention Networks for Visual Reference Resolution in Visual Dialog (Gi-Cheon Kang, Jaeseo Lim, Byoung-Tak Zhang) (6) Event Structure Frame-Annotated WordNet for Multimodal Inferencing (Seohyun Im) |
10:15 - 11:25 | Coffee Break & Poster Session |
11:25 - 11:55 | Invited Talk 2: High-level Video Understanding via Adversarial Inference and Spatio-temporal Graphs, Trevor Darrell (University of California, Berkeley) |
11:55 - 12:25 | Invited Talk 3: Video Recognition from a Story, Cees Snoek (University of Amsterdam) |
12:25 - 12:30 | Closing |
Invited Speaker 1: Leonid Sigal, University of British Columbia
Title: Video Understanding: Action, Activities and Beyond
Abstract: Automatic understanding and interpretation of videos is one of the core challenges in computer vision. Many real-world systems could benefit from various levels of human and non-human video "action" understanding. In this talk, I will discuss some of the approaches we developed over the years for addressing aspects of this challenging problem. In particular, I will first discuss a strategy for learning activity progression in LSTM models, using structured rank losses, which explicitly encourage the architecture to increase its confidence in prediction over time. The resulting model turns out to be especially effective in early action detection. I will then talk about some of our recent work on single-frame situational recognition. Situational recognition goes beyond traditional action and activity understanding, which only focuses on detection of salient actions. Situational recognition further requires recognition of semantic roles for each action. For example, who is performing the action, what is the source and/or target of the action. We propose a mixture-kernel attention Graph Neural Network (GNN) for addressing this problem. Finally, time permitting, I will also discuss our recent work on audio-visual weakly-supervised dense video-captioning.
Invited Speaker 2: Trevor Darrell, University of California, Berkeley
Title: High-level Video Understanding via Adversarial Inference and Spatio-temporal Graphs
Abstract: In this talk I'll present recent work towards High-level Video Understanding, including results on multi-sentence video description using novel adversarial inference methods. Our adversarial method relies on a hybrid discriminator design, with constituent elements for linguistic coherence, visual relevance, and paragraph consistency. I'll also present models for few-shot video activity recognition, leveraging scene graphs defined over space and time. Our method targets activity recognition where labeled training data are expensive and rare, i.e. in few shot conditions, such as crash detection for autonomous driving. Time permitting I'll cover other ongoing efforts on video description and activity recognition.
Invited Speaker 3: Cees Snoek, University of Amsterdam
Title: Video Recognition from a Story
Abstract: By 2022 there will be 45 billion cameras in the world, many of them tiny, connected and live streaming 24/7. Self-driving cars, drones and service robots are just three manifestations. For all these applications it will be of critical importance to understand what is happening where and when in the video streams. The common tactic to spatiotemporal video recognition is to track a human-specified box or to learn a deep classification network from a set of predefined action classes. In this talk I will present an alternative approach, that allows for spatiotemporal recognition from a natural language sentence as input, and show its potential for object tracking and action segmentation. For object tracking, rather than specifying the target in the first frame of a video by a bounding box, a natural language specification of the target provides a more natural human-machine interaction as well as a means to improve tracking results. For action segmentation, rather than learning to segment from a fixed vocabulary of actor and action pairs, inference from a natural language input sentence allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are completely outside of the vocabulary. For both tasks we discuss the realization via multimodal network architectures and sentence-augmented datasets, comparisons with the traditional state-of-the-art, as well as their potential for application in surveillance and other live video streams.
For more questions about the workshop and submissions, please email vttws2019@gmail.com