The ability to craft and understand stories is a crucial cognitive tool used by humans for communication. According to computational linguists, narrative theorists and cognitive scientists, the story understanding is a good proxy to measure the readers' intelligence. Readers can understand a story as a way of problem-solving in which, for example, they keep focusing on how main characters overcome certain obstacles throughout the story. Readers need to make inferences both in prospect and in retrospect about the causal relationships between different events in the story.
Especially, the video story data such as TV shows and movies can serve as an excellent testbed to evaluate the human-level AI algorithms from two points of view. First, video data have different modalities such as a sequence of images, audios (including dialogue, sound effects and background music) and text (subtitles or added comments). Second, video data show various cross-sections of everyday life. Therefore, understanding video story can be thought of a significant challenge to current AI technology, which involves analyzing and simulating human vision, language, thinking, and behavior.
Towards human-level video understanding, machine intelligence needs to extract meaningful information such as events from the sequential multimodal video data, consider the causal relationships between different events, and make inferences both in prospect and in retrospect about what events will occur and how these events could occur. Story in the video is highly-abstracted information which consists of a series of events across multiple scenes in a scenario.
In this workshop, we emphasize the necessity of findings and insights from the various research domain for video story understanding. We aims to invite experts in variety of related fields, including vision, language processing, computational narratology and neuro-symbolic computing to provide a perspective on the research that exists, and initiates discussion of future challenges in data-driven video understanding. Topics of interest include but not limited to:
We invite submissions of papers as extended abstract within 4 pages, excluding references or supplementary materials. All submissions must be in pdf format as a single file (incl. supplementary materials) using below templates and submitted through this CMT link. The review process is single-round and double-blind. All submissions have to be anonymized.
All accepted papers will be presented as posters during the workshop and listed on the website. Additionally, a small number of accepted papers will be selected to be presented as contributed talks.
Note that this workshop will not publish official proceedings. The accepted submission will not be counted as a publication. We encourage submissions of relevant work that has been previously published, or is to be presented at the main conference.
|Paper Submission Deadline||September 10, 2019 (GMT+9)|
|Notification to Authors||October 7, 2019|
|Paper Camera-Ready Deadline||October 18, 2019|
|Workshop Date||November 2, 2019|
|08:30 - 08:45||Welcome & Opening Talk|
|08:45 - 09:15||Invited Talk 1|
|09:15 - 09:45||Invited Paper 1:
VideoMem: Constructing, Analyzing, Predicting Short-Term and Long-Term Video Memorability
Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Martin Engilberge
Invited Paper 2:
Progressive Attention Memory Network for Movie Story Question Answering
Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, Chang D. Yoo
Invited Paper 3:
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
|09:45 - 10:15||Spotlight Talks (5 minutes each)
(1) DIFRINT: Deep Iterative Frame Interpolation for Full-frame Video Stabilization (Jinsoo Choi, In So Kweon)
(2) Adversarial Inference for Multi-Sentence Video Description (Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach)
(3) Robust Person Re-identification via Graph Convolution Networks (Guisik Kim, Dongwook Shu, Junseok Kwon)
(4) Enhancing Performance of Character Identification on Multiparty Dialogues of Drama via Multimodality (Donghwan Kim)
(5) Dual Attention Networks for Visual Reference Resolution in Visual Dialog (Gi-Cheon Kang, Jaeseo Lim, Byoung-Tak Zhang)
(6) Event Structure Frame-Annotated WordNet for Multimodal Inferencing (Seohyun Im)
|10:15 - 11:25||Coffee Break & Poster Session|
|11:25 - 11:55||Invited Talk 2|
|11:55 - 12:25||Invited Talk 3|
|12:25 - 12:30||Closing|
Invited Speaker 1: Trevor Darrell, University of California, Berkeley (tbc)
Invited Speaker 2: Cees Snoek, University of Amsterdam
Invited Speaker 3: Leonid Sigal, University of British Columbia
For more questions about the workshop and submissions, please email firstname.lastname@example.org