End-to-End Fine-grained Multi-modal Understanding
- đ¤ Speaker: Aishwarya Kamath, New York University
- đ Date & Time: Thursday 17 November 2022, 11:00 - 12:00
- đ Venue: GR04, English Faculty Building, 9 West Road, Sidgwick Site
Abstract
Previously, multi-modal reasoning systems relied on a pre-trained object detector to extract regions of interest from the image. However, this crucial module was typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This made it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this talk, I will first discuss MDETR , an end-to-end modulated detector that detects objects in an image, conditioned on a raw text query like a caption or a question. The model is trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. Next, we will explore further developments in architecture design that employ fusion between the visual and textual modalities deeper in the model, achieving state of the art results when coupled with a coarse-to-fine pre-training strategy. Finally, I will discuss a novel fine-grained visual understanding task and evaluation benchmark which shows that existing benchmarks overestimate VL model’s ability to understand and reason over complex visual scenes leaving substantial room for improvement.
Series This talk is part of the Language Technology Lab Seminars series.
Included in Lists
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- GR04, English Faculty Building, 9 West Road, Sidgwick Site
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- Language Technology Lab Seminars
- ndk22's list
- ob366-ai4er
- rp587
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Aishwarya Kamath, New York University
Thursday 17 November 2022, 11:00-12:00