University of Cambridge > Talks.cam > Language Technology Lab Seminars > End-to-End Fine-grained Multi-modal Understanding

Log in

Google

Microsoft

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

End-to-End Fine-grained Multi-modal Understanding

Download to your calendar using vCal

Aishwarya Kamath, New York University
Thursday 17 November 2022, 11:00-12:00
GR04, English Faculty Building, 9 West Road, Sidgwick Site.

If you have a question about this talk, please contact Panagiotis Fytas .

Previously, multi-modal reasoning systems relied on a pre-trained object detector to extract regions of interest from the image. However, this crucial module was typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This made it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this talk, I will first discuss MDETR , an end-to-end modulated detector that detects objects in an image, conditioned on a raw text query like a caption or a question. The model is trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. Next, we will explore further developments in architecture design that employ fusion between the visual and textual modalities deeper in the model, achieving state of the art results when coupled with a coarse-to-fine pre-training strategy. Finally, I will discuss a novel fine-grained visual understanding task and evaluation benchmark which shows that existing benchmarks overestimate VL model’s ability to understand and reason over complex visual scenes leaving substantial room for improvement.

This talk is part of the Language Technology Lab Seminars series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

End-to-End Fine-grained Multi-modal Understanding

📅 Download to calendar (vCal)

👤 Speaker: Aishwarya Kamath, New York University
📅 Date & Time: Thursday 17 November 2022, 11:00 - 12:00
📍 Venue: GR04, English Faculty Building, 9 West Road, Sidgwick Site

Questions? Contact Panagiotis Fytas

Abstract

Series This talk is part of the Language Technology Lab Seminars series.

Included in Lists

Note: Ex-directory lists are not shown.

Log in

🔐 Log In

Information on

ℹ️ Information

End-to-End Fine-grained Multi-modal Understanding

This talk is included in these lists:

End-to-End Fine-grained Multi-modal Understanding

Abstract

Included in Lists

Log in

🔐 Log In

Information on

ℹ️ Information

End-to-End Fine-grained Multi-modal Understanding

This talk is included in these lists:

Other lists

Other talks

End-to-End Fine-grained Multi-modal Understanding

Abstract

Included in Lists