Elevator Pitch & ELI5 of my first PhD project: Image Retrieval from Contextual Descriptions

Created
TagsConferencesGroundingNLPResearch

I recently presented my first paper at a conference in person: Image Retrieval from Contextual Descriptions.

ACL 2022 in Dublin was an amazing week but I noticed that I did not have a super intuitive and structured elevator pitch or ELI5 prepared.

So here it is, on different levels!

2-sentence-pitch

ImageCoDe is a new challenging benchmark for contextual vision-and-language understanding. The task is to select the right image from 10 minimally contrastive images based on a complex description and models do terribly! *highlights phenomena in example(s)* (more on our interactive demo!)

3-minute-pitch

ImageCoDe is a new challenging benchmark for contextual vision-and-language understanding. The task is to select the right image from 10 minimally contrastive images based on a complex description and models do terribly!

*highlights phenomena example(s)* (more on our interactive demo!)

We crowdsource 21k descriptions on more than 90k images (so 9k sets of minimally contrastive images). First, we get super similar images from video frames and also some from static images (but videos are way more interesting). Second, we give people 10 such images and ask them to distinguish a randomly selected one from the others (with instructions like “no mentions of other images explicitly”).

Third, we give such a description with 10 images to retriever worker(s). Only if they retrieve it correctly, it stays in our data!

We evaluate CLIP, ViLBERT and UNITER in 5 different settings, and I will highlight three here:

  1. zero-shot: all around ~20% accuracy (10% is random chance of picking the right image)
  1. fine-tuning (maximize/minimize the text-image matching-score) with hard negatives in the same batch: improves to 24-28%
  1. add ContextualTransformer with temporal embeddings (for videos) on top of e.g. CLIP output vector that allows model to compare all images in one forward pass: small improvements with our strongest model based on CLIP with ~30% Accuracy

Our best model is far from human performance of 91%!

Now the question everyone should have: Why do we need yet another vision-and-language benchmark? What’s new?

My answer:

It is a mix of several factors that make this a new unique challenge that is not trivial for models:

If you like this, consider submitting to our leaderboard!

ELI5

I think an Explain-to-me-like-I’m-5 type of explanation makes most sense when you know someone’s background and familarity and what they are curious about. So I would ask that first.

However I would probably generally speak about how in vision-and-language research or more broadly in language grounding you have to come up with good tasks that measure phenomena you care about. I would tell them about a few examples in the past and how ImageCoDe fits into this.

I would describe what kind of skills a model must have to solve such a task as ImageCoDe:

It would also be enlightening to some curious outsider to learn what happens “behind the scenes” when a dataset is created: the crowdsourcing, how to find workers and write instructions, make the data accessible to the public via tools like Huggingface Datasets or an interactive demo