[EARLY DRAFT] Defining the most basic concepts in Language Grounding


I am naturally inclined to go as deep as possible with why and what-is questions. In the case of language grounding, this has led me to compile this article of basic concepts that make up the field of language grounding, and of course more generally the field of language and AI.

My approach is the following: First, I will attempt to define the concepts only based on my own intuitions, past conversations or content I have read. Then, I will do some research and add what more experienced people have said about it. Either by citing them or interviewing them directly.

The definitions will be as specific to language grounding or AI as possible and should not be interpretated as general definition: I am sure medicine students would define a body differently than I do here.

Before we begin let's clarify what it means "to exist" generally. Are you serious? No, just kidding, but if you want to head down that rabbit hole (as I did recently), you might wanna check On what there is by Quine.

Enough jokes. Let's start.

What is a word?

The most boring answer one could give from simple Natural Language Processing applications: a word is something that is surrounded by whitespaces. But what about languages like Chinese? And how would you even approach finding words in spoken language when there are no whitespaces?

Wikipedia says: In linguistics, a word of a spoken language can be defined as the smallest sequence of phonemes that can be uttered in isolation with objective or practical meaning

okayish answer: the smallest unit of meaning... but: compound words or just even words like "went" or "hateful" etc.

answers from phonetics? words are divided by pauses (instead of whitespaces)

are there more profound answers? such as a word is a sound that is a cluster in some space of measurements of the brain → "a distinct idea in the brain sort of" → look into cogsci

Is there something specific to say about words in the context of language grounding?

Nouns are grounded to objects in the world: things that have persistency, that you can grab and manipulate, The full notion of an object only starts to make sense when you have persistency of it, which can only happen in video-like data or environments, ideally on a real robot. Not all vers are so physical of course! Think of democracy, love...

Verbs are grounded in actions: they are not persistent but fleeting, they usually happen over a time span, they are your way of changing the state of the world. Again, this concept really only makes sense when we deal with temporal data (videos, environment etc).

There are more kinds of words (parts-of-speech as linguists call it). For example adjectives/adverb specify a noun/verb: the surface or how it tastes to, how fast something happened.

What is a sentence?

Losely following Wittgenstein, a sentence connects the more atomic ideas/concepts, shows their relation to each.

A sentence tells a very short story; several sentences tell longer stories. Alternative formulation: a sentence elicits imagining a scene in the listener’s mind.

A sentence has intent (see the field of pragmatics in linguistics). It wants you to do or think a certain way.

What are some online definitions?


In the context of grounding?

A sentence can literally describe a scene in a video or image.

In image caption dataset a sentence is usually a concise summary of the visual scene in the image.

In video datasts sentences are often less related to the actual visual content and more pragmatic und functional from the people talking in the video. Unless the sentences are explicit annotations from crowdworkers tasked to e.g. “summarize the video”.

In action/environment grounding sentences are very task-oriented: instructions, clarifying questions and so on.

What is an image?


What is a video?

What is an environment?

What is an action?

RL perspective

lexically: a verb

Additionally: language itself is an action, trivially because speaking requires motor control. but on a more suitable conceptual level, pragmatics and discourse take that approach to language

What is grounding?

look at Timothy O'Donnell's Techaide conference talk or the corresponding paper

symbol grounding

What is communication?

What is an object?

What is a modality?

Also, does it make sense to say tokenized language is a modality? Surely not during actual language acquisition but maybe a reasonable assumption when we consider an agent that already knows language.

(references: AI Coffee Break video, Morency's survey paper...)

What is a body?

I was asked this question in an interview (though more like "what is the role of embodiment")...

My personal answers before consulting the literature:

First intuitive answer: when you close your eyes you have a model of your body, can mentally access it

More formally maybe: a body is that part of the environment of which you have the best model because you have direct access to its internal state. Also: a body is a tool of the environment (something you can manipulate to change the environment), but it is a tool you always have access to, even if you don't want it. Normal tools, like a knife, are only temporarily with you and have to be discarded if you want to use another tool. In that sense, a phone might slowly become part of the body?

A body is the direct source of sensory information. It is the sensory interface (and can be used surprisingly versatile: see "seeing" with tongue from neuroscience)

Before I go into the literature, I would expect, if I go deep enough, that I end up reading about Friston's Markov blankets at some point

Cite Joanna Bryson's Embodiment and Memetics paper?

In Dhruv Batra's NAACL keynote talk on "From Disembodied to Embodied Multimodal Learning" he defines embodiment (not a body though!) as simply: agents acting in environments, in contrast to static datasets. Is this already a full body if it is something like BabyAI though?