Categories:

A few years ago I was invited to give a keynote address at the NeurIPS Workshop on Reinforcement and Language Learning in Text-Based Games. I’ve always meant to post about it, but somehow never found the time.

(Fun fact: I was six months pregnant with twins when I filmed this presentation, and had such terrible dizzy spells that I couldn’t stay standing for more than two minutes at a time. All praise to the amazing Cayla Murdoch, who edited this video so perfectly that you’d never guess how often I had to pause for breath.)

So… a lot has happened in two years. GPT-2 happened. So did BERT, CTRL, Transformer-XL, PPLM, and GPT-3. (Oh, and AI Dungeon.) But the fundamental challenges of language learning remain the same.

From a machine learning perspective, human language is obnoxiously non-deterministic. There’s no one right answer to any language modeling question; instead, you have a plethora of possible appropriate responses for any given text prompt or dialog history, and in linguistic terms many of the responses don’t resemble each other at all. For example, the statement “Iron Man is an amazing movie” can appropriately be followed by literally thousands of sentences, including responses as diverse as

  • The Cinematography was crisp and innovative, and the CGI effects were spectacular
  • Robert Downey Jr. is a great actor
  • Another win from the Marvel Universe, two thumbs up!
  • Wow, can I please have a robot like Jarvis, pretty pretty please?
  • NOT!

From this messy, one-to-many training structure, we somehow desire to create an auto-regressive algorithm capable of learning the best possible response to any situation.

Yeah, right.

Machine learning researchers have our tools of the trade, of course. We have variational neural architectures, designed to model the dozens, perhaps hundreds, of invisible facts not present in the text history which nevertheless influence the choice of possible response. We have probabilistic decoders, which avoid converging to the degenerate behavior of always saying the most likely thing rather than saying more interesting but less probable things.

Recently, my research has focused on a more foundational exploration of language. Rather than trying to predict what comes next, my collaborators and I have instead studied the behavior of languague in mathematical terms. In other words, if you convert each word into a vector of numbers, (using, for example, the well-known word2vec algorithm), how do the geometric locations of those words relate to their semantic meaning, and how can this knowledge be leveraged?

The results of that research led to several academic publications, including applications such as state space disambiguation, affordance detection in text-based games, and human-guided action selection. But it also led to a critical discovery: Although these linguistic embedding spaces, as they’re sometimes called, contain incredibly valuable semantic information encoded within the geometric proximity of words and sentences, that functionality is emergent rather than induced. By which I mean, it’s a happy by-product of learning algorithms that are trained primarily for other purposes. BERT, for example, creates amazingly accurate contextualized word embeddings that are fabulously useful as pre-trained inputs for neural networks learning a variety of language tasks. It’s insanely good at learning fine-grained semantic distinctions, but it wasn’t designed to encode these distinctions as a coherent geometric structure in which words with specific relationships appear at consistent vector offsets relative to one another.

In other words, when we tried to apply BERT vectors to tasks like the ones in the above academic papers, we failed miserably.

Attempts using Skip-thought vectors, GPT-2 hidden states, and transformer-XL hidden states were equally unfruitful. InferSent and Google’s Universal Sentence encoder worked better, but not perfectly. This led us to wonder: What would it take, to create an embedding space that actually has the properties we were looking for?

The answer? We’re not certain. But we are certain that the current trend in training large language models is taking us in the wrong direction. Now don’t get me wrong! It’s taking us in great directions in terms of creating AI-generated text, enabling text-based few shot learning, and providing pre-trained input features for downstream tasks. But it’s not taking us in the right direction in terms of creating linguistic embedding spaces that encode language in semantically structured ways.

At DRAGN Labs, we’re starting to explore what it would take to create the type of embedding space we dream of. The first small step in that direction is an academic paper in Springer’s Advances In Intelligent Computing, called Rethinking Our Assumptions about Language Model Evaluation.

Like language itself, this problem is multifaceted, with many possible correct next steps. With luck, we’ll be able to find at least one of them.