Berkeley Andrus: This week OpenAI is wrapping up beta testing for GPT-3, a language model 10 times bigger than any of its competitors. With 175 billion parameters and 45 TB of unfiltered training data, GPT-3 far surpasses anything we’ve ever seen in NLP. 

But how good is it, really?

Last month I had the chance to find out for myself when I got beta access to OpenAI’s Playground, a website for interacting with GPT-3 at a small, non-industrial scale. I spent hours chatting with it, trying to find the limits in what it can and can not do. In this blog post I’m going to share some of the most impressive and surprising things I encountered, as well as some of the weaknesses common to all neural language models that GPT-3 hasn’t quite overcome.

But first, let me provide a little more context of how GPT-3 works for those of you who don’t have experience working with neural language models. If you want to jump straight to the examples, feel free to skip the next section.

How to talk to GPT-3

You can think of neural language models (including GPT-3) as an advanced version of the autocomplete feature on your phone. You give the model some context (an article headline, the first half of a sentence, the beginning of a conversation) and it tries to guess what should come next (the article body, the rest of the sentence, the next reply in the conversation). The autocomplete on your phone can save you a few seconds when you’re typing out a common sequence of words, but if your model is sophisticated enough you can use the same interface to accomplish so much more.

GPT-3’s interface is “text in, text out”, meaning you give it some starting text and it gives you back the text that should come next. The researchers at OpenAI have demonstrated that, using this basic interface, you can get GPT-3 to do all kinds of tasks that would usually require specialized models or programs. Basically, if you can describe a task in English and/or give it a few examples, GPT-3 can accomplish whatever it is you want it to do (within reason). 

Here are a few examples of tasks that actually work with GPT-3. The black text is what I provided to the model as the ‘context’ or ‘prompt’, and the blue text is what it responded with. The point of these examples is to get you thinking about how you can use text as ‘instructions’.

If you want GPT-3 to translate English to German:

Translate the following into German:
“How are you?” ==> “Wie gehts?”
“What is your name?” ==> “Wie heisst du?”

If you want GPT-3 to recall general information:

cat: mammal
eagle: bird
shark: fish

If you want GPT-3 to act as a ‘chatbot’:

user > hello!
bot > hi there
user > how are you doing?
bot > i’m doing fine, thanks for asking.

That’s the general idea. You give a minimal amount of instruction or examples in the prompt (almost as if you were giving instructions to another person. OpenAI suggests emulating the instructions you would give to a middle schooler) and GPT-3 does its best to fill in the rest. In all the examples above I told GPT-3 to stop after it output a newline character, but it’s also capable of much longer sequences as you’ll see later on.

If you’re curious, this is what the OpenAI Playground looks like:

Along the right you have options for constraining the output. The most interesting parameters (in my brief experience) are temperature, which basically controls how safe vs. chaotic the output is, and stop sequences, which tell GPT-3 when to stop outputting text. In the main text box, you enter some text and hit ‘Submit’ to watch GPT-3 try to autocomplete it.

GPT-3’s beta testing will be ending soon, but starting in just a few days you will be able to access GPT-3 directly with a paid account or a limited free trial.

The Best of GPT-3

Let me start by saying that I am super impressed with GPT-3 so far. I’ve worked with other generative language models before, and this one just feels different. It’s certainly not perfect, but it consistently exceeded my expectations. To borrow a phrase from the field of User Experience Design, it is delightful. I’m going to include some of the best examples I encountered, grouped according to the specific strengths they demonstrate.

Open-Domain Knowledge

GPT-3 was effectively trained on all of the internet (that’s not strictly true, but it’s a good way of thinking of it). A side effect of its training data is that it knows a little bit about almost everything. Here’s an example, which was the very first exchange I had with GPT-3. My motivation was to see if GPT-3 could pick up on implicit grammatical patterns and relationships between named entities. I gave GPT-3 the following prompt:

When Harry married Ginny, Ginny became Harry’s wife.
When Hermione married Ron, Ron became Hermione’s husband.
When Arthur married Molly, Molly became Arthur’s wife.
When Tonks married Lupin,

I was hoping that GPT-3 would pick up on the structural pattern of the sentences and the fact that female characters are ‘wives’ while male characters are ‘husbands’. I used Harry Potter characters because that was the first thing that came to mind (although in hindsight, if I wanted GPT-3 to guess the characters’ genders, I could have done better than ‘Lupin’ and ‘Tonks’).

GPT-3, to my surprise and delight, responded with this (the black text is the prompt, blue text is GPT-3’s contribution):

When Harry married Ginny, Ginny became Harry’s wife.
When Hermione married Ron, Ron became Hermione’s husband.
When Arthur married Molly, Molly became Arthur’s wife.
When Tonks married Lupin, Lupin became Tonks’s husband.
When Fleur married Bill, Bill became Fleur’s husband.
When Luna married Neville, Neville became Luna’s husband.
When George married Angelina, Angelina became George’s wife.
When Percy married Audrey, Audrey became Percy’s wife.
When Fred married Roxanne, Roxanne became Fred’s wife.
When Hagrid married Madame Maxime, Madame Maxime became Hagrid’s wife.
When Arthur married Molly, Molly became

Not only did GPT-3 notice the patterns I hoped it would, it also realized that I was talking about Harry Potter characters! It kept the pattern going by spitting out more relevant examples of couples in the Harry Potter Universe. To be fair, some of these ‘couples’ are not canonical romantic couples in the books (Fred and Roxanne are the children of George and Angelina; Luna/Neville and Hagrid/Maxime were never married), but they are all at least pairs of characters that are commonly associated with one another, and 4 of the 7 given by GPT-3 are actual married couples from the Harry Potter books. In short, GPT-3 had some inaccuracies, but it still showed more Harry Potter knowledge than I have.

Let me also point out that on the last line, GPT-3 started repeating something that had already been said. This is a common weakness in most neural language models. We’ll talk more about that later.

I didn’t feel qualified to explore Harry Potter lore any deeper than the previous example took me, so I switched over to a domain where I’m much more knowledgeable: Dungeons and Dragons. To test how well GPT-3 understands the world’s greatest role-playing game, I decided to list some of my favorite character classes in D&D 5th edition and see if it could identify tropes associated with each class.

I gave the following prompt:

Wizards are a good class because they learn lots of spells.
Bards are a good class because they are fun to role play.
Druids are a good class because

My hope was that GPT-3 would be able to say something relevant to the druid class, and I wasn’t expecting much beyond that. Instead GPT-3 came back with this:

Wizards are a good class because they learn lots of spells.
Bards are a good class because they are fun to role play.
Druids are a good class because they can turn into animals.
Paladins are a good class because they can heal and cast spells.
Monks are a good class because they can fight with bare hands.
Rangers are a good class because they fight well with bows and arrows.
Warlocks are a good class because they can

This blew me away. GPT-3 did 3 things right. First, it picked up on the fact that I was listing concise attributes of each class that people tend to enjoy (as opposed to criticisms, detailed lists, or flowery definitions for each class). Second, it correctly identified one of the things that makes druids popular. Finally, it kept going. All the classes it listed are actual classes in D&D 5th edition, and everything it said about those classes is true. With this example, GPT-3 was 2 for 2 on open-domain knowledge.

I gave it one more test on this skill. Harry Potter and D&D are fun, but what can GPT-3 do with facts that are more…factual? As in, based in the real world? I decided to test its knowledge of recent US history with this prompt:

US Presidential Election results:
2000 Bush v Gore – Bush
2004 Bush v Kerry – Bush
2008

This prompt is interesting because it requires some extra-grammatical formatting. The years, the ‘v’ character, and the hyphens all need to be put in the right place. The information in this list is fairly common knowledge, but in order to respond to the prompt in a human-like manner GPT-3 would have to combine factual knowledge with patterns in the prompt text. As it turns out, that’s exactly the kind of thing GPT-3 was built to do, and it came up with this response:

US Presidential Election results:
2000 Bush v Gore – Bush
2004 Bush v Kerry – Bush
2008 McCain v Obama – Obama
2012 Romney v Obama – Obama
2016 Trump v Clinton – Trump

Not only did it correctly list the major candidates and results in each US presidential election since 2008, it also put Republican candidates on the left and Democratic candidates on the right, following the pattern of the first two examples. It also chose to stop at the right place (i.e. it knew not to make predictions about the future). 

Following Complicated Instructions

By this point I was getting used to the text in, text out interface and I wanted to push myself (and GPT-3) and see if I could give it more complicated tasks. It took some trial and error, but I stumbled on some effective ways of interfacing with it.

For example, I figured out that I could give it a list of examples to process instead of having it answer just one question at a time:

Please classify each of the following weapons as melee or ranged:

– Rapier
– Heavy Crossbow
– Quarterstaff
– Battle Axe
– Shortbow
– Dart
– Greatsword
– Flail

Rapier: Melee

Heavy Crossbow: Ranged

Quarterstaff: Melee

Battle Axe: Melee

Shortbow: Ranged

Dart: Ranged

Greatsword: Melee

Flail: Melee

The output then went on to ask a question about longswords and gave this hilarious insight:

Longsword would be considered a melee weapon as it can be used in close-quarters fighting to overpower and incapacitate an opponent. If you were to shoot a long sword, it would be a ranged weapon. (Emphasis added)

The off-topic (or at least unsolicited) rambling at the end could pose a problem for someone building an app around GPT-3, but the list of answers was correct and well-formatted. Note that I didn’t give any examples of correct answers, just a simple instruction and a list of data to classify. GPT-3 decided what the answers should look like. It also managed to cover every item from my initial list and kept them in the original order. 

For the next test, I wanted to see how GPT-3 did with ambiguous instructions. Instead of giving it a discrete binary classification to make, I decided to make it come up with its own categories:

(This example goes back and forth between black and blue. This is because I gave a prompt, saw GPT-3’s output, added onto the prompt, let it output more text, etc.)

List some of the categories each animal belongs to.

Tiger: Feline, Predator, Mammal
Bear: Mammal, Carnivore, Predator
Turtle: Reptile, Vertebrate, Skeleton, Tropical, Terrestrial
Cat: Mammal, Carnivore, Feline
Shark: Fish, Vertebrate, Fish
Salamander: Amphibian, Reptile
Lion: Feline, Carnivore, Predator
Seahorse: Fish, Vertebrate, Tropical, Aquatic, Terrestrial
Eagle: Bird, Carnivore, Reptile, Predator

GPT-3 successfully interpreted and completed the task – albeit with a few inaccuracies. It incorrectly called eagles and salamanders reptiles and said that turtles are a type of skeleton. Aside from those inaccuracies, though, I was impressed with its response to ambiguous instructions. As a student, I can’t imagine anything more frustrating than being asked to ‘list some categories’ without any clarification about what that means, but GPT-3 played along without complaint.

I’ll end this section with one of my favorite exchanges with GPT-3. While GPT-3 probably isn’t ready to replace human writers any time soon, it has massive potential as a source of creative inspiration. For example, I love the idea of professional writers sitting down with GPT-3 for a brainstorming session. With that image in mind, I gave it a task I personally struggle with when writing stories: creating fun and unique character traits.

Let’s come up with 100 quirks that an elf merchant in a fantasy game might have.
1. Calls everybody ‘buckaroo’ instead of their name
2. Only takes copper pieces, won’t accept silver or gold
3. Comes 15 minutes late to every appointment
4. Sells the same knife for 200gp every time, because “It’s a better deal than the last time you came back for it”
5. Always sells milk for 3gp per gallon, despite it costing him 4 gp a gallon to buy it back from the cows
6. His beard is always half-cut
7. Always grants wishes for free, but only the ones involving him getting money out of it
8. Always wears a doughnut on his belt
9. Always has an infuriatingly lax attitude to opening times, despite being a ‘shopkeeper’
10. Always talks in the third person
11. Always talks in iambic pentameter and poetic rhymes
12. Attracts animals to his shop like moths to a flame
13. Can have hot soup on his breath, even though he doesn’t appear to have eaten anything hot recently
14. Can tell the history of every item that’s ever been sold, even though his memory doesn’t appear to work particularly well on any other topic
15. Chickens follow him around like he’s their mother
16. Constantly half-forgets to charge money for essential commodities like bread

I absolutely adore this list. In my opinion GPT-3’s response to my prompt reflected that it has enough knowledge about fantasy tropes to effectively play with and satirize them. The output was also impressively diverse while maintaining high quality. Some of the list items (#9, #14) are things I would actually expect to see in a fantasy book or game. Others (#4, #16) are phrased a little oddly, but could conceivably describe some interesting or funny characters. My favorite is #5, which seems like it might be a reference to the character Milo Minderbinder in Catch-22 (Milo famously found a way to buy eggs at 7 cents and sell them at 5 cents for a 2 cent profit).

This was the output from hitting the ‘Submit’ button once. After copying down this example I Submitted repeatedly to try to complete the list of 100 items, but after #16 they quickly devolved into ideas that were boring, repetitive, and/or off-topic. It seems 100 creative elf merchant traits was too much to ask for.

Another weakness in this output is that GPT-3 decided to start using exclusively masculine pronouns to describe the elf merchant, even though I didn’t give any indication of the merchant’s gender identity. Because GPT-3 is trained on human-generated data, it has inherited some of the racial, gender, and other biases that unfortunately are still present in our society. I don’t feel qualified to delve into this sensitive topic right now, but if you’re interested in learning more about social bias in language models I recommend this article as a good starting place. OpenAI has taken a step in the right direction by creating a filter that detects harmful content in GPT-3’s output, but as a community we still have a long way to go in this area.

Logic, Math, and Reason

OpenAI’s original paper on GPT-3 pointed out its ability to perform simple arithmetic, and I was excited to see for myself what it could do. 

It handles simple addition and multiplication problems, which was cool but not really surprising:

2 * 2 = 4
3 * 3 = 9
4 * 2 = 8
5 * 3 = 15
6 * 4 = 24
7 * 5 = 35
8 * 6 = 48
9 * 7 = 63
10 * 8 = 80
11 * 9 = 99
12 * 10 = 120

It was also able to do some simple algebra, though I found it was not consistent on anything more complicated than the following:

If 3x = 9, what does x equal?

x = 3

If 4 = 2x, what does x equal?

x = 2

If 2x = 6, what does x equal?

x = 3

GPT-3 was also able to pull numeric information out of natural language snippets and perform arithmetic on the result, as shown here:

Player: I started with 27 silver pieces, then I found 4 more. How much do I have now?
Banker: You have 31 silver pieces.
Player: Then I bought a potion for 15 silver pieces. How much do I have now?
Banker: You have 16 silver pieces.
Player: Then I found 17 silver pieces laying by the side of the road. How much do I have now?
Banker: You have 33 silver pieces.

These were all promising and impressive examples of GPT-3 performing mathematical operations. Unfortunately, when I tried anything more complicated than the above examples I found I often got inconsistent or inaccurate results.

GPT-3’s Shortcomings

GPT-3 clearly has a lot going for it. It’s been satisfying and exciting to see what it can do. However, we’ve already seen a few of its flaws and there are others that are worth addressing. Here are some of the tasks that I was unable to get it to accomplish:

Logic with Extra Rules

Since the banking example above went so well, I wanted to try a more complicated example. I imagined a hypothetical game where the player has a health score that cannot go below zero. I tried a few different combinations of examples and instructions, but I never quite convinced GPT-3 to follow my special rules:

A player cannot have less than 0 hp. If they have 6 hp and take 10 damage, they are now at 0 hp. Players with 0 hp don’t immediately die, they are just unconscious.

Player 1: I started with 10 health, then I took 15 fire damage. How much health do I have now?
DM: You have 0 hp and are unconscious.

Player 2: I started with 56 health, then I took 15 cold damage. How much health do I have now?
DM: You have 41 hp
Player 2: I cast a spell and heal for 6 hp. How much health do I have now?
DM: You have 47 hp
Player 2: I take 26 slashing damage. How much health do I have now?
DM: You have 21 hp
Player 2: I take 25 fire damage. How much health do I have now?
DM: You have -4 hp and are unconscious.

GPT-3 insisted on following standard rules of math instead of my imposed ‘health can’t go below 0’ rule. Of course, this is exactly the type of thing that we don’t need GPT-3 to do – any programmer could whip up a health tracking subroutine with a few lines of code. My point in including this example is that there are limits to what we can tell GPT-3 to do. I am also new to writing prompts for GPT-3, so it may be that I simply haven’t framed the problem in the right way.

Complex Language Tasks

The standard paradigm in natural language processing is to create different machine learning models for each language-based task. There has been a one-to-one ratio between tasks and models. GPT-3 fights that paradigm by excelling at many language tasks, as long as they can be framed using a “text in, text out” interface. Essentially, GPT-3 is big enough that it has learned to do many things well, potentially removing the need to create small specific models for each individual task.

That is the hope, anyway. It turns out that while GPT-3 can do lots of things a language model should do, it doesn’t quite do everything.

For example, this was my attempt at ‘Detokenization’. In NLP research we often make text easier to process by adding extra spaces and making all the characters lowercase. I wanted to see if GPT-3 could reverse that process:

This is how you de-tokenize text.

Tokenized: what ‘s up , guys ?
Original: What’s up, guys?

Tokenized: why don ‘t you stop seeing them ?
Original: Why don’t you stop seeing them?

Tokenized: i ‘m going to the store .
Original: I’m going to the store.

Tokenized: my name ‘s thomas .
Original: My name is Thomas.

GPT-3 did well on the capitalization, but it changed ‘s to is, which was not desired behavior. I tried a few other test cases and always got similar results.

Here’s my attempt to get GPT-3 to do part-of-speech tagging:

Part of Speech tagging is where you label each word in a sentence according to its Part of Speech (noun, verb, adjective, etc.)

Please tag the Parts of Speech in the following sentences

Sentence: The dog caught the ball.
Tags: DET NOUN VERB DET NOUN

Sentence: The house is red.
Tags: DET NOUN VERB ADJ

Sentence: He ran quickly up the steps.
Tags: NOUN VERB ADV PREP DET NOUN

Sentence: She went to the store.
Tags: NOUN VERB PREP ADP DET NOUN

GPT-3 output correct tags for every word in the sentence, but it gave an extra tag (ADP) that doesn’t seem to correspond to any of the original words.

Repetition

All neural language models that I’m aware of have problems with repetition. They will get caught in loops where they spit out the same words over and over again. The better models won’t loop on a single word but will repeat sentences or phrases. GPT-3 does come with some nice tools for combatting this, but by default it is still vulnerable to repetition.

Here’s the detokenization example from above with the temperature turned down (to 0.2) and without a stop sequence:

This is how you de-tokenize text.

Tokenized: what ‘s up , guys ?
Original: What’s up, guys?

Tokenized: why don ‘t you stop seeing them ?
Original: Why don’t you stop seeing them?

Tokenized: i ‘m going to the store .
Original: I’m going to the store.

Tokenized: my name ‘s thomas .
Original: My name is Thomas.

Tokenized: i ‘m going to the store .
Original: I’m going to the store.

Tokenized: i ‘m going to the store .
Original: I’m going to the store.

Tokenized: i ‘m going to the store .
Original: I’m going to the store.

Tokenized: i ‘m going to the store .
Original: I’m going to the store.

Tokenized: i ‘m going to the store .
Original: I’m going to the store.

This failure mode is a reminder of what GPT-3 really is: a highly effective pattern-detecting machine. Keeping the true nature of GPT-3 in mind will help us as we build awesome applications around this groundbreaking new technology.

Further Reading

If you want to know more of the technical details of how GPT-3 came to be and how it compares to other models, check out OpenAI’s paper Language Models are Few-Shot Learners. It gives a lot of great examples of why this language model facilitates an approach fundamentally different from the paradigm of fine tuning models.

For a demonstration of some of the ethical perils inherent in models this good, check out this blog post from a student who had GPT-3 write blog posts for him – and used it to amass a surprisingly large online following.

If you want to know how GPT-3 is going to affect real-world businesses, I recommend this blog post, which gives an excellent summary of how it is already being used and how it will hopefully be used in the future.

Last but not least, if you want some first-hand experience, I would highly recommend the game AI Dungeon. It’s a text-based adventure game that feels a little like a Choose Your Own Adventure book, except it lets you improvise any actions you want and adapts to your choices in real-time. It’s a lot of fun and is a great way to get familiar with language models like GPT-3.