Orion Weller: Understanding and generating humor automatically with machines is an interesting and difficult area of research in natural language processing.  Because humor can take varied forms and structure, even simple things like recognizing a joke can be hard!  Despite these difficulties, however, it is necessary for systems to learn: we can’t have an information extraction system thinking something is true, when it clearly is a joke (just imagine how this could go wrong when trying to find security threats or during fact checking).

In late 2019, I was working with Dr. Fulda on automatic humor generation, extending some of the previous work I had done on recognizing humor.  The project initially started by attempting to train a GAN to generate humorous text.  This proved difficult, with our many early attempts showing sentences that were clearly not-humorous and often non-intelligible.  As we pondered on these results, I realized there might be an easier method to generate humor automatically, one that had not yet been attempted.

The main idea was this: we train a system to do machine translation, but instead of translation from one language to the other, we translate from non-humor into humor.  Translation into humor-ese, if you will.  I was familiar with the Humicroedit dataset that was used in recent SemEval competitions, where they had humans change words in news headlines in order to make them humorous.  Using that as data, I worked on training this model.  After a while of training, we found some promising results.  

In the below table, you can see both the original headline, the edited human headline from the Humicroedit dataset, and our “translated” headline.  We also include a random model as “random” (more on that later).

List of example news headlines that have been edited to become jokes. Joke headlines generated by humans, random changes, and by a machine translation system.

Despite strong results, we wanted to see if these were actually an improvement, or if they could just be attributed to randomness (as the randomness could be causing the humor).  We conducted an experiment on Mechanical Turk, a system for performing crowdsourcing online.  To test whether random effects were causing our results, we created a random model that would randomly sample verbs, adjectives, or nouns, and replace them with a random word from the same part of speech (thus keeping sentence fluency). 

We then asked crowdsourcers to rate the headlines from all four categories, asking them whether it was fluent (on a scale of 1-5), whether it was humorous (on a scale of 1-5) and whether they thought the headline was human generated or machine generated. You can see our results below:

Graph showing how favorably humans rated generated joke headlines written by humans vs. random processes vs. machine translation system.

Bars indicate two standard deviations, for the crowdsourcing ratings. You can see that the “translation” model performed similarly to the human “edited” headlines in all cases, whereas the random model performed significantly worse.  We can also see that the workers thought both the human edited and machine translated headlines were equally human-generated, with both at around 55% percent of workers thinking that they were human-generated.

To directly compare the output of the different models, we also conducted an A/B test, asking workers to decide which headline was more humorous.  We found that the “translation” model performed significantly better than the random model, while the human edited headlines were not significantly different than the “translated” model.

Our results were published in the Figurative Language Workshop at ACL in 2020.  We think that this presents some intriguing results that will hopefully draw increased research into humor generation.  If you’d like to take this model for a spin and test it on some new headlines, feel free to grab our model on github or check out our paper!

Comments are closed