Humor Generation with Recurrent Neural Networks


Generating humor is quite a challenging problem in the domain of Computational Linguistics and Natural Language Processing. Humor is subjective and it can be interpreted in a large number of ways by different people. There have been attempts [1][2] to formalize humor from the perspective of Artificial Intelligence. However, the drawback of these methods is that they generate a very specific category of humor.

Deep Neural Networks have been successful in learning and understanding complex decision boundaries. In this blog post, I will discuss a method to generate funny texts using a character level LSTM model. The approach is quite simple:

  1. Collect a large corpus of jokes.
  2. Train character level LSTM network with optimal hyperparameter tuning.
  3. Once training is done, sample jokes from the model.

In order to get intuition behind using Recurrent Neural Networks for humor generation, let us first discuss a bit about their architecture and working.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are useful in modeling sequential data, which involves a temporal pattern like text, image captioning, ICU patient data etc. It is a simple feed forward neural network with feedback. At each timestep, based on the current input and past output, it generates new output. A simple Recurrent Neural Network architecture looks like:

Image: RNN Image Credits: WildML

RNNs are much more flexible than a simple feedforward neural network. We can pass variable sized inputs to RNNs and even get a variable sized output. For example, an RNN can be modeled to learn binary addition bit by bit. It learns the state machine diagram of a binary adder. After training, we can just pass two inputs of any arbitrary size to it and it can generate the correct output without performing any addition operation!

In this problem, I used a character level LSTM for producing jokes i.e. input and output of LSTM layer is just a single character. Hence we are not modelling any high abstract level semantics like words, language etc. Astonishingly, as we will notice in further sections, RNN is able to implicitly learn these semantics on its own.

Short Description of Dataset

For RNNs to work, a good amount of data is needed. I tried to find a corpus of jokes from relevant resources and publications in humor research but I could not collect more than 20,000 jokes. After scraping out inappropriate and repeated jokes, I was left with just about 17,000 jokes which was not enough to get any good and funny results from a LSTM model. I decided to build my own corpus of jokes by crawling various websites like Reddit, Twitter etc. The final dataset I built, is featured on Kaggle and it contains 231,657 jokes. The crawler scripts can be viewed in my short-jokes-dataset repository.

Hyperparameter Tuning

To find a good model, I tried various architectures of RNNs with varying number of layers, number of hidden units in each layer, sequence length and batch size. All of these hyperparameters should be tuned intelligently according to the dataset otherwise overfitting or underfitting may happen.

The Short Jokes dataset has over 22 million tokens. It is generally advisable to keep a model which has same order of magnitude as the number of parameters. A 3 layer LSTM model with 1024 hidden units in each layer satisfies this criteria as it has about 22 million parameters. I tried the model with and without dropout, but in both the cases, after certain iterations, validation loss became constant to about 1.15.

While analyzing the dataset, I observed that there are many confusing samples in the training data i.e. for the same previous context like "What do you call", the next context is different for different jokes. For example,

"What do you call a green cow in a field? Invisibull."

"What do you call bacteria that can swim fast? Micro Phelps."

These kinds of samples in the validation set confuse the model in making correct predictions. To find the lowest validation loss, I increased the parameters of the model to four times by keeping 2 hidden layers with 2048 units (no dropout) and expected it to heavily overfit. After 100000 iterations, as expected, the model started overfitting and the training loss reduced to around 0.8. This is evident from the plots below, comparing four different models:

Validation Loss Training Loss

After doing some analysis of the jokes in the Short Jokes dataset, I observed that most of them have length less than 150. Sequence length of LSTM model is also a crucial parameter. It refers to the number of time steps for which LSTM unrolls and propagates gradients in backward pass i.e. RNN is able to model dependencies only up to these time steps.

With this thought in mind, I tweaked the sequence length parameter keeping the number of hidden units equal to 1024 and number of layers equal to 3. As we can observe from the plot, this change definitely improves the performance, reducing the validation loss to 1.1. However, we get the best fit over this dataset with a 3-layer network having 1400 units and dropout of 0.5. This is because the number of parameters is huge and we get a strong regularizing effect with dropout of 0.5, which helps in avoiding overfitting. Also, it is good to observe that in the best fit model, although we have a sequence length of just 50, LSTM network is still able to model long term dependencies.


The results generated will be heavily dependent on the data, and since the data has been crawled from various websites, it may be offensive and inappropriate at times.

Now comes the fun part! Let us have a look at the generated samples from the LSTM network which I trained for a few days on Short Jokes dataset.

I like my women like I like my coffee. I don't like coffee.

Why did the cowboy buy a frog? Because he didn't have any brains.

Why can't Dracula be true? Because there are too many cheetahs.

What do you call a black guy flying a plane? A pilot, you racist!

I think hard work is the reason they hate me.

Why did the chicken cross the road? To see the punchline.

What do you call a political position? A condescending con descending.

What do you call a black man on the moon? An astronaut.

I like my women like I like my coffee... Still no lady.

What do you call a cow with no legs? Ground beef.

What's the difference between a snowman and a snow woman? Snowballs.

Your momma's so fat she threw his family up in the morning.

What do you call a short sense of humor? A charming saucer.

Hilarious indeed!

Of course some of the generated samples don’t make sense because the RNN gets confused between contexts of two jokes. But sometimes, this confusion results in antijokes too!

(Antijokes tend to start like regular jokes but lack a punchline)

What do you call a woman who has no legs? Doesn't matter.

What do you call a deer with no eyes? No idea.

What is the best thing about a good joke?

In the third sample above, the RNN didn’t even care to complete the joke by coming up with a punchline!

If we keep on decreasing the temperature, network does more exploitation rather than exploration, and keeps on generating similar words and characters.

Following are some of the generated samples with temperature of 0.4:

What do you call a black man who flies a plane? A pilot, you racist bastard.

What do you call a black man flying a plane? A pilot, you racist fuck.

What do you call a snake that starts with a vest? An investigator.

What do you call a cow with no legs? Ground beef.

What do you call a fat psychic? A pilot, you racist bastard.

What do you call a pig with no legs? Ground beef.	

To explore more, I compiled transcripts of all the episodes of the TV Series F.R.I.E.N.D.S in a single text file of 4.79MB and trained a LSTM network of 3 layers with 512 units. It learns how to properly start and close an episode with curtains closing, the way characters enter and leave. You can take a look at the generated text here.

In all, it was fun playing around with the RNN hyperparameters to find a good fit for the Short Jokes dataset by carefully monitoring training and validation loss from time to time. It seems so natural to see the way RNN learns to write characters, words and then forms sentences, which actually make sense! But there is still a lot of scope of research in the RNN space, to make them write creatively and learn to think step by step like humans.

All the source code of this project is available at my Github repository.