Finetuning GPT-2 to Generate Beatles Lyrics


The Beatles were a huge cultural phenomenon. Their timeless music still resonates with people today, both young and old. Personally, I’m a big fan. In my humble opinion, they are the greatest band to have ever lived¹. Their songs are full of interesting lyrics and deep ideas. Take these bars for example:

When you’ve seen beyond yourself
Then you may find peace of mind is waiting there²

Powerful stuff. However, the thing that made the Beatles great was their versatility. Some of their songs are deep and thoughtful while others are fun and lighthearted. Unsurprisingly, the biggest theme that weaves throughout their lyrics is love. Here is one such verse:

My sweet little darling can’t you see,
how wonderful it is when you’re mine.
If only I could be your lover forever,
I can never get my heart, I can’t get my mind.

Actually, these lyrics weren’t written by any Beatles you may know. Not by Lennon, or McCartney, or Harrison, or even, God forbid, Ringo Starr (just kidding, Ringo’s alright). They were actually generated by a Machine Learning model, namely OpenAI’s GPT-2 [1]. Although this uses their smallest model, the results are quite astonishing.

But before we get too ahead of ourselves, let’s take a step back and look at how this all works. As always, full working code is available on my Github.

Language Modeling

Language models try to learn the structure of a language (e.g. English or Beatles’ lyrics). They are generative models that are trained using supervised learning. Like other supervised learning tasks, language models try to predict a label given some features. However, unlike most supervised learning tasks, there are no explicit labels, rather the language itself acts both features and labels.

At a high level, what a language model tries to do is predict the next word given a sequence of previous ones. For example, a good language model might predict that “milk” is the logical conclusion to the phrase “to buy a gallon of ____.”

By trying to guess which word comes next, what we’re really doing is learning a probability distribution over the vocabulary conditioned on the words we’ve seen so far. That is, we want to learn

where w_i are the words in our vocabulary.

Because we’re explicitly modeling this distribution we can do some cool things with it, such as use it to generate words we haven’t seen before. The way we can to this is by repeatedly sampling the next word from this distribution, then using that as the conditional when we sample the next-next word, and so on. To make it concrete, let’s see what this might look like in Python. If we had a model object with a sample method, then we can generate new samples by doing something like this:

How we might generate sentences from a language model.

Of course I’m skipping over a few details but hopefully these will become clearer as we go on. For now, let’s consider the world’s simplest language model, the unigram.

The unigram model ignores any conditioning and simply chooses the next word randomly from the training data. What this amounts to is throwing our training data into a blender and spilling out the contents after a 10 minute blend on high. Needles to say, we’re not going to be generating anything that resembles English (unless of course we had a trillion monkeys with a trillion blenders, or was it typewriters).

The Bigram Model

A step above the unigram model is the bigram model. As you might have guessed from the name, the bigram model learns a distribution that is conditioned on only the previous word, i.e.

Being that it’s so simple, the bigram model is easy to implement in Python, and will give us a deeper understanding of how language models work.

Collecting data

Before we get to the implementation, we first need some data. Our ultimate goal is to generate songs that would have made the Beatles proud, so let’s start by harvesting all their known lyrics.

I found this website which catalogues the lyrics for every song they ever released. It also has a helpful index page with links to the individual songs which we can use to scrape the site. I wrote a simple script to iterate over each link on the page, parse its HTML to extract the lyrics, and dump the lyrics to a file, one song per line. If you plan on following along, or just want the Beatles’ lyrics yourself, I highly recommend using it since scraping HTML is pretty tedious and annoying, even with tools like Beautiful Soup.

Once we have the data in a nice, clean format, the rest is easy. But don’t just take my word for it, take it from this chart:

Data cleaning and organizing accounts for the biggest chunk of data science projects (source).

Building the model

As mentioned above, the bigram model just samples the next word conditioned on the previous one. One simple way we can do this is to keep track of which words follow the current one and with what frequency. That is, we keep a dictionary for each word current_word in our training data, then every time we see a next_word, we update current_word[next_word] += 1. Then, to generate words we simply look up all the words and counts in the current_word dictionary and sample a word with probability proportional to its count. Here’s a sketch³ of what the full model would look like in Python:

A sketch of the bigram language model.

One last thing to note is that we probably want to preprocess the lyrics by adding some special tokens to denote the starts/ends of lines and songs. This is to force our model to maintain some of the song structure when generating new lyrics, otherwise the model will just spit out large blobs of text with no end. In my code I use XXSL, XXEL, XXSS, and XXES to denote start line, end line, start song, and end song respectively.

Finally, to generate songs, we can start with the XXSS token and keep calling model.predict() until we hit an XXES token.

Generating a song with the bigram model.

In theory, once the loop stops, we will have generated a never-before-seen Beatle’s song. But is it any good?

A never-before-seen Beatles’ song

Here’s a small snippet of one of the songs the bigram model generates:

She’s so I love her heart.
Well they are; they said I’m so many,
She no surprise
When you’re mine
Sad and the Amsterdam Hilton
they make my way,
Yes I wait a boy been born with a rich man,
all share

The generated samples sound like the ramblings of a madman and only make sense if we get extremely lucky.

We can keep extending the bigram model to take into account the previous two words. This is known as a trigram model. You may find that the trigram model actually produces better sounding lyrics. This is because there’s less three word combinations so the model has less choices at each step, and at some steps it only has one choice. In general, we can create an arbitrary n-gram model by considering the previous n words. When n is equal to the length of a full song, you may find that the model is perfect at generating Beatles’ songs. Unfortunately, the songs it generates already exist.

Towards a better model

One of the most glaring issues of the bigram model is that it will only use the words and phrases it has seen in the training data. While we want to generate lyrics which sound like they were written by the Beatles’, we don’t want constrain ourselves to only the words they used. For example, if the word “parade” was never used by the Beatles, then the bigram model will not generate any songs about “parades”. Of course, since we’re training only on the Beatles’ lyrics, we can’t possibly expect our model to use words it’s never seen. What we need is to train on enormous corpora, such as Wikipedia or Reddit.

However, even if we trained on all of Wikipedia and saw every single word in the English language, our bigram model would still be way too rigid. For example, take the phrase “tall man”. Everyone with a basic grasp on the English language would recognize that “tall” is simply a modifier of “man” and not tied to it. Instead “tall” could be used to modify countless other things, such as “woman”, “boy”, “building”, “giraffe”, etc. Our bigram model, however, can’t learn this and instead must see any use of “tall” at least once before it can use it. So if the model only ever saw “tall man”, “tall boy”, “tall woman”, but not “tall girl”, as far as it’s concerned, the phrase “tall girl” doesn’t even exist.

Therefore, what we want is a model which has a much richer vocabulary and a deeper understanding of the relationships between the words in the vocabulary. Luckily for us, clever researchers have already invented such powerful models and we can use them to generate much better songs.

The GPT-2 Model

OpenAI’s GPT-2 model [1] recently made headlines for being “too dangerous to release.” The model generated such convincing text that the authors believed it might be used for malicious purposes⁴. Instead they released two smaller versions for people to play around and experiment with. We’ll use the smallest of the two to generate our Beatles’ lyrics.

GPT-2 is a transformer based model which was trained on huge amounts of Reddit data for hundreds of GPU hours. During its training, it was able to learn a very good model of the English language (or at least the version of English used on Reddit). This means that it’s able to understand things like “tall” can apply to humans or buildings or giraffes. Also, because it trained on a huge portion of Reddit, it’s likely that it’s seen 99.9% of words and phrases in the English language. This is great news for us since this is exactly what we were looking for: an expansive vocabulary and a deep understanding of how to use that vocabulary.

However, if we turned on the model and asked it to generate something, it’s extremely unlikely that it would come up with resembling Beatles’ lyrics (even if r/beatles exists). This is because the model doesn’t know that what we care about is generating Beatles lyrics, after all this wasn’t what it was trained to do. Instead we need to nudge the model towards doing what we want it to do. One way we can do this is via transfer learning.

Transfer Learning

Transfer learning is the idea that we can leverage information we’ve learned by doing one thing and apply it to solve a related thing. For example, when you started reading this article, you didn’t have to relearn what words are, which words follow which other words, or how they fit together to form sentences. Imagine how tedious that would be. Instead you leveraged all those hours spent reading AP Literature books to understand what I’m talking about now (I guess Huck Fin came in handy after all).

In a similar fashion, we can leverage the knowledge GPT-2 has already learned through it’s hundreds of hours of reading Reddit posts and transfer it to our task of generating Beatles’ lyrics. The high level idea is to take the pre-trained model and keep training it for bit longer. However, instead of Reddit posts, we’ll use the scraped Beatles’ lyrics only. This will heavily bias the model towards generating Beatles-like songs.

I’ll skip exactly how to do this here since it would take another post of similar length to explain everything. Instead, if you’re interested in the exact details, I’ll refer you to [2]. It’s a great blog post with step-by-step instructions of how to transfer the GPT-2 model to any language task you care about. It’s also what I followed to get the results shown here.

The new Beatles

Like all good deep learning results, the lyrics I posted in the introduction were heavily cherry picked. The generated songs are not all as good and their quality depends on where in the fine-tuning phase you are. Here’s what you might get after fine-tuning for ~100 mini-batches, when the model is still heavily underfitting the training data:

Ev’ry I Love You
Lennon & McCartney
Ev’ry I will always be
Ev’ry you love me.
Ev’ry you love me,
Ev’ry you love me too.
Ev’ry I will always be

And it just goes on like that for another 10–15 lines. Still better than Lil’ Pump at least.

Joking aside, what’s very interesting to me is the fist two lines. In the training data, each song starts out with the title on the first line, the writer(s) on the second, and the actual lyrics in the following lines. Even at this early stage the model has managed to learn the structure of our data: the first and second lines are special; in the second line there’s only a few combinations of words that could possibly appear, with Lennon & McCartney being the most likely.

If we fine-tune for ~350 mini-batches, the model starts generating much more believable lyrics, such as the one in the intro, or this one:

Woman in Black
Lennon & McCartney
I’d make a scene
If you don’t want me to appear
You might as well leave me alone.
I’m near death and I’m in love

Not perfect, but not bad. Finally, this is what happens if we keep fine-tuning for way too long (~2800 min-batches):

Get Back (Get Back)
Lennon & McCartney
Yellow Submarine
Lennon & McCartney
On a Saturday night as the sun shines down on me
The sun is out, the sails are clear
The sun is in, the sails are clear
Oooh — Hey

The model starts overfitting and the generated samples are things that are very likely to be present in the training data, such as the repeated “Lennon & McCartney” lines, “Yellow Submarine,” etc. I found that fine-tuning for ~300–500 steps produced the best lyrics.

Conclusion

Hopefully you now have a much better idea of how language models work and how we can leverage state-of-the-art models to greatly improve downstream tasks.

That being said, there’s so much left to explore with the GPT-2 model. The samples I generated used the default, (probably) suboptimal hyperparameters. It would be interesting to see how much better the generated lyrics can be if more time was spent getting the fine-tuning right. I also only used the smallest of the released models since I was using my laptop to train. I’m confident the bigger model would produce even more spectacular results. Finally, OpenAI recently released MuseNet which is capable of generating pretty realistic sounding music. How amazing would it be put GPT-2 and MuseNet together (they’re basically the same model) and generate both the lyrics and the accompanying music? If I had more time, money, or any idea of what I’m doing, I’d love to generate a full fledged song using Machine Learning, then have someone with actual talent perform it.

Thanks for reading!

Eugen Hotaj,
June 30, 2019

P.S. If you liked this article, then follow me to get notified about new posts! As always, all the code and data is available on my GitHub.

Footnotes

¹ Though I think the best solo musician would have to be another ’60s legend, Bob Dylan. His single, Like a Rolling Stone, may be the best song ever written, and that’s not just my opinion.

² From Within You Without You, my favorite Beatles’ song from my favorite Beatles’ album. It’s so weird, yet so good.

³ For the sake of brevity I’ve skipped over a few minor details. You can all the gory details on my GitHub.

⁴ Some people believe that this was just a huge publicity stunt, but that’s neither here nor there.

References

[1] A. Radford et al., Language Models are Unsupervised Multitask Learners (2019)

[2] S. Todorov, Generating Fake Conversations by fine-tuning OpenAI’s GPT-2 on data from Facebook Messenger (2019)



from Hacker News https://ift.tt/2FJtMoj