Insights from Playing with Language Models

Ever since the groundbreaking release of ChatGPT, I’ve been wanting to look into these “large language models” (referred to from here on as LLMs). LLMs, at their core, are autoregressive transformer-based machine learning models scaled up to be able to learn from vast collections of data scraped from the internet. The central concession of an autoregressive model is that it cannot have infinite memory; instead, it takes the prior $n$ tokens as input to generate the $n + 1$th token, and discards the earliest token in memory to replace it with the most recently generated one in a sliding-window fashion, before passing the result back into the model to generate the $n + 2$nd token. While one wouldn’t expect intentionally forgetting the earliest inputs would make for an effective language model, results ever since OpenAI’s Generative Pre-Trained Transformer (GPT) have proven otherwise. Combined with major advancements in other areas of NLP like Google’s SentencePiece tokenization, researchers have been able to achieve record-breaking performance on many natural language tasks using autoregressive language models. The most recent iteration of OpenAI’s GPT, GPT-4, can even perform better than most human specialists in legal and medical exams.

But while the size of large language models is the source of their incredible success when computing resources are abundantly available, it becomes their greatest weakness in cases where compute is scarce. GPT-3 has 175 billion parameters, far too large to run even on many single-node server computing systems, even for inferencing only (and GPT-4, while its exact parameter count has not been released publicly, is presumably even larger). I was never going to be able to run something of this scale on my personal laptop, not to mention that GPT-3, 3.5, and 4 are all closed source, so I couldn’t access the models even if I had the compute available.

Mood Analysis of Lyrics

Yet it turns out that SOTA language models are neither necessary nor desirable for many task-specific NLP applications. To argue this point, I present a little side-project I’ve been working on for ECLAIR, a student organization I recently joined. The org’s mission is to facilitate student learning and research in AI and robotics, and the project I chose to work on for the semester was developing an ML model capable of analyzing a song’s lyrics and determining which emotions are evoked by that song.

Attempt 1: Multilabel Classifier

Our team used this Kaggle dataset for song lyrics and the MuSe dataset for textual emotional tags associated with each song. My initial approach to this problem was to treat each unique tag as a category and make a multilabel classifier to predict based on a song’s lyrics which “tag categories” it was a part of. The problem was that there were 294 tags distributed surprisingly evenly across over 16K songs, meaning most tags were present in less than 1% of the entire dataset. This created a severe imbalance in the training data, resulting in any classifier I created just predicting 0 for all 294 possible outputs.

Attempt 2: 12 Binary Classifiers

My next attempt involved only investigating the presence of the 12 most frequent tags (those present in at least 1% of the dataset without any resampling), creating an independent classifier for each of those tags, and heavily using random undersampling and SMOTE oversampling to create a 50/50 dataset of positive and negative examples for each tag. To support this strategy, I switched from one-hot word-level encodings to word encodings generated by a pretrained FastText model (Facebook’s improvement upon word2vec released in 2016) so that spatial relationships between encodings would have more meaning. Once every classifier was trained on its respective tag, I wrote a simple function that took in a song’s lyrics, passed them through all 12 binary classifiers, and returned the tags associated with each classifier if that classifier predicted a value of 1. This method was the first I tried which achieved anything resembling a solution to the emotion analysis problem, and it did perform better than total randomness, but not by much.

The obvious problem with “creating” data using SMOTE is overfitting to the few positive examples present in the original dataset. At the end of the day, it was the garbage in, garbage out principle. Our data just wasn’t tagged well. How could any classifier achieve meaningful accuracy analyzing the moods of songs with so few examples of what each mood meant?

Attempt 3: Language Model Fine-Tuning

It was then that I connected the dots and realized that there was a way to solve this problem, using the very same language models that I’d been researching. My trained-from-scratch classifier would never see enough examples of “happiness” to understand what it meant for a song to be happy, but you can ask ChatGPT to define happiness right now and it’ll happily (no pun intended) do it for you. Fine-tuning a pretrained language model had the potential to be the solution to my emotion recognition problem.

So I went ahead and loaded a pretrained version of distilgpt2 (which was about at the limit of how large of a model I could fine-tune on my laptop) and reformulated my classification problem as a generative one which took advantage of the existing question-answering framework that GPT-2 was trained on. My training data looked something like the following:

Q: What are the moods evoked by the following song excerpt?
<Song_excerpt>
A: Happy, bright, exciting<|endoftext|>

Songs were split up into excerpts such that each data point was exactly 128 tokens in length (it was surprisingly difficult and slow to predict how many tokens the question and answer portions would be encoded as by SentencePiece). This was far less than distilgpt2’s actual limit, but it was the limit of what would train within a reasonable time on my laptop. The model was fine-tuned in an autoregressive fashion, learning to predict the $n + 1$ token given the past $n$ tokens. Fine-tuning took around 18 hours for 15 epochs before the loss stabilized.

Model loss over 15 training epochs.

Performance Evaluation

So how well did the model end up working? From a subjective analysis of a few songs I knew… not that great, but it was still a giant improvement over what I was able to achieve with my previous trained-from-scratch classifier. Below is an example result for the song Poker Face by Lady Gaga:

Q: What are the moods evoked by the following song excerpt?
Chorus]
Oh, woah-oh, oh, oh
Oh-oh-oh-oh-oh-oh
I'll get him hot, show him what I've got
Oh, woah-oh, oh, oh
Oh-oh-oh-oh-oh-oh
I'll get him hot, show him what I've got

Can't read my, can't read my
No, he can't read my poker face
(She's got me like nobody)
Can't
A: ersatz, witty, sexual, confident, r

Extending the model to operate over the entire song produces slightly more interesting results:

[['®overtible', 'slick', 'energetic'],
 ['ery', 'sexy', 'smooth', 'trippy', 'calm'],
 ['ive', 'fun', 'driving', 'energetic', 'energetic'],
 ['Ã©', 'theatrical', 'earnest', 'reflective'],
 ['clinical', 'detached', 'warm', 'passionate', 'cerebral'],
 ['sexual', 'lyrical', 'smooth', 'narrative'],
 ['manic', 'playful', 'passionate', 'reflective', 'confident'],
 ['ersatz', 'sexual', 'sentimental', 'powerful', 'light'],
 ['ive', 'bittersweet', 'calm', 'soft', 'quiet']]

There are some obvious problems here (not every generated tag is a real word, some tags don’t make any sense, many duplicates appear, etc.), but many of these can be fixed by output parsing and collating. The rest of my semester in the org will be spent implementing those improvements and conducting a formal statistical analysis of the model to determine objectively how well it performs.

Takeaways

This project was actually the first time I used the 🤗transformers library, and it was shockingly easy to use the same transformer architectures that I’d been previously implementing from scratch in TensorFlow. The final performance wasn’t astounding, but I can only really think of three things that could improve it:

Fine-tuning a larger, more current language model.
Training over a larger dataset.
Increasing the excerpt size to give the model more context.

Obviously all three of these options are constrained by compute capability, so I don’t think I can do much better than what I already have with my laptop alone (simple output parsing methods aside).

Societal Implications of LLMs

What’s most interesting (and concerning), in my opinion, is that I was able to do something like this at all with so little code. The preprocessing of my lyrics and the splitting of songs into excerpts aside, it was only a few lines of code to instantiate a Trainer object and fine-tune distilgpt2. And since this is a highly generalized generative model, there’s nothing binding it to the specific task I fine-tuned it on. distilgpt2 could be fine-tuned to classify, summarize, and generate pretty much any textual data.

It would not have been much of a challenge for me to fine-tune distilgpt2 to, say, generate misinformation or hate speech based on prompts. It would’ve been pretty much the exact same training pipeline, but with different data and a different evaluation method. In fact, fine-tuning may not even be completely necessary. GPT-Neo, an older language model with 1.3 billion parameters, was right at the limit of what I could use for inferencing on my laptop. With the right prompts, even the vanilla model was able to produce the following generations:

Clearly it’s racist…

And sexist…

And capable of impersonating politicians with both those traits…

And capable of generating dangerous misinformation.

The most powerful open-source language models today already have an order of magnitude more parameters than GPT-Neo. It would be trivial for anyone with enough financial backing to fine-tune something like GPT-NeoX on vaccine misinformation, for example, and connect it to existing social media bots to influence the opinions of American voters. We’ve known about foreign influences on our elections through bots for a long time, and in the past it was primarily done with human “troll farms” in low-income countries where cheap labor is widely available to generate misinformation. LLMs have the potential to fully automate this process and evade detection far better than the often non-native English speakers who write today’s bot posts. If I was able to do this with a laptop, imagine what state-backed malicious actors could do with large amounts of resources at their disposal.

OpenAI’s stated solution to this problem thus far has been to not open-source their models (personally I think they do that more for the profit incentive than the social implications, but I digress). While this does keep SOTA technology out of the hands of malicious actors, SOTA technology is clearly not necessary. In fact, in many cases it’s not even desirable. While I certainly could’ve produced better results with a much larger model, doing so would come with additional costs in terms of inferencing time and compute requirements. This is true even for troll farms, where quantity is often valued over quality. The technology that’s already out there is enough to cause us significant problems.

Not to mention that a single company’s decision to wall off SOTA language models has no bearing on other researchers. Open-source alternatives to OpenAI’s language models have been created for years now, and while they too often have some safeguards in place, it’s much easier to circumvent them because the code is publicly available and open to modification. OpenAI also can’t stop high-level malicious actors with enough data science talent and computing resources from simply creating their own copycat LLMs from scratch with the express purpose of generating misinformation and hate speech. All they’ve really done is create a massive roadblock for any well-intentioned developers wanting to use GPT-3, 3.5, 4, etc. for harmless purposes (case in point: me).

So how can we protect democracy from the threats posed by malicious actors equipped with LLMs? Honestly… I have no clue. This is an area which is in dire need of more research. Obviously detection of AI-generated content is the simplest and most effective approach for the moment, but this is a stopgap solution that will fall apart once LLMs get good enough at emulating human speech, beyond which point innocent people would start being incorrectly flagged as AI because there just isn’t enough information in a short textual post to properly determine whether it was human-generated.

Perhaps the solution lies in cracking down on bot and duplicate accounts and ensuring that every account belongs to a real, unique person (though there are difficulties with that too, as I previously elaborated). Or perhaps the solution is more social in nature, and the best thing we can do is educate people on the dangers posed by AI and the importance of independently verifying the things they see online. These approaches both have their own issues, but they’re undeniably better than attempting to lock down the technology and stop progress.

Anyways, I think that’s enough doom and gloom for one day. I’m sure I’ll come back to LLMs in the future now that I’ve learned how easy it is to use the 🤗transformers library. Feel free to check out the code library for my song analysis project (apologies on the mess…) and reach out to me with any further questions about my work.