Artist-based lyrics generator using machine learning
In this article I’ll show you how you can fine tune and train your own Transformer using HuggingFace.
As a sequel to my previous article about Music Generation using Recurrent Neural Networks, we will be using HuggingFace’s transformers to generate lyrics based on the artist we train it on. For this article, we will be using Diomedes Diaz’s lyrics to generate a new song with his writing style. I chose Diomedes because of his legacy and the influence in the Colombian Caribbean culture.
I’d encourage you to follow along with this article, pick your favorite artist and try different parameters and/or different models, and see what kind of results you get. All the code will be in a Github repo, linked at the end of the article.
In this article, we will follow these steps:
- Scrape the lyrics from the internet.
- Processing the lyrics into a .txt format.
- Downloading the tokenizer and model from HuggingFace.
- Fine-tuning the model with the scraped data.
- Generate a new song with the model’s results.
Web scraping
We should look for a website that contains lyrics for the artist we want to use, in our case Diomedes. Come to find that this website has all the lyrics in an accessible (and scrapable) form.
We should use Beautiful Soup for this task, so let’s import the relevant libraries:
import requests
from bs4 import BeautifulSoup
Web scraping is a very iterative process, and it’s recommended you understand HTML when doing so, so you can get a sense of what we’re doing.
page = requests.get('https://www.letras.com/diaz-diomedes/')
soup = BeautifulSoup(page.text,'html.parser')
Inspecting the website’s code, we can see that the list of songs are inside a list tag (<li>), so let’s take a look at these tags specifically.
listOfSongs = []
for i in soup.find_all('li'):
listOfSongs.append(i)
listOfSongs = [str(x) for x in listOfSongs]
Once we have seen its contents, let’s append them to a list if they contain ‘https’ to make sure we’re getting the links to the specific song lyrics
newSongs = []
for link in listOfSongs:
if 'https' in link:
newSongs.append(link)
This is what our new list looks like:
Now let’s extract the links from the songs:
urls = []
for song in newSongs:
try:
urls.append(song.split('"')[13])
except:
print(song)
This is what our new list of song lyrics list looks like:
Now that we have the links to all of Diomedes’ songs, we can just scrape them one by one, following a similar pattern to what we have done so far.
We should have a list in which each element is a string of a song’s lyrics. Now we should create a .txt file containing all of his songs’ lyrics, preferably separated by newlines.
With that, the lyrics scraping part is over. Now we ought to head to the tokenizer and model part of the article.
Tokenizers and Transformers
Before starting to pick our tokenizer, we will have to select a model to use for our lyrics generation. A common medium-sized transformer is GPT-2, so we will use that one. Another one that worked well was BERT, and RoBERTa, if you want to give those a try as well. I tested all these above transformers, and found out that GPT-2 gave the most consistent results.
Keeping in mind that all the songs from Diomedes Diaz are in Spanish, we will have to use a Spanish tokenizer. In our case, we will not be training a tokenizer because there are plenty of Spanish tokenizers that will suffice for now. But what does a tokenizer do exactly? Tokenizers split the sentences into words which are then converted to IDs and stored in a lookup table.
After testing several recommended tokenizers from HuggingFace website, I found out that flax-community/gpt-2-spanish is a great tokenizer for our use case as it was trained on the OSCAR dataset.
So far we have scraped our lyrics, and talked about tokenization, but we haven’t taken a look at what a transformer is.
Transfomers are a model architecture comprised of an encoder-decoder pair which use an attention mechanism to transform one sequence to another. This was first introduced in the paper ‘Attention is All You Need’, One of the main advantages of these transformers is that they can be parallelized, instead of doing it sequentially like the RNNs (LSTM, GRU, etc…). This helps by reducing training times.
The attention mechanism previously mentioned provides context for any position in the input sequence. If the input data is a natural language sentence, like in the case of this article, the transformer does not have to process one word at a time.
If you want to learn more about transformers, I’d suggest you take a look at Maxime’s What is a Transformer? and also Jay Alammar’s article on it.
Transformers are great for translation, sequence generation and sentiment analysis.
Let’s download the tokenizer for our text processing
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt-2-spanish")
model = AutoModelForCausalLM.from_pretrained("flax-community/gpt-2-spanish")
Now let’s load our test into our notebook and separate it at every newline.
After doing so, let’s split it into our train/test set using SciKit Learns train_test_split
train, test = train_test_split(lines,test_size=0.15)
print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))
>>> Train dataset length: 4676
>>> Test dataset length: 826
Now let’s save our train and test set into separate .txt files for later reference. We will also get their respective paths for our model to consume
with open('traincancionesDiomedes.txt', 'w') as f:
for t in train:
f.write(t)
f.write(' ')
with open('testcancionesDiomedes.txt', 'w') as f:
for t in test:
f.write(t)
f.write(' ')
train_path = 'traincancionesDiomedes.txt'
test_path = 'testcancionesDiomedes.txt'
Our transformer requires us to load the data into a DataCollator for it to use. By the way, data collators are objects that will form a batch by using a list of dataset elements as input. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.
from transformers import TextDataset,DataCollatorForLanguageModeling
def load_dataset(train_path,test_path,tokenizer):
train_dataset = TextDataset(
tokenizer=tokenizer,
file_path=train_path,
block_size=128)
test_dataset = TextDataset(
tokenizer=tokenizer,
file_path=test_path,
block_size=128)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False,
)
return train_dataset,test_dataset,data_collator
train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)
When calling DataCollatorForLanguageModeling we will use mlm = False (mlm stands for masked-language modeling).
Now let’s start setting up the model’s training argument by specifying out output directory (output_dir), number of training epochs (num_train_epochs), batch size for training (per_device_train_batch_size), batch size for model evaluation (per_device_eval_batch_size), number of update steps between two evaluations (eval_steps), checkpoint for saving the model after n steps (save_steps), and number of warmup steps for learning rate scheduler (warmup_steps).
model = AutoModelWithLMHead.from_pretrained("flax-community/gpt-2-spanish")
training_args = TrainingArguments(
output_dir="./gpt2-diomedes-2",
overwrite_output_dir=True,
num_train_epochs=300,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
eval_steps = 100,
save_steps=800,
warmup_steps=500
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset
)
This part may take a little bit because it needs to download the model. Depending on what model you are using while following along, the download time may vary.
Now that we have set up our trainer and our data, let’s start training the model.
trainer.train()
Using Google Colab Pro’s standard GPU, it took about an hour to go through 300 epochs. Here are the results of the model training
Now let’s test how the model performs if we give it a one word prompt for it to complete
from transformers import pipeline
diomedes = pipeline('text-generation',model='./gpt2-diomedes-2', tokenizer='flax-community/gpt-2-spanish')
results = diomedes('Ay! ')[0]['generated_text']
>>> 'Ay! Soy partidario decidir maldecir Que me perdones todo lo que yo hago Que no te conocí en la playa porque no me gustaba su actitud Que ya te la voy a pasar lista andarán mis canciones que acaban pero'
The generated text actually sounds like a song! Let’s now take a look at how we can generate a full song!
Generating a song
If you’ve been following along, you now have successfully trained and fine-tuned your own GPT-2 model with your favorite artist. Let’s generate a song with their style.
From our dataset, let’s pick a few examples, and get their first one or two words, and use them as a prompt
import random
prompts = []
for line in lines:
prompts(' '.join(line.split(' ')[:2]))
random_prompts = random.sample([x for x in prompts], 20)
With that, we have a list of 20 randomly sampled two word prompts which we will use to generate a 20 line song.
Let’s now call the model and iterate through it to generate our new song. We can also set a max_length parameter to limit the amount of characters the model generates. Let’s try a max_length between 20 and 30 characters, which will be randomly chosen using random.randint()
song = []
for line in random_inits:
song.append(diomedes(f'{line}',max_length=random.randint(20, 30))[0]['generated_text'])
Let’s see what our model generated
>>> ['con el paso del tiempo Que no crea que todo terminó Y ni se imaginan el mundo se va, yo era el',
'Y que los cumplas a todos gusto y alegremente deseando que pase un año más más lleno de proyectos que he',
'Qué triste sería no quererte a ti cerca Para ser un gran victoria También es un compositor que vive d',
'Al fin : Aleluas No sé por qué será, pero aléjense de la',
'Dime nada yo tengo el present. Y otra ves maana morirme Pero no tienes que temer, no tienes porque aclarar',
'Y al que encuentra es al difunto Con lo más sublime del sentimiento Dime que hizo mal presentándote con su',
...
'Yo soy un hombre solo Será en este lugar especial Quedar para ninguno Sí, y yo no',
'Y andabas pensando en mí dijo la malvade, a la luna que respiros (se escuchaban dulces quejas ']
It definitely needs some fine-tuning! We can try generating n-new lines and hand pick which ones make the most sense.
This method we used for generating a new song has one flaw; we randomly picked twenty two-word prompts from 100+ songs, which all not all of them are about the same topic, so the model generated lines which may be all about different topics.
Closing remarks
Learning how to use HuggingFace and it’s models has definitely been fun, and I look forward to keep using them. If you are interested in learning how to use it, their website has a lot of great documentation and videos showcasing how to use their models and tokenizers.
Once again, thank you for taking the time to read this.
Disclaimer: Do not use this to generate your own songs to profit from. This is meant to be an experiment and a showcase of what Machine Learning can do. Please respect the work of the artists who spent hours, days, months, and even years writing these songs. This article is meant for educational purpose only.