Fine-tuning is an important element of any AI practicioner's arsenal. It allows you to give the AI more context on what you want it to produce. This is often needed to get better quality results and really reinforce a sequence. Remember, most of this AI is just patterns. We teach it a good pattern and it looks after us for the long run. In this blog post I want to go over how to train an AI model with fine-tuning and exactly what that process looks like. We'll be focusing on creating a rephrasing model from scratch.

Why Rephrasing is Important

Rephrasing is important as it enables you to see how text can look in a different way. You might want to refresh an old article and produce it differently or just see different variations of content. There are a lot of reasons why people want rephrasers and we're busy implementing a highlight based rephraser into our document view in the Content Villain web app. Rephrasing can be difficult to do to a high level however. There are robotic type spinners which will switch out synonyms in your text and replace them with similar synonyms. The issue with this is that they are not looking at the structure of the content so it can make it unreadable and will require heavy editing.

AI has the benefit of being able to understand the structure of the content and write it in a different way that still makes sense from a grammar perspective. There are so many different toggles and switches on AI models that can affect the ability to rephrase content. The most important thing is often overlooked and that is data. You need to have a decent dataset to work off of to get a decent rephraser.

Let's Talk Data

One of the things we go over in more detail in the Content Villain documentation is the difference between prompt engineering and fine-tuning. For a model where you want to rephrase content, fine-tuning is the way to go. It would be very difficult to provide enough examples in a prompt for an AI to follow a pattern successfully on any rephrasing task. Let's think about this problem. If we give the AI a few examples of sentence pairs of what the original is and what we'd expect it to output, it would get a limited understanding of what we're trying to do. If we gave it a few hundred.. it gets better.. a few thousand.. even better.. a few ten thousand.. now we're talking!

Who has time to go through and make such a dataset though? This is one of the major problems. To create a dataset of such quality would take a lot of hours and effort. Fortunately, some datasets exist publicly online and we can extract what we need from them to build out our model. For a rephrasing model, we decided to use the Tapaco dataset. We filtered for English outputs only and created a CSV file with all of the sentence pairings. This gives us over 73,000 lines of data to work with. Awesome!

Compiling a Model

Now we have the dataset and know what we want to do, we can create a model but we need to put this data in a way that we can use it. We still need to build a prompt for the AI to tell it what we want it to do. This is very simple and for this, we simply turn our dataset into the following lines;

Another way of saying 'I'll go with you there.' is to say 'I'll accompany you there.'

In this example you can see we have the original example and then we have the example of the rephrased content that we want to output. Fixing this for all 73,000 plus lines of data can be done in any spreadsheet software with the help of the concatenate formula. The final step is to give the dataset a break sequence so that the AI knows what the separator is between all of this. For this, we just use a #.

Another way of saying 'Tom was grieving.' is to say 'Tom was in mourning.'
#
Another way of saying 'Isn't this blue?' is to say 'Isn't that blue?'
#
Another way of saying 'I'll go with you there.' is to say 'I'll accompany you there.'
#
Another way of saying 'Tom gave Mary advice.' is to say 'Tom gave Mary some advice.'
#

You'll end up with something like that x73,000+.

The Actual Fine-tuning

Now we have everything prepared we turn to actually fine-tuning the model. For this we need to decide which NLP we want to use. Fortunately, most NLP providers offer fine-tuning now. It is often a case of uploading the document in the right format, be that .jsonl or .txt or sending it directly to the team. In this case, we have the .txt file which is enough to do that. We upload the file, give the separator and hit go. The actual process of fine-tuning can take a few hours depending on the size of your dataset and the size of the original model. It'll work in the back so crack on with some other tasks whilst you wait.

When it is eventually done, you will often see a new endpoint that you can call to run the model. The advantage of fine-tuning compared to prompt engineering is we don't have to provide much in the way of context to what we want it to do now as it will already know. If we were to prompt engineer a rephraser, we'd want to provide maybe 10 or 20 examples which would be maybe 2000 characters in the prompt so every time you run that model, you are charged for those 2000 characters and the output characters. For our fine-tune we are charged for the following;

Another way of saying 'What kind of theme are you interested in?' is to say '

There are 77 characters in the above so we have cut down a prompt engineered model significantly making a model which is better quality and costs less. Better for us running our business and better for you users as it provides a higher quality. Of course, the output is also added to this prompt which would make a total of 120 characters but you can see how this is considerably less.

Playing with Wildness

Now we have the model loaded, it isn't just set it and forget it. We can still play with toggles including temperature, frequency and presence penalty, Top P, Top K and other settings. These can affect the quality of the rephrasing but it is always good to give the end user the choice and educating them about the powers of AI, this is something we always try and do at Content Villain as we believe a deeper understanding of this technology is great for everyone involved.

We'll be making this dataset available both within Content Villain and the API endpoint. We'd love for you to check it out and let us know your thoughts using it. I hope this blog post has been useful in giving you some ideas of how fine-tuning works, how putting a model together looks like and how to train an AI model with fine-tuning. The benefits are significant with getting better quality and more zoned in generations for your needs. If you are looking for a premium AI content generator who offer great models and custom solutions for our customers, we'd love to hear from you! Signup for a plan or get in touch with our team today.