Natural Language Processing, or NLP, is long overdue to stay. Thanks to the predictive models of Machine Learning, or ML, on which the NLP is based, we can automate tasks such as classifying incidents, translating texts, detecting the sentiment in a phone call, answering questions or even writing software.
One of the most promising techniques is known as a neural network. A neural network is formed by a system of artificial neurons that, from an input, returns an output. They are, in very simplified terms, statistical models. In other words, from a certain input, the model will select the most probable output, which will be the prediction offered in response.
An NLP model requires a huge amount of data to be able to offer acceptable predictions. The GPT-3 model evaluates around 175 billion parameters and an initial 45 TB of compressed text dataset. Other architectures, like BERT, offer us generic pre-trained models, which we can fine-tune to specialize in a particular task.
Pre-trained models are a great starting point for projects based on deep learning and artificial intelligence. They offer us models that understand the basic fundamentals of the language. However, these models are prepared to perform general tasks and not solve specific tasks of each business.
This is where model training comes in. This word, training, is not randomly chosen. A model trains, not with weights, doing martial arts or running the more kilometers the better, but through examples. As they are not competitive athletes but statistical models, the training they can be subjected to is of two types, supervised or unsupervised, depending on the task for which the model is being prepared. In other words, a model is not going to give its best if it is not previously trained for the scenario in which it is to be deployed.
Using an ML model is not a panacea that can automatically solve problems. An ML model in production may not be able to meet the needs of a company. One of the possible reasons is that they use pre-trained models but do not do specific training for their business. If we do not train a model we are not taking advantage of the models, we could use simpler solutions, such as an arbitrary selection of an answer, and the result would be the same.
In this article we will try to see how important tuning a pre-trained model is for a task. Our task is going to be the classification of texts, and for that we will use a pre-trained model, and four versions of this same model, each time trained with small datasets of different sizes, from 100 to just over 2000 examples. We will compare these five versions of the model to the accuracy of an arbitrary label choice and confirm, or not, whether a model is worth training for this use case.
Explaining our use case As we mentioned previously, we want to demonstrate how important it is to train a model and for that reason we are going to explain what differences in performance a pre-trained model has compared to another trained with more or less number of examples.
To do this, we are going to train a model that allows us, for a news headline, to tell us which IPTC category it belongs to. IPTC stands for the International Press Telecommunications Council, and the list of categories is as follows: arts, culture, entertainment and media; conflict, war and peace; crime, law and justice; disaster, accident and emergency incident; economy, business and finance; education; environment; health; human interest; labour; lifestyle and leisure; politics; religion and belief; science and technology; society; sport; and weather.
In our experiment we will use:
· a pre-trained BERT type model in Italian, dbmdz/bert-base-italian-xxl-uncased, · an Italian training dataset of labelled phrases following the IPTC taxonomy, consisting of 2199 phrases, and, · an evaluation dataset, also in Italian, with headlines from the Italian newspaper La Repubblica labelled following the IPTC taxonomy, made up of 6 utterances from each of the 17 categories,
We will do five iterations of the test
· The first iteration consists of evaluating the pre-trained model with the test dataset, · the second iteration is the evaluation of a trained model with 120 utterances, chosen at random from the training dataset, · the third iteration is the same as above but with 600 randomly chosen phrases, · in the fourth iteration we will do the evaluation with a model trained with 1000 utterances, · in the fifth iteration we will use the entire dataset to train the model.
Each model has been trained with the indicated number of phrases, for 6 epochs, or repetitions of the training.
To see the efficiency, we will look at the accuracy of each of the models and the loss of the evaluation to know how much the model fits the validation dataset. For this evaluation we will use the Hugging Face Trainer class.
Our initial accuracy reference is an arbitrary selection of the label. We must bear in mind that if we decided to ignore the models and always return a random result, we would have an accuracy of 100 / 17 = 5.88%. Any value below this number would indicate that the model performs worse than if we dispensed with it and randomly chose the label. If the accuracy is equal to 5.88% we would be using a whole ML model to obtain a result equivalent to a simple operation.
Brief analysis of the training dataset The IPTC dataset that we have used as a base is made up of 2,199 headlines, distributed among 17 labels, with an average of 133 headlines per label and a standard deviation of 59.45.
If we look at the sum and distribution percentages by category, we see that the categories with the most examples are sport, with 236; politics, 227; and economy/business and finance, with 207. The categories with fewer examples are weather, with 68 examples; science and technology, with 66 examples, and human interest, also with 66 examples. The percentages have been rounded to two digits.
As we can see in Figure 1, the dataset has an irregular distribution in the examples assigned to each of the labels. The human interest label is the one with the fewest examples, with 66, while sport is the category with the most examples, with 236.
Evaluation of the models
Starting from the previously mentioned models, that is, the base pre-trained model and four models, each trained with a different number of utterances from the IPTC training dataset, we are going to evaluate their accuracy against a 6-utterance test dataset times 17 labels = 102 phrases.
We will show the accuracy as a percentage, while the evaluation loss is a number that, the higher it is, the worse the performance has been obtained. We will also include the confusion matrix for each of the models.
Our initial baseline for whether training makes sense is 5.88%, which we calculated previously. It corresponds to 100 / 17, that is, what percentage of success we would obtain if we arbitrarily assigned a label to the example utterance.
Evaluation of the pre-trained model, without training
We start by evaluating the pre-trained model, without applying any training to prepare it for the classification task. The model, dbmdz/bert-base-italian-xxl-uncased, has obtained an accuracy of 5.88%, exactly equal to the value obtained if we arbitrarily assign a phrase to a label.
If we look at the confusion matrix, the model has decided that the vast majority of utterances have gone to the labour category. This result fits us with the fact that the accuracy, 5.88%, coincides with the base value of 5.88% that we obtain if we assign a category at random.
Evaluation of the fine-tuned model with 120 utterances
The dbmdz/bert-base-italian-xxl-uncased model improves if we apply a training of 120 utterances chosen at random from the dataset. Accuracy rises to 13.73%, with a loss of evaluation of 2.785.
The results of the confusion matrix show us how the results have mostly gone to the categories sport, conflict/war and peace, economy/business and finance and disaster/accident and emergency incident. However, of these categories, sport was correct 4 out of 6, economy/business and finance 4 out of 6, disaster/accident and emergency incident was correct 3 out of 6 and, finally, conflict/war and peace got 2 out of 6.
Evaluation of the fine-tuned model with 600 utterances
This model has obtained an accuracy of 54.90%, with an evaluation loss of 1.672. The improvement is significant compared to the previous model. We have managed to get more than half of the evaluation examples correct, with a significantly lower loss than the previous revised model, the 120-utterance model.
The confusion matrix shows us how the model has improved. We can see how the arts/culture/entertainment and media and health categories have got all the examples right, and sport, society, labour and environment have got 5 out of 6 right.
Evaluation of the fine-tuned model with 1000 utterances
The next evaluated model has been fine-tuned with 1000 utterances, and we see that the accuracy is 50.99%, with an evaluation loss of 1.623. The accuracy is slightly lower than the model fine-tuned with 600 utterances, and the loss value is similar.
Looking at the confusion matrix, we find that there are more than two categories that have achieved full, arts/culture/entertainment and media and health, and four categories that have achieved 5 out of 6: economy/business and finance, environment, politics and society. Interestingly, we see that human interest has been tagged 5 times as society and one time as environment.
Evaluating the fine-tuned model with our full dataset The accuracy of the model trained with the 2199 sentences has been 70.59%, much higher than the value obtained in all the previous models, with a loss of 1.279. In other words, the model has almost got three out of every four examples right.
Regarding the results by category, we see in the confusion matrix that all the examples of arts/culture/entertainment and media, politics and sport were correct. They were correct 5 out of 6 in the cases of disaster/accident and emergency incident, economy/business and finance, education, health, labour and weather. Curiously, the human interest category does not have any hits: 4 went to arts/culture/entertainment and media, 1 went to crime/law and justice, and another to education.
Analysis
To analyze the results, let’s first compare the accuracy improvements. Then, we will check how the models have labelled the La Repubblica test dataset examples.
Accuracy Summary Table
In Table 2 we can see how the accuracy of the untrained model is very low, it is barely 5.88% correct, exactly the expected value if the assignment were arbitrary. This result improves to 13.73% with a minimum training of 120 sentences, chosen at random from the entire dataset. It also does not allow us to know if labels with a proportion relative to the distribution of labels have been chosen. If we do a training with 600 sentences the improvement is considerable because the model is correct 54.90% of the time. With 1000 sentences the result is slightly lower, with 50.99% correct answers, with an equivalent evaluation loss. As soon as we use the full dataset of 2,199 sentences, we see how the accuracy goes up to 66.3% and the evaluation loss goes down to 1.357.
Summary table of hits
The confusion matrices show us very different results according to the analyzed categories. In Table 3 we have shown the results of the matrices and we have written the results equal to 5 or 6 in bold, which we will consider satisfactory.
First, we clearly see that the more fine-tuning, the better results are obtained. The base model, without fine-tuning, has 1 satisfactory result, but no other hits. The refined model with 120 phrases has 0 satisfactory results. The models fine-tuned with 600 and 1000 sentences have given us 6 satisfactory results, while the model fine-tuned with the full dataset achieved 9 satisfactory results.
If we look at the categories that have not received any satisfactory results in any training, marked in red in the table, we see that, except crime/law and justice, they all have a percentage of the dataset lower than 5%, and that we can consider them as categories difficult to distinguish between them. We would have to look in more detail at the training datasets, but these categories are likely to be ambiguous and more examples are needed to differentiate them from others.
We also see that there is no clear relationship between the percentage of the dataset and the number of successful hits. For example, the health category, with 3.27% of the dataset, is successfully hit from the model with 600 examples, while the human interest category, with a similar percentage of the dataset, 3.00%, barely receives hits.
Conclusion
In a text classification task, the pretrained model we have used is no more efficient than an arbitrary label assignment. Pretrained models are not prepared to solve specific tasks. It is necessary to fine-tune them and check that the dataset is balanced to have a satisfactory result in all categories and that it can help us in a pre-annotation task of a larger dataset.
A model, on its own, needs some help to be able to do a specific task, such text classification. When applied to a specific task we obtain a result equal to that if we made an arbitrary assignment of the result. An annotated dataset, even a small one, and fine-tuning are required for it to deliver results. With a minimum dataset of 120 sentences we have obtained a small improvement, however the result of the training begins to be interesting with a dataset of 600 sentences.
We need to do more tests to see how many examples are required for continuous improvement of the model accuracy. We get similar results with 600 and 1000 examples. We reach an accuracy of 70.59%, almost three out of four sentences are correctly labeled, with 2199 utterances. Such a percentage would allow us to automate a labelling task, reducing the labelling task to only reviewing that the labeling offered by the model is correct.
We also found that there are labels that, because they have few examples and due to their own ambiguity, are difficult for a model to get them right. The accuracy of the model could be improved with an extra dataset with more examples of the worst performing cases.
In conclusion, if we want a model to be able to help us in the tasks of annotating a dataset, we first need to provide a small dataset with which to train it. Later on, some iterations will be necessary to improve the training of the categories with less accuracy, until the model will be able to accurately predict almost an entire dataset.