Integration with Hugging Face Model Hub and more

September 2022 Platform News

September 19, 2022

Even if annotating a dataset could seem simple, it is not. There are lots of factors to consider in many steps of the process, such as the selection of the dataset, the distribution of tasks and the evaluation of the result. Good annotators need to be careful with small details that could cause a quality loss in the result.

In M47AI we know that, and that is the reason why we boosted our annotation tool this summer to improve the platform. We focused on many aspects of the platform, from giving more insights about the quality of the dataset to the label distribution summary of the annotated tasks, passing through integrating Hugging Face models into our platform or splitting the tasks between the annotation crew.

And now, after these months of hard work, we are so proud to present the latest changes introduced into the platform! 🚀

Integration with Hugging Face Model Hub for model-assisted labeling


We are now integrated with Hugging Face Model Hub, and it is very easy to use a model hosted there for model-assisted labeling. You only have to set the name of the model, for example microsoft/deberta-base. We'll use the Hugging Face Inference API to assist the annotators with suggestions when they are labeling.

Bring Your Own Model

What if you want to assist the annotators with your own model that is not hosted anywhere? Then, bring it to M47AI Platform as a private model and use it!

All you need is the URL where the model is, the language that this model supports and the task type that you want the model for. 

We offer different options for uploading the model, from the URL of a repository, such GitHub; or one where a zipped file with the model is uploaded. This model will be deployed into our system and be accessible to be used for tagging datasets as if it was one of the current existing models in the platform.

New Dataset Quality Tests

SEMANTIC AND GRAMMATICAL SIMILARITY

Sometimes a model could require a dataset with similar tasks. In other scenarios, we could prefer a dataset with more semantic or grammatical variety. That depends on the project. A dataset that could fit in one case would not serve as well as other datasets in a new use case.

The new quality checks focused on similarity analyze the variability between tasks. We implemented two: a semantic check, that shows the sentences that are similar by significance but not necessarily by the tokens that form the cases, and a grammatical check that matches the equivalent sentences inside the dataset.


LANGUAGE DETECTION


The other check that we include is an analysis of the languages used in the dataset. There are many ways to create a dataset, some of them automatically, and evaluating the language of the sentences recovered is one of the quickest and most effective ways to know if it could serve us. The report shows the presence of languages that differs from the language of the dataset, thus facilitating the detection of errors.


If we want to know more about how those languages are distributed, we can go into the report details. Here we can see the column with the percentage of confidence, the language matched and the text of the task.

Work Distribution

Sometimes, a project does not need to have all tasks annotated by all annotators. Instead, it will be more productive to split the set into smaller chunks of tasks and assign them to each annotator. Now, with the new work distribution system, that is possible!

In the Annotators tab of Project Settings, the new Work Distribution section allows you to assign everybody on all tasks, or to make a custom distribution. If you choose the custom distribution, a new form is created, where you can select either the task intervals or number of tasks for each one of the annotators.

Labels Distribution

Last but not least, we would like to talk about the labels distribution of a project. If dataset insights can help us to know if it fits with our needs, labels distribution insights would be useful when we want to see if the annotations are adequate to us.

We can find the Labels distribution of the project into the Project Settings tab. This output shows us which tags are chosen, the number of them and how they are distributed into the dataset. This tab, until now present for text classification and token classification projects, like Sentiment Analysis or Named-Entity Recognition, shows the volume of tasks, or the number of times that each task is used; and the number of tasks with, at least, one of those tags.


And that’s all for now. Soon you will get more news about all the improvements that we are working on!

Try it for freeBook a demo