Artificial Intelligence (AI) is transforming industries and businesses worldwide, revolutionizing everything from healthcare facilities to newsrooms. One of the key components of AI that is often overlooked is labeled data. It is the foundation of AI and machine learning (ML) algorithms, providing the necessary information for machines to learn, adapt, make decisions, and evolve. So, why is it overlooked, and how can we change it? Let’s take a closer look together!

What is labeled data?

Labeled data is simply data that has been tagged or classified with specific attributes or features. For example, if you want to train a machine learning algorithm to recognize images of cats and dogs, you need to label each image as either “cat” or “dog.” Humans often do this process manually, which is time-consuming and expensive. However, training AI models that can accurately recognize patterns and make predictions is necessary.

Why is labeled data important?

Labeled data is essential for several reasons:

Accuracy and reliability:

The accuracy and reliability of an AI model depend on the quality and quantity of the labeled data used to train it. The more labeled data an AI model can access, the better it can learn and make accurate predictions. Labeled data also ensures that the AI model is trained on relevant and meaningful data, reducing the risk of biased or inaccurate predictions.

Efficient Learning:

Labeled data is essential for efficient machine learning. AI models learn by analyzing patterns in the labeled data and making predictions based on those patterns. Without labeled data, the AI model would have no way of knowing which patterns to look for and which predictions to make.


Labeled data is also essential for building flexible AI models that adapt to new situations and scenarios. By training an AI model on labeled data, it can learn from past experiences and make informed decisions when faced with new data.


While labeling data can be time-consuming and expensive, it is still a cost-effective solution compared to building and training a custom AI model from scratch. Using existing labeled data sets can save significant time and resources, allowing businesses to focus on improving and refining their AI models instead of starting from scratch.

Why is the outsourcing of data labeling controversial?

Labeling data is a labor-intensive and time-consuming task that requires a significant amount of human effort and expertise. As a result, many businesses choose to outsource data labeling tasks to third-party companies or crowdsourcing platforms. This can effectively reduce costs and increase efficiency, as businesses can access a large pool of workers who can quickly and accurately label data.

However, outsourcing data often raises controversial issues, particularly if workers are underpaid or exposed to violent content. The Times colleagues discovered that OpenAI, the company behind ChatGPT, was not only outsourcing the labeling but also the responsibilities. Many workers who performed data labeling tasks live in developing countries where labor is cheap. Many of them received meager wages for their work, approximately $2 a day. This has led to concerns about worker exploitation and unfair labor practices in the AI and machine learning communities.

Additionally, there have been cases where data labeling tasks have involved sensitive or controversial content, such as hate speech or pornography. Workers were exposed to harmful or disturbing content, which can have negative psychological effects. That is why it is crucial to call for better working conditions and greater protections for workers who perform data labeling tasks.

Overall, while outsourcing data labeling can be a cost-effective and efficient way to train AI models, it is essential that workers are treated fairly and that their rights and well-being are protected. By ensuring that data labeling tasks are performed ethically and responsibly, innovators can build more reliable and trustworthy AI models that benefit everyone.

What is SELMA’s approach?

Labeled data is a critical component of training AI models. As AI continues transforming industries and changing how we live and work, the importance of fair and diversely labeled data in training AI cannot be overstated. It ensures accuracy and reliability while allowing the AI model to learn efficiently and adapt to new scenarios. Automatically labeled data for archival storage could be a cost-effective solution that saves time and resources. DW, one of the consortium partners in the SELMA project, is one of the most known German broadcasters worldwide. It broadcasts mainly for an international viewership in 32 languages worldwide. DW produces and airs an immense daily amount of content which it stores in its archives. As a media partner, DW provides content for SELMA to work on the possibility of using already labeled data from media output to train the AI. Although the approach is promising, there are still a lot of hurdles to overcome. Stay tuned and follow SELMA to stay up to date on the development of solutions for labeled data issues and other AI-related topics!