Friday, June 6, 2025

AI / ML ---> Musings on Data Quality

This post is focused on Data Quality - which as we saw in the previous post as one of the critical component of AI/ML model.  

Even during nascent stage,  Machine language was considered as an intersection of Data Science and Software Engineering. As Machine Language matured and also paved way to Artificial Intelligence, the importance of Data Science is only increasing. In fact for the Large LLMs and Generative AI technologies, Data Science is getting more and more crucial aspect than never before. 

With super efficiency brought about in building AI models these days - with deep neural networks and well tested scripts / functions, it is quite an irony that Data Quality continues to be a challenge. 

Lets look at  various aspects that impact Data Quality.

First and foremost, there are few Foundational Challenges which the data scientists need to take enormous care before processing the data. A key challenge is to ensure that data is fairly representative of real world scenario. This is referred as Sampling Bias where deliberately or inadvertently the data may not reflect the real life data (data which the users may eventually use in Production Environment). We will deal with concept of Data drift separately (which could be due to the time lag between development and deployment) but the major reason could be due to skew-ness or sampling errors. 

Another foundational challenge is missing / incomplete data. A thorough review of the data set for the completeness of various fields is vital for all types of supervised learning which fairly depend on  structured data. There are various methods to ensure this and also multiple techniques - simple (example delete the incomplete row of data or using domain specific rules) & complex (imputation or model agnostic hand ling). depending on the size and time available for the data scientists.

Data Duplication  is referred to the situation where the same data has duplicated itself set due to oversight. It could be a case of exact duplicates or near duplicates - some times it might be duplicating across Training set and Dev data set. When we end up in this kind of scenario, we will be misled by the model's performance and also  the credibility of  metrics will be at stake during development process.  

All the three above are basically at Data gathering stage which may be due to human error. However there are more human errors possible which are explained in the next part of the post. 

For all kinds of supervised models, labels (or the actual outcome) are critical to evaluate the model's predicted outcomes. To get the best quality of labels, normally human labelling is employed & even in the age of LLMs the importance of human labellers is not undermined. One mischievous kind of challenge is to have noisy labels which is a broad topic by itself. The labels could be incorrect, inconsistent or could be weak (when it is kept as rule - based). There can also be sensor or transcription errors. Without setting this right, starting off with model training will only be a crude joke.

Another subset of this is Annotation inconsistency which is quite common in classification problems. A particular kind of ambiguous situation can bThere are some challenges due to human errors which is some times tricky and happens at various stages - data gathering, human labelling & during data processing. Lets look at them one by one e decided in a particular manner by one of the human labeler while another one decides otherwise. When a team of people work on labelling this issue is in a way unavoidable unless we have clear rules on definition for each data element. Secondary verification or Group discussions can also help to substantially bridge the ambiguities. 

Like data leakages that we saw earlier, we can also have label leakages. This could be a serious situation because it means that the labels (actual outcome) has seeped into the input parameters  either by design of dataset or willfully by some team member to show better performance in training data. Extra care should be taken by Data Scientist to review all the input parameters closely and ensure that there is no signal / influence of the output given as part of input parameters.

 There are two more categories of challenges in data quality which can be grouped as (a) "unavoidable" in some cases but needs to be handled and (b) Quality focus of Data scientist team which I will handle in next post. In particular, I am keen to give more space to one of the Quality focus challenge called "Low Signal"  which was an eye opener for me to understand and appreciate the role of Data Scientists. 

No comments:

Post a Comment