Sunday, June 8, 2025

AI / ML --> Musings on Data Quality (Part 2)

So, Lets first look at couple of unavoidable issues in data quality. It is not that this challenge will be there in every project but if it is so, that needs to be recognized and handled properly. 

First one in the category is "Class Imbalance". Typically this happens in most of the classification problems where there is a natural imbalance in the distribution of the expected outcomes. For eg, let us consider a model being developed for cancer detection - it is quite natural that Number of people diagonised with cancer will be far less than those who do not have this disease. In this kind of situation, when we know that out of the total population only less than 2 % can have  a rare disease, we need to be doubly careful when the trained model is going to show 98% success rate (even 99 %). A large weight has to be attached to the wrong diagnosis to ensure that metrics behave in expected manner and a thorough review of wrong diagnosis would be needed on false positives and false negatives. This kind of challenge is also quite common in multi class classification models.

Next is the concept of "drift" which was referred in earlier post when "sampling bias' was explained. Particularly when the project life cycle is more  and in cases where the model is used in scenario where there are too many updates / changes to scenario, keeping in mind of Drift factor is very important for data scientists. Drift can be various types - it can happen to data (Example: In a medical dataset, patients now come from a new region with different average body weights, but the diagnosis rules haven’t changed) or to labels (Example: Credit card fraud increases post-COVID, so the % of fraud cases rises, but the fraud patterns are still detectable)  or to concept (Example: A spam filter becomes outdated as spammers change tactics over time). Needless to say the last one is the most dangerous type of drift.

Now, there is one more possible slip for the data science team to ensure the granularity of the data - in other words he level of detail or resolution at which data is captured or processed. Wrong granularity occurs when data is too coarse (aggregated too much) or too fine-grained (minor individual events / outliers taken into consideration) or  mismatch input data and target labels or multiple tables or sources are wrongly joined. This is one kind of quality challenge which falls within the scope of Data scientists team. It is some thing they can completely avoid by careful planning and seamless execution during data processing stages.

Now, lets come to the last one of data quality challenge which is referred as "Low Signal" which is also referred as SNR (Low signal-to-noise ratio). This hurts the AI/ML performance silently. If the features of the data set carry too little useful signal (predictive power) compared to random noise, the model may do either of the 3 things - overfit noise, fail to generalize or produce meaningless prediction. It might be interesting to notice that the outcomes are varied - so the early this is detected, the remediation is going to be effective else we will go by wrong diagnosis doing all irrelevant things to mitigate the issue.It is not uncommon that  the development team is asked to finetune the model and chase the algorithms instead of doing the right things.

So what are the right things when it comes to Low signal ? Right in Feature selection stage, it is important to remove relevant/redundant ones. Also, Data scientists can look at combining and transforming features to create better ones or use embedding technique to represent sparse features. Regularization is one technique that is adopted during development of model to attack this issue (as it will silence the noisy feature). There are also tools available like autoencoders which can extract latent structure and  discard weak noise in the data.

To summarize, Data Science is an exciting area of AI/ML. There is a popular proverb "Proof of the pudding is in the eating" which actually I realize is half truth. It should be a a statement that should be after "Guarantee of the pudding is in the making" !! The main advantage of Data Science compared to the process of making the pudding is that - we have wide range of tools and remediation measures to mitigate the challenges of poor data quality even if we miss it earlier. The subject got so evolved that it offers both preventive as well as remediation processes for each of the challenges. Relatively speaking, proof of pudding is more of a unavoidable result !! 

No comments:

Post a Comment