Sunday, June 15, 2025

AI / ML : Celebrating a Tiny Success

 So, here I am -feeling blissful after completing a ML certificate course from Coursera  -  ready and eager to relax myself with a long drive in the evening (after all its a Sunday).

During last week, I had invested some time studying about the challenges in data quality and was brooding a bit why Mr.Andrew NG did not cover these things in the other ML/ Deep Learning courses that I have been taking up quite systematically from the same academy.

For me, it was a pleasant surprise to have some of the issues of data quality were addressed in this course - which is relatively smaller compared to other courses that I had underwent recently from Coursera - nevertheless, most enriching.

First thing first, this course is not about algorithms, python and its omnipotent libraries like pytorch / tensorflow or any other theoretical framework of AI/ML. It talks about the practical wisdom while implementing ML projects and fittingly the assessment towards the end of both modules only ask questions on practical decisions.

I should say I had a disdain to start off this course though I finished my earlier course on Deep learning / Neural networks was completed during last week of Apr 2025 itself. I had been pushing off this module with a thought that it might turn out to be superficial and uninteresting. On the contrary, the entire week had turned out to be quite gripping and keeping me hooked to the course and provided a lot of practical tips and tested processes.

Should a  person like me - having no domain knowledge - take up this course as the first one in AI/ML - my answer will be a clear NO.

I feel quite sincerely that I would not have understood this course as much as I have now - without the time invested to understand AI / ML in a structured manner during the past several months. As the course title itself says = "Structuring" ML projects, this course essentially helps us to structure our overall understanding.

Well, do you like the cherry on the top of Ice cream in a fruit salad ? 

I like it so much and liked this one too.....!! 

Thursday, June 12, 2025

AI / ML --> How ChatGPT Works (Courtesy : Andrej Karpathy)


This is a summary of a video available in Youtube  by none other than one of the AI Legends of current times (Andrej Karpathy). This video is amongst huge inspiration for myself in recent times to get deeper and wider into AI/ML.

I felt like posting this today for two reasons - First is about the compassion in my heart to all those people who don't have much time and patience to go through the video themselves. (Yes the you tube video is for a duration of 3 hours & 30 minutes ). The second reason let me put towards the end.

Well, I know there is a small version of this topic from Andrej himself is available too. Also I know these days LLMs can give a better nice summary than the below. However, I am compelled to post this one since I exist still and my craving to share is still intact.  

*********************************************************************************

As an introduction, Andrej explains that this video is not about ChatGPT alone - it is also about LLMs in general. Also, he makes it quite clear in the opening statement of this video  that this one is meant for general audience and no need for technical knowledge to understand this video. So here is how it goes.

Stages of Development like ChatGPT 

Typically there are 3 stages of development in any such LLMs which act as a platform for users to interact directly

Stage 1 : Pre-staging & neural Training -  High quality and  documents on diversified subjects are downloaded from internet and they are pre-processed in structured manner. For example all URL embedded in original text are deleted, duplication filters carried out, PII detected and removed. Entire data is converted into unique symbols (tokenized) and eventually crunched by re-processing the loads of data.

Neural Training - In this stage the data sequence is trained using probability and is done repeatedly to get the best sequences of words - which happens on a bunch of tokens but is repeated across the entire data set in parallel to improve the overall data sequence.

What gets created at the end of Neural training is referred as base model. It is still not ready for public to start interacting with it directly.

Stage 2 :  Post training - Turning the base model into instructor model needs post training. We need involvement of human beings for this - a pool of people will start creating "data set" which is essentially Q & A which are going to be used by the Base model to turn more intelligent and user friendly. 

Relatively speaking post training may take very less time and the people who are involved in this process are called "Human Labellers" who give the human touch that we all sense when we interact with ChatGPT. These people are normally well educated and experienced & they also ensure ethical standards while developing the responses to the hypothetical questions. We can understand that they may not be able to create all possible Q & A but the data sets will contain "persona" and models can understand how to interpret them based on the neural training provided to the base models.

Responses given by ChatGPT are just statistical imitations / simulation of human labellers and not any thing magical.  

Stage 3 : Reinforcement Training - Like stage 2, here also humans are involved. Why we need this stage can be understood with an analogy of text book. We have different layers of learning in any academic text book We may look the volume of text written similar to the first stage of training of ChatGPT . The illustrations and interactive questions during the course of any chapter is like Stage 2 explained above. At the end of each chapter, we do have only questions WITHOUT answers (perhaps final answers is normally given at the last few pages). This kind of learning is essentially achieved with Reinforcement Training.

For example if we ask a LLM to tell a joke, the output given by it is reviewed by humans to give their rating of best joke. In parallel  a reward model (which is a separate neural network) is asked their prioritization in the scale of 1-9. We compare the scores given from both sources and the model is given update based on the human ranking – so that humans need not involve in the entire rating exercise of joke. So reward model is nudged at the end of each iteration and move towards human score. 

In fact - as explained above, we only use RLHF (Reinforcement learning from Human feedback) instead of Reinforcement Learning going by the definition.

How does ChatGPT / Other LLMs work ? 

When a user asks a specific question, the chatbot first searches in its data bank created by the human labellers & even  if it is not available it is capable of imitating the training information and provide best possible response. It goes for internet search if needed.  The responses provided by ChatGPT may look very personalized and comprehensive at the same time but the reality is that it is just generating series of tokens. 

We can test this by asking the same question repeatedly, we can see Chatbot will reword or modify the responses each time without changing the core response. It is so eloquent with the tonnes of data that it has not just control over but also trained over. !

To summarize, responses generated by ChatGPT / other LLms are just statistical imitations / simulation of human labellers and not any thing magical.  

Myths abouts  ChatGPT and other LLMs

(1)  Hallucination

ChatGPT (or any other LLM) will never accept that it does not know. It searches its training data and try to give a response somehow.  This effect is called "hallucination" and we can avoid this by asking a question and giving a special comment "Do not use any tools". Now it cannot use internet or any other source of data and will admit its ignorance.

Mitigation strategy 1 : In fact some of the models provide methods to enrich the knowledge by provisioning the user to add up the new information to the training data. 

Mitigation strategy 2 : On such situation where we know that the LLM does not know, we can try to provide contextual data and ask our questions. It is smart enough and in fact more powerful by having contextual information. 

(2) Knowledge of self 

"Who are you"  is a very dumb question to ask to ChatGPT or any other LLM since it is possible to add it in training data or it can be a hardcoded response in some models Afterall the model is just a “token tumbler” & has no memory / personality of its own ; 

Mitigation strategy : Make use of Chat GPT to know more about things that you don't know and learn from it. No point trying to be smart to understand the source ! (Well there are few LLMs - Perplexity for instance - which essentially is built on Chat GPT but goes to the extend of giving references for its responses also . This is being provided to gain more credibility with users) 

(3) Question asking for arithmetic calculation  

We need to remember that a LLM operates with just an "one dimensional sequence of token" and the calculation will be done based on stream of tokens. For an arithmetic problem, it is natural for the LLM to give the response step by step - one after the other. If we insist on giving the calculated value first and then have the detailed step, it will be quite a complex thing for a ChatGPT considering the load on the token that is available for processing at each interaction with the user. So it is better to have the step wise responses for all arithmetic calculations. 

Mitigation Strategy : Better way to ask ChatGPT is after giving a arithmetic statement problem type “use code”. It will be more accurate & reliable since it uses the python arithmetic instead of mental arithmetic of the language model (“Model need token to think”). By using “code” it uses another part of the model where program is executed and just brings the result to the interactive screen

(4) As a subset of earlier point, ChatGPT is not good in counting. Ask “how many dots are in the below ……..” don’t be surprised if the answer is wrong. It is going to try counting the total number of dots which may have got split into different tokens. When we say “use code” it will calculate using Python loop function. 

Well Andrej gives an example that it was quite a popular joke that ChatGPT was not able to correctly count the number of 'r" in the word "strawberry" until recently. He remarks that it is no longer doing that mistake -perhaps got fixed by the ChatGPT team. 

To summarize, ChatGPT and other LLMs can be effectively used if we understand them better and use them wisely.

*********************************************************************************

So that was a quick summary of Andrej's video and hope you don't all the details / references that he provides in his video.

Btw, the second reason for this post is that today is my birthday and this post is just an expression of my bliss about today 

 

Regards // Suren 

 

 

 

Sunday, June 8, 2025

AI / ML --> Musings on Data Quality (Part 2)

So, Lets first look at couple of unavoidable issues in data quality. It is not that this challenge will be there in every project but if it is so, that needs to be recognized and handled properly. 

First one in the category is "Class Imbalance". Typically this happens in most of the classification problems where there is a natural imbalance in the distribution of the expected outcomes. For eg, let us consider a model being developed for cancer detection - it is quite natural that Number of people diagonised with cancer will be far less than those who do not have this disease. In this kind of situation, when we know that out of the total population only less than 2 % can have  a rare disease, we need to be doubly careful when the trained model is going to show 98% success rate (even 99 %). A large weight has to be attached to the wrong diagnosis to ensure that metrics behave in expected manner and a thorough review of wrong diagnosis would be needed on false positives and false negatives. This kind of challenge is also quite common in multi class classification models.

Next is the concept of "drift" which was referred in earlier post when "sampling bias' was explained. Particularly when the project life cycle is more  and in cases where the model is used in scenario where there are too many updates / changes to scenario, keeping in mind of Drift factor is very important for data scientists. Drift can be various types - it can happen to data (Example: In a medical dataset, patients now come from a new region with different average body weights, but the diagnosis rules haven’t changed) or to labels (Example: Credit card fraud increases post-COVID, so the % of fraud cases rises, but the fraud patterns are still detectable)  or to concept (Example: A spam filter becomes outdated as spammers change tactics over time). Needless to say the last one is the most dangerous type of drift.

Now, there is one more possible slip for the data science team to ensure the granularity of the data - in other words he level of detail or resolution at which data is captured or processed. Wrong granularity occurs when data is too coarse (aggregated too much) or too fine-grained (minor individual events / outliers taken into consideration) or  mismatch input data and target labels or multiple tables or sources are wrongly joined. This is one kind of quality challenge which falls within the scope of Data scientists team. It is some thing they can completely avoid by careful planning and seamless execution during data processing stages.

Now, lets come to the last one of data quality challenge which is referred as "Low Signal" which is also referred as SNR (Low signal-to-noise ratio). This hurts the AI/ML performance silently. If the features of the data set carry too little useful signal (predictive power) compared to random noise, the model may do either of the 3 things - overfit noise, fail to generalize or produce meaningless prediction. It might be interesting to notice that the outcomes are varied - so the early this is detected, the remediation is going to be effective else we will go by wrong diagnosis doing all irrelevant things to mitigate the issue.It is not uncommon that  the development team is asked to finetune the model and chase the algorithms instead of doing the right things.

So what are the right things when it comes to Low signal ? Right in Feature selection stage, it is important to remove relevant/redundant ones. Also, Data scientists can look at combining and transforming features to create better ones or use embedding technique to represent sparse features. Regularization is one technique that is adopted during development of model to attack this issue (as it will silence the noisy feature). There are also tools available like autoencoders which can extract latent structure and  discard weak noise in the data.

To summarize, Data Science is an exciting area of AI/ML. There is a popular proverb "Proof of the pudding is in the eating" which actually I realize is half truth. It should be a a statement that should be after "Guarantee of the pudding is in the making" !! The main advantage of Data Science compared to the process of making the pudding is that - we have wide range of tools and remediation measures to mitigate the challenges of poor data quality even if we miss it earlier. The subject got so evolved that it offers both preventive as well as remediation processes for each of the challenges. Relatively speaking, proof of pudding is more of a unavoidable result !! 

Friday, June 6, 2025

AI / ML ---> Musings on Data Quality

This post is focused on Data Quality - which as we saw in the previous post as one of the critical component of AI/ML model.  

Even during nascent stage,  Machine language was considered as an intersection of Data Science and Software Engineering. As Machine Language matured and also paved way to Artificial Intelligence, the importance of Data Science is only increasing. In fact for the Large LLMs and Generative AI technologies, Data Science is getting more and more crucial aspect than never before. 

With super efficiency brought about in building AI models these days - with deep neural networks and well tested scripts / functions, it is quite an irony that Data Quality continues to be a challenge. 

Lets look at  various aspects that impact Data Quality.

First and foremost, there are few Foundational Challenges which the data scientists need to take enormous care before processing the data. A key challenge is to ensure that data is fairly representative of real world scenario. This is referred as Sampling Bias where deliberately or inadvertently the data may not reflect the real life data (data which the users may eventually use in Production Environment). We will deal with concept of Data drift separately (which could be due to the time lag between development and deployment) but the major reason could be due to skew-ness or sampling errors. 

Another foundational challenge is missing / incomplete data. A thorough review of the data set for the completeness of various fields is vital for all types of supervised learning which fairly depend on  structured data. There are various methods to ensure this and also multiple techniques - simple (example delete the incomplete row of data or using domain specific rules) & complex (imputation or model agnostic hand ling). depending on the size and time available for the data scientists.

Data Duplication  is referred to the situation where the same data has duplicated itself set due to oversight. It could be a case of exact duplicates or near duplicates - some times it might be duplicating across Training set and Dev data set. When we end up in this kind of scenario, we will be misled by the model's performance and also  the credibility of  metrics will be at stake during development process.  

All the three above are basically at Data gathering stage which may be due to human error. However there are more human errors possible which are explained in the next part of the post. 

For all kinds of supervised models, labels (or the actual outcome) are critical to evaluate the model's predicted outcomes. To get the best quality of labels, normally human labelling is employed & even in the age of LLMs the importance of human labellers is not undermined. One mischievous kind of challenge is to have noisy labels which is a broad topic by itself. The labels could be incorrect, inconsistent or could be weak (when it is kept as rule - based). There can also be sensor or transcription errors. Without setting this right, starting off with model training will only be a crude joke.

Another subset of this is Annotation inconsistency which is quite common in classification problems. A particular kind of ambiguous situation can bThere are some challenges due to human errors which is some times tricky and happens at various stages - data gathering, human labelling & during data processing. Lets look at them one by one e decided in a particular manner by one of the human labeler while another one decides otherwise. When a team of people work on labelling this issue is in a way unavoidable unless we have clear rules on definition for each data element. Secondary verification or Group discussions can also help to substantially bridge the ambiguities. 

Like data leakages that we saw earlier, we can also have label leakages. This could be a serious situation because it means that the labels (actual outcome) has seeped into the input parameters  either by design of dataset or willfully by some team member to show better performance in training data. Extra care should be taken by Data Scientist to review all the input parameters closely and ensure that there is no signal / influence of the output given as part of input parameters.

 There are two more categories of challenges in data quality which can be grouped as (a) "unavoidable" in some cases but needs to be handled and (b) Quality focus of Data scientist team which I will handle in next post. In particular, I am keen to give more space to one of the Quality focus challenge called "Low Signal"  which was an eye opener for me to understand and appreciate the role of Data Scientists. 

Tuesday, June 3, 2025

AI / ML - Neurons - Biological vs Artificial

  • Forward pass gives structure.

  • Activation functions give expressiveness.

  • Backward pass enables learning

  • Optimization algorithms guide learning.

  • Data brings meaning.

 Life goes on with priorities in our hand and which are urgent - isn't it ?

Sunday, June 1, 2025

AI / ML - History of Activations

 In one of my favorite film (Tamil) which deals with the greatness of a 5th Century  spiritual master  (Bodhi Dharma),  there is a wonderful dialogue - Film - Aezhaam Arivu (which means "7th sense")

" We started losing our science when we started ignoring our history and heritage"

***********************************************************************************

 In 1943, "Step" function was used to have the early neural network just to enable "fire" or "don't fire"  in other word - "give output or keep quiet based on the computation made" (no details covered in this post please). You can appreciate this is the most simplistic way of looking at things but the data scientists those days were sincere to mimick their limited understanding of human brain which they observed that only few of the neurons of the brain were activated at any point of time.

Understandably this milestone made their initial neural network to continue the data processing in neural networks - or what is referred as "forward propagation". However, it lacked the ability to support the reverse calculation of the neural network automatically which is critical to optimize the output. Can you believe the AI experts used to manually work on derivatives (calculus) to handle the back propagation since that was not enabled by "step" function ? It was cumbersome but there were no choice for them.

During 1970 - 80, a breakthrough was achieved by using sigmoid function in the basic neural networks (also referred as shallow networks which did not have many layers) which helped  optimization when the concept of gradient descent (a method to do the reverse calculation / backward propagation automatically) was developed. 

However, since - as we know - sigmoid function returns a value 0 to 1, there were challenges of "vanishing gradients" during the backward propagation process so the whole idea of optimization was stuck up. In the year 1980, a much better activation called tanh function started getting used which gives an output in the range -1 to 1 which helped to avoid the earlier challenge substantially. However still when there are situations of large values of inputs or very small values of inputs, optimization still suffered.

After quite a while which also marked the advent of "deep learning era" (supported by deep neural networks (allowing multiple hidden layers and large language models), during 2010, RELU activation function  (Proud to say one of the co-founder was Vinod Nair - though he lives in Canada) . This activation is very simple (output either maximum of the input value or zero which ever is higher) and the computational cost was minimal.  It has almost  taken out the woes of optimization and was a huge shot in the arm of deep learning. Even today, though there are many other sophisticated activation functions available, when in doubt or unsure, developing community goes for ReLU for hidden layer activation as a safe bet. Is it the best one available today ? It still has issue with returning values zeros some times.

During 2011-15, a very smart variation of RELU (returns either input value or .001 of input value) named as "Leaky Relu" was introduced to adjust against the issues faced by zero value of the computed value since now it wont return zero any more but a tiny value instead  and keep the backward propagation going with non zero values ! 

In parallel, we had softmax function introduced in deep neural network based on the need to return a probablistic output for a chosen set of values. We should be clear that this was more out of need than any logical progression in the history described so far.

After 2015, we have Swish, Mish , GELU and so on  which has made smoother activation possible for ultra deep networks ..and this is not an exhaustive list of activation functions.

 Well, my constant companion ChatGPT gave a nice idea to remember the history for people over 40 years of age (when obviously the neurons start dying quite rapidly)

*************************************************************************

 "Some Teachers Run Like Super Geniuses"

S = Step & Sigmoid

T = Tanh

R = ReLU

L = Leaky ReLU

S = SWISH

G = GELU

**************************************************************************

 Another Memory Anchor - 

First they 'stepped' (binary), then made it smooth (Sigmoid), then centered (tanh), then said 'lets forget curves, just 'cut'' (ReLU), fixed 'dying neurons' (Leaky ReLU) and finally started 'smart, curvy activations' (SWISH, GELU)