In one of my favorite film (Tamil) which deals with the greatness of a 5th Century spiritual master (Bodhi Dharma), there is a wonderful dialogue - Film - Aezhaam Arivu (which means "7th sense")
" We started losing our science when we started ignoring our history and heritage"
***********************************************************************************
In 1943, "Step" function was used to have the early neural network just to enable "fire" or "don't fire" in other word - "give output or keep quiet based on the computation made" (no details covered in this post please). You can appreciate this is the most simplistic way of looking at things but the data scientists those days were sincere to mimick their limited understanding of human brain which they observed that only few of the neurons of the brain were activated at any point of time.
Understandably this milestone made their initial neural network to continue the data processing in neural networks - or what is referred as "forward propagation". However, it lacked the ability to support the reverse calculation of the neural network automatically which is critical to optimize the output. Can you believe the AI experts used to manually work on derivatives (calculus) to handle the back propagation since that was not enabled by "step" function ? It was cumbersome but there were no choice for them.
During 1970 - 80, a breakthrough was achieved by using sigmoid function in the basic neural networks (also referred as shallow networks which did not have many layers) which helped optimization when the concept of gradient descent (a method to do the reverse calculation / backward propagation automatically) was developed.
However, since - as we know - sigmoid function returns a value 0 to 1, there were challenges of "vanishing gradients" during the backward propagation process so the whole idea of optimization was stuck up. In the year 1980, a much better activation called tanh function started getting used which gives an output in the range -1 to 1 which helped to avoid the earlier challenge substantially. However still when there are situations of large values of inputs or very small values of inputs, optimization still suffered.
After quite a while which also marked the advent of "deep learning era" (supported by deep neural networks (allowing multiple hidden layers and large language models), during 2010, RELU activation function (Proud to say one of the co-founder was Vinod Nair - though he lives in Canada) . This activation is very simple (output either maximum of the input value or zero which ever is higher) and the computational cost was minimal. It has almost taken out the woes of optimization and was a huge shot in the arm of deep learning. Even today, though there are many other sophisticated activation functions available, when in doubt or unsure, developing community goes for ReLU for hidden layer activation as a safe bet. Is it the best one available today ? It still has issue with returning values zeros some times.
During 2011-15, a very smart variation of RELU (returns either input value or .001 of input value) named as "Leaky Relu" was introduced to adjust against the issues faced by zero value of the computed value since now it wont return zero any more but a tiny value instead and keep the backward propagation going with non zero values !
In parallel, we had softmax function introduced in deep neural network based on the need to return a probablistic output for a chosen set of values. We should be clear that this was more out of need than any logical progression in the history described so far.
After 2015, we have Swish, Mish , GELU and so on which has made smoother activation possible for ultra deep networks ..and this is not an exhaustive list of activation functions.
Well, my constant companion ChatGPT gave a nice idea to remember the history for people over 40 years of age (when obviously the neurons start dying quite rapidly)
*************************************************************************
"Some Teachers Run Like Super Geniuses"
S = Step & Sigmoid
T = Tanh
R = ReLU
L = Leaky ReLU
S = SWISH
G = GELU
**************************************************************************
Another Memory Anchor -
First they 'stepped' (binary), then made it smooth (Sigmoid), then centered (tanh), then said 'lets forget curves, just 'cut'' (ReLU), fixed 'dying neurons' (Leaky ReLU) and finally started 'smart, curvy activations' (SWISH, GELU)
No comments:
Post a Comment