bias-in-neurons
Table of content
The question
In general, why do we have a bias term in artificial neurons of ANNs (Artificial Neural Networks)?
Implications
This question deals with "mathematical and biological history of AI" and lets us rethink the "Explainability" and "expressivity" of current ANNs building blocks.
Applications
I believe neither the question was mindlessly posed, nor the answer to it is limited in scope. This question has been the core curiosity of KAN paper, published last week, which calls the expressiveness power of current neural network architectures into question. Out of 14 different scenarios, MLP architecture only outperforms KAN in one of them.
KAN or not KAN illustration from [1]
My updated answers
Having read some papers on the History of AI, I had speculated that the bias term was introduced because of some experimental observation in neuroscience.
But, I was wrong! In fact, there had been no bias term in the initial models of artificial neurons!
I also could not track down the exact time that the bias term was introduced. Does anyone know?
From [2]
Initial versions of artificial neurons which does not have any bias term From [2]
One explanation as to why we need a bias term, that I forgot to mention in the session, is that it helps us to shift the neuron output.
If we, for example, want to predict points coming from y = 2*x + 3, we are unable to do so using a single neuron without bias.
You might ask, "Well, we have enough computation power and if we have a moderately large MLP, it can predict that y = 2*x +3"
Good question :) I went and implemented a custom MLP with three neurons and it could fit that line.
To train our models, we need to sample from our line (y = 2x + 3). In the following code, I have implemented this part.
We know that we can use one single neuron with a bias term to model our line. (Why? Because it has enough parameters to model a single line). For the activation function, we need to choose Identity activation or any other activation that has an unbounded board (why? Because our y can range from -infinity to +infinity). Ok, Let's code this.
I: Identity activation function
As you can see, and as the loss value evidences, we could fit our single neuron pretty well on this task.
Ok, Let's design the simplest ANN without any bias term that can fit. Before continuing, can you try this yourself? I thought I could code this problem in less than 10 minutes but it took 1 hour! It's a good example to refresh our minds on different parts of ANNs.
Here is the illustration of the network and its code:
I: Identity activation function
S: Sigmoid activation function
As shown above, we could fit our three-neuron network on this task. With a bit higher loss value, however.
Then you might ask, If we have two large MLPs one with bias and another one without, do they differ in expressiveness power?
If yes, Why we didn't have a bias term in the initial representation of artificial neurons?
if no, Why do we use them now?
My to-do list:
- Reading and implementing KAN paper, which speaks to the above questions.
- Reading more about history of AI
References
[1] KAN: Kolmogorov-Arnold Networks
[2] Anderson and G. McNeill. Artificial neural networks technology. Kaman Sciences Corporation, 1992.