Categories: Data Science

A Brief Introduction to Neural Networks


In this article, we will cover key theoretical topics and aspects surrounding the Neural Network model i.e. things like:

  • Neurons and Activation Functions
  • Cost Functions
  • Gradient Descent
  • Backpropagation

Understanding a high-level overview of these key elements is really going to make it much easier to understand what’s happening when we begin to use TensorFlow.

The Perceptron

Before we launch straight into Neural Networks, we need to understand the individual components first, such as a single “neuron”. Artificial Neural Networks (ANN), have a basis in Biology! Let’s see how we can attempt to mimic biological neurons with an artificial neuron, known as Perceptron. And then once we go through the process of how a simple perceptron works, we’ll see how we can represent that mathematically.

Let’s start off with a Biological neuron, such as a brain cell. So the Biological neuron works in a simplified way through the following manner:

In the above diagram, we can see that we have dendrites that feed into the body of the cell and what happens is these electrical signal gets passed through these dendrites and then later on a single output or a single electrical signal is passed on through an axon to, later on, connect to some other neuron. And that’s the basic idea.

So, Artificial Neural Networks also has its inputs and outputs.

This simple model is known as a Perceptron. In this case, we have two inputs. These inputs can have values of features.

So, when you have your dataset, you are going to have various features and these features can be anything from how many rooms a house has or how dark an image is represented by some sort of pixel amount or some sort of darkness number, etc.

The next step is to have these inputs multiplied by some sort of weight.

So we have Weight 0 for Input 0 and Weight 1 for Input 1. Typically the weights are actually initialized through some sort of random generation. So we just choose a random number for these weights. In this case, we’ll pretend that the random number chosen is 0.5 and -1.

So now these inputs are going to be multiplied by the weights. And that ends up looking like this:

The next step is to take these results and pass them into an Activation Function. An Activation Function calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not.

Deep Networks

We’ve seen how a single perceptron behaves, now let’s expand this concept to the idea of a neural network. Let’s see how to connect many perceptrons together and then how to represent this mathematically. So, the multiple perceptrons that work actually look like this:

Here we can see that we have various layers of single perceptrons connected to each other through their inputs and outputs. In this case, we have an input layer on the left which is purple. We have two hidden layers and hidden layers are the layers that are in between the input layer and all the way on the right that output layer. Essentially, hidden layers are the layers that don’t get to see outside, i.e, all the way the inputs on the left and all the way the output on the right. When there are three or more hidden layers, it is called a “Deep Network“.

Activation Functions

The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.

3 Types of Activation Functions

Binary Step Function

A binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.

The problem with a step function is that it does not allow multi-value outputs—for example, it cannot support classifying the inputs into one of several categories.

Linear Activation Function

A linear activation function takes the form:

A = cx

It takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input. In one sense, a linear function is better than a step function because it allows multiple outputs, not just yes and no.

Non-Linear Activation Functions

Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality.

Almost any process imaginable can be represented as a functional computation in a neural network, provided that the activation function is non-linear.

Non-linear functions address the problems of a linear activation function:

  1. They allow backpropagation because they have a derivative function which is related to the inputs.
  2. They allow “stacking” of multiple layers of neurons to create a deep neural network. Multiple hidden layers of neurons are needed to learn complex data sets with high levels of accuracy.

7 Common Nonlinear Activation Functions  and How to Choose an Activation Function

Sigmoid / Logistic

  • Smooth gradient, preventing “jumps” in output values.
  • Output values bound between 0 and 1, normalizing the output of each neuron.
  • Clear predictions—For X above 2 or below -2, it tends to bring the Y value (the prediction) to the edge of the curve, very close to 1 or 0. This enables clear predictions.
  • Vanishing gradient—for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or being too slow to reach an accurate prediction.
  • Outputs not zero centered.
  • Computationally expensive

TanH / Hyperbolic Tangent

  • Zero centered—making it easier to model inputs that have strongly negative, neutral, and strongly positive values.
  • Otherwise like the Sigmoid function.
  • Like the Sigmoid function

ReLU (Rectified Linear Unit)

  • Computationally efficient—allows the network to converge very quickly
  • Non-linear—although it looks like a linear function, ReLU has a derivative function and allows for backpropagation
  • The Dying ReLU problem—when inputs approach zero or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.

Leaky ReLU

  • Prevents dying ReLU problem—this variation of ReLU has a small positive slope in the negative area, so it does enable backpropagation, even for negative input values
  • Otherwise like ReLU
  • Results not consistent—leaky ReLU does not provide consistent predictions for negative input values.

Parametric ReLU

  • Allows the negative slope to be learned—unlike leaky ReLU, this function provides the slope of the negative part of the function as an argument. It is, therefore, possible to perform backpropagation and learn the most appropriate value of α.
  • Otherwise like ReLU
  • May perform differently for different problems.


  • Able to handle multiple classes only one class in other activation functions—normalizes the outputs for each class between 0 and 1, and divides by their sum, giving the probability of the input value being in a specific class.
  • Useful for output neurons—typically Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories.

Cost Function

It is a function that measures the performance of a Machine Learning model for given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number. Depending on the problem Cost Function can be formed in many different ways. The purpose of the Cost Function is to be either:

  • Minimized – then returned value is usually called cost, loss or error. The goal is to find the values of model parameters for which Cost Function return as a small number as possible.
  • Maximized – then the value it yields is named a reward. The goal is to find the values of model parameters for which the returned number is as large as possible.

Gradient Descent

To explain Gradient Descent I’ll use the classic mountaineering example.

Suppose you are at the top of a mountain, and you have to reach a lake which is at the lowest point of the mountain (a.k.a valley). A twist is that you are blindfolded and you have zero visibility to see where you are headed. So, what approach will you take to reach the lake?

The best way is to check the ground near you and observe where the land tends to descend. This will give an idea in what direction you should take your first step. If you follow the descending path, it is very likely you would reach the lake.

To represent this graphically, notice the below graph.

Let us now map this scenario in mathematical terms.

Suppose we want to find out the best parameters (θ1) and (θ2) for our learning algorithm. Similar to the analogy above, we see we find similar mountains and valleys when we plot our “cost space”. Cost space is nothing but how our algorithm would perform when we choose a particular value for a parameter.

So on the y-axis, we have the cost J(θ) against our parameters θ1 and θ2 on the x-axis and z-axis respectively. Here, hills are represented by the red regions, which have a high cost, and valleys are represented by the blue region, which has low cost.

Back Propagation

Back-propagation is the essence of neural net training. It is the practice of fine-tuning the weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e. iteration). Proper tuning of the weights ensures lower error rates, making the model reliable by increasing its generalization.


In this article, we have got a very brief introduction to Neural Networks and how it works. We discussed:

  • Perceptron
  • Activation Functions
  • Cost Function
  • Gradient Descent, and
  • Back Propagation

We’ll deal with the coding section in the next article. Till then, stay tuned and Happy Reading!

Surya Remanan

Budding Data Scientist and a Student. Loves to blog about Data Science.

Recent Posts

Using Pandas to explore data in Excel files

Introduction: When it comes to Data Science, we need to talk about data, and data…

2 days ago

The Best Data Science Courses on EdX

Learn smarter with these EdX courses - Photo by Tim Mossholder on Unsplash Learn data science with these…

5 days ago

The Best Books for Learning Data Science with Python in 2020

Start (or continue) your data science journey here with these amazing books. You may have…

1 week ago

Classification Scoring Functionalities with Scikit-Learn

Supervised learning is a type of machine learning which deals with regression and classification. When…

2 weeks ago

Evaluation of Regression Models in scikit-learn

Importance of Model Evaluation Being able to correctly measure the performance of a machine learning…

2 weeks ago

The 5 Best Python Books for Data Scientists

Introduction Data Scientists all around the world are loving the wonders Python can do. It…

2 weeks ago