Neural Networks: Types and Applications

View

Types of DL networks

The most famous types of deep learning networks are discussed in this section: these include recursive neural networks (RvNNs), RNNs, and CNNs. RvNNs and RNNs were briefly explained in this section while CNNs were explained in deep due to the importance of this type. Furthermore, it is the most used in several applications among other networks.


Recursive neural networks

RvNN can achieve predictions in a hierarchical structure also classify the outputs utilizing compositional vectors. Recursive auto-associative memory (RAAM) is the primary inspiration for the RvNN development. The RvNN architecture is generated for processing objects, which have randomly shaped structures like graphs or trees. This approach generates a fixed-width distributed representation from a variable-size recursive-data structure. The network is trained using an introduced back-propagation through structure (BTS) learning system. The BTS system tracks the same technique as the general-back propagation algorithm and has the ability to support a treelike structure. Auto-association trains the network to regenerate the input-layer pattern at the output layer. RvNN is highly effective in the NLP context. Socher et al. introduced RvNN architecture designed to process inputs from a variety of modalities. These authors demonstrate two applications for classifying natural language sentences: cases where each sentence is split into words and nature images, and cases where each image is separated into various segments of interest. RvNN computes a likely pair of scores for merging and constructs a syntactic tree. Furthermore, RvNN calculates a score related to the merge plausibility for every pair of units. Next, the pair with the largest score is merged within a composition vector. Following every merge, RvNN generates (a) a larger area of numerous units, (b) a compositional vector of the area, and (c) a label for the class (for instance, a noun phrase will become the class label for the new area if two units are noun words). The compositional vector for the entire area is the root of the RvNN tree structure. An example RvNN tree is shown in Fig. 5. RvNN has been employed in several applications.

Fig. 5 An example of RvNN tree

Fig. 5 An example of RvNN tree


Recurrent neural networks

RNNs are a commonly employed and familiar algorithm in the discipline of DL. RNN is mainly applied in the area of speech processing and NLP contexts. Unlike conventional networks, RNN uses sequential data in the network. Since the embedded structure in the sequence of the data delivers valuable information, this feature is fundamental to a range of different applications. For instance, it is important to understand the context of the sentence in order to determine the meaning of a specific word in it. Thus, it is possible to consider the RNN as a unit of short-term memory, where x represents the input layer, y is the output layer, and s represents the state (hidden) layer. For a given input sequence, a typical unfolded RNN diagram is illustrated in Fig. 6. Pascanu et al.  introduced three different types of deep RNN techniques, namely "Hidden-to-Hidden", "Hidden-to-Output", and "Input-to-Hidden". A deep RNN is introduced that lessens the learning difficulty in the deep network and brings the benefits of a deeper RNN based on these three techniques.

Fig. 6 Typical unfolded RNN diagram

Fig. 6 Typical unfolded RNN diagram

However, RNN's sensitivity to the exploding gradient and vanishing problems represent one of the main issues with this approach. More specifically, during the training process, the reduplications of several large or small derivatives may cause the gradients to exponentially explode or decay. With the entrance of new inputs, the network stops thinking about the initial ones; therefore, this sensitivity decays over time. Furthermore, this issue can be handled using LSTM. This approach offers recurrent connections to memory blocks in the network. Every memory block contains a number of memory cells, which have the ability to store the temporal states of the network. In addition, it contains gated units for controlling the flow of information. In very deep networks, residual connections also have the ability to considerably reduce the impact of the vanishing gradient issue which explained in later sections. CNN is considered to be more powerful than RNN. RNN includes less feature compatibility when compared to CNN.


Convolutional neural networks

In the field of DL, the CNN is the most famous and commonly employed algorithm. The main benefit of CNN compared to its predecessors is that it automatically identifies the relevant features without any human supervision. CNNs have been extensively applied in a range of different fields, including computer vision, speech processing, Face Recognition, etc. The structure of CNNs was inspired by neurons in human and animal brains, similar to a conventional neural network. More specifically, in a cat's brain, a complex sequence of cells forms the visual cortex; this sequence is simulated by the CNN. Goodfellow et al. identified three key benefits of the CNN: equivalent representations, sparse interactions, and parameter sharing. Unlike conventional fully connected (FC) networks, shared weights and local connections in the CNN are employed to make full use of 2D input-data structures like image signals. This operation utilizes an extremely small number of parameters, which both simplifies the training process and speeds up the network. This is the same as in the visual cortex cells. Notably, only small regions of a scene are sensed by these cells rather than the whole scene (i.e., these cells spatially extract the local correlation available in the input, like local filters over the input).

A commonly used type of CNN, which is similar to the multi-layer perceptron (MLP), consists of numerous convolution layers preceding sub-sampling (pooling) layers, while the ending layers are FC layers. An example of CNN architecture for image classification is illustrated in Fig. 7.

Fig. 7 An example of CNN architecture for image classification

Fig. 7 An example of CNN architecture for image classification

The input x of each layer in a CNN model is organized in three dimensions: height, width, and depth, or m \times m \times r, where the height (m) is equal to the width. The depth is also referred to as the channel number. For example, in an RGB image, the depth (r) is equal to three. Several kernels (filters) available in each convolutional layer are denoted by k and also have three dimensions (n \times n \times q), similar to the input image; here, however, n must be smaller than m, while q is either equal to or smaller than r. In addition, the kernels are the basis of the local connections, which share similar parameters (bias b^k and weight W^k) for generating k feature maps h^k) with a size of (m-n-1) each and are convolved with input, as mentioned above. The convolution layer calculates a dot product between its input and the weights as in Eq. 1, similar to NLP, but the inputs are undersized areas of the initial image size. Next, by applying the nonlinearity or an activation function to the convolution-layer output, we obtain the following:

h^{k}= f(W^{k}*x+ b^{k} ) (1)

The next step is down-sampling every feature map in the sub-sampling layers. This leads to a reduction in the network parameters, which accelerates the training process and in turn enables handling of the overfitting issue. For all feature maps, the pooling function (e.g. max or average) is applied to an adjacent area of size p \times p, where p is the kernel size. Finally, the FC layers receive the mid- and low-level features and create the high-level abstraction, which represents the last-stage layers as in a typical neural network. The classification scores are generated using the ending layer [e.g. support vector machines (SVMs) or softmax]. For a given instance, every score represents the probability of a specific class.


Benefits of employing CNNs

The benefits of using CNNs over other traditional neural networks in the computer vision environment are listed as follows:

  1. The main reason to consider CNN is the weight sharing feature, which reduces the number of trainable network parameters and in turn helps the network to enhance generalization and to avoid overfitting.
  2. Concurrently learning the feature extraction layers and the classification layer causes the model output to be both highly organized and highly reliant on the extracted features.
  3. Large-scale network implementation is much easier with CNN than with other neural networks.


CNN layers

The CNN architecture consists of a number of layers (or so-called multi-building blocks). Each layer in the CNN architecture, including its function, is described in detail below.

  1. Convolutional Layer: In CNN architecture, the most significant component is the convolutional layer. It consists of a collection of convolutional filters (so-called kernels). The input image, expressed as N-dimensional metrics, is convolved with these filters to generate the output feature map.

    • Kernel definition: A grid of discrete numbers or values describes the kernel. Each value is called the kernel weight. Random numbers are assigned to act as the weights of the kernel at the beginning of the CNN training process. In addition, there are several different methods used to initialize the weights. Next, these weights are adjusted at each training era; thus, the kernel learns to extract significant features.
    • Convolutional Operation: Initially, the CNN input format is described. The vector format is the input of the traditional neural network, while the multi-channeled image is the input of the CNN. For instance, single-channel is the format of the gray-scale image, while the RGB image format is three-channeled. To understand the convolutional operation, let us take an example of a 4 x 4 gray-scale image with a 2 x 2 random weight-initialized kernel. First, the kernel slides over the whole image horizontally and vertically. In addition, the dot product between the input image and the kernel is determined, where their corresponding values are multiplied and then summed up to create a single scalar value, calculated concurrently. The whole process is then repeated until no further sliding is possible. Note that the calculated dot product values represent the feature map of the output. Figure 8 graphically illustrates the primary calculations executed at each step. In this figure, the light green color represents the 2 x 2 kernel, while the light blue color represents the similar size area of the input image. Both are multiplied; the end result after summing up the resulting product values (marked in a light orange color) represents an entry value to the output feature map.

    Fig. 8 The primary calculations executed at each step of convolutional layer

    Fig. 8 The primary calculations executed at each step of convolutional layer

    However, padding to the input image is not applied in the previous example, while a stride of one (denoted for the selected step-size over all vertical or horizontal locations) is applied to the kernel. Note that it is also possible to use another stride value. In addition, a feature map of lower dimensions is obtained as a result of increasing the stride value.

    On the other hand, padding is highly significant to determining border size information related to the input image. By contrast, the border side-features moves carried away very fast. By applying padding, the size of the input image will increase, and in turn, the size of the output feature map will also increase. Core Benefits of Convolutional Layers.

    • Sparse Connectivity: Each neuron of a layer in FC neural networks links with all neurons in the following layer. By contrast, in CNNs, only a few weights are available between two adjacent layers. Thus, the number of required weights or connections is small, while the memory required to store these weights is also small; hence, this approach is memory-effective. In addition, matrix operation is computationally much more costly than the dot (.) operation in CNN.
    • Weight Sharing: There are no allocated weights between any two neurons of neighboring layers in CNN, as the whole weights operate with one and all pixels of the input matrix. Learning a single group of weights for the whole input will significantly decrease the required training time and various costs, as it is not necessary to learn additional weights for each neuron.

  2. Pooling Layer: The main task of the pooling layer is the sub-sampling of the feature maps. These maps are generated by following the convolutional operations. In other words, this approach shrinks large-size feature maps to create smaller feature maps. Concurrently, it maintains the majority of the dominant information (or features) in every step of the pooling stage. In a similar manner to the convolutional operation, both the stride and the kernel are initially size-assigned before the pooling operation is executed. Several types of pooling methods are available for utilization in various pooling layers. These methods include tree pooling, gated pooling, average pooling, min pooling, max pooling, global average pooling (GAP), and global max pooling. The most familiar and frequently utilized pooling methods are the max, min, and GAP pooling. Figure 9 illustrates these three pooling operations.

    Fig. 9 Three types of pooling operations

    Fig. 9 Three types of pooling operations

    Sometimes, the overall CNN performance is decreased as a result; this represents the main shortfall of the pooling layer, as this layer helps the CNN to determine whether or not a certain feature is available in the particular input image, but focuses exclusively on ascertaining the correct location of that feature. Thus, the CNN model misses the relevant information.

  3. Activation Function (non-linearity) Mapping the input to the output is the core function of all types of activation function in all types of neural network. The input value is determined by computing the weighted summation of the neuron input along with its bias (if present). This means that the activation function makes the decision as to whether or not to fire a neuron with reference to a particular input by creating the corresponding output.

    Non-linear activation layers are employed after all layers with weights (so-called learnable layers, such as FC layers and convolutional layers) in CNN architecture. This non-linear performance of the activation layers means that the mapping of input to output will be non-linear; moreover, these layers give the CNN the ability to learn extra-complicated things. The activation function must also have the ability to differentiate, which is an extremely significant feature, as it allows error back-propagation to be used to train the network. The following types of activation functions are most commonly used in CNN and other deep neural networks.

    Sigmoid: The input of this activation function is real numbers, while the output is restricted to between zero and one. The sigmoid function curve is S-shaped and can be represented mathematically by Eq. 2.

    f(x)_{sigm}=\frac{1}{1+e^{-x}} (2)

    Tanh: It is similar to the sigmoid function, as its input is real numbers, but the output is restricted to between − 1 and 1. Its mathematical representation is in Eq. 3.

    f(x)_{tanh}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} (3)

    ReLU: The mostly commonly used function in the CNN context. It converts the whole values of the input to positive numbers. Lower computational load is the main benefit of ReLU over the others. Its mathematical representation is in Eq. 4.

    f(x)_{ReLU}= max(0,x) (4)

    Occasionally, a few significant issues may occur during the use of ReLU. For instance, consider an error back-propagation algorithm with a larger gradient flowing through it. Passing this gradient within the ReLU function will update the weights in a way that makes the neuron certainly not activated once more. This issue is referred to as "Dying ReLU". Some ReLU alternatives exist to solve such issues. The following discusses some of them.

    Leaky ReLU: Instead of ReLU down-scaling the negative inputs, this activation function ensures these inputs are never ignored. It is employed to solve the Dying ReLU problem. Leaky ReLU can be represented mathematically as in Eq. 5

    \begin{aligned} f(x)_{Leaky ReLU}= \left \{ \begin{array}{ll} x,& if \quad  x > 0\\ mx,& x \le 0 \end{array} \right \} \end{aligned} (5)

    Note that the leak factor is denoted by m. It is commonly set to a very small value, such as 0.001.

    • Noisy ReLU: This function employs a Gaussian distribution to make ReLU noisy. It can be represented mathematically as in Eq. 6

    f(x)_{Noisy ReLU}= max(x+Y),with\, Y \sim N (0,\sigma (x)) (6)

    • Parametric Linear Units: This is mostly the same as Leaky ReLU. The main difference is that the leak factor in this function is updated through the model training process. The parametric linear unit can be represented mathematically as in Eq. 7.

    \begin{aligned} f(x)_{ Parametric Linear}=\begin{Bmatrix} x,& if\; x >0\\ ax,& x \le 0 \end{Bmatrix} \end{aligned} (7)

    Note that the learnable weight is denoted as a.

  4. Fully Connected Layer: Commonly, this layer is located at the end of each CNN architecture. Inside this layer, each neuron is connected to all neurons of the previous layer, the so-called Fully Connected (FC) approach. It is utilized as the CNN classifier. It follows the basic method of the conventional multiple-layer perceptron neural network, as it is a type of feed-forward ANN. The input of the FC layer comes from the last pooling or convolutional layer. This input is in the form of a vector, which is created from the feature maps after flattening. The output of the FC layer represents the final CNN output, as illustrated in Fig. 10.

    Fig. 10 Fully connected layer

    Fig. 10 Fully connected layer

  5. Loss Functions: The previous section has presented various layer-types of CNN architecture. In addition, the final classification is achieved from the output layer, which represents the last layer of the CNN architecture. Some loss functions are utilized in the output layer to calculate the predicted error created across the training samples in the CNN model. This error reveals the difference between the actual output and the predicted one. Next, it will be optimized through the CNN learning process.

    However, two parameters are used by the loss function to calculate the error. The CNN estimated output (referred to as the prediction) is the first parameter. The actual output (referred to as the label) is the second parameter. Several types of loss function are employed in various problem types. The following concisely explains some of the loss function types.

    (a) Cross-Entropy or Softmax Loss Function: This function is commonly employed for measuring the CNN model performance. It is also referred to as the log loss function. Its output is the probability  p \in \{ 0 , 1\} In addition, it is usually employed as a substitution of the square error loss function in multi-class classification problems. In the output layer, it employs the softmax activations to generate the output within a probability distribution. The mathematical representation of the output class probability is Eq. 8.

    p_{i}= \frac{e^{a_{i}}}{\sum _{k=1}^{N} e^{a}_{_{k}}} (8)

    Here e^{a_{i}}, epresents the non-normalized output from the preceding layer, while N represents the number of neurons in the output layer. Finally, the mathematical representation of cross-entropy loss function is Eq. 9

    H(p,y)=-\sum _{i}^{} y_{i}\log (p_{i}) \quad where \quad i \in [1,N] (9)

    (b) Euclidean Loss Function: This function is widely used in regression problems. In addition, it is also the so-called mean square error. The mathematical expression of the estimated Euclidean loss is Eq. 10.

    H(p,y)=\frac{1}{2N}\sum _{i=1}^{N} (p_{i}-y_{i})^{2} (10)

    (c) Hinge Loss Function: This function is commonly employed in problems related to binary classification. This problem relates to maximum-margin-based classification; this is mostly important for SVMs, which use the hinge loss function, wherein the optimizer attempts to maximize the margin around dual objective classes. Its mathematical formula is Eq. 11.

    H(p,y)=\sum _{i=1}^{N} max (0,m-(2y_{i}-1)p_{_{i}}) (11)

    The margin m is commonly set to 1. Moreover, the predicted output is denoted as p_i, while the desired output is denoted as y_i.


Regularization to CNN

For CNN models, over-fitting represents the central issue associated with obtaining well-behaved generalization. The model is entitled over-fitted in cases where the model executes especially well on training data and does not succeed on test data (unseen data) which is more explained in the latter section. An under-fitted model is the opposite; this case occurs when the model does not learn a sufficient amount from the training data. The model is referred to as "just-fitted" if it executes well on both training and testing data. These three types are illustrated in Fig. 11. Various intuitive concepts are used to help the regularization to avoid over-fitting; more details about over-fitting and under-fitting are discussed in latter sections.

  1. Dropout: This is a widely utilized technique for generalization. During each training epoch, neurons are randomly dropped. In doing this, the feature selection power is distributed equally across the whole group of neurons, as well as forcing the model to learn different independent features. During the training process, the dropped neuron will not be a part of back-propagation or forward-propagation. By contrast, the full-scale network is utilized to perform prediction during the testing process.
  2. Drop-Weights: This method is highly similar to dropout. In each training epoch, the connections between neurons (weights) are dropped rather than dropping the neurons; this represents the only difference between drop-weights and dropout.
  3. Data Augmentation: Training the model on a sizeable amount of data is the easiest way to avoid over-fitting. To achieve this, data augmentation is used. Several techniques are utilized to artificially expand the size of the training dataset. More details can be found in the latter section, which describes the data augmentation techniques.
  4. Batch Normalization: This method ensures the performance of the output activations. This performance follows a unit Gaussian distribution. Subtracting the mean and dividing by the standard deviation will normalize the output at each layer. While it is possible to consider this as a pre-processing task at each layer in the network, it is also possible to differentiate and to integrate it with other networks. In addition, it is employed to reduce the "internal covariance shift" of the activation layers. In each layer, the variation in the activation distribution defines the internal covariance shift. This shift becomes very high due to the continuous weight updating through training, which may occur if the samples of the training data are gathered from numerous dissimilar sources (for example, day and night images). Thus, the model will consume extra time for convergence, and in turn, the time required for training will also increase. To resolve this issue, a layer representing the operation of batch normalization is applied in the CNN architecture.

    The advantages of utilizing batch normalization are as follows:

    • It prevents the problem of vanishing gradient from arising.
    • It can effectively control the poor weight initialization.
    • It significantly reduces the time required for network convergence (for large-scale datasets, this will be extremely useful).
    • It struggles to decrease training dependency across hyper-parameters.
    • Chances of over-fitting are reduced, since it has a minor influence on regularization.
Fig. 11 Over-fitting and under-fitting issues

Fig. 11 Over-fitting and under-fitting issues



Optimizer selection

This section discusses the CNN learning process. Two major issues are included in the learning process: the first issue is the learning algorithm selection (optimizer), while the second issue is the use of many enhancements (such as AdaDelta, Adagrad, and momentum) along with the learning algorithm to enhance the output.

Loss functions, which are founded on numerous learnable parameters (e.g. biases, weights, etc.) or minimizing the error (variation between actual and predicted output), are the core purpose of all supervised learning algorithms. The techniques of gradient-based learning for a CNN network appear as the usual selection. The network parameters should always update though all training epochs, while the network should also look for the locally optimized answer in all training epochs in order to minimize the error.

The learning rate is defined as the step size of the parameter updating. The training epoch represents a complete repetition of the parameter update that involves the complete training dataset at one time. Note that it needs to select the learning rate wisely so that it does not influence the learning process imperfectly, although it is a hyper-parameter.

Gradient Descent or Gradient-based learning algorithm: To minimize the training error, this algorithm repetitively updates the network parameters through every training epoch. More specifically, to update the parameters correctly, it needs to compute the objective function gradient (slope) by applying a first-order derivative with respect to the network parameters. Next, the parameter is updated in the reverse direction of the gradient to reduce the error. The parameter updating process is performed though network back-propagation, in which the gradient at every neuron is back-propagated to all neurons in the preceding layer. The mathematical representation of this operation is as Eq. 12.

w_{i j^{t}}=w_{i j^{t-1}}-\Delta w_{i j^{t}},\quad \Delta w_{i j^{t}}=\eta *\frac{\partial E}{\partial w_{i j}} (12)

The final weight in the current training epoch is denoted by w_{i j^{t}}, while the weight in the preceding t - 1 training epoch is denoted w_{i j^{t-1}}. The learning rate is \eta and the prediction error is E. Different alternatives of the gradient-based learning algorithm are available and commonly employed; these include the following:

  1. Batch Gradient Descent: During the execution of this technique, the network parameters are updated merely one time behind considering all training datasets via the network. In more depth, it calculates the gradient of the whole training set and subsequently uses this gradient to update the parameters. For a small-sized dataset, the CNN model converges faster and creates an extra-stable gradient using BGD. Since the parameters are changed only once for every training epoch, it requires a substantial amount of resources. By contrast, for a large training dataset, additional time is required for converging, and it could converge to a local optimum (for non-convex instances).
  2. Stochastic Gradient Descent: The parameters are updated at each training sample in this technique. It is preferred to arbitrarily sample the training samples in every epoch in advance of training. For a large-sized training dataset, this technique is both more memory-effective and much faster than BGD. However, because it is frequently updated, it takes extremely noisy steps in the direction of the answer, which in turn causes the convergence behavior to become highly unstable.
  3. Mini-batch Gradient Descent: In this approach, the training samples are partitioned into several mini-batches, in which every mini-batch can be considered an under-sized collection of samples with no overlap between them. Next, parameter updating is performed following gradient computation on every mini-batch. The advantage of this method comes from combining the advantages of both BGD and SGD techniques. Thus, it has a steady convergence, more computational efficiency and extra memory effectiveness. The following describes several enhancement techniques in gradient-based learning algorithms (usually in SGD), which further powerfully enhance the CNN training process.
  4. Momentum: For neural networks, this technique is employed in the objective function. It enhances both the accuracy and the training speed by summing the computed gradient at the preceding training step, which is weighted via a factor \lambda (known as the momentum factor). However, it therefore simply becomes stuck in a local minimum rather than a global minimum. This represents the main disadvantage of gradient-based learning algorithms. Issues of this kind frequently occur if the issue has no convex surface (or solution space).

    Together with the learning algorithm, momentum is used to solve this issue, which can be expressed mathematically as in Eq. 13.

    \Delta w_{i j^{t}}= \left( \eta *\frac{\partial E}{\partial w_{i j}}\right) +(\lambda *\Delta w_{i j^{t-1}}) (13)

    The weight increment in the current t^{\prime} \text{th} training epoch is denoted as \Delta w_{i j^{t}}, while is the learning rate, and the weight increment in the preceding (t-1)^{\prime} \text{th} training epoch. The momentum factor value is maintained within the range 0 to 1; in turn, the step size of the weight updating increases in the direction of the bare minimum to minimize the error. As the value of the momentum factor becomes very low, the model loses its ability to avoid the local bare minimum. By contrast, as the momentum factor value becomes high, the model develops the ability to converge much more rapidly. If a high value of momentum factor is used together with LR, then the model could miss the global bare minimum by crossing over it.

    However, when the gradient varies its direction continually throughout the training process, then the suitable value of the momentum factor (which is a hyper-parameter) causes a smoothening of the weight updating variations.

  5. Adaptive Moment Estimation (Adam): It is another optimization technique or learning algorithm that is widely used. Adam [85] represents the latest trends in deep learning optimization. This is represented by the Hessian matrix, which employs a second-order derivative. Adam is a learning strategy that has been designed specifically for training deep neural networks. More memory efficient and less computational power are two advantages of Adam. The mechanism of Adam is to calculate adaptive LR for each parameter in the model. It integrates the pros of both Momentum and RMSprop. It utilizes the squared gradients to scale the learning rate as RMSprop and it is similar to the momentum by using the moving average of the gradient. The equation of Adam is represented in Eq. 14.

    w_{i j^{t}}=w_{i j^{t-1}}-\frac{\eta }{\sqrt{\widehat{E[\delta ^{2}]^{t}} }+\in } *\widehat{E[\delta ^{2}]^{t}} (14)


Design of algorithms (backpropagation)

Let's start with a notation that refers to weights in the network unambiguously. We denote  {{w}}_{i j}^{h}  to be the weight for the connection from input or (neuron at \left. (\text {h}-1){\text{th}}\right) to j{\text{t }} the neuron in the \text {hth} layer. So, Fig. 12 shows the weight on a connection from the neuron in the first layer to another neuron in the next layer in the network.

Fig. 12 MLP structure

Fig. 12 MLP structure

Where w_{11}^{2} has represented the weight from the first neuron in the first layer to the first neuron in the second layer, based on that the second weight for the same neuron will be w_{11}^{2} which means is the weight comes from the second neuron in the previous layer to the first layer in the next layer which is the second in this net. Regarding the bias, since the bias is not the connection between the neurons for the layers, so it is easily handled each neuron must have its own bias, some network each layer has a certain bias. It can be seen from the above net that each layer has its own bias. Each network has the parameters such as the no of the layer in the net, the number of the neurons in each layer, no of the weight (connection) between the layers, the no of connection can be easily determined based on the no of neurons in each layer, for example, if there are ten input fully connect with two neurons in the next layer then the number of connection between them is (10 * 2 = 20 connection, weights), how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network,

\text{ error } =1 / 2\left( {\mathbf {d}}_{i}-{\mathbf {y}}_{i}\right) ^{2} (15)

where d is the label of induvial input ith and y is the output of the same individual input. Backpropagation is about understanding how to change the weights and biases in a network based on the changes of the cost function (Error). Ultimately, this means computing the partial derivatives \partial \text {E} / \partial \text {w}_{\text {ij}}^{h} and \partial \text {E} / \partial \text {b}_{\text {j}}^{h}. But to compute those, a local variable is introduced, \delta _{j}^{1} which is called the local error in the ith neuron in the hth layer. Based on that local error Backpropagation will give the procedure to compute \partial \text {E} / \partial \text {w}_{\text {ij}}^{h} and \partial \text {E} / \partial \text {b}_{\text {j}}^{h} how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network that is shown in Fig. 13.

Fig. 13 Neuron activation functions

Fig. 13 Neuron activation functions

Output error for \delta _{\text {j}}^{1} each 1=1: \text {L} where L is no. of neuron in output

\delta _{\text {j}}^{1}({\mathbf {k}})=(-1) \text {e}(\text {k}) {\vartheta }^{\prime }\left( v_{j}({{k}})\right) (16)

where \text {e}(\text {k}) is the error of the epoch as shown in Eq. (2) and {\vartheta }^{\prime }\left( {{v}}_{j}({{k}})\right) is the derivate of the activation function for v_{j} at the output.

Backpropagate the error at all the rest layer except the output

\delta _{\text {j}}^{\text {h}}({\mathbf {k}}) ={\vartheta }^{\prime }\left( v_{j}({{k}})\right) \sum _{l=1}^{L} \delta _{\text {j}}^{1} \quad {{w}}_{j l}^{h+1}({{k}}) (17)

where \delta _{j}^{1}({\mathbf {k}}) is the output error and w_{j l}^{h+1}(k) is represented the weight after the layer where the error need to obtain.

After finding the error at each neuron in each layer, now we can update the weight in each layer based on Eqs. (16) and (17).


Improving performance of CNN

Based on our experiments in different DL applications. We can conclude the most active solutions that may improve the performance of CNN are:

  • Expand the dataset with data augmentation or use transfer learning (explained in latter sections).
  • Increase the training time.
  • Increase the depth (or width) of the model.
  • Add regularization.
  • Increase hyperparameters tuning.