EarthWeb   
HomeAccount InfoLoginSearchMy ITKnowledgeFAQSitemapContact Us
     

   
  All ITKnowledge
  Source Code

  Search Tips
  Advanced Search
   
  

  

[an error occurred while processing this directive]
Previous Table of Contents Next


It is worthwhile to investigate an efficiency of this type of search. The answer primarily depends upon the form of the minimized function Q(w). The efficiency of this gradient-based search could vary. If we confine ourselves to so-called quadratic forms

then the findings are far more conclusive. This class of Q(w) forms, however, a very special case and the outcomes of this analysis need not be exaggerated too much. There are several points to be emphasized:

  the calculations of the updates require an inverse of the Hessian matrix. This could be very tedious for high values of “n”,
  similarly, the Hessian matrix could be ill-defined contributing to an even higher computational burden.

The self-evident alleviation of these shortcomings would be to replace the Hessian matrix by a constant (learning rate). As we will study in a while, this path is often pursued in neural learning.

Generally, the learning in neural networks is a difficult and demanding problem; this study scratches only a surface and analyzes some basic issues. Not even embarking on structural learning (optimization of a topology of the neural networks), one should be aware of the complexity of the learning dynamics introduced by the training method. In fact, we end up having a highly nonlinear system. Many notions borrowed from the theory of dynamical systems (Schuster, 1984) may explain some learning phenomena. In particular, one should make sure that the local minima do not become attractors of the dynamics and that limit cycles do not occur. We should become aware that the system could move into a complex attractor of the search space - then some diversification mechanism could mitigate this shortcoming.

2.5.2. Perceptron learning rule

This is one of the earliest learning rules. We start with a single processing unit that is equipped with a hard limiter, Fig. 2.11.


Figure 2.11  Perceptron learning in a single processing unit

In other words, the output of the neuron reads as

The learning set consists of input-output pairs {x(k), target(k)}, k=1, 2, …, N where target(k) ∈ {-1, 1}. The objective of this learning is to derive such connections w* that the following set of inequalities hold

  if target(k) = +1 then x(k)Tw* > 0
  if target(k) = -1 then x(k)Tw* < 0

for all k=1, 2, …, N.

Note that the relationship

describes a separating hyperplane - in this sense the overall mapping (classification) is definitely linear. The patterns that can be classified without any error with the use of this rule, are called linearly separable.

The classic perceptron learning rule (Rosenblatt, 1961) is expressed in the form of cyclic updates of w’s,

start with any intitial connection weight vector w(1) and cycle through the patterns modifying the connections

where α > 0 is the learning rate.

The most pronounced property of the perceptron algorithm states that if the training patterns are linearly separable, then the perceptron rules finds the classifier in a finite number of learning steps (iterations); moreover the misclassification rate of the classifier achieves zero.

The cycling through data can be expressed explicitly by expressing “k” as a function of the training cycle (iter), namely

that is

If α = 0.5 then the learning rule is expressed as

Hence if target(k’) = y(k’) then

If target(k’) differs from y(k’) (that is either target(k’) =1 and y(k’)=-1 or target(k’)=-1 and y(k’)=1), then the modification of factor Δw depends upon the current target and reads as

  Δw = x(k’) if taret (k’) =1
  Δw = - x(k’) if target (k’) =-1

2.5.3. Delta learning rule

The delta learning rule is one of the most popular learning rule encountered in the framework of neurocomputing. In its generic version (Widrow and Hoff, 1960), it applies to any neural network without a hidden layer. Let us start with a single linear unit y = xTw. The quadratic performance (Euclidean distance) index computed over all input-output pairs is expressed as

The gradient-descent method (Section 2.5.1) yields the expression

Here we consider an off-line (batch) learning meaning that the connections are updated once we have cycled through all the training data. Note that the so-called on-line learning would have involved the updates that occur after presenting each input-output pair of the training set),

The so-called delta effect is visible after a slight modification of the learning expression. Let us introduce the difference between the target and the corresponding output of the network

This implies

Finally, the off-line scheme reads as

Now we derive the delta learning rule for the sigmoidal nonlinear function; furthermore we confine ourselves to the on-line learning scheme. First,

Here

where o(k) is the linear combination of the inputs, o(k) = x(k)Tw(iter) while y(k) is a nonlinear function of o(k). For the sigmoidal function

the simple computations reveal that

so that

2.5.4. Backpropagation learning

The delta rule supports learning in neural networks without hidden layer(s). The Backpropagation (Backprop or BP) (Rumelhart et al., 1986; Hecht-Nielsen, 1990) method is aimed at learning in multilayer neural networks. The reason behind the development of this rule is that the modifications of the connections of the neurons whose outputs are not directly confronted with the target values (and this happens at all hidden layers) should be completed differently. The simple idea implemented by the BP method is to backpropagate an error signal from the output down to the input layer and use it as a reference signal to carry out the learning therein. From the formal point of view, the BP method takes full advantage of the well-known chaining rule in differential calculus.

Let us consider the multilayer network as portrayed in Fig. 2.12.


Figure 2.12  BP learning - computational details and notation

The notation we will be using becomes crucial in the derivation of the learning formulas. As usual, we describe each processing unit as a linear combination of the inputs followed by a nonlinear element (g), namely


Previous Table of Contents Next

Copyright © CRC Press LLC

HomeAccount InfoSubscribeLoginSearchMy ITKnowledgeFAQSitemapContact Us
Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.