![]() |
|
|||
![]() |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
![]() |
|
![]() |
[an error occurred while processing this directive]
It is worthwhile to investigate an efficiency of this type of search. The answer primarily depends upon the form of the minimized function Q(w). The efficiency of this gradient-based search could vary. If we confine ourselves to so-called quadratic forms then the findings are far more conclusive. This class of Q(w) forms, however, a very special case and the outcomes of this analysis need not be exaggerated too much. There are several points to be emphasized:
The self-evident alleviation of these shortcomings would be to replace the Hessian matrix by a constant (learning rate). As we will study in a while, this path is often pursued in neural learning. Generally, the learning in neural networks is a difficult and demanding problem; this study scratches only a surface and analyzes some basic issues. Not even embarking on structural learning (optimization of a topology of the neural networks), one should be aware of the complexity of the learning dynamics introduced by the training method. In fact, we end up having a highly nonlinear system. Many notions borrowed from the theory of dynamical systems (Schuster, 1984) may explain some learning phenomena. In particular, one should make sure that the local minima do not become attractors of the dynamics and that limit cycles do not occur. We should become aware that the system could move into a complex attractor of the search space - then some diversification mechanism could mitigate this shortcoming. 2.5.2. Perceptron learning ruleThis is one of the earliest learning rules. We start with a single processing unit that is equipped with a hard limiter, Fig. 2.11.
In other words, the output of the neuron reads as The learning set consists of input-output pairs {x(k), target(k)}, k=1, 2, , N where target(k) ∈ {-1, 1}. The objective of this learning is to derive such connections w* that the following set of inequalities hold
for all k=1, 2, , N. Note that the relationship describes a separating hyperplane - in this sense the overall mapping (classification) is definitely linear. The patterns that can be classified without any error with the use of this rule, are called linearly separable. The classic perceptron learning rule (Rosenblatt, 1961) is expressed in the form of cyclic updates of ws, start with any intitial connection weight vector w(1) and cycle through the patterns modifying the connections where α > 0 is the learning rate. The most pronounced property of the perceptron algorithm states that if the training patterns are linearly separable, then the perceptron rules finds the classifier in a finite number of learning steps (iterations); moreover the misclassification rate of the classifier achieves zero. The cycling through data can be expressed explicitly by expressing k as a function of the training cycle (iter), namely that is If α = 0.5 then the learning rule is expressed as Hence if target(k) = y(k) then If target(k) differs from y(k) (that is either target(k) =1 and y(k)=-1 or target(k)=-1 and y(k)=1), then the modification of factor Δw depends upon the current target and reads as
2.5.3. Delta learning ruleThe delta learning rule is one of the most popular learning rule encountered in the framework of neurocomputing. In its generic version (Widrow and Hoff, 1960), it applies to any neural network without a hidden layer. Let us start with a single linear unit y = xTw. The quadratic performance (Euclidean distance) index computed over all input-output pairs is expressed as The gradient-descent method (Section 2.5.1) yields the expression Here we consider an off-line (batch) learning meaning that the connections are updated once we have cycled through all the training data. Note that the so-called on-line learning would have involved the updates that occur after presenting each input-output pair of the training set), The so-called delta effect is visible after a slight modification of the learning expression. Let us introduce the difference between the target and the corresponding output of the network This implies Finally, the off-line scheme reads as Now we derive the delta learning rule for the sigmoidal nonlinear function; furthermore we confine ourselves to the on-line learning scheme. First, Here where o(k) is the linear combination of the inputs, o(k) = x(k)Tw(iter) while y(k) is a nonlinear function of o(k). For the sigmoidal function the simple computations reveal that so that 2.5.4. Backpropagation learningThe delta rule supports learning in neural networks without hidden layer(s). The Backpropagation (Backprop or BP) (Rumelhart et al., 1986; Hecht-Nielsen, 1990) method is aimed at learning in multilayer neural networks. The reason behind the development of this rule is that the modifications of the connections of the neurons whose outputs are not directly confronted with the target values (and this happens at all hidden layers) should be completed differently. The simple idea implemented by the BP method is to backpropagate an error signal from the output down to the input layer and use it as a reference signal to carry out the learning therein. From the formal point of view, the BP method takes full advantage of the well-known chaining rule in differential calculus. Let us consider the multilayer network as portrayed in Fig. 2.12.
The notation we will be using becomes crucial in the derivation of the learning formulas. As usual, we describe each processing unit as a linear combination of the inputs followed by a nonlinear element (g), namely
Copyright © CRC Press LLC
![]() |
![]() |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
![]() |
![]() |