EarthWeb   
HomeAccount InfoLoginSearchMy ITKnowledgeFAQSitemapContact Us
     

   
  All ITKnowledge
  Source Code

  Search Tips
  Advanced Search
   
  

  

[an error occurred while processing this directive]
Previous Table of Contents Next


2.7. Enhancements of gradient-based learning in neural networks

It has become obvious that the gradient-based method can exploit only the first-order information about the multidimensional surface of the performance index Q(w); the use of the second order information (conveyed by Hessian matrix) is essentially computationally infeasible.

This limited search guidance, being profoundly manifested in the multilayer architectures of neural networks, usually results in a very slow convergence. To alleviate this drawback, several enhancements of the generic learning algorithms were proposed. In particular, one can be overwhelmed by an abundance of various enhancements of the vanilla-form of BP. All of them attempt to get some extra mileage and achieve to repair what the drop of the Hessian matrix did to the efficiency of the optimization method. The reader should not be too optimistic, though: all of these attempts are rather local in their nature and may not work well across a broad range of the learning scenarios. In what follows, we briefly discuss some of these enhancements.

Modifiable learning rate Too low values of the learning rate slow down learning. Too high rates produce an oscillatory behavior. This simple observation leads to a number of useful heuristics. Starting from an initial learning rate α, its value is increased linearly

where κ >1 if the learning is smooth (no oscillations of Q along with its steady decrease), or decreased exponentially

(ρ < 0) if the oscillations in the performance index were present.

Momentum term Momentum is another augmentation of the standard delta rule and adds to the original method as an extra term

Δ(conn) describes a change in the connections, while β > 0 is a momentum term. In flat regions of high fluctuations, the momentum term prevents the weights from gaining momentum. For flat regions of Q the effective learning rate increases. To observe that, let us rewrite the momentum formula over p-step iteration (Hassoun, 1995), that is

For flat regions we can assume that the gradient does not change too much and can be placed in front of the summation operation

Thus an effective learning rate a’ has increased and is equal

2.8. Concluding remarks

This chapter serves as a condensed, yet highly comprehensive introduction to neural networks and neurocomputing. We have emphasized the role of neural networks as universal approximators. In fact, all applications dwell to some extent on this important finding. Neural networks are versatile computational structures loaded with parametric flexibility conveyed by their connections. These variable weights strongly support learning yet can hamper the efficiency of learning due to the excessively large search space within which the optimization of neutral networks needs to be completed. We have discussed a number of the main topologies and learning methods. What is also evident is the fact that the learning perceived in neural networks is primarily geared into their parametric optimization - any structural changes do call for a different methodology. Do neural networks constitute a new concept? The answer is partially affirmative and this comes at the level of the idea of distributed processing. On the other hand, one may identify several examples showing that neural networks borrow a number of concepts from some other areas. An interesting example of such associations has emerged from statistics - the list below compares the basic nomenclatures of neural networks and statistics. Amazingly, one can identify a series of useful similarities:

neural network model
input independent variable
output dependent variable
learning estimation
supervised learning regression
generalization interpolation
weight decay ridge regression
training set observations

While the approximation capabilities of neural networks can promise a lot, the role of engineering of neural networks is to put these into work. How this potential can be exploited becomes a matter of choosing the right architecture, preprocessing the data and carry out efficient learning. As we discuss in the reminder of this book, neural networks operating alone cannot fully satisfy these design goals and need a symbiotic interaction with some other technologies, especially fuzzy sets and evolutionary computing.

2.9. Problems

2.1. Consider the modified performance index

where the vector conn symbolizes all the connections of the network and μ > 0. Analyze the role of the second component as supporting the regularization effect.

2.2. Discuss the role of the nonlinear functions in the basic neurons. Are these functions essential if the neurons are situated in the hidden layer(s)? What about the neurons forming the output layer?

2.3. Elaborate how the network in Fig. 2.18, where n >>m, can serve as a data compressor. What should be the target vector used in the training of this network?


Figure 2.18  Neural networks as a signal compressor

2.4. The single variable function provided by its input-output pairs shown in Fig. 2.19 is to be approximated by a neural network. Not completing detailed learning, discuss whether this could be a difficult task. Why? Which segment of the data would be the most difficult to represent (approximate)?


Figure 2.19  Experimental data to be approximated by a neural network

2.5. The standard sigmoidal nonlinearity can be equipped with two auxiliary parameters

How do they impact the nonlinear characteristics? Elaborate on the role of α and β on the efficiency of any learning procedure. Would you recommend their changes over the course of training?

2.6. Considering the RBF neural network, discuss the learning of its output layer composed of “m” linear units yi = wiTz. How can you introduce the effect of regularization into the training mechanism?

2.7.; Generalize the delta rule for a neural network with a single layer having “n” inputs and “m” outputs. Derive detailed learning formulas for the tangh type of nonlinearities.

2.8. Using the gradient-based learning, derive the perceptron learning rule. To accomplish that, it is convenient to consider the performance index of the form

If the sum is taken over an empty set then Q(w) =0, otherwise it is positive. Moreover, the lower the values of Q, the better the performance of the network.


Previous Table of Contents Next

Copyright © CRC Press LLC

HomeAccount InfoSubscribeLoginSearchMy ITKnowledgeFAQSitemapContact Us
Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.