EarthWeb   
HomeAccount InfoLoginSearchMy ITKnowledgeFAQSitemapContact Us
     

   
  All ITKnowledge
  Source Code

  Search Tips
  Advanced Search
   
  

  

[an error occurred while processing this directive]
Previous Table of Contents Next


The problem of learning of a neural network under this mixture of training data splits naturally into two tasks:

(a)  direct learning. This phase is completed based on explicitly labelled patterns. In fact, it emerges as a standard way of training of many neural networks. Assume a certain performance index Q that describes a distance between the outputs of the network and the target vectors of class membership


Figure 2.15  Classification problem with direct and referential class assignment

The gradient - descent method applied to Q gives rise to the update expression

(b)  referential learning. The training of the network in this mode requires more attention and is usually more demanding in terms of the training time. First, we observe that the format of supervision available now is far more weaker than that envisioned in the previous mode of learning. The guidance is provided in the format of a single scalar reinforcement signal describing a similarity between two patterns. Thus the objective of learning is expressed as

Where x1 and x2 are two patterns coming from . The expression sim(., .) quantifies the degree of similarity between the outputs of the neural network computed for x1 and x2, respectively. The similarity can be computed as a complement of the normalized distance between the outputs of the neural network for these two inputs. Let us emphasize that we are not furnished with their complete class membership vectors but a level of similarities between the patterns, d ∈ [0, 1].

To fully explain the learning scheme, it is instructive to portray it using two copies of the same neural network; the training simultaneously modifies the two networks in a synchronous manner, Fig. 2.16.


Figure 2.16  Training with the use of referential labeling

Consider a single triple of the elements in the training set , namely (x1, x2, d). The neural network (NN) responds to x1 by producing y(1); likewise y(2) denotes the output of the same neural network for x = x2. The result of this matching is confronted with the required similarity level (d) and the connections of the network become updated to minimize V. The main mechanism of the optimization scheme comes in the form of the duplicated connections. Let us compute the derivative of V with respect to the connections (conn),

and

The first component in the above formula can be specified once the similarity expression (sim) has been explicitly defined.

2.6. Generalization abilities of neural networks

The (potential) approximation abilities of neural networks are guaranteed by the fundamental approximation theorem. The approximation alone may not be sufficient to the network to perform equally well over the testing set.

So far the approximation of training data was the only objective to be optimized. The error computed for the training data was minimized by modifying the connections of the network. To make the neural network useful, our expectations are that it should perform comparatively well on some data different from those used for the training purposes. To accomplish that, our intent should be to test the already trained neural network on some other testing data set. The error diminishes over the training set - the same may not be true for the testing set. As shown in Fig. 2.17 this comes as an overtraining effect - too long training could lead to a poor generalization effect. In other words, the network tends to memorize rather than generalize.

Here to enhance performance, the training should have been terminated at an earlier stage as suggested by the second curve when still showing a decline of prediction error on the testing set. Simply, it follows the data too closely even though some of the elements of the training set could be excessively noisy. The error can be made as small as possible by increasing the size of the network - this, however, implies poor generalization abilities.


Figure 2.17  Performance index over a training and testing set during the course of learning

A more detailed statistical analysis sheds light on the very nature of this problem. Consider that the input (x) is governed by a certain probability density function (p.d.f.) p(x), so it is meaningful to talk about an expected squared error between the original function f(x) the neural network NN(x) has to approximate. This error is just equal to

Let us rewrite this as a sum of two terms, namely2


2Let us expand the left-hand side of the expression
E(X-a) 2 = E(X 2 ) -2aE(X) + a 2

The other side reduces to the form
[a-E(X)] 2 + E(X-E(X)) 2 = a 2 -2aE(X) +[E(X)] 2 + E[X 2 -2XE(X) + [E(X)] 2 ] = a 2 -2aE(X)+ [E(X)] 2

Obviously, these two expressions are equivalent.

The first component is known as a bias while the second one is usually referred to as a variance. The left hand side of the expression of obviously fixed. Thus any increase of bias reduces the variance and vice-versa. Here we have to come up to grip as far as this balance has to be achieved:

  the over-training of the neural network on the training set reduces the bias yet, as already noticed, one cannot expect a high predictive behavior of the network on any testing set - the resulting high variance is taking its toll.
  to reduce the variance (that is the sensitivity of the network), it is worth admitting a slightly higher bias with the hope of the better performance of the network.

Definitely, this bias-variance guideline is well - known in practice, yet if we do not know p(x) (and this is usually the case in any application), the overall expression serves as an important qualitative hint.

The solution to the overfitting problem is to extend the original performance index by adding a so-called regularization term (Poggio and Girosi, 1990). Thus

This additional term ||P|| captures some extra requirements about smoothness of the function to be approximated and is expressed over the connections of the network. λ stands for a scaling factor that helps achieve a reasonable compromise between accuracy of the mapping and the produced regularization effect.


Previous Table of Contents Next

Copyright © CRC Press LLC

HomeAccount InfoSubscribeLoginSearchMy ITKnowledgeFAQSitemapContact Us
Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.