Upgrading and Repairing Networks

Previous chapterNext chapterContents


- 32 -

Repairing Common Types of Problems

The increasing complexity of modern networks--and the growing diversity of network hardware and software--is reflected in the range of faults and glitches that today's network administrators must face. This chapter deals with some of the most common faults.

It would be impossible to cover the symptoms and fixes for all such problems here. For faults not explicitly dealt with in this chapter, the troubleshooting procedures outlined in the previous chapter--and, to a lesser extent, in this chapter--may be used to identify the approximate cause of any fault. You would do well at that stage to refer to the chapter dealing specifically with that element of the network.

This chapter describes typical symptoms and troubleshooting procedures for problems in the following broad classes:

You may need to refer to other chapters when attempting to solve a specific problem. For example, chapter 31, "Locating the Problem: Server versus Workstation versus Links," deals with problem-solving methodologies. Information related to a particular hardware or software component can be found in the relevant chapter of this book.

Server Problems

Because of the central role file servers play in their networks, server faults can cause serious disruption to a large number of users. In small networks, a crashed server can mean no service at all--if the server is down, the network is of no use. As a result, server troubleshooting is generally more urgent than workstation troubleshooting.

Hassle from users and the need for haste can get in the way of a clear-minded, methodical approach to solving server problems. Therefore, unless the source of the problem is immediately obvious, the first step is to create some space for yourself. Get someone to act as a buffer between you and the users so that your investigation of the problem is not constantly interrupted by people asking when the server will be back up. Announce the problem to your users so that they are aware that you are working on it.

In some cases, it is impossible to down a server or even to experiment much with it to examine a noncritical fault. If you fail to diagnose the problem while the server is "live," schedule some downtime when you can tackle the problem, and announce the downtime to your users well in advance.

Troubleshooting

Debugging the server startup process can be a tricky business. Text messages might flash by on-screen too rapidly for you to read, or the server might load an NLM at startup that causes the server to hang immediately. The following tips explain how to control the startup process and how to watch what happens at each step.

ECHO ON. The server normally shows the output resulting from commands in a script file without showing the commands themselves. It can be useful to see the commands while debugging, so add the line

ECHO ON

at the start of the AUTOEXEC.NCF file on each server. Each command in AUTOEXEC.NCF is then echoed on-screen as it is executed, prefaced by a ">" sign, as in the following example:

>load tcpip
Loading module TCPIP.NLM
  TCP/IP  v1.00 (910219)
  Auto-loading module SNMP.NLM
  SNMP Agent  v1.00 (910208)
>load 3C503 DIX int=3 mem=c8000 port=300
Loading module 3C503.LAN
  Previously loaded module was used re-entrantly
>bind IP to 3C503 address=111.222.333.444
IP: Bound to board 2.  IP address 111.222.333.444, net mask FF.FF.0.0
IP LAN protocol bound to 3Com EtherLink II 3C503  v3.11 (910121)

This makes it easy for you to match messages on-screen with the commands that generated them.

CONLOG. The blur of text that whizzes past when the server starts is usually of little or no interest, as long as the server is working properly. When you are trying to track down a fault, however, you might want to examine each message in detail. This can be difficult, because there is no way to pause the screen (or SERVER.EXE) so that you can read messages before they scroll off.

CONLOG.NLM can help with this. This utility writes a copy of all text that appears on the console to a file on the server's NetWare partition: SYS:ETC/CONSOLE.LOG. To start logging console messages, just add the line

LOAD CONLOG

near the top of AUTOEXEC.NCF, immediately after the FILE SERVER NAME and IPX INTERNAL NET lines. If you try to start the log before these lines, SERVER.EXE has to prompt you for the file server name and IPX address before it can load CONLOG.

CONLOG starts a new CONSOLE.LOG file each time it loads, erasing the previous copy. This helps to keep the size of CONSOLE.LOG within reasonable limits; if CONLOG always appended data to the existing file, CONSOLE.LOG would grow rapidly. However, this start-from-scratch aspect can be a nuisance if the server crashes and you want to examine the log file since restarting the server overwrites the log file!

If you want to see CONSOLE.LOG after a server crash, you can manually restart the server without executing the AUTOEXEC.NCF file (see the discussion on the -NA option below) so that CONLOG is not loaded, and the file is not erased.

Alternatively, use a utility such as Pierre Blanco's NCL.NLM to automatically make a copy of the previous CONSOLE.LOG before loading CONLOG. Something like the following in the AUTOEXEC.NCF file should do it:

LOAD NCL
NNCOPY SYS:ETC/CONSOLE.LOG SYS:ETC/CONSOLE.OLD
LOAD CONLOG

You may want to add the line

UNLOAD CONLOG

at the end of AUTOEXEC.NCF, to stop logging at the end of the startup process. This is not necessary, but it helps to prevent the CONSOLE.LOG file from becoming too large.

CONLOG can also be useful when debugging problems that arise during normal operation. Leave it running to gather messages over a period of time when dealing with an intermittent fault or any fault that is not readily reproducible.


TIP: The file appears to be zero bytes in length to a user at a client workstation as long as CONLOG is running. That doesn't mean that the file is empty--it just means that the file has not been released by CONLOG.

The main limitation of CONLOG when trying to track down serious faults is that it may not always manage to write the message text to the file before the system crashes. Even if CONLOG does its job successfully, the crash may cause the SYS volume to dismount, preventing the output that CONLOG traps from being saved to the file.

CONLOG does attempt to write each message to the file as it appears on the console, however, without any intermediate buffering. This means that, in general, CONLOG catches enough information to be of great use in post-crash debugging.

Server Startup Options. SERVER.EXE has some optional command-line parameters that can be used to control how STARTUP.NCF and AUTOEXEC.NCF are treated at startup time. The following sections describe these parameters.

-NS. Use the -NS option to prevent the server from reading both STARTUP.NCF and AUTOEXEC.NCF. This is particularly useful when debugging faults that arise during the startup process. After starting the server in this way, you can manually enter the commands in the usual startup script one at a time, while watching for errors.

This also is a sensible approach to take when tracing other difficult faults, as it reduces the unknown quantities to a minimum.

-NA. The -NA option is somewhat similar to -NS. It prevents the server from reading AUTOEXEC.NCF, but the commands in STARTUP.NCF are executed. If STARTUP.NCF does no more than load a disk driver, as is often the case, then you may want to run it before you begin debugging. In this case, use SERVER -NA and not -NS to begin your debugging session.


TIP: Create a special, stripped-down AUTOEXEC script for debugging sessions, and save it on the server's DOS partition as, for example, MYDEBUG.NCF. It should contain no more than the bare minimum to get the server started:

ECHO ON
LOAD NCL
NNCOPY SYS:ETC/CONSOLE.LOG SYS:ETC/CONSOLE.OLD
LOAD CONLOG

You then can use SERVER -NA to start the server without loading the usual AUTOEXEC.NCF; supply the file server name and number when prompted, and then enter MYDEBUG to get into fault finding mode.


-S (Alternate Script). Use the -S option to specify an alternate startup script file. The specified file is parsed instead of STARTUP.NCF. For example, consider the following line:

C:\NETWARE.312> server -s C:\UTIL\STARTDBG.NCF

This line tells SERVER.EXE to use C:\UTIL\STARTDBG.NCF instead of the usual STARTUP.NCF. The AUTOEXEC.NCF file is parsed in the usual way, unless you use the -NA option as well.

Either a DOS path or NetWare path may be specified. The full path must be given, including the DOS drive letter or NetWare volume name.

Disk Drives

The need to provide more and faster storage space on servers encourages the use of relatively new, complex, or innovative technologies. This can apply to the hard disks themselves, the disk controllers, or the type of bus slot used. File server disks are likely to be quite busy, as they are used by a number of people simultaneously; they also are likely to be in use for longer periods of sustained activity than a workstation hard disk. It is hardly surprising, then, that disk problems are relatively common on file servers.

The impact of disk problems can be quite severe. Aside from the obvious danger of losing precious data, the operation of the server as a whole may be affected. After all, the bindery/NDS database is stored on disk and so are the files that comprise the NOS itself. Corruption of these files can mean server faults of many kinds. Complete disk failure can mean total loss of service.


CAUTION: There is no substitute for proper backups. Data is sometimes irretrievably lost from disk; even if a disk fault is completely rectified, some damage may have been done that cannot be undone. If you're reading this while trying to solve a server disk problem, and you don't have adequate backups, it may be too late--you'd better hope that the fault did no damage.

Disk Troubleshooting. Start by determining the extent of the problem. You need to establish answers to the following questions:

There is little point in wasting precious time trying to resolve faults that don't exist; it's better to take a while getting an overview of the nature and extent of the fault before launching into a repair procedure. Determining the answers to the above questions can bring you a long way toward full identification of the fault and can save you a lot of time in wasted trial-and-error troubleshooting.

Is It a Disk-Related Fault? Many problems appear to be disk-related when their causes actually lie elsewhere. For example, an inability to write a file to a volume might be due to access rights or quotas or corruption of data might be caused by network errors.

Real evidence of a disk-related problem includes the following:

Checking the Volumes. The first thing to check is whether all volumes are mounted. If you know which volumes should be there, you can check this quite easily by entering the volumes command at the file server console. All mounted volumes are listed, along with the name spaces used by each:

Mounted Volumes              Name Spaces
  SYS                           DOS
  APP1                          DOS
  APP2                          DOS
  HOME                          DOS

This output is useful, but only if you know which volumes to expect. If you're not familiar with the server, then the only volume you can expect to see for certain is SYS; other than that, if you don't know what's supposed to be there, you can't tell what's missing.

Use MONITOR.NLM and INSTALL.NLM for a more thorough check. Load the NLMs on the server with the following command if they are not already loaded:

LOAD MONITOR
LOAD INSTALL

If you can't do so, there may be a problem with the SYS volume. In that case, load the NLMs from the server's DOS partition:

LOAD C:\NETWARE.312\MONITOR
LOAD C:\NETWARE.312\INSTALL

If you do not have a copy of the NLMs on the hard disk, or if the server's DOS partition is inaccessible for some reason, get copies on floppy disk and load them.

Now, check that all disks are physically functioning:

1. Choose Disk Information from the main menu of MONITOR. MONITOR lists each hard disk.

2. If any disk is missing, check that the driver for that disk's controller is loaded by using the modules command at the console prompt.

3. If the driver is not loaded, load it manually and check the disk again.

4. If the driver is loaded but the disk is not listed, suspect physical failure of the disk or its controller--refer to the "Disk or Controller?" section that follows.

5. Select the first disk in the list. MONITOR displays information about the disk.

6. Check that the Hot Fix Status field reads Normal. If it says Not Active, then it is likely that there are an unacceptably high number of bad blocks on the disk. Consider the disk faulty and replace it.

7. Look at the Redirection Blocks and Redirected Blocks fields. If the number of redirected blocks is more than 50% or so of the total number of available redirection blocks, then the disk is likely to fail in the near future. Even if the number of redirected blocks is small but rising, the disk is deteriorating and should be replaced.

If all disks appear to be responding, check the partition tables:

1. Choose Volume Options from the main menu of INSTALL.

2. Press Enter to view information for the first volume in the list.

3. Look at the Status field--it says either Mounted or Dismounted.

4. If the volume is dismounted, try mounting it by using the mount command at the console prompt, as in the following example:
mount home2
If the volume fails to mount, note the error message.

5. Repeat steps 2 through 4 for each volume in the list.

At the end of these checks, you should know which disks are responding and which volumes are available. Use this information to determine your next course of action:

Eliminating Simple Causes. A quick check of some of the more common causes of disk faults might save you a lot of time. If you suspect physical failure of the disk or controller, then check the following before launching into any elaborate, time-consuming, or expensive troubleshooting and repair procedures:

Obviously, some of these situations are not going to spontaneously occur on a working server; for example, SCSI cables cannot switch orientation by themselves. Many disk problems arise, however, pretty soon after other work has been carried out on the server--it is possible that a disk fault could have arisen due to oversight or an accident during some previous work.

Disk or Controller? It isn't always obvious whether a fault is due to a disk or a controller failure. If a controller services multiple disks, you might be able to infer something; if all the disks are affected, it's likely to be the fault of the controller. If just one of the disks is affected, it's more likely to be the fault of the disk than the controller. These conclusions aren't definite, though, and in many cases, there is just one disk per controller.

The situation is easiest to diagnose when the suspect disk is the home of the server's DOS partition. If the server boots successfully from this disk, then physical problems are unlikely; run a DOS-based disk-checking utility such as Norton's Disk Doctor or Microsoft's Scandisk for a quick check. There's no guarantee that the disk is trouble-free if such a utility fails to turn up any errors, but you can at least eliminate total physical failure from the list of possibilities. If the volumes on the NetWare partition of such a disk cannot be mounted, then refer to "Using VREPAIR To Solve Volume Errors" later in this chapter.

Quite often, the only way to determine whether the disk or controller is at fault is to swap in a working replacement. If you can read a different disk using the original controller or read the original disk using a different controller, then you know which to replace. But should you replace the controller or the disk?

You do not always have a choice. Many IDE disk controllers are built into the motherboard, in which case you need to find a working IDE disk with which to test the controller. You might have a spare disk on hand, but no spare controller; or you might have a spare controller, but no spare disk.

If a replacement controller is available, it makes sense to try that first. Installing it is not too big of a job--just insert it, load the driver, and configure the board. If the original controller was faulty, and the disk is not, the server should be up and running immediately with the new controller. You have, in effect, combined fault-tracing with repair work.

If you replace the disk instead of the controller, and find that the problem persists--implying that the controller is faulty--then you have to replace the controller, anyway. Of course, you might reach the conclusion that the disk is faulty in either case. If so, restoring service is a big job; the replacement disk might need to be formatted, and certainly does need to be partitioned. You then need to re-create the original volumes, and perform a full restore from tape.


CAUTION: When replacing a controller for diagnostic purposes, make sure that it can properly support any drives that are attached to it. Some older controllers cannot support drives larger than a giga- byte, for example. These may actually destroy data stored on a disk that is beyond their storage capacity.

If you decide that either the disk or the controller is physically faulty, replace it. There generally isn't anything you can do to repair physical faults in either case, and your priority must be to restore the server to production mode as quickly as possible.

Using VREPAIR To Solve Volume Errors. VREPAIR.NLM is designed to check the integrity of information on server disks at the volume, directory, and file levels. It cannot directly detect physical disk errors, but quite often you can make deductions about such problems based on error reports from VREPAIR.

Use VREPAIR to examine and fix problems on volumes that are visible under INSTALL.NLM (see "Checking the Volumes," earlier in this chapter) but that will not mount. If all volumes are mounted and you still suspect disk errors, dismount each volume in turn and run VREPAIR on it using the following steps:

1. If you haven't already done so, disable logins and get all users to log off the system. This helps to avoid confusion and possible loss of data by users while volumes are being dismounted and remounted.

2. If you haven't already done so, load VREPAIR. It's best to keep a copy on the server's DOS partition and load it from there--such a copy is always accessible, even when SYS is not:
load c:\netware.312\vrepair
3. Choose Repair a Volume--if only one volume is dismounted, VREPAIR immediately starts to diagnose it.

4. If more than one volume is dismounted, VREPAIR offers you a choice of volumes to repair. Go through the list one at a time, running VREPAIR on each in turn.

5. If VREPAIR finds an error on a volume, it reports the error on-screen. Take note of the type of error--for example, directory entry mismatch. The file names don't matter in this context, although they might be of interest to a user whose data has possibly been lost.

6. Keep going until VREPAIR has completed its sweep of the volume. If there are very many errors, you can speed up the process by pressing F1 when an error report is on-screen and turning off the Report Errors to Screen option.

7. If VREPAIR found errors on the volume, run VREPAIR on the same volume again. Repeat as necessary until VREPAIR reports no errors for the volume. If the errors persist--if VREPAIR reports the same errors during subsequent sweeps, indicating that it could not repair them, or if the total number of errors refuses to fall over a number of sweeps--then there probably is a physical fault.

8. Repeat this process for any other volumes that might be having problems. In particular, if VREPAIR detects errors on one volume of a physical disk, use VREPAIR to scan all other volume segments on the same disk.

9. Mount any volumes that were dismounted, using the mount command, as follows:
mount sys
If the volume fails to mount, check the available cache buffers. If this value is less than approximately 40%, then it is possible that the server could refuse to mount a perfectly good volume.

10. Reenable logins when you are satisfied that the problem has been resolved.

Network Adapters

Network adapters are less likely to fail than hard disks, perhaps because they have no moving parts. They can and do fail physically, but this is relatively rare, and the majority of problems with server network adapters can be traced to software or configuration issues. When a network error is traced to a server, therefore, it is advisable to check the server's network configuration before examining the network hardware in detail.


TIP: Make sure that the server's SYS volume is mounted and functioning--if it is not, then the server will not respond, no matter how well the network adapter operates.

Traffic Statistics. Begin with a quick check of traffic statistics for the server's network adapter:

1. Load MONITOR if it is not already loaded.

2. Choose LAN Information from the main menu in MONITOR. MONITOR displays a list of the network drivers that are loaded. Drivers are listed once for each time they were loaded, so if the same driver has been used for two identical network adapters--or for two different frame types on the same adapter--it is listed twice. The traffic statistics for both instances are the same, however, so it doesn't matter which you choose.

3. Scroll down through the LAN Information screen to the General Statistics section. The Total Packets Sent and Total Packets Received values refer to all protocols used by the adapter, not just the protocol shown on this screen. For a properly functioning adapter on a typical network, both values should climb steadily, with increments every second or so. If this is the case, then you may assume that the adapter and its network connection are functional and that any problems you observe are due to frame or protocol misconfiguration.

4. Even on a network with no other traffic, the Total Packets Sent value should increase by one every couple of seconds as the server sends SAP packets to advertise its presence. (Realize that this might be spurious; the adapter has no way of telling how far any of its packets get after it sends them, so these packets might not really be transmitted.) If the Total Packets Sent value is static--or just increasing by a packet every second or two--and the Total Packets Received value is static, then you should treat the adapter as if it were passing no traffic at all.

5. Check the NO ECB Available Count value just below the traffic counts. This value reflects the number of times that a packet arrived at the server and the server had to allocate memory for an Event Control Block from its memory pool. This should never happen; if it does, the server is in need of tuning or the adapter driver is faulty.

Frame Type and Protocols. If the adapter appears to be handling at least some incoming and outgoing traffic, take a look at the protocols and frame types in use on the server. Use the protocol command at the server console:

protocol
The following protocols are registered:
  Protocol: IPX  Frame type: VIRTUAL_LAN     Protocol ID: 0
  Protocol: IPX  Frame type: ETHERNET_802.3  Protocol ID: 0
  Protocol: ARP  Frame type: ETHERNET_II     Protocol ID: 806
  Protocol: IP   Frame type: ETHERNET_II     Protocol ID: 800
  Protocol: IPX  Frame type: ETHERNET_II     Protocol ID: 8137

This lists all protocols the server knows about. It does not differentiate between different adapters or drivers or between different instances of the same driver. It is useful as a first step, however. If you don't see IPX loaded, for example, or if the server isn't using the frame type used by your workstations, then something obviously is amiss.

Use MONITOR.NLM for a closer look:

1. Choose LAN Information from the main menu in MONITOR. MONITOR displays a list of each instance of every network driver that is loaded. Unlike when you checked the traffic statistics using MONITOR earlier, it now matters which instance you choose--check each one in turn unless you know in advance which one is for the particular protocol you want to investigate.

2. Select the adapter driver instance that you want to inspect. MONITOR displays a screenful of statistics for the adapter.

3. Look for the Protocols heading near the top of this window. Any protocols bound to this instance of the adapter are listed there, along with information relevant to the particular protocol. If IPX is bound, for instance, then the network address used on the bind IPX line is displayed here. Remember that separate instances of the adapter are displayed separately by MONITOR--if you don't see a protocol that you expected, it may be because it's listed under a separate instance of the same adapter.

4. Check the other information for each listed protocol. In the case of IPX, the network cable address should be listed; in the case of IP, the server's IP address should be listed. Verify the details of what MONITOR displays here.

By now, you should be able to place the problem in one of the following categories:

If the problem falls into one of the first two categories, load the correct adapter driver with the correct frame type. Load the appropriate protocol stack, if necessary, and then bind it to the adapter.

Problems in the third category generally can be resolved by reviewing the server's communications parameters--for example, MIN PACKET RECEIVE BUFFERS--and adjusting them appropriately.

The remainder of the server network adapter coverage in this chapter deals with problems in the last category--where the adapter shows no life whatsoever or at most just an occasional outgoing packet.

No Adapter Throughput. Such symptoms can be caused by either a malfunctioning adapter or a faulty network connection. Chapter 31, "Locating the Problem: Server versus Workstation versus Links," describes how to distinguish between these two classes of fault; the description of a simple, single-strand thinwire network is particularly relevant in this case. Follow the instructions in that chapter to determine whether the adapter or its network connection is at fault.

If the adapter is still totally lifeless at this stage, check the following:

After checking all of that, only the crudest of debugging methods remains: Replace the adapter with one that you know to be functional. Ideally, use an identical adapter model.

If the replacement adapter doesn't work either, then the server system probably is faulty. Double-check this by installing the original adapter in another computer and testing it. If it works in the second system, then the first system needs attention. Of course, if the original adapter doesn't work in either system, it is possible that both the adapter and motherboard in the original server are faulty, perhaps as a result of a power surge.

If the replacement adapter works in the original server, then the original adapter is either faulty or misconfigured. Check the configuration of the original adapter.

Links--Connectivity Problems

Communications problems are often much more difficult to identify than server faults. The server is relatively finite, with a small number of components located at a single point in space; the network usually is more amorphous, comprising many active and passive components and possibly extending over a number of buildings. It is particularly important, therefore, that a logical investigative procedure be adopted when checking out communications faults.

Overview of the Fault

Start by determining the extent and general nature of the problem. A report from a single user of an inability to connect using the network could have many causes; workstation misconfiguration, user error, a server down, or perhaps a genuine network fault. If you have not already done so, use the procedures outlined in chapter 31, "Locating the Problem: Server versus Workstation versus Links," to determine whether the fault is definitely network-related.

The next step is to try to identify the extent of the problem in terms of the network. The details of how you go about this, and the questions you need to answer, depend on your particular network configuration; for starters, at least, consider the following:

It should be possible to zero in on the affected part of the network using this type of logic, with a certain amount of testing along the way. Apart from attempting to establish connections between a workstation and server, you may want to test individual cable runs with a cable tester.


TIP: Discard any faulty cable sections, connectors, or terminators as soon as you discover them. If they are left lying around, they might be reused by mistake, and you could find yourself debugging the same problem a second time.

Structural Limitations

Network problems sometimes arise as a result of the expansion of the network. The maximum number of nodes, maximum cable length, and other parameters are defined in the standards for each type of network; breaching these limits can mean overloading the active equipment or expecting too much of a network adapter or cable run. The following sections contain guidelines for each subnetwork.

Thinwire Networks. The relevant limits for thinwire or coaxial cable networks are the following:

Thinwire cable is more prone to damage and poor connections than other types. This is, in part, because it is physically lighter than thickwire cable, yet less flexible than twisted pair. It also is the most likely to have incorrectly installed connectors.

As a matter of policy, use the smallest possible number of cable segments in any given run. This minimizes the amount of cable splicing and reduces the risk of a faulty connection. It's also important to avoid crimping the cable or bending it in such a way as to cause kinks, which can damage the insulation. Try to secure the cable to fixed points rather than letting it run loosely across the floor (where you might pinch it by, for example, moving furniture).

When performing diagnostic tests on a network adapter with a thinwire (BNC) connector, attach a T-piece to the adapter and connect a 50-ohm terminator to each end of the T-piece.


TIP: Individual thinwire cable segments may be tested using an ordinary digital multimeter. Terminate one end of the cable segment using a 50-ohm terminator and attach the multimeter to the other end. The resistance should be 50 ohms. Now shake the cable a bit, in particular jiggling the ends near any connectors. Watch for any sudden changes in the resistance displayed by the multi-meter--these indicate a poor quality connection that might lead to intermittent faults.

Thickwire Networks. The relevant limitations for thickwire Ethernet networks are the following:

1,640-(S÷3.28 meters)

Twisted-Pair Networks. The relevant limitations for twisted-pair networks are the following:

A twisted-pair adapter must be connected to a properly functioning concentrator before full diagnostics can be carried out.

Workstation Problems

Client workstations often are the least stable part of a network. This is partly because they are not dedicated network devices in the same way as file servers or active network equipment and also because these are the pieces of the network to which users have direct access.

Private, office desktop PCs are often prone to reconfiguration. The owner may decide to upgrade the operating system or hardware without consulting you or might try out some completely inappropriate network software. When the computer can no longer access the network, the problem suddenly becomes yours.

Public access computers, such as those used in student laboratories or shared facilities, fall victim to another set of problems. These machines often are in use for several straight hours per day, far more than the typical office or home computer. They also are used by a number of different people in the course of a typical day, making the machines much more prone to physical wear and tear, virus infection, or someone's burning need to delete all that messy operating system stuff to make room for a really neat game.

The volatility of the average network client can be multiplied by the total number of client machines to estimate the magnitude of this particular maintenance headache for the network administrator. Most networks have far more client workstations than the combined total of servers, routers, concentrators, and computer support staff.

All in all, client faults should be expected to form a significant portion of the problems arising on any network. The remainder of this chapter deals with the most common problems, and suggests some possible solutions.


CAUTION: You must establish at an early stage where your responsibility toward a user's workstation starts and ends. If you are responsible only for the user's network connection and not for the configuration of the machine for optimal performance, make this clear at the outset; if your responsibilities extend to network software optimization but not to application tuning, let the user know. This can help you avoid being sucked into support issues far beyond the call of duty. If it is completely up to you to configure the machine in all details--system and network software, applications, and hardware--then this is not an issue.

Troubleshooting Procedures

Start by trying to replicate the reported problem. Remember that, unlike server and network equipment fault reports, workstation faults are likely to be reported by users without the technical expertise to tell a network fault from a typographical error. Normal network behavior is sometimes viewed as aberrant by new users. Discuss the circumstances of the problem with the reporting user, and establish whether what he saw was actually a network-related fault.

Once you have satisfied yourself that there is a reproducible error or reasonable cause to suspect one, then you can begin to look for the underlying cause. How you go about this depends on the nature of the problem's symptoms.

Bootup Trouble. First, make sure that the workstation functions correctly as a stand-alone machine. At a minimum, this means that the machine should boot up normally without connecting to the network. The following guidelines should facilitate your ssessment:

Errors Loading Drivers. Watch out for error messages while the workstation loads the network drivers. These are relatively uncommon, but occasionally you might see an error message from IPXODI.COM or the adapter's MLID if the version of LSL installed on the workstation is particularly old.

Verify that LSL.COM, IPXODI.COM, and the MLID are all loading successfully. If not, check the NET.CFG file for possible errors. Make sure that the workstation is using the latest available version of LSL.COM and IPXODI.COM; try to update the adapter's MLID if possible.

The most common error that occurs at this stage is generated when the shell or redirector cannot attach to a server. The shell reports A File Server could not be found, but the redirector is more verbose: A file server could not be found. Check the network cabling and the server's status before continuing. Both cases are dealt with in the following section.

Workstation Can't Connect to Server. Establishing the initial connection between the workstation and the server is often problematic. It is fairly easy to get to this point with an improper configuration or a faulty network adapter; the network drivers generally load without complaint. But that final step, where the workstation and server recognize each other, can't happen unless the adapter and its network connection are functioning, and the workstation's network configuration is at least mostly correct.

Start by checking whether the adapter is functioning properly. The base criteria for deciding this is checking whether the adapter can send and receive packets to and from the network:

If the adapter can transmit and receive packets on the network but still cannot connect to a server, then the problem is most likely due to misconfiguration of the workstation's network client software. There are a few other simple possibilities that must be eliminated first, though:

1. Has a preferred server been specified for the workstation's shell or requester? If not, and if no servers on the network are set to reply to GET NEAREST SERVER requests, the workstation will not get any response from a file server. Try loading the NetWare shell with the /ps=<server> option (if using NETX), or add a preferred server=<server> line to the NetWare DOS Requestor section of the workstation's NET.CFG file (if using VLM).

2. Is the server working? Check that you can attach to it from another workstation. If this is impossible, then refer to chapter 31, "Locating the Problem: Server versus Workstation versus Links," for broad troubleshooting instructions.

3. Does the server have the right protocol loaded? If so, is it bound to the correct adapter using the correct frame type? If in doubt, refer to the "Server Problems" section earlier in this chapter.

4. Are there any available connection slots on the file server? Examine the number of connections in use (shown on the main screen of MONITOR under the heading Connections in Use). If none are available, clear some connections and try again.

At this stage you have established that the adapter is functioning, the network connection is valid, and a working server with a free connection slot is available on the network. The only remaining reason why the workstation would be unable to connect is a mismatch of frame or protocol types.

Review the workstation's NET.CFG file:

If any changes are made to the NET.CFG file, you need to unload IPXODI.COM and LSL.COM and then reload them before attempting to reconnect. Here's an example:

ne2000 /u
ipxodi /u
lsl /u
lsl
ipxodi
ne2000
vlm

Some MLIDs are a little buggy, so you might need to reboot the workstation after making such changes.

User Can't Log in to Server. With the basic connection established between server and workstation, most of the work is done. There still are some things that can go wrong, though. One of the more frustrating problems from a user's point of view is the inability to log in, in spite of being able to get to the server's login prompt. When you are faced with such a situation, proceed in the following manner:

login server1/jdean

Workstation Hanging. Intermittent hanging of the workstation is another frustrating problem. This almost always is caused by a memory or hardware interrupt configuration clash. Refer to chapters 6, "The Workstation Platform," and 11, "Network Client Software," for detailed information on the relevant configuration issues, and how to check for and resolve resource conflicts of this type.

The workstation's power supply is another possible cause of such glitches. A digital multimeter may be used for a simple check of the output voltage, but it won't necessarily catch periodic or occasional fluctuations. If you suspect the power source rather than the workstation's power supply, consider investing in a low-end UPS for the workstation. Refer to chapter 18, "Backup Technology: Uninterruptible Power Supplies," for details.

Windows Problems. Some workstation faults appear under Windows but not under DOS. These most often are due to configuration issues, in particular shared RAM.

If the workstation uses EMM386 for memory management, make sure that the adapter's shared RAM has been excluded using the X= option on the EMM386 load line in the workstation's CONFIG.SYS file. Check the EmmExclude= line in the Windows SYSTEM.INI file, too.

If these lines appear correct, check whether the adapter uses 16K or just 8K of shared RAM--some cards use as little as 2K. What matters is that EMM386 and Windows know which area to exclude, and that means specifying the precise memory range, not just the starting point.

Another common cause of Windows problems is having mismatched network driver files. NETWARE.DRV and its associated files should always be upgraded as a set; refer to chapters 11, "Network Client Software," and 12, "Network Client Software for 32-Bit Windows," for details.

Application Problems. Many problems that appear to be the fault of the workstation are in fact due to unusual behavior by application software. The application might attempt to write a temporary file to a location where the user does not have write access, for example. A well-written application informs the user what it has tried to do and what error has occurred, but many programs simply ignore such errors or even hang the workstation.

On investigation, you should be able to determine what the application was trying to do and where. Some applications write to the location specified by the TMP or TEMP environment variables. Others write to the current default directory, while some insist on trying to write to the directory where the application itself is stored. Be careful not to grant write access too liberally as a quick fix--refer to chapter 20, "Tools for Restricting Access," for advice on access restriction as related to application setup.

Some application-specific problems can be resolved by adjusting the client configuration. Many applications--Windows applications in particular--need to hold a large number of files open simultaneously. Increasing the FILES= setting in the workstation's CONFIG.SYS can solve problems in such cases.

Other applications are sensitive to the combination of the read only compatibility= setting in NET.CFG and the user's write access on the server. In such cases, refer to the documentation for the particular application that is causing difficulty--there is no single correct way to resolve these issues.

Summary

Tackling faults on a live network can be a stressful business. Make it easier on you and your users by announcing that you are aware of the problem and working on it. If possible, announce downtime in advance. This should give you the space required to tackle the problem in a cool, logical way. Take care when making preliminary assessments of the type of problem; a mistake at this stage can cost a lot of precious time. And once you locate the source of the problem, draw freely on all the information available--including the relevant chapters of this book--before launching into a solution.


Previous chapterNext chapterContents


Macmillan Computer Publishing USA

© Copyright, Macmillan Computer Publishing. All rights reserved.