The Internet RFC 789 Case Study 4 – Faults and Solutions

Events like the ARPAnet crash are considered major failures the network went down for quite some time. With the management tools and software of today, the ARPAnet manager may have been able to avoid the crash completely if not at least detect and correct much more efficiently than what they did.

Drop Bits

The three updates created through IMP 50 were not checked thoroughly from protection against dropped bits. Increased planning, organising and budgeting would have been valuable to the network. Managers realised that not enough resources allocated and although resources were scarce CPU cycles and memory were allocated to different methods. Instead only the update packets as a whole were checked for errors. Once an IMP receives an update it stores the information from the update in a table. If it requires a re-transmission of the update then it merely sends a packet based on the info in the table. This means that for maximum reliability the tables would need to be checksummed also. Again this would not appear to be a cost effective option as checksumming large tables requires a lot of CPU cycles (Rosen, 1981).

Apart from checksumming, the hardware had parity checking disabled. This was because the hardware had parity errors when in fact there was not any. This is a common security problem having fail safe measures installed but just disabling them because they don not work correctly, instead the system should have been correctly set so it gave errors message correctly.

More checksumming may be detected the problem, but still this is not to say that checksumming will always be free of cracks for bits to fall through. Another option would have been to modify the algorithm used, but again this could have fixed one problem but allowed others to rise.

Performance and Fault management

The crash of ARPAnet was not specifically a technical fault instead it was a number of faults which added together to bring the network down. No algorithms or protections strategy’s can be assured to be fail safe. Instead the managers should have aimed to reduce the likely hood of a crash or failure and in the even of a crash have detection and performance methods in place that would detect if something was wrong earlier. Detection of high-priority processes that consume a given (high ie 90) percentage of system resources would allow updates to occur but still allow the up/down packets to be sent.

If the ARPAnet managers had been able to properly control the network through network monitoring tools which give reports on system performance it might have enabled them to respond to the problem before it became unstable and unusable. Network monitoring can be active, ie routinely test the network by sending message or passive, collecting data and recording it in logs) Either type might have been a great asset as the falling performance could have been detected allowing them to avoid fire fighting the problem ie trying to repair after the damage is done (Dennis, 2002). Having said this system resources as already mentioned were scarce and the sending and recording of data requires memory and CPU time that might not be available, instead some overall speed might have needed to be sacrificed to allow for network monitoring.

Other network management tools such as alarms could have also allowed the problem to be corrected much more efficiently by alerting the staff as soon as a fault occurred. Alarms software would not only have alerted staff earlier but would make it easier to pinpoint the cause of the fault, allowing for a fix to be implemented quickly.

Network Control

The lack of control over the misbehaving IMP’s proved to be a great factor in the down time. The IMP’s were a number of kilometres away from each other. Fixes and patches were loaded on to the machines remotely which too several hours because of the networks slow speeds. The lack of control of the networks allocation of resources we detrimental to the networks slow recovery. Even in a down state network moinoring tool would have been able to download software to the network device configure parameters, and back up, changing and altering at the mangers decision (Duck & Read, 2003).

Planning and Organising

The lack of proper planning, organising and budgeting was one of the main factors which cause the network to fail. The ARPAnet managers were aware of the lack of protection of dropped bits, but due to constraints in costs and hardware and a it wont happen to us attitude were prepared to disregard it.

Better forecasting and budgeting may have meant they would be able to put in place more checking, which could have picked up the problem straight away, only the packets were checked (checking the tables that the packets were copied onto was not considered cost-effective) this left a large hole in their protection. Obviously it was not considered a large enough problems considering it probably will not happen. Documenting the fact there was no error checking on table might have also reduced the amount of time it took them to correct the error (Dennis, 2002)

The RFC789 case study points out hey knew that bit dropping could be a future problem. If this is true then procedures should have been documented for future reference as a means for solving the problem of dropped bits. Planning could have severely cut, down time of the network.

Correct application of the 5 key management tasks and the network managing tools available today would have made the ARPAnet crash of 1980 avoidable or at least easier to detect and correct.

Better planning and documentation would have allowed the manger to look ahead acknowledging any gaps in their protection. They could have prepared documentation and procedures to highlight they do not have protection on certain aspects of the network.

Once the crash had occurred, uncertain decisions were made as to what the problem could possibly be. Planning, organising and directing could have helped resolve the situation in a more productive method than the fire fighting technique used.

The Internet RFC 789 Case Study 3 – How the ARAPA Crash occurred

An interesting and unusual problem occurred on October 27th, 1980 in the ARPA network. For several hours the network was unusable but still appeared to be online. This outage was caused by a high- priority process executing and consuming more system resources then it should. The ARPAnet had IMP’s (Interface Message processors) which where used to connect computers to each other suffered a number of faults. Restarting individual IMP’s did nothing to solve the problem because as soon as they connected to the network again, IMP’s continued with the same behaviour and the network was still down.

It was eventually found that there was a bad routing updates. These updates are created at least 1 per minute by each IMP and contain information such as the IMP’s direct neighbours and the average packet per second across the line. The fact they could not keep their lines up was also a clue and it suggested that the IMP’s were unable to send the line up/down protocol because of heavy CPU utilisation. After an amount of time the lines would have been declared down because the lines up/down protocol was not able to be sent.

A core dump (log files) showed that all IMP’s had routing updates waiting to be processed and it was later revealed that all updates came from the one IMP, IMP 50.

It showed that IMP 50 had been malfunctioning before the network outage, unable to communicate properly with its neighbour IMP 29, which was also malfunctioning, IMP 29 was dropping bits.

The updates which were waiting to be processed by IMP 50 had a pattern this was as follows: 8, 40, 44, 8, 40, 44….. This was because of the way the algorithm determine what the most recent update was. 44 was considered more recent then 40, 40 was considered more recent then 8 and 8 was considered more recent then 44. Thus this set of updates formed an infinite loop, and the IMP’s were spending all their CPU time and buffer space processing this loop. Accepting the updates because the algorithm meant that each update was more recent then the last, this was easily fixed by ignoring any updates from IMP 50; but what had to be found is how did IMP 50 manage to get three updates into the network at once?

The answer was in IMP 29, which was dropping bits. When looking at the 6 bits that make up the sequence numbers of the updates we can see a problem

8 – 001000

40- 101000

44- 101100

If the first update was 44, then 40 could easily have been created by an accidental dropped bit and again 40 could be turned into 8 by dropping another bit. Therefore this would make three updates from the same IMP that would create the infinite loop.

The Internet RFC 789 Case Study 2 – Network Management

Network managers play a vital part in any network system, the organisation and maintenance of networks so they remain functional and efficient for all users. They must plan, organise, direct, control and staff the network to maintain speeds and efficiency for all users. Once these tasks are completed the four basic functions of a network manager will be complete these are the network; performance, fault management, provide end user support and manage the ongoing costs associated with maintaining networks.

Network Managing Tasks

The five key tasks in network management as described in Networking in an Internet Age by Alan Dennis (2002, p.351) are:

Careful planning of the network which includes the following; forecasting, establishing network objectives, scheduling, budgeting, allocating resources and developing network policies.

Organising tasks which includes developing organisational structure, delegating, establishing relationships, establishing procedures and integrating the small organisation with the larger organisation.

Directing tasks- initiating activities, decision making, communicating, motivating

Controlling tasks establishing performance standards, measuring performance, evaluating performance and correcting performance.

Staffing tasks interviewing people, selecting people, developing people

It is vital that these tasks are carried out neglect in one area can cause problems later on down the track. For example bad organisation could mean an outage lasts double what it should, or bad decision making when creating the topology of the network and what communication methods to use could mean the network is not fast enough for the organisations needs even when running at full capacity.

Four Main Functions of a network manager

The functions of a network manager can be broken down into four basic functions

Configuration management; performance and fault management, end-user support and cost management. Sometimes the tasks that a network manager will perform can cover more than one of these functions, such as documentation the configuration of hardware and software, performance reports, budgets and user manuals. The five key tasks of a network manager must be done in order to cover the basic functions of a manger as this will keep the network working smoothly and efficiently.

Configuration management

Configuration management is managing a networks hardware and software configuration and documentation. It involves keeping the network up to date, adding and deleting users and the constraints those users have as well as writing the documentation for everything from hardware to software to user profiles and application profiles.

Keeping the network up to date involves changing network hardware and reconfiguring it, as well as updating software on client machines. Innovative software called Electronic software distribution (ESD) is now available allowing managers to install software remotely on client machines over the network without physically touching the client computer saving a lot of time (Dennis, 2002)

Performance and Fault Management

Performance and fault management are two functions that need to be continually monitored in the network. Performance is concerned with the optimal settings and setup of the network. It involves monitoring and evaluating network traffics, and then modifying the configuration based on those statistics. (Chiu & Sudama 1992)

Fault management is preventing, detecting and rectifying problems in the network, whether the problem is in the circuits, hardware or software (Dennis, 2002) Fault management is perhaps the most basic function, as users expect to have a reliable network whereas slightly better efficiency in the network can go unnoticed in most cases.

Performance and fault management rely heavily on network monitoring which keeps track of the network circuits and the devices connected and ensures they are functioning properly (Fitzgerald & Dennis 1999).

End User Support

End user support involves solving any problems that users encounter whilst using the network. Three main functions of end user support is resolving network faults, solving user problems and training end-users. These problems are usually solved by going through troubleshooting guides set out by the support team. (Dennis, 2002)

Cost Management

Costs increase as network services grow this is a fundamental economic principle (Economics Basics: Demand and Supply, 2006) Organisations are committing more resources to their networks and need an effective and efficient management in the place to use those resources wisely and minimise costs.

In cost management the TCO (total costs of ownership) is used to measure how much it costs for a company to keep a computer operating. It takes into account the costs of repairs, support staff that maintain the network, software and upgrades as well as hardware upgrades. In addition to these costs it also calculates wasted time, for example the cost to the store manager, whilst his staff learn a newly implemented computer system. This inclusion of wasted time is widely accepted however many companies dispute whether it should be included. NCO (network cost of ownership) focuses on everything except wasted time. It exams the direct costs rather than invisible costs such as wasted time.

The Internet RFC 789 Case Study 1

The ARPANET (Advanced Research Project Agency NETwork) was the beginning of the Internet; a network of four computers which was put together by the U.S. Department of Defence in 1969. On the 27th of October, 1980 there was an unusual occurrence within ARPANET. For several hours the network appeared to be unstable, due to a high priority processes that was executing to the detriment of the system. It later expanded to a faster and more public network called NSFNET (Network Science Foundation Network) This network then grew into the internet as we know it today. On October 27th, 1980 the ARPAnet crashed for several hours, due to high priority processes that were executing exhausting system resources and causing down time within the system. (Rosen, 1981)

With today’s network management tools the system failure could have been avoided. Network manager responsibilities such as planning, organizing, directing, controlling and staffing (Dennis, 2002) would have allowed the situation to be handled correctly had these tools been available.  The case study RFC789 by Rosen summarises that the main problems the managers experienced were the initial detection that a problem existed and the control of the problematic software/hardware. Assuming they were available, the managing responsibilities would have allowed for a much quicker and efficient recovery of the system. However if careful planning and organising had been carried out when the system was implemented the crash might have been avoided completely.