The Internet RFC 789 Case Study 4 – Faults and Solutions

Events like the ARPAnet crash are considered major failures the network went down for quite some time. With the management tools and software of today, the ARPAnet manager may have been able to avoid the crash completely if not at least detect and correct much more efficiently than what they did.

Drop Bits

The three updates created through IMP 50 were not checked thoroughly from protection against dropped bits. Increased planning, organising and budgeting would have been valuable to the network. Managers realised that not enough resources allocated and although resources were scarce CPU cycles and memory were allocated to different methods. Instead only the update packets as a whole were checked for errors. Once an IMP receives an update it stores the information from the update in a table. If it requires a re-transmission of the update then it merely sends a packet based on the info in the table. This means that for maximum reliability the tables would need to be checksummed also. Again this would not appear to be a cost effective option as checksumming large tables requires a lot of CPU cycles (Rosen, 1981).

Apart from checksumming, the hardware had parity checking disabled. This was because the hardware had parity errors when in fact there was not any. This is a common security problem having fail safe measures installed but just disabling them because they don not work correctly, instead the system should have been correctly set so it gave errors message correctly.

More checksumming may be detected the problem, but still this is not to say that checksumming will always be free of cracks for bits to fall through. Another option would have been to modify the algorithm used, but again this could have fixed one problem but allowed others to rise.

Performance and Fault management

The crash of ARPAnet was not specifically a technical fault instead it was a number of faults which added together to bring the network down. No algorithms or protections strategy’s can be assured to be fail safe. Instead the managers should have aimed to reduce the likely hood of a crash or failure and in the even of a crash have detection and performance methods in place that would detect if something was wrong earlier. Detection of high-priority processes that consume a given (high ie 90) percentage of system resources would allow updates to occur but still allow the up/down packets to be sent.

If the ARPAnet managers had been able to properly control the network through network monitoring tools which give reports on system performance it might have enabled them to respond to the problem before it became unstable and unusable. Network monitoring can be active, ie routinely test the network by sending message or passive, collecting data and recording it in logs) Either type might have been a great asset as the falling performance could have been detected allowing them to avoid fire fighting the problem ie trying to repair after the damage is done (Dennis, 2002). Having said this system resources as already mentioned were scarce and the sending and recording of data requires memory and CPU time that might not be available, instead some overall speed might have needed to be sacrificed to allow for network monitoring.

Other network management tools such as alarms could have also allowed the problem to be corrected much more efficiently by alerting the staff as soon as a fault occurred. Alarms software would not only have alerted staff earlier but would make it easier to pinpoint the cause of the fault, allowing for a fix to be implemented quickly.

Network Control

The lack of control over the misbehaving IMP’s proved to be a great factor in the down time. The IMP’s were a number of kilometres away from each other. Fixes and patches were loaded on to the machines remotely which too several hours because of the networks slow speeds. The lack of control of the networks allocation of resources we detrimental to the networks slow recovery. Even in a down state network moinoring tool would have been able to download software to the network device configure parameters, and back up, changing and altering at the mangers decision (Duck & Read, 2003).

Planning and Organising

The lack of proper planning, organising and budgeting was one of the main factors which cause the network to fail. The ARPAnet managers were aware of the lack of protection of dropped bits, but due to constraints in costs and hardware and a it wont happen to us attitude were prepared to disregard it.

Better forecasting and budgeting may have meant they would be able to put in place more checking, which could have picked up the problem straight away, only the packets were checked (checking the tables that the packets were copied onto was not considered cost-effective) this left a large hole in their protection. Obviously it was not considered a large enough problems considering it probably will not happen. Documenting the fact there was no error checking on table might have also reduced the amount of time it took them to correct the error (Dennis, 2002)

The RFC789 case study points out hey knew that bit dropping could be a future problem. If this is true then procedures should have been documented for future reference as a means for solving the problem of dropped bits. Planning could have severely cut, down time of the network.

Correct application of the 5 key management tasks and the network managing tools available today would have made the ARPAnet crash of 1980 avoidable or at least easier to detect and correct.

Better planning and documentation would have allowed the manger to look ahead acknowledging any gaps in their protection. They could have prepared documentation and procedures to highlight they do not have protection on certain aspects of the network.

Once the crash had occurred, uncertain decisions were made as to what the problem could possibly be. Planning, organising and directing could have helped resolve the situation in a more productive method than the fire fighting technique used.