Diagnostics – Architecture and technical overview

2.12.2 Diagnostics

Diagnostics refers to identification of errors, symptoms, and determination of the potential causes of the errors identified.

Power E1080 server is supplemented with several advanced troubleshooting and diagnostic routines that are available through multiple service interfaces, as discussed in 2.12.1, “Error detection” on page 130. For more information about the available aids, see the following IBM Documentation web pages:

Ê Analyzing problems

Ê Isolation procedures

Reference codes

These codes represent the system IPL status progress codes, OS IPL progress codes, dump progress codes, service request numbers (SRN) and so on, which serve as diagnostic aid to help determine the source of various hardware errors. Diagnostic applications report problems by means of SRNs.

The support team and service engineers use this information with reference code-specific information to analyze and determine the source of errors or find more information about other isolation procedures.

Automatic diagnostics

The processor and memory FFDC is designed to perform without the need to recreate the problems or user intervention. Solid and intermittent errors are detected early and isolated at the time of failure. Runtime and boot-time diagnostics fall into this category.

Stand-alone diagnostics

These routines provide methods to test system resources by using diagnostics that are packaged on CDROM. They can be accessed by starting the system in service mode.

Service processor diagnostic

A service processor is self-sufficient to monitor unrecoverable errors in the system processor without the need of resources from the system processor. It also can monitor HMC connections and system thermal and operating environments, along with remote power control, reset, and maintenance functions.

Chapter 2. Architecture and technical overview 131

2.12.3 Reporting

If a system hardware or environmentally induced failure occurs, the system runtime error diagnostics analyze the hardware error signature to determine the cause of failure.

The analysis is stored in the system NVRAM. The identified errors are reported to the operating system and recorded in the system logs of the operating system.

For an HMC-managed system in PowerVM environment, an ELA routine analyzes the error, forwards the event to the Service Focal Point (SFP) application that is running on the HMC, and notifies the system administrator about isolation of likely cause of the system problem. The service processor event log also records unrecoverable checkstop conditions and forwards them to the SFP application.

The system can call home from IBM i and AIX operating systems to report platform recoverable errors and errors that are associated with PCIe adapters and devices. In an HMC-managed system environment, a Call Home service request is started from the HMC and the failure report that carries parts information and part location is sent to IBM service organization.

Along with such information, customer contact information and system-specific information, machine type, model, and serial number, and error logs also are sent to IBM service organization electronically.

Lucas Cooper
http://sidebrown.com

Leave a Reply