Error detection – Architecture and technical overview
2.12.1 Error detection
System ability to detect errors immediately can greatly enhance its capabilities to take suitable actions in time. For many soft errors, it can recover by self-healing properties. For other non-recoverable errors, it can notify through service interfaces for further actions. The overview of these service interfaces as applicable for Power E1080 server are discussed next.
Service interface
Service engineers are assisted by multiple system service interfaces that communicate with the service support applications in a server that is using the operator console, the GUI that is on the management console or service processor menu, or an operating system terminal. The service interface helps the support team to efficiently manage system resources and service information.
Applications that are available through the service interface are configured and placed to give service engineers access to important service functions. Depending on the system state, hypervisor, and the operating environment, one or more service interfaces can be useful in accessing logs and service information and communicating with the system. The following primary service interfaces are available:
Ê Light path diagnostics (LPD) Ê Operator panel
Ê Service processor menu Ê ASMI
Ê Operating system service menu
Ê Service focal point on the HMC or vHMC
The system can identify components for replacement by using Field Replaceable Unit (FRU) specific LEDs. The service engineer can use the identify function to set the Field replaceable unit (FRU) level LED to blink, which lights the blue enclosure locate and system locate LEDs. The enclosure LEDs turn on solid and guide the service engineer to follow the light path from the system to the enclosure and down to the specific FRU in error.
Similar to LPD notifies, other interfaces in the previous bulleted list provide tools to capture logs/dumps and other essential information to identify and detect errors.
For more information about service interfaces and the available service functions, see this web page.
Error check, first failure data capture, and fault isolation registers
Power processor-based systems feature specialized hardware detection circuits that are used to detect erroneous hardware operations. Error-checking hardware ranges from parity error detection that is coupled with Processor Instruction Retry and bus try again, to ECC correction on caches and system buses.
Within the processor or memory subsystem error checker, error-checker signals are captured and stored in hardware FIRs. The associated logic circuitry is used to limit the domain of an error to the first checker that encounters the error. In this way, runtime error diagnostic tests can be deterministic so that for every check station, the unique error domain for that checker is defined and mapped to field-replaceable units (FRUs) that can be repaired when necessary.
130 IBM Power E1080: Technical Overview and Introduction
First-failure data capture (FFDC) is a technique that helps ensure that the root cause of the fault is captured without the need to re-create the problem or run any sort of extending tracing or diagnostics program when a fault is detected in a system. For most faults, a good FFDC design means that the root cause also can be detected automatically without service engineers’ intervention.
FFDC information, error data analysis, and fault isolation are necessary to implement the advanced serviceability techniques that enable efficient service of the systems and to help determine the failing items.
In the rare absence of FFDC and Error Data Analysis, diagnostics are required to re-create the failure and determine the failing items.
Leave a Reply