Proactive Identification and Remediation of HPC Network Subsystem Failures

October 20, 2016
3:00 pm - 3:30 pm
GRB 310 A

Track: Computer Systems Engineering
Type: Presentation
Level: Advanced

This presentation discusses a HIgh Performance Computing network subsystem monitoring and detection solution that has greatly reduced the number and range of failures resulting from faults within this critical subsystem. Faults that occur within this subsystem can be catastrophic and/or intermittent, very hard to detect, and often result in failures that do not easily indicate root cause.


Susan Coulter, HPC Network Engineer, Los Alamos National Laboratory, High Performance Computing