Kurt B Ferreira - Publications List

Journal articles

2013

Kurt Ferreira, Rolf Riesen, Patrick Bridges, Dorian Arnold, Ron Brightwell (2013) Accelerating Incremental Checkpointing for Extreme-Scale Computing Future Generation Computer Systems

Abstract:

Notes: To appear.

2011

Kurt Ferreira, Patrick Bridges, Ron Brightwell, Kevin Pedretti (2011) The Impact of System Design Parameters on Application Noise Sensitivity Journal of Cluster Computing (online): September

Abstract: Operating system (OS) noise, or jitter, is a key limiter of application scalability in high end computing systems. Several studies have attempted to quantify the sources and effects of system interference, though few of these studies show the influence that architectural and system characteristics have on the impact of noise at scale. In this paper, we examine the impact of three such system properties: platform balance, noisy node distribution, and the choice of collective algorithm. Using a previously-developed noise injection tool, we explore how the impact of noise varies with these platform characteristics. We provide detailed performance results that indicate that a system with relatively less network bandwidth is able to absorb more noise than a system with more network bandwidth. Our results also show that application performance can be significantly degraded by only a subset of noisy nodes. Furthermore, the placement of the noisy nodes is also important, especially for applications that make substantial use of tree-based collective communication operations. Lastly, performance results indicate that non-blocking collective operations have the ability to greatly mitigate the impact of OS interference. When combined, these results show that the impact of OS noise is not solely a property of application communication behavior, but is also influenced by other properties of the system architecture and system software environment.

Notes:

2009

Rolf Riesen, Ron Brightwell, Patrick G Bridges, Trammell Hudson, Arthur B Maccabe, Patrick M Widener, Kurt Ferreira (2009) Designing and Implementing Lightweight Kernels for Capability Computing Concurrency and Computation: Practice and Experience 21: 6. 793-817 April

Abstract: In the early 1990s, researchers at Sandia National Laboratories and the University of New Mexico began development of customized system software for massively parallel â��capabilityâ�� computing platforms. These lightweight kernels have proven to be essential for delivering the full power of the underlying hardware to applications. This claim is underscored by the success of several supercomputers, including the Intel Paragon, Intel Accelerated Strategic Computing Initiative Red, and the Cray XT series of systems, each having established a new standard for high-performance computing upon introduction. In this paper, we describe our approach to lightweight compute node kernel design and discuss the design principles that have guided several generations of implementation and deployment. A broad strategy of operating system specialization has led to a focus on user-level resource management, deterministic behavior, and scalable system services. The relative importance of each of these areas has changed over the years in response to changes in applications and hardware and system architecture. We detail our approach and the associated principles, describe how our application of these principles has changed over time, and provide design and performance comparisons to contemporaneous supercomputing operating systems.

Notes:

Conference papers

2012

David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, Ron Brightwell (2012) Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing In: in Proceedings of the International Conference on High-Performance Computing, Networking, Storage, and Analysis (SC'12)

Abstract: This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. By assuming a model wherein corruption in application data manifests itself by producing differing MPI message data between replicas, we study the best suited protocols for detecting and correcting MPI data that is the result of corruption. To experimentally validate our proposed detection and correction protocols, we introduce RedMPI, which is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source by utilizing either double or triple redundancy. Our results indicate that our most efficient consistency protocol can successfully protect applications experiencing even high rates of silent data corruption with runtime overheads between 0% and 30% as compared to unprotected applications without redundancy.

Notes:

Dewan Ibtesham, Dorian Arnold, Patrick G Bridges, Kurt B Ferreira, Ron Brightwell (2012) On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance In: Proceedings of the International Conference on Parallel Processing (ICPP)

Abstract: The increasing size and complexity of high performance computing (HPC) systems have lead to major concerns over fault frequencies and the mechanisms nec- essary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. There- fore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effec- tive. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing check- point commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems; (2) checkpoint compression viability scales with checkpoint size; (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability; and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact checkpoint compression might have on projected extreme scale systems.

Notes:

2011

Kurt Ferreira, Rolf Riesen, Ron Brightwell, Patrick Bridges, Dorian Arnold (2011) Libhashckpt: Hash-based Incremental Checkpointing Using GPU's In: Proceedings of the 2011 European MPI Users' Group Conference 272-281 Santorini, Greece:

Abstract: Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.

Notes:

Kurt Ferreira, Jon Stearley, James Laros, Ron Oldfield, Kevin Pedretti, Ron Brightwell, Rolf Riesen, Patrick Bridges, Dorian Arnold (2011) Evaluating the Viability of Process Replication Reliability for Exascale Systems In: Proceedings of the 2011 ACM/IEEE International Conference ror High Performance Computing, Networking, Storage, and Analysis Seattle, Washington:

Abstract: As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an application's time to solution. Replicated computing techniques, particularly state machine replication, long used in distributed and mission critical systems, have been suggested as an alternative to checkpoint-restart. In this paper, we evaluate the viability of using state machine replication as the primary fault tolerance mechanism for upcoming exascale systems. We use a combination of modeling, empirical analysis, and simulation to study the costs and benefits of this approach in comparison to check-point/restart on a wide range of system parameters. These results, which cover different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms.

Notes:

2010

Ron Brightwell, Kurt Ferreira, Rolf Riesen (2010) Transparent Redundant Computing with MPI In: Proceedings of the 2010 European MPI Users' Group Conference 208-218 Stuttgart, Germany:

Abstract: Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We describe a transparent redundancy approach for MPI applications and detail two different implementations that provide the ability to tolerate a range of failure scenarios, including loss of application processes and connectivity. We compare these two approaches and show performance results from micro-benchmarks that bound worst-case message passing performance degradation. We propose several enhancements that could lower the overhead of providing resiliency through redundancy.

Notes:

Kurt Ferreira, Patrick Bridges, Ron Brightwell, Kevin Pedretti (2010) The Impact of System Design Parameters on Application Noise Sensitivity In: Proceedings of the 2010 IEEE International Conference on Cluster Computing 146-155 Crete, Greece:

Abstract: Operating system noise, or â��jitter,â�� is a key limiter of application scalability in high end computing systems. Several studies have attempted to quantify the sources and effects of system interference, though few of these studies show the influence that architectural and system characteristics have on the impact of OS noise at scale. In this paper, we examine the impact of three such system properties: platform balance, â��noisyâ�� node distribution, and non-blocking collective operations. Using a previouslydeveloped noise injection tool, we explore how the impact of noise varies with these platform characteristics. We provide detailed performance results that indicate that a system with relatively less network bandwidth is able to absorb more noise than a system with more network bandwidth. Our results also show that application performance can be significantly degraded by only a subset of noisy nodes. Furthermore, the placement of the noisy nodes is also important, especially for applications that make substantial use of collective communication operations that are tree-based. Lastly, performance results indicate that nonblocking collective operations have the ability to greatly mitigate the impact of OS interference. Combined, these results show that the impact of OS noise is not solely a property of application communication behavior, but is also influenced by other properties of the system architecture and system software environment.

Notes:

2008

Kurt Ferreira, Ron Brightwell, Patrick Bridges (2008) Characterizing Application Sensitivity to OS Interference Using Kernel-Level Noise Injection In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'08) (Best Paper and Best Student Paper nominee) Austin, TX:

Abstract: Operating system noise has been shown to be a key limiter of application scalability in high-end systems. While several studies have attempted to quantify the sources and effects of system interference using user-level mechanisms, there are few published studies on the effect of different kinds of kernel-generated noise on application performance at scale. In this paper, we examine the sensitivity of real-world, large-scale applications to a range of OS noise patterns using a kernel-based noise injection mechanism implemented in the Catamount lightweight kernel. Our results demonstrate the importance of how noise is generated, in terms of frequency and duration, and how this impact changes with application scale. For example, our results show that 2.5% net processor noise at 10,000 nodes can have no impact or can result in over a factor of 20 slowdown for the same application, depending solely on how the noise is generated. We also discuss how the characteristics of the applications we studied, for example computation/communication ratios, collective communication sizes, and other characteristics, related to their tendency to amplify or absorb noise. Finally, we discuss the implications of our findings on the design of new operating systems, middleware, and other system services for high-end parallel systems.

Notes:

Workshop papers

2012

Kurt Ferreira, Kevin Pedretti, Patrick Bridges, Ron Brightwell, David Fiala (2012) Evaluating Operating System Vulnerability to Memory Errors in Proceedings of the International Workshop on Runtime and Operating Systems for Supercomputers [Workshop papers]

Abstract: Reliability is of great concern to the scalability of extreme- scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algo- rithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute nodeâ��s physical memory, recent studies show more memory errors in this region of memory than the remainder of the sys- tem. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high- performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft er- rors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight op- erating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.

Notes:

2011

Patrick Bridges, Mark Hoemmen, Kurt B Ferreira, Michel A Heroux, Philip Soltero, Ron Brightwell (2011) Cooperative Application/OS DRAM Fault Recovery In: Fourth Workshop on Resiliency in High Performance Computing - in conjunction with the 17th International European Conference on Parallel and Distributed Computing [Workshop papers]

Abstract:

Notes:

2008

Ron Brightwell, Kevin Pedretti, Kurt Ferreira (2008) Instrumentation and Analysis of MPI Queue Times on the SeaStar High-Performance Network In: Proceedings of the 2008 Workshop on Advanced Networking and Communications at the 17th International Conference on Computer Communications and Networks [Workshop papers]

Abstract: Understanding the communication behavior and network resource usage of parallel applications is critical to achieving high performance and scalability on systems with tens of thousands of network endpoints. The need for better understanding is not only driven by the desire to identify potential performance optimization opportunities for current networks, but is also a necessity for designing next-generation networking hardware. In this paper, we describe our approach to instrumenting the SeaStar interconnect on the Cray XT series of massively parallel processing machines to gather low-level network timing data. This data provides a new perspective on performance evaluation, both in terms of evaluating the resource usage patterns of applications as well as evaluating different implementation strategies in the network protocol stack.

Notes: