Abstract
Production machine performance has large variability. On the UK National Supercomputing Service, the time a job takes to complete can vary by as much as 53%. Load imbalance and shared resource contention are largely responsible, but we find that previous efforts to model application/architecture performance do not typically take these into account.
In this research we model and simulate network contention, which allows us to explore the impact of multiple interacting jobs and approaches to alleviate these effects, including network re-design and communication-staging within applications. We show the utility of this work on a variety of systems and interacting applications.