A Machine Learning Approach for an HPC Use Case: the Jobs Queuing Time Prediction

Vercellino, C., Scionti, A., Varavallo, G., Viviani, P., Vitali, G., & Terzo, O. (2023). A machine learning approach for an HPC use case: The jobs queuing time prediction. Future Generation Computer Systems, 143, 215-230.

DOI: https://doi.org/10.1016/j.future.2023.01.020

Download

Abstract

High-Performance Computing (HPC) domain provided the necessary tools to support the scientific and industrial advancements we all have seen during the last decades. HPC is a broad domain targeting to provide both software and hardware solutions as well as envisioning methodologies that allow achieving goals of interest, such as system performance and energy efficiency. In this context, supercomputers have been the vehicle for developing and testing the most advanced technologies since their first appearance. Unlike cloud computing resources that are provided to the end-users in an on-demand fashion in the form of virtualized resources (i.e., virtual machines and containers), supercomputers’ resources are generally served through State-of-the-Art batch schedulers (e.g., SLURM, PBS, LSF, HTCondor). As such, the users submit their computational jobs to the system, which manages their execution with the support of queues. In this regard, predicting the behaviour of the jobs in the batch scheduler queues becomes worth it. Indeed, there are many cases where a deeper knowledge of the time experienced by a job in a queue (e.g., the submission of check-pointed jobs or the submission of jobs with execution dependencies) allows exploring more effective workflow orchestration policies. In this work, we focused on applying machine learning (ML) techniques to learn from the historical data collected from the queuing system of real supercomputers, aiming at predicting the time spent on a queue by a given job. Specifically, we applied both unsupervised learning (UL) and supervised learning (SL) techniques to define the most effective features for the prediction task and the actual prediction of the queue waiting time. For this purpose, two approaches have been explored: on one side, the prediction of ranges on jobs’ queuing times (classification approach) and, on the other side, the prediction of the waiting time at the minutes level (regression approach). Experimental results highlight the strong relationship between the SL models’ performances and the way the dataset is split. At the end of the prediction step, we present the uncertainty quantification approach, i.e., a tool to associate the predictions with reliability metrics, based on variance estimation.