Optimizing Data Analytics on supercomputers
High Performance Computing has been traditionally used for scientific computing to solve complex problems which require extreme amounts of computation. HPC is designed with performance as principal focus, leveraging on supercomputers along with parallel and distributed processing techniques.
The rise of Big Data came with an increasing adoption of data analytics and Artificial Intelligence in modern applications that make use of data-driven models and analysis engines to facilitate the extraction of valuable insights. Big Data Analytics utilize Cloud data-centers which provide elastic environments based on commodity hardware and adapted software; while instead of performance they focus on flexibility and programming simplicity. Containerization, based on Docker, has greatly improved the productivity and simplicity of Cloud technologies; and together with the advanced orchestration, introduced through systems such as Kubernetes, enabled the adoption of Big Data software by a large community.
Big Data analytics are applied extensively, under the digitalization efforts, in various industries such as pharmaceutics, construction, automotive but also agriculture and farming. Supercomputers and HPC can be of great benefit to Big Data Applications since large datasets can be processed in timely manner.
But the steep learning curve of HPC systems software and parallel programming techniques along with the rigid environment deployment and resource management remain an important obstacle towards the usage of HPC for Big Data analytics. In addition, the usage of classic Cloud and Big Data tools for containerization and orchestration cannot be applied directly on the HPC systems because of security and performance issues. Hence workflows mixing HPC and Big Data executions cannot be yet combined intelligently using off-the-shelf software.
CYBELE aims to provide solutions to the above issues. We have designed a prototype architecture for the execution management services combining HPC and Big Data technologies to enable the deployment of data analytics workflows, in the context of precision agriculture and livestock farming. We propose a suite of Cloud-level tools combined with Big Data and HPC systems software and adapted techniques to bring the right abstractions to data scientists with non-HPC systems expertise to optimally leverage HPC platforms.
The architecture features one Big Data partition composed of VMs managed by Kubernetes, using a mix of Docker and Singularity runtimes for containerization, along with one HPC partition, composed of baremetal machines managed by Slurm or Torque, using Singularity containerization.
Furthermore, the proposed execution management stack includes:
- a flexible environment creation and deployment tool based on Singularity containerization and specialized repository with pre-built images featuring Big Data and AI frameworks (such as Pytorch, Tensorflow and Horovod) optimally configured to use HPC resources (such as GPUs and Infiniband).
- meta-scheduling and resource abstraction techniques enabling the execution of Big Data Analytics as batch jobs on Slurm or Torque managed HPC partition along with the possibility to deploy Cloud-level tools on the VMs of the Big Data partition to enable the abstraction of HPC services, all using the same Kubernetes API.
The first version of the CYBELE execution management layer has been delivered. The developments are done by partners with a cross-discipline expertise (HPC, Big Data, AI, Cloud) and are lead by Ryax Technologies.
The design, implementation and initial experimentation of the CYBELE execution management layer in the context of a precision agriculture use case has been accepted for publication in the proceedings of the scientific workshop ISC-VHPC'20 and has been presented in June 2020. For more details you can check the related presentation entitled: “Converging HPC, Big Data and Cloud technologies for precision agriculture data analytics on supercomputers” provided here.