[SPONSORED CONTENT] How did you, in the heart and training of a research scientist, financial analyst or product design engineer doing multi-physics CAE, how did you become a… system administrator? You decided it was one thing, and it turned out to be something else. You finished school and started working with some great HPC class collections. One day, the system will fail, and you, dark soul, will step forward to fix it. Someone – maybe older – will give you a compliment: “Oh, that’s great. Man, I can do it not at all I’ve known myself …,” that kind of thing.
Word gets around, and soon you go to someone who has a problem with the cluster, often. Before long, there you are, sitting in front of a screen bank checking the system while everyone else is doing science, comparing hedge fund portfolios, or modeling new product designs. And you may be asking yourself, “Well, how did I get here?”**
Organizations that rely on clusters – whether 100 nodes or 1,000 – are out of place without a system administrator, that is, an architect. It’s a split-head, hard work that’s not pretty use cluster. But everyone from the CEO on down knows that if the planners don’t do well they will end up in their organizations.
And there aren’t many of them. Teams are bigger, more complex, more powerful and more diverse, and they become more manageable as they take on bigger and more complex tasks.
“You don’t start out thinking, ‘I’m going to get into cluster systems management,’” Glen Otero, a Ph.D. who is the Director of Scientific Computing, Genomics AI and Machine Learning at CIQ , a technology firm specializing in HPC-class packages. “You start out as someone who is going to do something great in science. But you change this place, because – we joke about it – you let the system set up. And then you’re like, ‘Hey, can you do this again? Can you do that again?’ And you wake up one day and you’re like, ‘Where did my life go? I need to do some research.’”
Cluster provisioning and management has demanded a solution that smooths and automates – at least to automate – those processes when the clusters are in place. Three popular open source projects have faced the team’s difficulties, all three of which are led by Greg Kurtzer, founder and CEO of CIQ. The three themes are:
– The Rocky Linux operating system, based on the CentOS Linux distribution, started by Kurtzer and supported by Red Hat in December 2020 (see related information in HPC), was widely used by organizations building large, complex, HPC-class infrastructure. .
– Warewulf, a cluster provisioning solution developed by Kurtzer started in 2001 while managing Linux clusters at Lawrence Berkeley National Laboratory for the Department of Energy.
– Apptainer, also invented at Berkeley Lab by Kurtzer, is a proprietary application container system that began life as “Singularity,” HPC’s answer to Docker.
Kurtzer started CIQ to provide Rocky Linux, Warewulf and Apptainer support, services, tools and other added value, and is a strong driving force behind the open source communities involved in all three projects. CIQ provides HPC-related solutions and support, behind the computing model leading the way to cloud-native, hybrid, converged computing called HPC-2.0 (and discussed in a later article on this site).
“Building and managing clusters is difficult and irreversible,” said Brock Taylor, CIQ’s vice president of computing and strategic partners. “There are thousands of components to the team. You add all the hardware and software, and the operating system itself is a load of stuff. It takes a lot of effort to get there, a lot of expertise.”
When the Beowulf clusters started in the early 1990s, the supply was based on text, hand-made and do-it-yourself. Soon tools became available, open source tools like Oscar, Rocks and Warewulf.
“So you have these provisioning systems that make cluster deployment more efficient,” Taylor said, “but over time, complexity increases. It’s like entropy, right? With clusters, it’s not easy, the harder it gets, the harder it goes before the solution.
Commercial software offerings also came to the market, such as those from Platform Computing, based in large part on Rocks and later acquired by IBM, and Bright Computing, which NVIDIA added it to its enterprise stack last January.
But for open source advocates, Warewulf and Apptainer have value in being community-supported and commercial-free. That said, it’s not a panacea – cluster entropy still exists, and there’s the problem of not enough system architects to meet the demand, especially those who succeed in the pit. of HPC alligator bands.
“This is a big problem in HPC,” Taylor said. “Finding people who can stay on top of all the technology, and keeping them, is a draining reservoir. And as they gain more expertise in managing HPC systems, their costs increase and they have more opportunities to go elsewhere.
Warewulf helps cluster management in part by simplifying the addition of new cluster clusters using “images,” says Taylor, “where all the magic happens.” The snapshots include an entire computer stack, a “golden snapshot,” of resources – the computer that uses the computer’s performance, memory, network, everything – in the node. The images can add new nodes that are exact copies of other nodes that work together, and make sure that all the pipes and wires are connected correctly,” says Taylor, “it’s a very difficult task .”
In Rocky Linux-Warewulf-Apptainer stores, Warewulf images are provided as containers to rotate computers on the cluster. These can also include variations on existing cluster configurations – say, a node with both GPUs and CPUs, while other nodes are CPU-only – but can still function as part of the cluster.
Jonathan Anderson, CIQ’s HPC solution architect, describes why the combination of Apptainer and Warewulf is such a powerful combination.
“Apptainer brings scientific computing end users into the container ecosystem, giving them control over the operating environment in which their applications run,” he said. Warewulf brings 4 cluster managers into that container ecosystem by mapping the compute node images to standard operating system containers. By bringing users and administrators together into a single ecosystem they can better collaborate and build on each other’s work.
This is where CIQ can make a big difference to HPC shops. The company specializes not only in the base level of operating system but also in Warewulf and Apptainer.
“Warewulf helps you keep your software consistent when all users are running different applications, ‘snowflake’ applications, in containers,” said Otero. All three (Rocky, Apptainer, Warewulf) are integrated in a way that allows organizations to create and expand clusters at scale, quickly, in an easy way.
“Applications run in containers, and because they’re platform-independent – because everything is wrapped up in a container – an administrator can run these nodes as if they were all the same ,” Otero said. “When Snowflake applications arise, some nodes have GPUs, for example, an administrator may want to use Warewulf to create a different version of Linux that will run on those nodes. Warewulf can push that container out to the node with the GPUs, and then Warewulf can restore that node to its original state.
Node flexibility, scalability, provisioning and expansion of clusters, simplification of system management tasks – all these are included in organizations that rely on HPC clusters to fulfill their tasks.
And who knows, maybe some of the researchers, analysts and designers turned system administrators will spend more time doing what they did in the first place.
** Talking Heads, “Once in a Lifetime”