Thin-OSCAR: Diskless Clustering for All
Solving problems that arise with diskless cluster support
Feb. 18, 2004 12:00 AM
While OSCAR (Open Source Cluster Application Resource) has been conceived for clusters with disks since its very first version, diskless and systemless support was a feature that a lot of people expected. The Center for Scientific Computing has built several clusters without disks; we tested OSCAR and were easily convinced of its quality, especially when compared to our own homemade scripts. We decided to use OSCAR for our diskless cluster and then transfer our diskless expertise to the OSCAR project. The thin-OSCAR workgroup was created specifically to analyze and solve problems that arise while adding this diskless cluster support.
This article first defines essential notions for a diskless cluster. The actual implementation of thin-OSCAR is explored, along with a roadmap for the development of thin-OSCAR. Interactions with OSCAR are detailed so that features can be discussed, prioritized, and eventually added to the OSCAR framework.
Why Use Diskless Nodes?
First, disks are useless for calculation, and they don't get you on the Top 500 .org list. Save your money and buy more nodes with it. In addition, a disk is a mechanical part that is subject to faillure; fewer parts means greater reliability.
In addition, consider the increased consistency across nodes. It's easier to manage one image than many individual installations. For example, if a package update occurs while a node is down, this node is not exactly the same as others, and homogeneity issues can rise.
As a consequence, nodes with disks are subject to greater entropy than diskless nodes. As soon as a diskless node is rebooted, it is an exact copy of the image that was sent to it. Today this argument is somehow softened by the existing multicast technology used to install nodes with disks. Still, multicast installation of diskless nodes is faster (diskless image size is smaller) and less error-prone (the only error cause is the network) than the automatic remote installation of nodes with disks.
Nodes with disks can be considered diskless when, for security or practical reasons, the disk cannot be used. This is the case, for instance, in a Grid environment in which cluster nodes are workstations during the day. It allows nodes that are not used in a cluster to be very quickly integrated into a cluster without any alteration of the main operating system stored on their disks.
Is Diskless Right for Your Application?
There are limitations on the type of computation that can be made on this kind of node: intensive I/O applications can be executed on such nodes but will not scale well and will slow down each calculation, resulting in inefficiency.
Diskless and Systemless Techniques
Its main drawback is a scaling problem that is common to all clusters that share files with the NFS protocol. /home is generally exported and used like a distributed resource among the nodes. As a consequence, if computations cause intensive I/O usage, the network will be exclusively used by the NFS protocol. The cluster can be paralyzed and can even crash the NFS server, depending on the configuration of the NFS server and on the quality of the NFS implementation.
A common solution to this problem is to use a dedicated network exclusively for the transport of information for permanent storage. However, this doesn't solve the inherent NFS problem - the NFS server is central and network load can't be distributed. While this problem wasn't very important in building small clusters, it's very important today as clusters are commonly built with more than 1,000 nodes.
Diskless clusters have the same problem, but it occurs on a smaller scale because NFS is more heavily used. The complete root file system of each node resides on the NFS server. Only /opt and /usr are common (and read only). As such, diskless nodes are not good candidates for large Root-NFS-based clusters.
This approach is interesting because the transfer of the initial RAM disk can be multi-casted so that booting a cluster can be very fast. Another advantage is that under certain conditions the connection with the file server can be lost. In this case, this kind of diskless node will still be up and running correctly (as long as it doesn't need files from the file server - which is generally the case with scientific computation programs once loaded).
Single System Image Model
The Perl script is an interactive script that lets you configure the diskless cluster easily. It performs all the tasks necessary to transform a regular systemimager image into a set of RAM disks for diskless operation and to configure the OSCAR server correctly. Loopback device support has to be available on the system (master node) where thin-OSCAR is executed.
In order to launch the thin-OSCAR wizard, issue (as root) the following command:
The next section examines each of the steps necessary to use thin-OSCAR and create a diskless cluster.
Image Creation in the OSCAR Wizard
Linking Nodes to Models
The configuration is written in oscar-package/etc/link.xml.
Configuring the Details
The boot image contains the linuxrc script, which will build the raid array in RAM and put the run image into it. The script ends with a pivot_root to change the root. The raid array becomes the new root of the system. After that, the system boots (almost) as if it were a normal disk install.
The run image is typically about 25Mb, which will reside on the node after the boot process. Thin-OSCAR simply copies the directory from the SIS image to this RAM disk and creates an empty directory for those that are mounted by NFS.
The relevant directories are copied from their systemimager image location to the image directory: /bin, /boot, /dev, /etc, /home, /lib, /mnt, /proc, /root, /sbin, /var.
The directories /usr, /opt, and /lib/modules are created and will be used as a mounting point for the NFS exported file system.
The /etc/fstab file is then generated in the image directory. It contains the raid array in the RAM device (/dev/md0) as its root mount point. /home is mounted via NFS (OSCAR standard), and the /opt, /usr, and /lib/modules directories are NFS mount points from the systemimager image directory on the server (see Figure 1).
Networking capabilities are then generated, mainly /etc/sysconfig/network-scripts/ ifcfg-eth0, which is configured via DHCP.
Some information is deleted from the run image, for example the RPM database, because no further RPM operation will occur on the node.
The /etc/exports file is adjusted so that the systemimager image is exported (read only) to the cluster net with a given subnet mask. /home is exported read-write to the same network.
For example, there is a tool called cpush from the C3 package that puts a specified file on each node. It's obvious that if the file is in RAM, it will be lost on the next reboot. For diskless nodes, cpush should copy the file in the image directory, rebuild the RAM disk if needed, and reboot affected nodes. As a result, many commands from the traditional OSCAR environment should behave differently depending on the configuration of the node. A lot of integration work has to be done to seamlessly administrate a heterogenous cluster.
A more detailed version of the roadmap is available in the thin-OSCAR package in oscar-package/ROADMAP.
Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
SYS-CON Featured Whitepapers
Most Read This Week