|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
Features Flexibility Is Key to Cluster Administration Software
Adapting to a changing environment
Sep. 27, 2004 12:00 AM
When companies purchase a significant number of machines and cluster them together to solve their computing needs, their site environment often drives specific requirements for their clusters. These requirements can include specific networking configurations, particular applications they want to have managed, a specific approach to software installation and maintenance, or existing management software and processes that they want to use on the cluster. The key to successful cluster administration software is that it be flexible enough to accommodate many of these environments. For optimum flexibility, the systems management software must have the following characteristics:
Flexible Fundamental CapabilitiesCluster administration encompasses a wide variety of tasks that are often unique to the cluster or to the cluster's purpose. Therefore cluster management tools need to provide ways to accomplish many different tasks with simple tools. The more inherent the flexibility in these tools the better. Basic functionality that is needed for cluster management includes:
Examples of cluster administration tools that include forms of the above functionality are xCAT, the C3 tools in Oscar, Scali Manage, and CSM. Some of the tasks above can also be accomplished through the use of enterprise-level software that can be used in clusters. One example is Red Carpet, which provides software maintenance. Extensible Hardware ControlMany clusters consist of heterogeneous hardware. Even if all the nodes are the same machine type, there are still non-node devices such as switches and terminal servers to consider. This provides a challenging environment for remote hardware control (power on, off, and query) to the various types of hardware, as many models require unique methods of power control.To support the ever-growing number of power methods, the administration software must support user-defined power methods that can be plugged into the main power commands. A pluggable method allows the software to more easily support new hardware, and allows the user to run the same command to all the nodes, despite their different control methods. It also allows other software components, such as installation, to drive the power control to the various hardware. In addition to power control of the cluster hardware, remote console is another area that requires pluggable methods. There are a wide variety of terminal servers, and now serial, over LAN (SOL) support, on the market and each of these has its own intricacies for establishing a remote console session to the node. In addition to writing your own console method for new terminal servers, "in-house" development can allow more flexibility when upgrading cluster hardware: instead of being required to wait for and upgrade to the latest version of the software to support new hardware, you can script your own solution. Examples of simple hardware control methods that cluster administrators can easily develop are power on through Wake On LAN, power off through a distributed shell, and power control via a power switch like APC or Baytech. Cluster products that provide extensible power control include xCAT, Scali Manage, and CSM. Variety of Node Installation MethodsInstalling the operating system and applications on nodes is one of the most important functions of cluster administration software, because it can take so long to do manually. Because the method of installation affects the other administration processes, it's important for the software to support a variety of installation methods.For clusters in which the nodes are not all identical and for which there exists a separate software maintenance procedure, the approach of directly installing the RPMs from the distribution media is generally the most useful. This allows the administrator to initiate an install with just the distribution CDs in hand, and they can easily specify a different list of RPMs for different nodes. Products that support this installation method include Rocks, Clusterworx, Scali Manage, xCAT, and CSM. They generally use kickstart's and autoyast's unattended installation features to automate the installation of multiple nodes over the network in parallel. While many users like the simplicity of the direct installation method, an equally large user camp prefers the cloning method. This generally combines the node installation method with the node software maintenance strategy. In this approach a typical node (sometimes called a "golden" node) is installed manually and configured exactly how the administrator wants the rest of the nodes to be. Then the software image is captured from the golden node and replicated to the other nodes. When updates or configuration changes are necessary, the golden node is updated and the capture/replicate process is done again. This approach is most effective for clusters in which the nodes in the cluster are almost identical, in terms of both hardware and software. There are a variety of ways the software image can be captured. Some tools, like Clusterworx and the open source tool ghost, take a snapshot of the disk image and replicate that disk image to the nodes. This approach has the advantage of being independent of the operating system that is being captured, but can only work on homogeneous hardware, since the disks all need to be similar. Another approach, used by System Imager, captures all the files in each of the file systems on the golden node using rsync. This has a couple of advantages:
While installing the operating system locally on each node generally works well (disks are cheap, and the OS files can be loaded more quickly once they are on the local disk), some users are moving to diskless nodes. The motivation for this is generally not price (disks are dirt cheap these days) or even easier maintenance (there are both pros and cons in this area). The motivation is usually reliability in large clusters, because the last moving part in the node is eliminated. (For certain users, security can also be a motivation, since there is no persistent information on the nodes.) There are three main classes of diskless clusters:
Extensible Monitoring CapabilitiesSimilar to hardware control, extensible monitoring of the cluster is a useful tool for the automation of cluster events. While there are many enterprise software packages that provide error detection and response, it's useful to have at least some set of customizable and user-defined monitoring capabilities in a cluster administration product. Common events across the cluster to which the software may need to respond include node down and up events (useful for manipulating workload queues), filesystem space used, processor idle time, network adapter throughput, and syslog entries. The following extensibility points are important in event monitoring:
Hierarchical SupportThere are several possible reasons for using cluster administration in a hierarchical fashion. The obvious reason is to be able to manage more nodes than supported by the current scaling limit of the administration software. Another reason is to divide up the nodes into smaller sets that can be managed individually, sometimes by different administrators. A third reason is to handle unusual networking configurations, for example, cross-geography clusters.A typical hierarchical cluster consists of a three-level hierarchy in which there are sets of nodes, with each set being managed by a management server (called the First Line Management Server in Figure 2, or FMS). A top-level management server (Executive Management Server or EMS) manages all of the First Line Management Servers. Ideally, all management operations could be done from the EMS, but it's important that the following are done from the EMS:
Modular and CustomizableWe've already mentioned that customers often have established system management processes in their lab prior to using any of the administration products mentioned in this article. It's not normally well received when the product dictates the processes to be used for all the administration tasks (installation, software maintenance, user management, configuration, monitoring, etc.). To avoid this "barrier to entry," the product must have the following characteristics:
Frequent Updates and User ContributionsAs we all know, Linux software and its associated hardware does not stand still. The many components of a typical Linux cluster continue to evolve with new versions, usually several times a year, with all the components on different release schedules. And new technology continually appears. As a result, the administration software needs to continually adapt to its changing environment. This requires the ability to put out frequent updates to the product. Open source solutions (e.g., Rocks, CIT, OSCAR) generally have an easier time of this due to their iterative development style and less testing done by the development team (and more by the user community). But even vendor products need to find ways to release updates often.CSM uses a combination of traditional product releases and early updates on the IBM alphaWorks site. User contributions can also help tremendously in keeping up with all the changing components. This is business as usual for open source solutions, but can be difficult for vendor products due to legal restrictions. This issue must be resolved in order for vendor products to be able to keep up with the changing environment. SummaryIn Linux clusters, there are so many open source administration utilities and so many home-grown solutions that there is very little need for a one-size-fits-all cluster administration product. The administration software must be extremely flexible to accommodate a variety of environments and to complement, but not conflict with, the utilities already being used.Referenceswww.alphaworks.ibm.com/tech/ect4linux Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
||||||||||||||||||||||||||||||||||||||||||||||||