Building a Real-World IaaS Cloud Foundation
May. 26, 2009 12:15 PM
I recently gave a talk at the Cloud Computing Expo in New York about where to begin if you're building a cloud infrastructure or "Infrastructure-as-a-Service." The response was great, so I'll try to summarize the high points here for others who are interested.
What Is IaaS Anyhow?
Why? Without a dependable, scalable, and expandable hardware (CPU, storage, networking) foundation, building virtualization layers and higher-level services are for naught. Plus, it should be noted, that certain specialized "cloud" infrastructures might not have a virtualization layer at all. So a flexible physical layer is even more crucial.
This IaaS layer is assumed to provide a number of idealized properties: It should "present" a pool of highly available CPUs (and maybe even differentiated CPU types), capacity/utilization data, chargeback data, and data needed by CMDB/compliance systems. In return, it should "consume" requirements for real-time server needs, storage needs, network needs, and SLA requirements. In this way the infrastructure's control policy can be set to provide the necessary SLAs and performance (see Figure 2).
Note that what we're talking about is very different from a virtual infrastructure - we're talking about the highly reliable and elastic physical infrastructure that's underlying any software infrastructure - whether or not it's physical or virtual.
In summary, such IaaS architecture ought to act like an idealized CPU pool with a simple API - not unlike what you see when using Amazon's EC2 or other hosted CPU services. But unless you're an Amazon, Salesforce, or another mega-provider, how do you go about building one of these?
Where the Industry "Went Wrong"
It all starts with server hardware...motherboards to be exact. When the computer industry was just getting started, motherboards harbored a CPU and remedial I/O. But as processors got more sophisticated, they were integrated with complex I/O (e.g., Network Interface Cards or NICs) as well as with storage connectivity (e.g., Host Bus Adaptors or HBAs). Plus, there was usually a local disk, of course. These components all added-up to giving the motherboard the concept of "state."
This "state" meant that the server retained static data, specifically things like I/O addressing and storage connectivity naming, not to mention data on the local disk. Usually the local network had state too - ensuring that the IP and MAC address of the motherboard were attached to switches and LANs in a particular way. Add to this the fact that for critical applications, all of these components (plus naming/addressing) were frequently duplicated for redundancy.
This meant that if you had to replace (or clone) this server, say because of a failure, you had to reconfigure all of these addresses, names, storage connections, and networks - and sometimes in duplicate. This resulted in lots of things to administer to, and lots of room for error. And frankly, this is where fundamental "data center complexity" probably arose.
In response to dealing with failures and complexity, vendors developed special-purpose clustering and failover software - often necessarily closely coupled to specific software and hardware - to provide the reassignment of state to the new hardware and networking. This software often required hand-crafted integration and frequent testing to ensure that all of the addressing, I/O, and connectivity operations worked properly. And many of these special-purpose systems are what are in use today.
Similarly, there are equally complicated software packages for scale-out and grid computing, that perform similar operations - not for the purpose of failure correction, but for "cloning" hardware to scale-out systems for parallel computing, databases, etc. But these systems are equally complex and usually application-specific, again having to deal with replicating stateful computing resources.
So the industry, in an effort to add "smarts" and sophistication to the server - to enable it to failover or scale - has instead created complexity and inflexibility for itself. The question is, what could the industry have done differently, and what can we do now?
A More Elegant Approach to Infrastructure Availability & Elasticity
If this were possible, then the ability to re-purpose and re-assign CPUs would be massively simplified. CPUs, their compute loads, and their connections to the rest of the data center, could be easily cloned. Why? Local cloning would be the equivalent of HA, and environment cloning would be the equivalent of DR. The way that much of today's clustering and DR activities are administered would be greatly simplified with an elegant approach.
Part of the secret here is figuring a way to eliminate (or, at least, abstract away) the traditional I/O and storage naming/addressing, plus the static networking, that causes so many headaches and complications. And, there are essentially two solutions:
The first approach is a ground-up design in which the server motherboards are designed without the traditional NIC and HBA adapters at all, and where the board interconnections consist of a converged high-speed "fabric" that conveys both I/O and storage data. No Ethernet, no iSCSI, no Fibre Channel (except between the system and the "outside world"). Drivers are written for the OS so that it "thinks" it's talking to ordinary NICs and HBAs, but in reality, those entities are logical, not physical; hence they can be configured at will. Within the fabric's converged control are logical switches and load balancers, plus the ability to create secure single or redundant interconnections. This "purist" architecture was first brought to market by Egenera in 2001 with its BladeFrame product, which is still available today. This architecture has also more recently been brought to market by Cisco with their Unified Computing System.
The second approach leverages existing standard hardware (see Figure 3). The same replacement device drivers are written for I/O and storage connections. However, the existing networking hardware is repurposed to provide converged I/O and storage data along a single (or dual-redundant) standard networking wire. Standard switches can be used to rout traffic in the fabric, and standard NIC and HBA cards can be used for communication between the fabric and the rest of the data center. This approach is used, for example, in a joint Dell/Egenera solution, and as a slight variant in solutions from HP and IBM.
Regardless of the approach used, it's now possible to "orchestrate" physical infrastructure in software (logically) rather than in hardware (physically). This allows IT OPs to provide those idealized IaaS services which are expected: The ability to define hardware infrastructure configurations in software, the ability to "clone" failed machines (whether they're virtual hosts or physical hosts), and the ability to "clone" entire computing/networking infrastructures (again, regardless of software, OS or VM host technology). And it's truly elegant - all this without having to worry about the minutia of special-purpose clustering, HA, or DR systems.
Engineers know that an "elegant" design is one that accomplishes a goal with the minimum of complexity and resources. And indeed, this "infrastructure orchestration" meets these criteria. Besides providing HA and DR for essentially any form of software (P or V), it also reduces the overall complexity by removing the need for multiple cables, I/O cards, and application-specific clustering. Plus, it does so in a fully flexible way. In many aspects, infrastructure orchestration simplifies for IT "plumbing" what hypervisors have accomplished in the software and OS world.
Operating a Real-World Infrastructure-as-a-Service
As mentioned earlier, this simplified form of infrastructure management for IaaS is agnostic to whether it underpins physical or virtual environments. In physical scenarios, multiple scale-out physical instances (such as Web server farms or large-scale databases) can easily be provisioned on top of the IaaS foundation - which can be managed as "elastic" if needed, cloning instances as demand warrants, or retiring instances when demand ebbs.
Similarly in virtual environments, each CPU can be automatically pre-provisioned with a virtual host - presenting an "elastic" pool of VMs when/as demand warrants. Traditional VM management tools can then transparently operate on top of the physical IaaS foundation. If new hosts are needed, the IaaS can provision as needed.
And the beauty of infrastructure orchestration is that both physical and virtual environments can be simultaneously built on top of the same IaaS.
As virtual infrastructures, cloud computing, and automation begin to permeate the market, this form of infrastructure management will continue to be seen as the perfect complement to software virtualization...flexible software environment, flexible infrastructure architecture. And, as more higher-level PaaS and SaaS models establish themselves in the market, building the foundational IaaS architecture will be more prevalent than ever.
Consider that the entire IT industry will have advanced quite a bit once our complex hardwired plumbing finally evolves into a simple and elegant IaaS foundation.
Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
SYS-CON Featured Whitepapers
Most Read This Week