Managing Cloud Applications
Cloud computing will change the processes and tools that IT organizations currently use
By: Peter Loh
Sep. 26, 2009 08:00 AM
As enterprises evaluate if and how cloud computing fits into their core IT services, they must consider how they will manage cloud services as part of their day-to-day operations. This article examines how operational management of cloud computing differs from traditional methods, and examine techniques for addressing these needs.
Cloud computing will change the processes and tools that IT organizations currently use. In a traditional datacenter environment, IT organizations have complete control and visibility into their infrastructure. They install each piece of hardware and therefore have complete configuration control. All components in the network are accessible and can be monitored with the right tools. Most enterprises have invested heavily in complex tools in order to manage this environment so that they can identify service-affecting conditions, and analyze performance metrics so they may tune their systems to optimize performance.
For cloud computing services, the enterprise no longer has control and visibility into the components of the service. Yet if the cloud is to replace a core service, how can the IT organization guarantee the equivalent availability and performance service levels? In today's IT environment, isolating problems between an enterprise and its vendor are the most difficult to resolve. Cloud vendors are painting a future in which an enterprise will pick multiple cloud services from a market of cloud services, which means these problems will become more common and more complex. Yet enterprises will not deploy services, even if they are more costly and more agile, if they cannot provide an acceptable level of service. The relationship between cloud vendors and enterprises must evolve. Vendors must not only earn the trust of enterprises, but must provide mechanisms where they can verify that trust in a transparent manner. One step toward that goal is to have management tools that can provide the in-depth views that customers need and that can prove promised service levels are being met.
Let's look at how such a system might work, from a technical perspective. Most enterprise class management systems include the following basic features:
The ability to gather metric information from a variety of components, including:
The process of metric gathering should open as simply as possible by enabling many scripting options. Many current enterprise tools require specific and deep programming skills in order to expand monitoring. This limits the use of the tool since most management systems are deployed by system administrators, not software developers. For managing cloud applications, this is even more important since interfacing to specific cloud vendors will require writing to their APIs, which is typically a REST-type interface.
Enterprise operations typically have additional requirements. Ideally, a tool can integrate with other tools that the enterprise uses. For example, monitoring tools may get their configuration information from a configuration database. Alerts may be fed into a problem ticketing system. The more automation capability that tool has, the better chance it has of fitting into current IT operational processes.
How might this system deal with the dynamic configuration of cloud systems? For many existing tools the provisioning process is a problem. Existing systems management tools are designed to follow a host name/IP address model, not a virtualized model. You need to define the IP/host of each managed system. However, cloud instances are typically dynamically defined. Take the case of Amazon EC2. The host name and IP address are assigned on instance startup. As shown in Figure 1, in order to use these tools, the instance must first be started, the provisioning parameters (IP/host) extracted from the Amazon API and implemented in the management system, and then the management system must be reloaded to implement the change. The exact mechanics vary depending on the management system and cloud vendor, but all rely on a tight dependency between cloud configuration and the monitoring system. This disconnect can cause a lag time in monitoring the true cloud configuration, or worse, an incomplete monitoring system.
Another issue with existing management tools is limited visibility into the cloud infrastructure's operational data. Each cloud vendor has their own configuration definition and operational parameter fields. I call this data, the cloud vendor's "metadata." In the case of Amazon EC2, the metadata includes instance-id, Amazon image ID (AMI), security groups, location, public DNS name and private DNS name. Existing tools are not designed to gather this metadata. When IT operations personnel are troubleshooting EC2 problems, it would be difficult to understand the entire scenario without this metadata. Every cloud vendor has their own metadata, so the problem is further complicated with each additional cloud vendor.
An alternative to using existing management tools is to rely on the vendor to provide the required visibility. However, today, most vendor tools and APIs provide limited visibility. Most infrastructure providers only show whether the instance is running or not. From the infrastructure cloud provider, this makes sense. They are responsible for the virtual server, not what a user might install on it. Amazon has recognized this issue and has tried to address it with their CloudWatch service. This is an optional service that allows the user to gather additional instance metrics like CPU utilization, disk read/write operations, and throughput from Amazon's APIs. However, Amazon only exposes the information - it is up to the user to use the data for alerts or reporting. Though there are some entry-level cloud tools that read API information for status, they do not provide the management features previously listed.
Cloud-specific tools are usually not designed for use by IT enterprise operations. Simple web browser-oriented interfaces are fine for monitoring a few development instances, but enterprises can require monitoring of hundreds of instances and thousands of metrics, which would be beyond the capability of most Web applications. For enterprises, IT operations that are used for in-depth monitoring and high function capability, vendor services alone are inadequate.
The preferred approach is to combine the capability of high-function, enterprise management tools integrated with information from vendor APIs. This system is shown in Figure 2.
In this system, standard monitoring scripts can be deployed either under the control of agents on monitored instances, or as active checks from the management server. Supporting open source scripts, such as those from the Nagios plug-in project, will allow in-depth monitoring of many components, including those listed above. However, this basic monitoring information must be augmented with vendor API information. Vendor data may be queried from the agent, the management server, or both. This approach allows the management system to process vendor metadata combined with monitoring data. Views presented to operators and system administrators can then show much richer information.
Dynamic changes in the cloud must be immediately recognized by the system. One way this can be handled is to change the requirement for managed systems to be pre-defined. Events received by managed systems can be processed as long as they are authenticated. This "event-based" model allows new instances to manage as soon as they start.
Let's look at an example. Suppose we are monitoring a set of application servers running in the Amazon EC2 cloud. A standard script used to get memory utilization from the Linux system would result in the output, "Free memory at 98%. Critical severity."
In a traditional system, this information is associated with the host that it is run on. The management information received after the execution of such a script would look like Table 1.
However, if we combine the cloud metadata, the event information would be (using Amazon EC2) (see Table 2).
An operator looking at this event information would have more complete information on the source and impact of this memory alert. Greater value can be realized by correlating all event information by vendor metadata. For example, grouping instances by their location (or which vendor datacenter the virtual instance is running) might explain system behavior. Access to the metadata also gives the management system the opportunity to perform higher-level functional checks on the managed cloud application.
The following application scenario is based on a real user. The application consists of many server instances running in the Amazon EC2 cloud. The application consists of groups of servers, each performing a different role. There may be up to 50 groups running in the application. The group role is determined by a parameter passed in the "User Data" field of the Amazon EC2 metadata.
The management system accesses the metadata and customizes its active checks based on the role contained in the metadata. Since dynamic changes are handled, as the number of instances within each role group changes, the management system will adapt. Existing management tools were in place, so the cloud management system gathered some of its metric information from an existing tool rather than requiring re-instrumentation. This made the transition to managing the cloud easier for operations.
Since the application consists of many groups of instances, the status of a single instance is not as important as the status of the role's group. One type of higher-level check that was applied is to compute average (or optionally, maximum or minimum) values across the group. Alerts can then be generated based on group metrics rather than instance metrics. The operator has the ability to drill down from the group to the instance values.
This example is using infrastructure cloud provider services. An equivalent scenario can be applied for platform service providers. The difference would be that the monitoring metrics would be testing the platform providers APIs rather than traditional measurements. Again, this implies a tight integration between the management system and the cloud vendor's services.
The management system in this scenario can be used as a basis for establishing vendor trust because of the following advantages:
These advantages also enable the system to be a basis for trust-enabling applications. For example, a billing report can be generated based on the telemetry information gathered by the system. This report would be independent from the vendor, generated by the gathered metric data and the vendor's stated billing policy. This report could be used as an independent audit of the vendor's bill. Another example would be to gather the security information from the vendor's configuration and perform TCP port checks defined by those groups. This verifies the security policy stated by the vendor is enforced for this user's cloud configuration.
If we project this scenario to the future where multiple cloud vendors are used by an enterprise, the management system would look like the following shown in Figure 3.
This system would be a consolidation point for all cloud services, and would translate the heterogeneous cloud services into a common view, simplifying IT operations.
For cloud computing to fulfill its promise of enabling enterprise IT organizations to improve the service it provides to its users, traditional IT operational processes and tools must adapt to a new ways of interacting with external vendor services. I've examined some of the issues that enterprises are encountering, and have offered solutions. But it is clear that these issues are a barrier for adoption of cloud services, and the needs of enterprise operations must be addressed by the cloud community.
Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
SYS-CON Featured Whitepapers
Most Read This Week