From the Blogosphere
Ten Common Cloud Usage Traps
Many cloud users moved to the public cloud for cost reasons, but stay for the availability benefits
By: Aaron Klein
Jul. 18, 2013 11:41 AM
Many cloud users moved to the public cloud for cost reasons, but stay for the availability benefits. AWS (and other public clouds) offers users tremendous advantages in terms of elasticity - need another 100 servers? We can spin those up instantly. Unexpected CDN demand? No problem. Want to test something out? We'll get the resources right away. In short, the public cloud offers users nearly unlimited capacity in a comparatively (to the old "order the servers from IT") instantaneous manner.
However, along with the fantastic increase in availability, we have found that using a public cloud is far more complex than it first appears. As anyone who has tried to navigate through the AWS management console will attest, there are numerous opportunities for missteps. Unfortunately, the errors are rarely obvious and always sacrifice functionality. Here, organized by service, are 10 common and avoidable traps with a brief explanation of what each means and why each matters. Importantly, these are all readily identifiable through manual account inspection or with the help of an automated tool from CloudCheckr, Cloudyn, CloudVertical, or other vendors.
1. Over-utilizing instances. If the CPU Utilization of an instance is averaging higher than 75% over a two-day period, the server may be over-utilized. Potentially over-utilized instances should be reviewed to see if upgrading to a larger instance type may be appropriate to ensure system stability.
2. Failing to properly distribute instances across AZ. Unfortunately, outages do occur. Consequently, EC2 instances within a region should be evenly distributed across Availability Zones within that region to ensure that when an Availability Zone does experience an outage, or any other disruption, its impact on your services is minimized.
3. Failing to adequately update EBS snapshots. Users can take snapshots of their EBS volumes to act as backup, or to be used as a baseline for new volumes. Snapshots of EBS volumes should be taken regularly to be used in the event of disaster recovery.
4. Over utilized EBS volumes. If the number of bytes written to, and read from, an EBS Volume averages higher than 1,000,000,000 over a two-day period, the EBS Volume may be over-utilized. To ensure optimal functionality, review the activity and potentially add additional volumes.
5. Failing to use multiple AZ. If an Auto Scaling group only has 1 Availability Zone, it is vulnerable to fail in the event that that particular Availability Zone's service is interrupted. By configuring the Auto Scaling group to utilize multiple Availability Zones, this will allow Auto Scaling to redistribute the load into a healthy Availability Zone to avoid being negatively impacted by any service interruption.
6. Failing to use adequate cool down period. Cool down periods help to prevent Auto Scaling from initiating additional scaling activities before the effects of previous activities are visible. Because scaling activities are suspended when an Auto Scaling group is in cool down, an adequate cool down period helps to prevent a trigger from firing based on stale metrics.
7. Referencing an Invalid security group or invalid key pair. If the security group or key pair being referenced within the launch configuration has been deleted, the Auto Scaling group will not be able to launch new EC2 instances. A new launch configuration will need to be created and the Auto Scaling group will need to be updated.
8. Over-utilized ELB. If the request count for a Load Balancer is greater than 1,000,000 over a two-day period, the Load Balancer may be over-utilized. To ensure optimal functionality, potentially over-utilized load balancers should be reviewed to see if additional ELB need to be added to properly handle requests.
9. Failure to attach at least 2 healthy instances. There should always be a minimum of two healthy instances associated with a load balancer. If there is only one, the load balancer will not be able to failover, as it will not be able to reroute traffic to another instances.
10. Not detecting and replacing an unhealthy instance. Load balancers will consider an instance unhealthy if the instance is closing the connection to the load balancer, responses are timing out, or if public key authentication is failing. While it is possible that the instance may be failing health checks simply because it is under heavy load, any instance that a load balancer deems ‘unhealthy' should be checked.
Every issue on this list is relatively common (see CloudCheckr's March survey). Further, none of the fixes are particularly tricky or requires a significant amount of time. Plainly, the trick is identifying the problems before you degrade performance.
Unfortunately, identification is complicated by the cloud's dynamic and elastic nature. With ever-changing and evolving deployments, last month's, last week's, or even yesterday's review is outdated. To avoid an unplanned outage, users need to regularly re-check and remain vigilant for these concerns. The monitoring can be completed manually, through an automated tool, but it needs to be done.
As a note of disclosure: I am a Founder and the COO of CloudCheckr Inc. We specialize in this space and devote our solution to addressing all of these core infrastructure monitoring and control issues. Obviously, I am biased in favor of our leading product: CloudCheckr Pro. However, do not take my word for which solution is optimal. I encourage you to try CloudCheckr Pro along with solutions from vendors such as Cloudyn, Cloudability, and others. Use the free trials and judge for yourself.
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
SYS-CON Featured Whitepapers
Most Read This Week