An Open Source-Based Cloud Data Storage and Processing Solution
Why is an on-demand storage solution required?
Jan. 13, 2011 11:45 AM
Applications are increasingly being made available over the Internet. Several applications have a large user base that produces a huge volume of data, for example, content in a community portal, emails in a web-based email system, and call log files generated at call centers. Due to a large amount of data being added every minute and the need to keep historical data for various requirements such as legal, reference, data warehousing, and analytics, the systems' data size keeps growing exponentially. This requires a huge storage and processing infrastructure, incurring a high cost of procuring and maintaining it for companies. Other typical challenges with such large data sets are how to store the data reliably and economically. How do you process the data efficiently? How do you provide search?
Traditionally storage solutions, as shown in Figure 1, use n-tiered architecture with SAN or NAS for storage, Relational Databases (RDBMS) for search and retrieval, and separate compute servers for processing. This solution architecture however requires expensive hardware and a long lead time to scale.
Figure 1: Traditional storage solution architecture with NAS/SAN and RDBMS
On-Demand Data Storage and Processing Solution
Cloud computing offers the on-demand scalability of resources that can be leveraged for data storage to provide scalable storage. To efficiently and effectively manage the resources and data stored in the cloud, the cloud data storage and processing solution is presented here. Our solution uses Eucalyptus, the open source cloud platform, to manage the underlying storage infrastructure. There are some specialized open source cloud computing solutions such as Hadoop and Lucene that offer low-cost scalable alternatives for applications that need to process huge amounts of data.
The proposed solution uses Hadoop, an open source storage solution, to provide replication, distributed file storage system and framework capabilities for running large data processing applications on clusters of commodity hardware. These layers provide a foundation for our solution to handle the QoS requirements of reliability and performance for huge volume and systems that are prone to breakdown.
Next, the need to efficiently manage the large data, distributed on multiple machines across the cloud, poses a great challenge. Also it's necessary to know exactly when and how much capacity needs to be added or removed and then the complexity involved in provisioning new infrastructure.
To enable an easy need-based management of cloud storage environment by the user, this solution has a web interface with capabilities such as monitoring the consumption and availability of storage and a method to quickly add/remove storage when required. To optimize resource usage, alerting mechanisms are included to send messages when lot of space is lying unused and/or when more space is required based on forecasting models' results. Thus, an on-demand storage solution will provide the following capabilities:
- Increase/decrease the storage as and when required
- Faster access to the distributed files
- Proactive alerts for increasing/decreasing the storage
Traditional vs Cloud Computing Based on On-Demand Storage Solutions
Table 1 provides a summary of the limitations of traditional solutions and how these new solutions address them.
Table 1: Comparison of traditional and Cloud computing solutions for data processing and storage
Architecture of Cloud Data Storage and Processing Solution
The solution architecture for a cloud-based data processing and storage solution is shown in Figure 2.
Figure 2: Cloud based storage and data processing solution architecture
The solution architecture consists of following components:
Specialized Cloud Infrastructure
The foundation layer of the solution consists of the cloud infrastructure to virtualize the underlying hardware and provide components on-demand. The solution leverages Eucalyptus, an open source cloud computing framework to provide the base cloud infrastructure . Eucalyptus uses the Xen virtualization platform to virtualize the physical hardware. It provides on-demand scalability by enabling the addition, instantiation and management of the nodes in the cluster. These nodes not only can contain a virtual machine with the operating system but they can also contain a complete software stack, thus enabling the creation of virtual appliances that can be instantiated and shut down on demand. In addition, a cluster management module is included to automate and ease the management of these instances.
Figure 3: Eucalyptus Cloud infrastructure architecture
Distributed File System
The next layer in the solution is a distributed file system (DFS) that provides scalable and fault-tolerant file system to leverage the storage capacity that is available on multiple machines. For this the solution uses the Apache Hadoop distributed file system as it provides reliable data storage through replication .
Figure 4: Hadoop based distributed file system
The on-demand solution bundles HDFS data nodes as Eucalyptus images and keeps the Hadoop name node on an isolated machine. Whenever there is a storage requirement, data nodes are instantiated on new hardware using these images and are added to the cluster.
Data Processing Module
The next component of the solution is a highly scalable data processing engine that is based on a parallel processing algorithm and is co-located with storage nodes. To implement this, the Hadoop MapReduce solution  is leveraged as it helps partition the processing and executes them in parallel across several nodes, reducing the overall processing time.
Figure 5: Hadoop Map-Reduce based Data Processing module
In the proposed solution, Hadoop Job and Task tracker nodes are bundled as Eucalyptus images. This allows new processing nodes to be instantiated and added to the cluster on-demand.
Distributed Search Engine
Another important component of the solution is a distributed search engine that enables search operations on the data stored in a distributed file system. There are two implementation options available: Hive and Lucene.
With the Lucene implementation, MapReduce tasks are used to build the index files shards . The index files shards are distributed across multiple Lucene search nodes to enable an efficient distributed search.
In case of Hive, the query engine is implemented using MapReduce tasks for distributed data processing. Hive offers a SQL-like interface and converts the search requests into MapReduce tasks that process the search operation in parallel to efficiently retrieve the results.
The top layer is a Management console that provides a web-based user interface to:
- Provision Infrastructure - For quickly and easily adding new nodes to the cluster. The console will add new node instances and remove unused nodes running on the Eucalyptus to manage the storage capacity on-demand.
Figure 6: Provision infrastructure of cloud management solution
- Monitor runtime usage of resource consumption and availability, thus enabling on-time warnings and accurate capacity management.
Figure 7: Infrastructure monitoring using Cloud management solution
The web console interacts with a Hadoop monitoring component to retrieve the usage and availability information and display it graphically in a single monitoring console.
- Forecast future storage requirements and automatically initialize new data nodes. The management console would have a forecasting module to forecast the expected data volume. It will use the historical volume information and statistical forecasting models to project the storage requirement in the future. Depending on the forecasted data, new data nodes can be added proactively before getting any capacity-full alerts from the monitoring system.
Applicability of Cloud-Based On-Demand Storage Solution
This solution architecture is efficient when processing and search is needed in addition to storage that is otherwise less efficient to implement using traditional approaches. For applications that need low-latency retrieval, this architecture may not be efficient. This solution is useful for applications where data is "Written once and read many times." This architecture may not be useful for scenarios that require frequent updates to the data. It's not useful for scenarios where traditional RDBMS-based architecture addresses these requirements:
- Data processing and search is needed along with storage
- For applications not requiring low-latency retrieval
- Application with "Write once, Read many times" data
- Applications with infrequent updates to the data
Cloud computing has quickly evolved and there's whole lot of commercial storage solutions available in the market.  Permabit Cloud Storage is a highly scalable, available, and secure storage platform designed for service providers.  Nirvanix Storage Delivery Network (SDN) is a fully managed storage solution with unlimited, on-demand scalability. Their standards-based integration APIs allow applications to quickly read and write to the cloud.  The Mezeo Cloud Storage Platform is another highly secure platform offering encryption in storage. This enables files stored in their cloud to be securely accessed through a variety of mobile devices or web browsers and without any Virtual Private Network (VPN) setup.
Zetta Enterprise Cloud Storage solutions support all unstructured data types and are backed by industry-leading data integrity and security.  EMC Atmos onLine is an Internet-delivered cloud storage service that provides Cloud Optimized Storage (COS) capabilities to customers with reliable SLAs and secure access. It enables customers to move data from on-premise to off-premise using policies.  The ParaScale Cloud Storage (PCS) software does not require custom or dedicated hardware and can leverage existing IP networking interconnections. It aggregates disk storage on multiple standard Linux servers to present one or more logical namespaces, and enables file access via standard file-access protocols (NFS, HTTP, WebDAV, and FTP). Applications and clients don't have to be modified or recompiled to use PCS.
As customers traditionally store data in-house, they find it difficult to put their business at risk by moving their data out of their premises. Also they fear to risk of result of hardware failure or someone accidentally erasing or corrupting their high-value data outside their control. Thus private clouds are much in demand. Most of the existing solutions require the data to be moved out of the organization's premises. For having the on-demand scalable, distributed and fast-processing storage solution in the private cloud, very few options such as ParaScale Software are available. However the open source-based solution proposed here provides a cost advantage over using commercial software. Also the customization can be done, as per client-specific requirements, with minimal effort and cost.
To handle the huge volume of data generated by applications in an organization, a scalable storage infrastructure is required. This article described the architecture of cloud storage and a processing solution using the available open source options for a private cloud environment. It proposes using Eucalyptus for cloud infrastructure management; HDFS for distributed file storage and parallel processing, and Lucene/Hive search mechanisms. A web-based console is proposed to proactively and quickly monitor and manage these systems. This on-demand storage system will provide IT administrators with the capability to rapidly bring up hundreds of servers, run parallel computations on them, and then shut down the instances as and when required, monitor and proactively manage their cloud environment, all with minimal effort and at a low cost.
- The Eucalyptus Open-source Cloud-computing System, Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, Dmitrii Zagorodnov, in Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid, Shanghai, China.
- The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP'03, October 19-22, 2003, Bolton Landing, New York, USA.
- MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI 2004
- Distributed Lucene: A distributed free text index for Hadoop, Mark H. Butler, James Rutherford, HP Laboratories, June 7, 2008.