Comments
yourfanat wrote: I am using another tool for Oracle developers - dbForge Studio for Oracle. This IDE has lots of usefull features, among them: oracle designer, code competion and formatter, query builder, debugger, profiler, erxport/import, reports and many others. The latest version supports Oracle 12C. More information here.
Cloud Expo on Google News
SYS-CON.TV
Cloud Expo & Virtualization 2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts
An Open Source-Based Cloud Data Storage and Processing Solution
Why is an on-demand storage solution required?

Applications are increasingly being made available over the Internet. Several applications have a large user base that produces a huge volume of data, for example, content in a community portal, emails in a web-based email system, and call log files generated at call centers. Due to a large amount of data being added every minute and the need to keep historical data for various requirements such as legal, reference, data warehousing, and analytics, the systems' data size keeps growing exponentially. This requires a huge storage and processing infrastructure, incurring a high cost of procuring and maintaining it for companies. Other typical challenges with such large data sets are how to store the data reliably and economically. How do you process the data efficiently? How do you provide search?

Traditionally storage solutions, as shown in Figure 1, use n-tiered architecture with SAN or NAS for storage, Relational Databases (RDBMS) for search and retrieval, and separate compute servers for processing. This solution architecture however requires expensive hardware and a long lead time to scale.

Figure 1: Traditional storage solution architecture with NAS/SAN and RDBMS

On-Demand Data Storage and Processing Solution
Cloud computing offers the on-demand scalability of resources that can be leveraged for data storage to provide scalable storage. To efficiently and effectively manage the resources and data stored in the cloud, the cloud data storage and processing solution is presented here. Our solution uses Eucalyptus, the open source cloud platform, to manage the underlying storage infrastructure. There are some specialized open source cloud computing solutions such as Hadoop and Lucene that offer low-cost scalable alternatives for applications that need to process huge amounts of data.

The proposed solution uses Hadoop, an open source storage solution, to provide replication, distributed file storage system and framework capabilities for running large data processing applications on clusters of commodity hardware. These layers provide a foundation for our solution to handle the QoS requirements of reliability and performance for huge volume and systems that are prone to breakdown.

Next, the need to efficiently manage the large data, distributed on multiple machines across the cloud, poses a great challenge. Also it's necessary to know exactly when and how much capacity needs to be added or removed and then the complexity involved in provisioning new infrastructure.

To enable an easy need-based management of cloud storage environment by the user, this solution has a web interface with capabilities such as monitoring the consumption and availability of storage and a method to quickly add/remove storage when required. To optimize resource usage, alerting mechanisms are included to send messages when lot of space is lying unused and/or when more space is required based on forecasting models' results. Thus, an on-demand storage solution will provide the following capabilities:

  • Increase/decrease the storage as and when required
  • Faster access to the distributed files
  • Fault-tolerance
  • Proactive alerts for increasing/decreasing the storage

Traditional vs Cloud Computing Based on On-Demand Storage Solutions
Table 1 provides a summary of the limitations of traditional solutions and how these new solutions address them.

Table 1: Comparison of traditional and Cloud computing solutions for data processing and storage

Architecture of Cloud Data Storage and Processing Solution
The solution architecture for a cloud-based data processing and storage solution is shown in Figure 2.

Figure 2: Cloud based storage and data processing solution architecture

The solution architecture consists of following components:

Specialized Cloud Infrastructure
The foundation layer of the solution consists of the cloud infrastructure to virtualize the underlying hardware and provide components on-demand. The solution leverages Eucalyptus, an open source cloud computing framework to provide the base cloud infrastructure [7]. Eucalyptus uses the Xen virtualization platform to virtualize the physical hardware. It provides on-demand scalability by enabling the addition, instantiation and management of the nodes in the cluster. These nodes not only can contain a virtual machine with the operating system but they can also contain a complete software stack, thus enabling the creation of virtual appliances that can be instantiated and shut down on demand. In addition, a cluster management module is included to automate and ease the management of these instances.

Figure 3: Eucalyptus Cloud infrastructure architecture

Distributed File System
The next layer in the solution is a distributed file system (DFS) that provides scalable and fault-tolerant file system to leverage the storage capacity that is available on multiple machines. For this the solution uses the Apache Hadoop distributed file system as it provides reliable data storage through replication [8].

Figure 4: Hadoop based distributed file system

The on-demand solution bundles HDFS data nodes as Eucalyptus images and keeps the Hadoop name node on an isolated machine. Whenever there is a storage requirement, data nodes are instantiated on new hardware using these images and are added to the cluster.

Data Processing Module
The next component of the solution is a highly scalable data processing engine that is based on a parallel processing algorithm and is co-located with storage nodes. To implement this, the Hadoop MapReduce solution [9] is leveraged as it helps partition the processing and executes them in parallel across several nodes, reducing the overall processing time.

Figure 5: Hadoop Map-Reduce based Data Processing module

In the proposed solution, Hadoop Job and Task tracker nodes are bundled as Eucalyptus images. This allows new processing nodes to be instantiated and added to the cluster on-demand.

Distributed Search Engine
Another important component of the solution is a distributed search engine that enables search operations on the data stored in a distributed file system. There are two implementation options available: Hive and Lucene.

With the Lucene implementation, MapReduce tasks are used to build the index files shards [10]. The index files shards are distributed across multiple Lucene search nodes to enable an efficient distributed search.

In case of Hive, the query engine is implemented using MapReduce tasks for distributed data processing. Hive offers a SQL-like interface and converts the search requests into MapReduce tasks that process the search operation in parallel to efficiently retrieve the results.

Management Console
The top layer is a Management console that provides a web-based user interface to:

  • Provision Infrastructure - For quickly and easily adding new nodes to the cluster. The console will add new node instances and remove unused nodes running on the Eucalyptus to manage the storage capacity on-demand.

Figure 6: Provision infrastructure of cloud management solution

  • Monitor runtime usage of resource consumption and availability, thus enabling on-time warnings and accurate capacity management.

Figure 7: Infrastructure monitoring using Cloud management solution

The web console interacts with a Hadoop monitoring component to retrieve the usage and availability information and display it graphically in a single monitoring console.

  • Forecast future storage requirements and automatically initialize new data nodes. The management console would have a forecasting module to forecast the expected data volume. It will use the historical volume information and statistical forecasting models to project the storage requirement in the future. Depending on the forecasted data, new data nodes can be added proactively before getting any capacity-full alerts from the monitoring system.

Applicability of Cloud-Based On-Demand Storage Solution
This solution architecture is efficient when processing and search is needed in addition to storage that is otherwise less efficient to implement using traditional approaches. For applications that need low-latency retrieval, this architecture may not be efficient. This solution is useful for applications where data is "Written once and read many times." This architecture may not be useful for scenarios that require frequent updates to the data. It's not useful for scenarios where traditional RDBMS-based architecture addresses these requirements:

  • Data processing and search is needed along with storage
  • For applications not requiring low-latency retrieval
  • Application with "Write once, Read many times" data
  • Applications with infrequent updates to the data

Related Work
Cloud computing has quickly evolved and there's whole lot of commercial storage solutions available in the market. [1] Permabit Cloud Storage is a highly scalable, available, and secure storage platform designed for service providers. [2] Nirvanix Storage Delivery Network (SDN) is a fully managed storage solution with unlimited, on-demand scalability. Their standards-based integration APIs allow applications to quickly read and write to the cloud. [3] The Mezeo Cloud Storage Platform is another highly secure platform offering encryption in storage. This enables files stored in their cloud to be securely accessed through a variety of mobile devices or web browsers and without any Virtual Private Network (VPN) setup.[4]

Zetta Enterprise Cloud Storage solutions support all unstructured data types and are backed by industry-leading data integrity and security. [5] EMC Atmos onLine is an Internet-delivered cloud storage service that provides Cloud Optimized Storage (COS) capabilities to customers with reliable SLAs and secure access. It enables customers to move data from on-premise to off-premise using policies. [6] The ParaScale Cloud Storage (PCS) software does not require custom or dedicated hardware and can leverage existing IP networking interconnections. It aggregates disk storage on multiple standard Linux servers to present one or more logical namespaces, and enables file access via standard file-access protocols (NFS, HTTP, WebDAV, and FTP). Applications and clients don't have to be modified or recompiled to use PCS.

As customers traditionally store data in-house, they find it difficult to put their business at risk by moving their data out of their premises. Also they fear to risk of result of hardware failure or someone accidentally erasing or corrupting their high-value data outside their control. Thus private clouds are much in demand. Most of the existing solutions require the data to be moved out of the organization's premises. For having the on-demand scalable, distributed and fast-processing storage solution in the private cloud, very few options such as ParaScale Software are available. However the open source-based solution proposed here provides a cost advantage over using commercial software. Also the customization can be done, as per client-specific requirements, with minimal effort and cost.

Summary
To handle the huge volume of data generated by applications in an organization, a scalable storage infrastructure is required. This article described the architecture of cloud storage and a processing solution using the available open source options for a private cloud environment. It proposes using Eucalyptus for cloud infrastructure management; HDFS for distributed file storage and parallel processing, and Lucene/Hive search mechanisms. A web-based console is proposed to proactively and quickly monitor and manage these systems. This on-demand storage system will provide IT administrators with the capability to rapidly bring up hundreds of servers, run parallel computations on them, and then shut down the instances as and when required, monitor and proactively manage their cloud environment, all with minimal effort and at a low cost.

References:

  1. http://www.permabit.com/pressreleases/cloud-storage-solution-service-providers.asp
  2. http://www.nirvanix.com/solutions/service-providers.aspx
  3. http://www.hostreview.com/news/press/090701SparkCommunications2.html
  4. http://www.reuters.com/article/pressRelease/idUS30718+06-Apr-2009+BW20090406
  5. http://www.emc.com/about/news/press/2009/20090518-02.htm
  6. http://www.parascale.com/index.php/library/parascale-cloud-storage/reference-papers
  7. The Eucalyptus Open-source Cloud-computing System, Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, Dmitrii Zagorodnov, in Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid, Shanghai, China.
  8. The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP'03, October 19-22, 2003, Bolton Landing, New York, USA.
  9. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI 2004
  10. Distributed Lucene: A distributed free text index for Hadoop, Mark H. Butler, James Rutherford, HP Laboratories, June 7, 2008.
About Shyam Kumar Doddavula
Shyam Kumar Doddavula works as a Principal Technology Architect at the Cloud Computing Center of Excellence Group at Infosys Technologies Ltd. He has a MS in computer science from Texas Tech University and over 13 years experience in enterprise application architecture and development.

About Nidhi Tiwari
Nidhi Tiwari is a Senior Technical Architect with SETLabs, Infosys Technologies. She has over 10 years of experience in varied software technologies. She has been working in the field of performance engineering and cloud computing for 6 years. Her research interests include adoption of cloud computing and cloud databases along with performance modeling. She has authored papers for international conferences, journals and has a granted patent.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Latest Cloud Developer Stories
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitiga...
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With tec...
"Digital transformation - what we knew about it in the past has been redefined. Automation is going to play such a huge role in that because the culture, the technology, and the business operations are being shifted now," stated Brian Boeggeman, VP of Alliances & Partnerships at ...
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery,...
"WineSOFT is a software company making proxy server software, which is widely used in the telecommunication industry or the content delivery networks or e-commerce," explained Jonathan Ahn, COO of WineSOFT, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 201...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021



SYS-CON Featured Whitepapers
ADS BY GOOGLE