From the Blogosphere
Getting Automation Right with Big Data | @BigDataExpo #BigData
Things To Remember While Automating With Big Data
By: Harry Trott
Feb. 1, 2017 05:00 PM
Big data automation can mean writing dozens of scripts to process different input sources and aligning them in order to consolidate all this data and produce the required output.
Why exactly do you need big data for your enterprise projects? Many industry observers have been noting that although a lot of enterprises like to claim that their big data projects are aimed at "deriving insights" that replace human intuition with data-driven alternatives, in reality though, the objective appears to be automation. They point out that the role of data scientists at a lot of organizations has got little to do with replacing human intuition with big data. Instead, it is about augmenting human experience by making it easier, faster and more efficient.
But automating big data processing is easier said than done and the biggest problem here is that big data is well big. What this means is that there is a lot of chaos and inconsistency in the data available. As a result, creating a MapReduce script that can instantly input all your data and process the results is just wishful thinking. In reality, big data automation can mean writing dozens of scripts to process different input sources and aligning them in order to consolidate all this data and produce the required output.
The first thing to get right with respect to automating big data is the architecture. One of the most popular ways to set up big data automation is through data lakes. To put it simple, data lakes is a large storage repository that holds all the raw data until it is necessary for processing. Unlike traditional hierarchical data warehouses, data lakes stores raw data in a flat architecture . One of the key advantages here is that data lakes can store all sorts of data - structured, semi-structured and unstructured and is thus ably suited for big data automation.
The next thing to get right is agility. Traditional data sources are structured and using a data warehouse technology ensures seamless processing and efficient processing of data. With big data though, this can be a disadvantage. Data scientists need to build agile systems that can be easily configured and reworked in order to quickly and efficiently navigate through the multitude of data sources and build an automation system that works.
While challenges as those mentioned above can be tackled by choosing the right technologies, there are other problems with big data that need to be dealt at a more granular level. One example is manipulative algorithms that can bring about vastly different outputs and rogue or incompetent developers can cause automation issues that can be extremely difficult to track down and modify. Another issue is with misinterpretation of data. An automated big data system could possibly magnify minor discrepancies in data and feed them into a loop that could lead to grossly misleading outputs.
These are issues that cannot be wished away and the only way to get automation right in such cases is by diligently monitoring and evaluating the code and outputs. This way, it is possible to identify discrepancies in the algorithm and outputs before it can potentially blow up. From a business perspective, this means additional resources to test and validate the code and output at each stage of the development and operational cycle. This could effectively bring down the cost advantage that big automation has. But this is a necessary expense to pay if businesses need to establish a sustainable big automation product that also works.
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
SYS-CON Featured Whitepapers
Most Read This Week