As you look ahead to BDA Amsterdam, your focus is inevitably on the new shiny things promising to revolutionise your Big Data capabilities. Maybe they’re even promising ‘Big Data in a box!’. Whatever it is, new components all add complexity to your mesh of technology, whereas Big Data demands fluid interoperability and tight communication. This blog outlines why data management automation is required as the basis of any efficient Big Data strategy.
Why Does This Concern Me?
Can you query a variety of diverse data sources and send accurate reports to end users in hours? If so, stop reading now. Perhaps you are ready to add new technology and branch out in a new direction. If you are heading to a Big Data event without this level of control, ask yourself what you aim to achieve.
Before you add something new, you must remove the latency, complexity and technical debt produced by existing data technology that does not do what you promised the business it would. Does your data warehouse work as well as you envisaged at the start of that project? The truth is that your own bespoke collection of products, services and platforms is not stable enough to build upon at present. Unfortunately this problem is exclusively yours as today everyone’s stack is different.
What Should I Do?
There are two choices here: continue to manage your data stack manually and keep producing huge volumes of technical debt with documentation, patches, tickets and so on, or hand control over to a data management automation tool. The best solutions merely take the system you already have and improve it. Rather than adding a new tool that your stack must incorporate, work around and/or learn to communicate with, automation uses metadata to make your existing components communicate more effectively. What’s the point in buying a new boiler if your water pipes just need repairing?
I often see data ecosystems with everything they need for rapid, accurate reporting capabilities, but they are simply wrought with manual hand-offs and a back-log of manual coding requirements the developers simply cannot keep up with. All their time is spent fighting fires caused by an imperfect process. In every case, these problems can be permanently eradicated with data warehouse automation.
How does it work?
Automation eradicates manual handoffs so no connections are waiting for a button to be pressed or an approval to be made. There’s no “automation” box in a process flow, just arrows between and among pre-existing sources and targets. As (big) data volume increases, end-to-end visibility and documentation show how data spans and moves between environments such as Hive and the data warehouse. Job processing capabilities execute user-defined processing routines on Hive, which are incorporated into broader job processing routines, allowing Hive tasks to be managed in conjunction with data warehouse processing. Data flow is now bidirectional between Hive and your data warehouse environments.
How Does it Really Work?
The automation tool utilizes the existing data warehouse metadata repository on, for example, Oracle, Teradata or SQL Server. Its server component is installed on an edge node. The HiveServer2 (HS2) component is accessed by the tool’s UI and edge component to issue DDL/DML commands during development and operations. Sqoop (v1) is accessed from the edge node and allows data to flow between the data warehouse RDBMS and Hive.
The tool will then create and manage Hive DDL for tables, views and indexes defined within its metadata repository. It supports Hive DDL option setting such as partitioning, clustering, table properties, skewing, row format and file format. It generates native Hive SQL code by combining metadata with templates (which can be used as they are, modified or created from scratch) and takes advantage of database specific features where available. Technical artefacts – ETL codes, data structures, tuning options and scheduling components – can also be created.
All logs generated during Hive processing are captured so documentation is automatically created and managed changes are made, including full data lineage, track back, track forward, and impact analysis across both enterprise and Big Data solutions.
Whatever new tool you want to add, modern demands for agile reporting dictate that any new technology must look to break down silos rather than add new ones. The expansion of your Big Data strategy can only function as desired with an efficient data warehouse at its core, one that communicates well both internally and with all other components in its ecosystem.
In the past this meant a constant headache for the developers tasked with designing and maintaining the data warehouse. Now we simply use automation and concentrate on the latest shiny things.
Charlie Coffey (EMEA Marketing Manager at WhereScape)