For Cutting Edge Big Data Projects: On Demand, Metadata Driven Hadoop
What is “Metadata Driven or Dynamic ETL”? IT IS DTL. It’s a cool, compelling concept. The ability for you to generate on the fly Map Reduce jobs based on metadata stored in a repository of tables. For those of you familiar with my last blog Effortless EDW Offload to Hadoop with DMX-h, you may have 100s of jobs and lines of code to be run every day – I’m sure that sounds sound familiar. But, more often than not, they all are highly dynamic and depend on daily/hourly/ad-hoc parameters that change on a whim due to business needs! So how do you cope with this level of flux when your critical data is on Hadoop?
Many organizations embarking on cutting edge big data projects have clearly requested and opted for the above mentioned flexibility, which is what inspired me to see how DMX-h fits the bill. It’s surprisingly trivial with DMX-h. It’s a 4GL like language called Data Transformation Language (DTL) that can import any/all dynamic parameters and can be generated completely on the fly! Sounds pretty nifty, right? It is!
Why do Map Reduce or Hadoop jobs need this?….Because of the late binding nature of Hadoop. Nobody wants to have pre-configured jobs on Hadoop when it was designed to ingest and process all kinds of data formats and data sets. Not to mention, ask all kinds of questions never asked before, ad-hoc!
Let me give you an example. What if you need to change a source record format to a target format and roll up cell phone usage by account number? You could very easily do this dynamically by generating something like “source layout ACC_NO, USAGE_MINUTES Target layout ACC_NO, TOTAL_USAGE group by ACC_NO.” Can you do this with SQL??….. sure you can, but on Hadoop you need a robust, scalable, and efficient transformation engine to offload your conventional SQL jobs.
I have received innumerable data points on the inefficiencies and inadequacies of HIVE – probably worth saving for a future blog. Please also see my blog on migrating PL/SQL to Hadoop using DMX-h. But more importantly, you can leverage your existing ETL and DWH expertise to implement DMX-h on Hadoop.
Metadata driven, runtime Hadoop is the way to conclude here – very flexible and dynamic. Extremely important when different questions can be asked every day and the parameters are stored in systems outside of Hadoop, most likely, an enterprise scheduler like Autosys or control-M. It’s an amazingly simple way to adapt to changing requirements on a real time basis without having to pull a development team every time it happens.