Effective metadata management processes can prevent analytics teams working in data lakes from creating inconsistencies that skew the results of big data analytics applications.

Deploying a data lake lets you absorb information from a wide variety of sources in a single repository. But that glut of available info creates challenges — especially when integrating and preparing data in a consistent way. Big data analytics applications can hit rocks as a result, a fate that metadata management tools are designed to help you avoid.

There’s no doubt that a data lake provides an efficient platform for capturing massive amounts of raw data and enabling data scientists and other downstream users to pursue different methods of preparing and analyzing the accumulated data. In addition, storage capacity can be easily expanded by adding more disk space to a data lake’s underlying Hadoop cluster; that in turn makes it easy to capture more and more data for processing and analysis.

 

Incoming data is often statistic

The incoming data may be static, with an entire data set delivered at one time. Or it can be dynamic — data that’s continuously streamed and appended to existing files in the data lake. In either case, a broader collection of data clears the way for more advanced processing and analytics. Examples include indexing data for search uses, applying predictive models to data streams and implementing automated processes triggered in real time by predefined events or patterns recognized in a data set.

 However, all that varied information, and the various ways in which it’s used, puts data integration and preparation processes to the test. The main issue has to do with what might be deemed an inverted approach to data ingestion, transformation and preparation.

 

One-size-fits-all view of data

 In a traditional data warehouse environment, data sets are initially integrated and stored at a server that’s set up as a staging area, where they’re standardized, reorganized and configured into a predefined data model. Then, the processed data is loaded into the data warehouse and made available for use. The work is typically managed by the IT department, which usually applies a monolithic set of data integration routines and transformations. This process, referred to as schema-on-write, basically limits all end users to working with the same interpretation of a particular data set.

 That’s too much of a constraint for many data analysts; data science team members often have their own ideas about how they want to use data. As a result, it’s likely that each one will want to individually assess data sets in their raw formats before devising a target data model for a specific analytics application and engineering the required data transformations.