Many IT organizations are simply overwhelmed by the sheer volume of data sets – small, medium, and large – that are stored in Hadoop, which although related, are not integrated. However, when done right, with an integrated data management framework, data lakes allow organizations to gain insights and discover relationships between data sets.
Data lakes created with an integrated data management framework eliminate the costly and cumbersome data preparation process of ETL that traditional EDW requires. Data is smoothly ingested into the data lake, where it is managed using metadata tags that help locate and connect the information when business users need it.
This approach frees analysts for the important task of finding value in the data without involving IT in every step of the process, thus conserving IT resources. Today, all IT departments are being mandated to do more with less. In such environments, well-governed and managed data lakes help organizations more effectively leverage all their data to derive business insight and make good decisions.
Data flowing in from everywhere
The main advantage of this architecture is that data can come into the data lake from anywhere, including online transaction processing (OLTP) or operational data store (ODS) systems, an EDW, logs or other machine data, or from cloud services. These source systems include many different formats, such as file data, database data, ETL, streaming data, and even data coming in through APIs.
The data is first loaded into a transient loading zone, where basic data quality checks are performed using MapReduce or Spark by leveraging the Hadoop cluster. Once the quality checks have been performed, the data is loaded into Hadoop in the raw data zone, and sensitive data can be redacted so it can be accessed without revealing personally identifiable information (PII), personal health information (PHI), payment card industry (PCI) information, or other kinds of sensitive or vulnerable data.
Data scientists and business analysts alike dip into this raw data zone for sets of data to discover. An organization can, if desired, perform standard data cleansing and data validation methods and place the data in the trusted zone. This trusted repository contains both master data and reference data.
Master data on the data lake ground
Master data is the basic data sets that have been cleansed and validated. For example, a healthcare organization may have master data sets that contain basic member information (names, addresses) and members’ additional attributes (dates of birth, social security numbers). An organization needs to ensure that this reference data kept in the trusted zone is up to date using change data capture (CDC) mechanisms.
Reference data, on the other hand, is considered the single source of truth for more complex, blended data sets. For example, that healthcare organization might have a reference data set that merges information from multiple source tables in the master data store, such as the member basic information and member additional attributes to create a single source of truth for member data. Anyone in the organization who needs member data can access this reference data and know they can depend on it.
So how are your experiences with data lakes? What is your opinion about them? We are looking forward to your feedback. Visit our linkedIn Group DWH & BI Experts.