Data lakes are all the rage right now, and will continue to grow in 2017, but they’re much more than a dumping ground for unmodeled and unverified data of all types. Companies need to approach them strategically, and with some solid understanding of current best practices, in order to keep management at a minimum and give various analytics tools the best shot at extracting meaningful data.

Data lakes are the function of companies collecting more data than ever before, and then demanding that technical teams make new insights from that data. Data is persisted in its raw state so that it can handle large volumes of diverse data, quick ingestion, and leave many opportunities for analysts to attack it with new technology.

 

The data lake, in brief

Most data lakes are built using Hadoop, an open-source framework. Hadoop isn’t necessarily required, but it is where most companies are headed. Russom praises Hadoop’s benefits, such as the ability to manage multi-structured and unstructured data, and a relatively small cost compared to relational databases like MySQL. Companies are using data lakes for analytics, reporting, marketing, sales, and more. Best of all, a data lake helps companies get business value from both old and new data.

 

The lake is hungry—how to feed it right

Without some smart management for the data going into the lake—if you simply launch a Hadoop-powered data lake and throws everything into it—you’re going to end up with a “toxic dump,” according to Chuck Yarbrough, the senior director of solutions marketing and management at Pentaho. The challenge is that incoming data varies in volume, diversity, type, whether there’s metadata or not—it’s a lot to think about, but the ability to ingest data is essential if you want a variety of users to actually take advantage of it.

 

Data can’t just be left to fester

Yarbrough says companies should be relying on data integration tools and infrastructure to make that controlled, governed process possible. That includes metadata management, and strong integration with other data warehouses that might exist within an enterprise. He also suggests developing metadata as it’s ingested and doing more on-the-fly data modeling. Essentially, it’s less about abandoning older data management techniques, but rather refining them for the data lake’s particular nuances.