Data Glossary 🧠

What is a Data Lake?

Last updated Oct 25, 2022 - Edit Source

Data engineering

A Data Lake is a storage system with vast amounts of unstructured and structured data, stored as-is, without a specific purpose in mind, that can be built on multiple technologies such as Hadoop, NoSQL, Amazon Simple Storage Service, a relational database, or various combinations and different formats (e.g. Excel, CSV, Text, Logs, etc.).

According to Hortonworks Data Lake Whitepaper, the data lake arose because new types of data needed to be captured and exploited by the enterprise. As this data became increasingly available, early adopters discovered that they could extract insight through new applications built to serve the business. The data lake supports the following capabilities:

To capture and store raw data at scale for a low cost
To store many types of data in the same repository
To perform data transformation on the data where the purpose may not be defined
To perform new types of data processing
To perform single-subject analytics based on particular use cases

The initial concept was created by Databricks in the CIDR Paper in 2021. Read more on our Data Lake and Lakehouse Guide.

Data Glossary 🧠

What is a Data Lake?

Interactive Graph

Backlinks