Data Glossary 🧠
Search
What is a Data Lake?
A Data Lake is a storage system with vast amounts of unstructured and structured data, stored as-is, without a specific purpose in mind, that can be built on multiple technologies such as Hadoop, NoSQL, Amazon Simple Storage Service, a relational database, or various combinations and different formats (e.g. Excel, CSV, Text, Logs, etc.).
According to Hortonworks Data Lake Whitepaper, the data lake arose because new types of data needed to be captured and exploited by the enterprise. As this data became increasingly available, early adopters discovered that they could extract insight through new applications built to serve the business. The data lake supports the following capabilities:
- To capture and store raw data at scale for a low cost
- To store many types of data in the same repository
- To perform data transformation on the data where the purpose may not be defined
- To perform new types of data processing
- To perform single-subject analytics based on particular use cases
The initial concept was created by Databricks in the CIDR Paper in 2021. Read more on our Data Lake and Lakehouse Guide.