Data Glossary 🧠
What is DuckDB?
DuckDB is an in-process SQL OLAP database management system. It has strong support for SQL. DuckDB is borrowing the SQLite shell implementation. Each database is a single file on disk. It’s analogous to “ SQLite for analytical (OLAP) workloads” (direct comparison on the SQLite vs DuckDB paper here), whereas SQLite is for OLTP ones. But it can handle vast amounts of data locally. It’s the smaller, lighter version of Apache Druid and other OLAP technologies.
It’s designed to work as an embedded library, eliminating the network latency you usually get when talking to a database.
Skip working with error-prone Excels or CSVs directly
With DuckDB, we no longer need to use plain files (CSV, Excel, Parquet). DuckDB supports schema, types, and SQL interface and is super fast.
- Ultra-fast analytical use-case locally. E.g., the Taxi example includes a 10 Year, 1.5 Billion row Taxi data example that still works on a laptop. See benchmarks below.
- It can be used as an SQL wrapper with zero copies (on top of parquets in S3).
- Bring your data to the users instead of having big roundtrips and latency by doing REST calls. Instead, you can put data inside the client. You can do 60 frames per second as data is where the query is.
- DuckDB on Kubernetes for a zero-copy layer to read S3 in the Data Lake! Inspired by this Tweet. The cheapest and fastest option to get started.
Check out Rill Data, a BI tool that delivers sub-second interactivity because it’s backed by DuckDB (and Druid for enterprise-grade cloud services).
MotherDuck is the managed service around DuckDB that lets you scale from a local DB to a cloud DB and hybrid—done by one of Google BigQuery creators or developers such as Jordan Tigani. Check his discussion on the Analytics Engineering Podcast about The Personal Data Warehouse. The stimulating conversation around connected WebAssembly, e.g., Is an application compiled to C code, which is super fast. E.g., Figma is using that, which would otherwise never work in a browser.
- Fast Analysis With DuckDB + Pyarrow - 2022
- SQL on Python, part 1: The simplicity of DuckDB: How to use DuckDB to analyze 4.6+ million mentions of climate change on Reddit
- Tweet: Impressively fast, collaborative exploratory data analytics over a 20+ million row data set, hosted in the cloud with Drifting’s Jamsocket + Rill Data + DuckDB - 2022
- Taking DuckDB for a spin | Uwe’s Blog - 2019
- SQLite vs DuckDB paper:
- SQLite out-performs DuckDB on a write transactions benchmark by 10x-500x on a powerful cloud server and 2x-60x on a Raspberry Pi, for small to large databases.
- For analytical benchmarks using the SSB (Star Schema Benchmark) DuckDB out-performs SQLite by 30-50x at the highest margin and 3-8x at the lowest.
# Example Projects
- Modern Data Stack in a Box with DuckDB: A fast, free, and open-source Modern Data Stack (MDS) can now be fully deployed on your laptop or to a single machine using the combination of DuckDB, Meltano, dbt, and Apache Superset.
- Fully Featured Project Example (s3, dbt, parquet, and snowflake) reading from Hackernews orchestrated with dagster.
- Data Engineering in 2022: Exploring dbt with DuckDB: Step-by-step guide on using dbt and DuckDB.
- Build a poor man’s data lake from scratch with DuckDB: Full fleshed example by Dagster, also in video.
- Using Polars on results from DuckDB’s Arrow interface in Rust
# Tech and Papers
It ships as an amalgamation build - a single giant C++ file (SQLite is a single giant C file). And it’s also backed up by some strong computer science. It’s by the academic researchers behind MonetDB and includes implementations of a bunch of interesting papers:
- Data Management for Data Science - Towards Embedded Analytics (CIDR 2020)
- DuckDB: an Embeddable Analytical Database (SIGMOD 2019 Demo)
From Why DuckDB:
DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) is processed in one operation. This dramatically reduces overhead in traditional systems such as PostgreSQL, MySQL, or SQLite, which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries.
# Python API and Handling Data
# Fetching Data with SQL
# Create Table
# Insert Data