Data Glossary 🧠

Search

Search IconIcon to open search

What is DuckDB?

Last updated Nov 29, 2022 - Edit Source

DuckDB is an in-process SQL OLAP database management system. It has strong support for SQL. DuckDB is borrowing the SQLite shell implementation. Each database is a single file on disk. It’s analogous to “ SQLite for analytical (OLAP) workloads” (direct comparison on the SQLite vs DuckDB paper here), whereas SQLite is for OLTP ones. But it can handle vast amounts of data locally. It’s the smaller, lighter version of Apache Druid and other OLAP technologies.

It’s designed to work as an embedded library, eliminating the network latency you usually get when talking to a database.

Skip working with error-prone Excels or CSVs directly

With DuckDB, we no longer need to use plain files (CSV, Excel, Parquet). DuckDB supports schema, types, and SQL interface and is super fast.

# Use-Cases

Check out Rill Data, a BI tool that delivers sub-second interactivity because it’s backed by DuckDB (and  Druid for enterprise-grade cloud services).

MotherDuck is the managed service around DuckDB that lets you scale from a local DB to a cloud DB and hybrid—done by one of Google BigQuery creators or developers such as Jordan Tigani. Check his discussion on the Analytics Engineering Podcast about The Personal Data Warehouse. The stimulating conversation around connected WebAssembly, e.g., Is an application compiled to C code, which is super fast. E.g., Figma is using that, which would otherwise never work in a browser.

# Benchmarks

# Example Projects

# Tech and Papers

It ships as an amalgamation build - a single giant C++ file (SQLite is a single giant C file). And it’s also backed up by some strong computer science. It’s by the academic researchers behind MonetDB and includes implementations of a bunch of interesting papers:

From Why DuckDB:

DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) is processed in one operation. This dramatically reduces overhead in traditional systems such as PostgreSQL, MySQL, or SQLite, which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries.

# Python API and Handling Data

From DuckDB Docs - Python API

# Fetching Data with SQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# fetch as pandas data frame
df = con.execute("SELECT * FROM items").fetchdf()
print(df)
#        item   value  count
# 0     jeans    20.0      1
# 1    hammer    42.2      2
# 2    laptop  2000.0      1
# 3  chainsaw   500.0     10
# 4    iphone   300.0      2

# fetch as dictionary of numpy arrays
arr = con.execute("SELECT * FROM items").fetchnumpy()
print(arr)
# {'item': masked_array(data=['jeans', 'hammer', 'laptop', 'chainsaw', 'iphone'],
#              mask=[False, False, False, False, False],
#        fill_value='?',
#             dtype=object), 'value': masked_array(data=[20.0, 42.2, 2000.0, 500.0, 300.0],
#              mask=[False, False, False, False, False],
#        fill_value=1e+20), 'count': masked_array(data=[1, 2, 1, 10, 2],
#              mask=[False, False, False, False, False],
#        fill_value=999999,
#             dtype=int32)}

# Create Table

1
2
3
4
5
6
7
8
9
# create a table
con.execute("CREATE TABLE items(item VARCHAR, value DECIMAL(10,2), count INTEGER)")
# insert two items into the table
con.execute("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)")

# retrieve the items again
con.execute("SELECT * FROM items")
print(con.fetchall())
# [('jeans', 20.0, 1), ('hammer', 42.2, 2)]

# Insert Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# insert a row using prepared statements
con.execute("INSERT INTO items VALUES (?, ?, ?)", ['laptop', 2000, 1])

# insert several rows using prepared statements
con.executemany("INSERT INTO items VALUES (?, ?, ?)", [['chainsaw', 500, 10], ['iphone', 300, 2]] )

# query the database using a prepared statement
con.execute("SELECT item FROM items WHERE value > ?", [400])
print(con.fetchall())
# [('laptop',), ('chainsaw',)]