Data Glossary 🧠

What is Data Observability?

Last updated Dec 7, 2022 - Edit Source

Data engineering

Data observability, also known as monitoring, continuously collects metrics about your data. You can collect data about the number of rows, columns, and properties for each dataset. You can also manage metadata about the dataset, such as when it was last updated.

From the great article Choosing a Data Quality Tool - by Sarah Krasnik, there are also different categories for observability:

Auto-profiling data
- Bigeye: unique in a wide range of ML-driven automatic threshold tests and alerts
- Datafold: unique Github integration presenting Data Diff between environments with custom tests
- Monte Carlo: unique in being the most enterprise-ready enterprise-ready with many data lake integrations
- Lightup: unique self-hosted deployment option, appealing to highly regulated industries
- Metaplane: unique in a high level of configuration for a hosted tool with both out-of-the-box and custom tests
Pipeline Testing
- Great Expectations: unique in its data quality specific community and automatic documentation of tests
- Soda: unique in its self-hosted cloud option
- dbt tests: unique in integration with dbt core and dbt Cloud builds (naturally), but not as versatile outside of the dbt ecosystem
Infrastructure monitoring
- DataDog: unique agent implementation that can be deployed anywhere for monitoring, even at the container level, with custom Airflow metric reporting
- New Relic: unique one-step integration with the big three cloud
A little bit of everything
- Databand: unique integration with Airflow and specific Airflow metric monitoring
- Unravel: unique support for other data sources like Spark, data lake, and NoSQL databases
- Data Catalogs: Helping observe existing data

Related terms are Data Governance and Data Quality.

Data Glossary 🧠

What is Data Observability?

Interactive Graph

Backlinks