Data Glossary 🧠


Search IconIcon to open search

What is a Data Contract?

Last updated Oct 20, 2022 - Edit Source

Data Contracts are API-like agreements between software/data engineers who own services and data consumers that understand how the business works. The goal is to generate well-modeled, high-quality, trusted, real-time data.

It’s an abstraction that allows engineers to decouple their databases and services from analytics and ML requirements. It will avoid production-breaking incidents when modifying the schema as they are validated and enforced.

Illustration by Chad Sanderson on The Rise of Data Contracts - by Chad Sanderson

Chad Sanderson said that at Convoy, they use Protobuf and Apache Kafka to abstract the CRUD transactions. They define the schema based on what they need, not what they get from the source. Same as Software-Defined Assets describe the Data Asset in a declarative manner and set expectations.

Confluent also built similar functions on top of Kafka with their Schema Registry, and terms such as Semantic Layer and Analytics API (with GraphQL) are trying to achieve similar things.

Data Contracts are not meant to replace data pipelines and Modern Data Stack, a more batch approach. These are good for fast prototyping. You could start defining data contracts when you have some knowledge about data.

Interestingly, the differentiation to Data Mesh is an organizational framework with a micro-service approach to data. Data Mesh doesn’t inform which data should be emitted or validate the data being emitted from production is correct or conforms to a consumer’s expectations.

Also, data contracts are a form of Data Governance. This term is very vague and gets more concrete with explicit contracts. You can also use Great Expectations to set expectations for your data, which I believe is a great way to start.

# From the Discussion on YouTube w/ Chad Sanderson vs Ethan Aaron

Chad Sanderson says in Data Contract Battle Royale w/ Chad Sanderson vs Ethan Aaron - YouTube :

Ethan Aaron is saying his problem with data contracts is that you focus on defining the interface/contract too early. E.g., if you have a big task done by several teams or people, you have a contract to agree on an interface. I’d argue that’s precisely what the data products are, and instead of agreeing on some artificial contract, decide on the product, so the tools and teams can be distinct.

# Summary Blog Posts

An excellent summary by Mehdi Ouazza about data contracts From Zero To Hero. He is illustrating how Apache Kafka could also be the interface that defines the contract.

Illustration from Data Contracts — From Zero To Hero

See also Semantic Warehouse.