Monday, July 14, 2025

Change-Conscious Information Validation with Column-Degree Lineage


instruments like dbt make developing SQL knowledge pipelines straightforward and systematic. However even with the added construction and clearly outlined knowledge fashions, pipelines can nonetheless grow to be advanced, which makes debugging points and validating adjustments to knowledge fashions troublesome.

The rising complexity of knowledge transformation logic offers rise to the next points:

  1. Conventional code evaluate processes solely have a look at code adjustments and exclude the info impression of these adjustments.
  2. Information impression ensuing from code adjustments is difficult to hint. In sprawling DAGs with nested dependencies, discovering how and the place knowledge impression happens is extraordinarily time-consuming, or close to not possible.

Gitlab’s dbt DAG (proven within the featured picture above) is the proper instance of a knowledge venture that’s already a house-of-cards. Think about attempting to observe a easy SQL logic change to a column by this whole lineage DAG. Reviewing a knowledge mannequin replace could be a frightening process.

How would you strategy such a evaluate?

What’s knowledge validation?

Information validation refers back to the course of used to find out that the info is appropriate when it comes to real-world necessities. This implies making certain that the SQL logic in a knowledge mannequin behaves as supposed by verifying that the info is appropriate. Validation is normally carried out after modifying a knowledge mannequin, comparable to accommodating new necessities, or as a part of a refactor.

A novel evaluate problem

Information has states and is instantly affected by the transformation used to generate it. Because of this reviewing knowledge mannequin adjustments is a singular problem, as a result of each the code and the info must be reviewed.

On account of this, knowledge mannequin updates needs to be reviewed not just for completeness, but in addition context. In different phrases, that the info is appropriate and present knowledge and metrics weren’t unintentionally altered.

Two extremes of knowledge validation

In most knowledge groups, the particular person making the change depends on institutional data, instinct, or previous expertise to evaluate the impression and validate the change.

“I’ve made a change to X, I believe I do know what the impression needs to be. I’ll examine it by operating Y”

The validation technique normally falls into certainly one of two extremes, neither of which is good:

  1. Spot-checking with queries and a few high-level checks like row rely and schema. It’s quick however dangers lacking precise impression. Vital and silent errors can go unnoticed.
  2. Exhaustive checking of each single downstream mannequin. It’s gradual and useful resource intensive, and may be expensive because the pipeline grows.

This ends in a knowledge evaluate course of that’s unstructured, laborious to repeat, and sometimes introduces silent errors. A brand new technique is required that helps the engineer to carry out exact and focused knowledge validation.

A greater strategy by understanding knowledge mannequin dependencies

To validate a change to an information venture, it’s necessary to know the connection between fashions and the way knowledge flows by the venture. These dependencies between fashions inform us how knowledge is handed and remodeled from one mannequin to a different.

Analyze the connection between fashions

As we’ve seen, knowledge venture DAGs may be enormous, however a knowledge mannequin change solely impacts a subset of fashions. By isolating this subset after which analyzing the connection between the fashions, you may peel again the layers of complexity and focus simply on the fashions that really want validating, given a particular SQL logic change.

The varieties of dependencies in a knowledge venture are:

Mannequin-to mannequin

A structural dependency wherein columns are chosen from an upstream mannequin.

--- downstream_model
choose
  a,
  b
from {{ ref("upstream_model") }}

Column-to-column

A projection dependency that selects, renames, or transforms an upstream column.

--- downstream_model
choose
  a,
  b as b2
from {{ ref("upstream_model") }}

Mannequin-to-column

A filter dependency wherein a downstream mannequin makes use of an upstream mannequin in a the place, be a part of, or different conditional clause.

-- downstream_model
choose
  a
from {{ ref("upstream_model") }}
the place b > 0

Understanding the dependencies between fashions helps us to outline the impression radius of a knowledge mannequin logic change.

Determine the impression radius

When making adjustments to an information mannequin’s SQL, it’s necessary to know which different fashions is likely to be affected (the fashions it’s essential to examine). On the excessive stage, that is achieved by model-to-model relationships. This subset of DAG nodes is called the impression radius.

Within the DAG beneath, the impression radius contains nodes B (the modified mannequin) and D (the downstream mannequin). In dbt, these fashions may be recognized utilizing the modified+ selector.

DAG displaying modified mannequin B and downstream dependency D. Upstream mannequin A and unrelated mannequin C will not be impacted (Picture by writer)

Figuring out modified nodes and downstream is a good begin, and by isolating adjustments like this you’ll scale back the potential knowledge validation space. Nonetheless, this might nonetheless end in numerous downstream fashions.

Classifying the varieties of SQL adjustments can additional enable you to prioritize which fashions truly require validation by understanding the severity of the change, eliminating branches with adjustments which can be recognized to be secure.

Classify the SQL change

Not all SQL adjustments carry the identical stage of danger to downstream knowledge, and so needs to be categorized accordingly. By classifying SQL adjustments this manner, you may add a scientific strategy to your knowledge evaluate course of.

A SQL change to an information mannequin may be categorized as one of many following:

Non-breaking change

Adjustments that don’t impression the info in downstream fashions comparable to including new columns, changes to SQL formatting, or including feedback and so forth.

-- Non-breaking change: New column added
choose
  id,
  class,
  created_at,
  -- new column
  now() as ingestion_time
from {{ ref('a') }}

Partial-breaking change

Adjustments that solely impression downstream fashions that reference sure columns comparable to eradicating or renaming a column; or modifying a column definition.

-- Partial breaking change: `class` column renamed
choose
  id,
  created_at,
  class as event_category
from {{ ref('a') }}

Breaking change

Adjustments that impression all downstream fashions comparable to filtering, sorting, or in any other case altering the construction or which means of the remodeled knowledge.

-- Breaking change: Filtered to exclude knowledge
choose
  id,
  class,
  created_at
from {{ ref('a') }}
the place class != 'inner'

Apply classification to cut back scope

After making use of these classifications the impression radius, and the variety of fashions that should be validated, may be considerably decreased.

DAG showing three categories of change: non-breaking, partial-breaking, and breaking
DAG displaying three classes of change: non-breaking, partial-breaking, and breaking (Picture by writer)

Within the above DAG, nodes B, C and F have been modified, leading to doubtlessly 7 nodes that should be validated (C to E). Nonetheless, not every department comprises SQL adjustments that really require validation. Let’s check out every department:

Node C: Non-breaking change

C is classed as a non-breaking change. Due to this fact each C and H don’t should be checked, they are often eradicated.

Node B: Partial-breaking change

B is classed as a partial-breaking change as a consequence of change to the column B.C1. Due to this fact, D and E should be checked solely in the event that they reference column B.C1.

Node F: Breaking change

The modification to mannequin F is classed as a breaking-change. Due to this fact, all downstream nodes (G and E) should be checked for impression. For example, mannequin g would possibly mixture knowledge from the modified upstream column

The preliminary 7 nodes have already been decreased to five that should be checked for knowledge impression (B, D, E, F, G). Now, by inspecting the SQL adjustments on the column stage, we will scale back that quantity even additional.

Narrowing the scope additional with column-level lineage

Breaking and non-breaking adjustments are straightforward to categorise however, in relation to inspecting partial-breaking adjustments, the fashions should be analyzed on the column stage.

Let’s take a more in-depth have a look at the partial-breaking change in mannequin B, wherein the logic of column c1 has been modified. This modification may doubtlessly end in 4 impacted downstream nodes: D, E, Okay, and J. After monitoring column utilization downstream, this subset may be additional decreased.

DAG showing the column-level lineage used to trace the downstream impact of a change to column B.c1
DAG displaying the column-level lineage used to hint the downstream impression of a change to column B.c1 (Picture by writer)

Following column B.c1 downstream we will see that:

  • B.c1 → D.c1 is a column-to-column (projection) dependency.
  • D.c1 → E is a model-to-column dependency.
  • D → Okay is a model-to-model dependency. Nonetheless, as D.c1 is just not utilized in Okay, this mannequin may be eradicated.

Due to this fact, the fashions that should be validated on this department are B, D, and E. Along with the breaking change F and downstream G, the full fashions to be validated on this diagram are F, G, B, D, and E, or simply 5 out of a complete of 9 doubtlessly impacted fashions.

Conclusion

Information validation after a mannequin change is troublesome, particularly in massive and sophisticated DAGs. It’s straightforward to overlook silent errors and performing validation turns into a frightening process, with knowledge fashions usually feeling like black bins in relation to downstream impression.

A structured and repeatable course of

Through the use of this change-aware knowledge validation method, you may carry construction and precision to the evaluate course of, making it systematic and repeatable. This reduces the variety of fashions that should be checked, simplifies the evaluate course of, and lowers prices by solely validating fashions that really require it.

Earlier than you go…

Dave is a senior technical advocate at Recce, the place we’re constructing a toolkit to allow superior knowledge validation workflows. He’s all the time pleased to speak about SQL, knowledge engineering, or serving to groups navigate their knowledge validation challenges. Join with Dave on LinkedIn.

Analysis for this text was made potential by my colleague Chen En Lu (Popcorny).

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com