Data Lineage

Even at the risk of having no readers for this article, I decided to boldly give it a tedious title.

Besides being self-explanatory, it’s about something important that is often overlooked.

To redeem myself of the lack of an imaginative title I promise to thank personally to however signals in the comments that they went all the way to the end of the article. For all others my apologies in advance.

What happened here?

This a question that frequently comes to mind to however works with data produced or processed by others.

It’s frequently asked by people in analytics, regulatory reporting, data scientists and of course the data governance guys.

Even in an environment where there are data policies in place, strict compliance or data protection directives, and data quality requirements it is almost impossible to have the full control over all these requirements and their implementation.

Data gone bad

When data starts going sideways, immediately, everybody want to see root causes and impact analysis reports – something that is not so easy to produce when data went through five different systems, twenty-five excel spreadsheets and was worked by ten different people in six different departments.

Data is present in all the organization’s processes, from risk or regulatory compliance to routine operations, so when data problems arise and bad data starts propagating across the organisation, impacting every process, from the most operational all the way to high lever decision making, and considering the number of decisions being made daily, from the more operational level to the more strategic, the impacts on business are impressive and may extend to the long term.

This is where data lineage comes into play.

Data lineage maps the flows of data through the organization, making those root causes and impact analysis viable, providing full visibility over organizations data flows, and making it possible to identify and fix data problems, making possible to correct the processes and implement effective corrective and preventive measures.

Not without challenges

Data lineage can only be fully effective and reliable if it’s fully automated – This is the truly challenging requirement.

Manual data lineage is a scenario that can’t even be considered, besides being an intensive, time and resource consuming process, it is also highly susceptible to errors – Sometimes adding to the problems instead of solving them.

If we look closely at the characteristics of a common data ecosystem, we can see they are tendentially siloed, using multiple types of data sources, with data flows supported by multiple technologies and coded with very distinct methodologies and quality levels (All items to take under consideration when selecting a data lineage tool).

It’s easy to conclude that data lineage, taken in a holist approach, can easily turn in an expensive initiative, that is time and resource consuming and can span through long time frames.

Finally, we are talking of the kind of initiative where the added value is hard to determine, making it hard, even with a strong sponsorship, to keep the necessary traction to complete all the necessary changes.

Focused approach

As in other kinds of initiatives with similar characteristics, the approach should be focused and the less disruptive possible to allow data lineage to gain some traction withing the organization.

Developing a sequence of targeted initiatives, instead of a single large initiative, has the benign effect of increasing the awareness of the importance and impact of data quality across the organization, increasing the overall internal engagement.

Start with business areas than can clearly identify and measure the business impact of bad data on their processes. In every organization the opportunities to identify these cases are abundant. Across all the business areas there are pain points related with the quality of data and identifying them is not a challenge.

Once a critical pain point is identified, you’ll have the business stakeholder that can passionately and effectively articulate the impacts of poor data quality in their processes and that will be eager to defend the project.

Having the business stakeholder working by your side will accelerate the process of quickly move from the findings to specific actions.

Making the option for smaller, focused initiatives that are more focused and efficient creates and increases the awareness across the enterprise and ends acting as the motor from within the organization that will allow data lineage to propagate across all the data processes.