Eighty-twenting data

I entered the world of data, starting by data quality, making quality one of the foundational themes for all the work I’ve been producing since.

“If I have seen further it is by standing on the shoulders of Giants” is one of my favorite quotes, belonging to Isaac Newton, working as a reminder the everything we know and do is a compound of work done before us.

One of these giants is Joseph M. Juran, whose work in the field of quality management is still a reference.

So why am I bringing Juran here? Mainly because he introduced the Pareto principle to quality issues, verifying that a small percentage of root causes contributes to a high percentage of defects.

The Pareto principle or 80/20 rule, follows the observations of economist Vilfredo Pareto, whose studies showed that 80% of the land in Italy was owned by 20% of the population.

Although I’ve frequently used this principle while dealing with data quality issues, this is a principle - even though there is little scientific analysis that either proves or refutes its validity – that is frequently used in many different fields.

This is also true when reflecting on some of the issues faced by who has responsibilities in the data management area, and if correctly applied can bring a better understanding of the issues and possibly, additional benefits, cutting costs, and increasing some efficiencies – Or at least to be used as a tool to identify priorities.

Putting it in a different way – considering data as a corporate asset – the rule allows an organization to identify its best assets and use them efficiently to create maximum value.

Keep in mind that 80-20 is only a guideline, it’s in fact almost a branding name. What those two numbers measure are outputs and inputs, not even necessarily using the same units. S, it can easily be 70-30, 50-10 or whatever combination.

What I’m proposing here is a questioning exercise, that will allow in certain situations to do a more efficient allocation of resources, or even help to define future investments.

Asking questions like:

  • Which 20% of data produces most business valuable insights?

  • Which 20% of data is more critical for business continuity?

  • Which 20% of data is more liable to security risks?

  • Which 20% of data is more frequently accessed?

  • Which 20% of data is less frequently updated?

  • Which 20% of data is more critical for regulatory purposes?

  • Which 20% of data is taking more processing time in loading and transformation processes?

  • Which 20% of data is the cause for most of the data quality problems? *

These are just a few examples of questions that can be put and that can in some situations lead to a change in perspective followed by some specific actions, specially when we start crossing the answers from different questions.

As an example, trying to identify the 20% of data that is most valuable to the organization, will allow to better prioritize and define any future data initiative, to review the current ones or even adapt ongoing initiatives to maximize the efficiency of the data architecture overall.


* I had to include a data quality related question 😊