Fighting the data swamp monster

Every forecast agrees that the data lake market will grow by at least two digits in the years to come. Largely driven by a shift towards cloud-based data platforms and by the increasing need to derive in-depth insights from larger volumes, and streamlined access to organizational data from departmental, mainframe, and legacy silos.

This fast-paced growth may imply that the existing data swamps will most likely grow in the same proportion.


An ocean of possibilities

Data Lakes hold an ocean of possibilities for organizations ready to face the emergence of new business innovations and new forms of competition, with constant advances in digitization, analytics, artificial intelligence, machine learning, internet of things or robotics - to enable the competitive edge, that makes the difference in an increasingly competitive business environment.

Data Lakes offers the capability to introduce cutting edge data initiatives and to capture vast amounts of data from unrelated sources - social, mobile, cloud applications, or IoT – enabling working with “raw” data in its native format, structured, semi structured, or unstructured.

Data Lakes allows breaking down the data silos, centralizing and consolidating the organization’s batch and streaming data assets into a complete and authoritative data store, enabling organizations to:

· Power data science and machine learning initiative.

· Centralize, consolidate, and catalogue data.

· Integrate disparate data sources and formats.

· Allowing users self-service tools.

Having an unquestionable set of advantages, Data Lakes also have their own challenges and risks that an organization needs to tackle to avoid turning them into data swamps.

The Data Swamp

When organizations are faced with the inability to find, understand, and trust the data they need from their data lakes for business value and to gain a competitive edge – They are facing a Data Swamp.

· Data that is dumped into the data lake without structure, processes, and rules governing it has no business value.

· Data that is not governed or cataloged will turn into a business liability.

· Data with undetermined origin and unknown transformations increases regulatory and privacy compliance risks, as regulators and privacy legislation requires strong data governance, traceability and privacy controls such as data masking and anonymization.

· Data without accuracy or context does not produce valid insights. Compromising any strategy built based on data analytics, resulting in poor decisions, ultimately leading to financial and reputational risks.

As the data “hoarding” develops, so the risks of falling into the murky waters of what was initially a Data Lake, easily recognized by the lack of metadata, large volumes of irrelevant, incomplete or inconsistent data, absence of governance processes and nonexistent data quality strategy.

Bottom line:

· Stakeholders don't trust data.

· Failure to achieve the intended benefits of big data.

· Decision processes are impaired.

· Compliance costs increase.

· Low data scientist productivity.

· Data dependent initiatives and project fail to make production.

Resulting in organizations that will not succeed exploring their data assets effectively to reach their strategic goals and create value.

Data that was initially a business asset becomes a business liability.

Data lake governance

It seems clear that the more data an organization has available, more easily it will be able to pursue its objectives, but it does not always work that way, especially if what happens is that data is simply being moved to the Data Lake without any Data Governance in place – ultimately replicating all the data silos in a single place.

Organizations that aim to effectively and efficiently maximize the value of data stored in data lakes, need to implement policy-driven processes that classify and identify what information is in the lake, why it’s there, what it means, who owns it, and who uses it, to ensure that high-quality data is available throughout the data's full life cycle.

If an organization wants to use the data stored in the data lake to inform their decisions, the data must be governed.

Understanding which organization’s strategic goals are underlying the Data Lake is essential. Clear, focused and business-oriented objectives, controlled through Data Governance processes are paramount for success during implementation.

Other aspects, also critical and directly depending on a robust Data Governance framework, will contribute directly to avoid some of the pitfalls that will lead inevitably to a Data Swamp, and some of the negative effects mentioned above:

· Metadata Management.

· Data Lineage.

· Data Catalog.

· Business Glossary.

· Data Quality.

· Data Privacy.

· Data Security.

· Data Access Control.

· Data Encryption.

The ability to successful implement a Data Governance plan in the Data Lake, and subsequently the creation of an efficient Data Lake, will bring clear benefits to the organization both from an operational and a management perspective:

· Optimization of the value of data.

· Increasing the synergies between its business areas, through the establishment of standard analytics methodologies.

· Increased management productivity, driven by more efficient and effective evidence-based decision-making.

· Higher revenue generation thought a more efficient usage of its data assets.

· Reduction in time spent by knowledge workers in finding and acquiring information.

· Reduction of regulatory compliance and data privacy fines.

· ROI that can be associated with specific analytics initiatives.

Is it too late?

Data Governance is commonly seen as a long, painful, and expensive process, and it can be, and for an organization already facing the problem of having a Data Swamp, it seems to be an even larger challenge.

It doesn't have to be so.

Adopting a business-driven approach, aligned with business objectives, priorities, and needs, complemented with an agile implementation approach where the benefits of each governance initiative can be quickly apprehended by the stakeholders, will allow the organization to start regaining control over the Data Lake.