Data Provenance

Data Provenance is loosely referred to as where the data came from. This includes everything from the source application, database type, database instance, database name, schema name, table name, and attribute name.

Data Provenance specifically refers to the first instance of that data or its source.

Definition from DataVersity – Quoting W3C

“Provenance is defined by W3C as “information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability, or trustworthiness.” By knowing the history of data, in other words, you know whether to trust it. Provenance comes in the form of metadata that details a data packet’s lineage: its origin and changes made to it, with timestamps. Users can more easily track down errors, inaccuracies, flaws, and fraud, leading to better data-analytics outcomes.” Source – DataVersity

The Importance of Provenance within a Data Warehouse

It is important for Enterprise Reports/Analytics to maintain accurate Data Provenance for each attribute that follows each attribute throughout all layers of the warehouse. This is required to accurately respond to users data quality questions such as “Where did this data come from?”. Knowing the exact source database, table, and attribute where value originated within an aggregated data set (e.g., Golden Record) can take a great deal of time without this capability architected in as a foundational element of the modern data warehouse. Being able to quickly and authoritatively answer user questions is essential to users having faith in the quality of the data within the warehouse which is essential for high user adoption.

Research Data Provenance Interest Group

“Tracking provenance for research data is vital to science and scholarship, providing answers to common questions researchers pose when sharing and exchanging data: Where did it come from? Who modified it? Is this copy the same as the copy I deposited? In what way is it the same? How do I resolve discrepancies or anomalies?

This group focuses on the comparison and evaluation of models for data provenance. It is concerned with questions of data origins, maintenance of identity through the data lifecycle, and how we account for data modification. Objectives of this group include: recommending general and expressive frameworks for documenting research data transactions proposing syntheses of complementary provenance views, and relating data provenance to problems of scientific equivalence and the assessment of data quality.

The Research Data Provenance group anticipates potential intersections with the Data Citation, Data Foundation and Terminology, and Metadata Standards working groups as well as the Data in Context interest group.” Source – rd-alliance.org/groups/research-data-provenance.html

Definition from DataVersity – Quoting W3C

The Importance of Provenance within a Data Warehouse

Research Data Provenance Interest Group

Related