Skip to Main Content U.S. Department of Energy

Provenance Overview

This document will introduce you to the provenance features of MIF and walk through an example usage. For a more detailed discussion of provenance in general (including the provenance API) see MeDICI Provenance. NOTE: This is very much a work in progress…

Provenance Concepts

Provenance is an optional feature of the MIF API that allows users to capture run-time metadata about the pipeline being executed. The metadata can be captured at various levels of the architecture including the module, component, and pipeline layers.

Provenance capture at the module layer captures relevant metadata directly before and after entering the implementation code of the module (see Figure below). Examples of provenance metadata include:

  • Implementation module code name/version
  • Time taken to execute module
  • Input and output data

High level provenance concepts

In order to automatically enable provenance capture, the pipeline designer only needs to set a Boolean flag when creating the module. Additionally, there is the option of implementing the MifProvenanceObject, which allows the developer to customize exactly which metadata is captured and the annotation of that metadata using application-specific knowledge.

Provenance at the component and pipeline level allows the definition of metadata capture at a coarser grain than the module level. This enables the pipeline designer to carefully trade-off the quality of the provenance that is recorded against the potential overhead of provenance capture. Moreover, defining the provenance capture on a component or pipeline allows the designer to also take advantage of the transaction features of the MIF API. If configured, the MIF will group a set of provenance actions within an explicitly defined provenance transaction boundary. At the end of a transaction boundary, the pipeline designer specifies a condition that determines if the transaction should commit (and record the provenance captured within the transaction boundary) or rollback (and discard the data). This feature is particularly useful in streaming applications where much of the data that is processed is ‘uninteresting’, and provenance needs only to be captured if some ‘interesting’ data is detected by the analytical components.

The core MIF API does not specify how provenance data is stored. The implementation of the MIF provenance API simply formats the captured provenance into a message that is published on a configurable JMS provenance topic. A provenance listener and provenance store must be configured to receive and persist the provenance messages for future inspection. Within the MeDICi project, we are building the MeDICi Provenance Store that can be used for this purpose. A description of this technology is in MeDICI Provenance.

Using Provenance - helloProvenance Example

The following describes:

  1. An overview of the helloWorld sample included with MIF.
  2. How to run the helloWorld sample.
  3. A walkthrough of the code used to construct this sample.

Prerequisites:

Description

When the pipeline starts, the user is prompted to enter a name. The name is sent to the helloNameModule which calls the helloNameProcessor implementation to add “Hey” to the front of the string, then passes the whole string to the helloHalModule which calls the helloHalProcessor to add “what are you doing” to the end of the string, which is returned to the user via the console. The provenance features require that the incoming String be converted into a Java object implementing the DataObject interface (described below), thus there are transformers at the beginning and end of the component to convert between String and HelloData objects.

Provenance Component Diagram

Running helloProvenance

Note: $MIF_HOME refers to the MIF installation directory

Follow these steps:

  • Open a shell, navigate to $MIF_HOME/bin
  • Execute the following
run-mif.bat -c gov.pnnl.mif.samples.helloProvenance.HelloProvenanceDriver
  • MeDICI has fully started when the console displays something similar to:
MeDICI has started
  • Enter a name when prompted.
  • The console output will then look like:
Enter name:Dave
HelloNameProcessor: Hey, Dave
HelloHalProcessor: Hey, Dave, what are you doing?
Hey, Dave, what are you doing?
  • Additionally, there will be some provenance-related output

Adding Provenance to HelloWorld

All the source files for this example can be found in $MIF_HOME/sources/mif-samples/src/gov/pnnl/mif/samples/helloProvenance. The following code snippets demonstrate the pertinent portions of the code for the provenance features. Please refer to the hello world example and the base component model for a complete description of the components, modules and implementation code.

Enabling Provenance on a Module

In order to enable provenance on a module, it simply requires setting the provenance handler:

  //This is a MIF-implemented provenance handler which simply redirects the data
  //to an arbitrary endpoint.
  MifProvenanceHandler handler = pipeline.addEndpointProvenanceHandler("stdio://stdout");
 
  //Add HelloNameModule, setting the provenance handler enables provenance for this module.
  //This causes the provenance handler to run before and after the module's processor is called.
  MifModule helloModule = pipeline.addMifModule(HelloNameProvProcessor.class.getName(), inNameEndp, "vm://hal.queue");
  helloModule.setProvenanceHandler(handler);
 
  MifModule halModule = pipeline.addMifModule(HelloHalProvProcessor.class.getName(), "vm://hal.queue", outHalEndp);
  halModule.setProvenanceHandler(handler);

Provenance-Compliant Data Structure

In order to fully utilize the provenance-related features of the API is is necessary to create a data structure that implements the 'DataObject' interface. This interface has methods that tell the provenance API how to deal with a data object and retrieve it's “important” information. Here are the methods defined in the interface:

public interface MifProvenanceObject extends Serializable {
 
	/**
	 * A mapping of the application-specific "important" data values to keys.  This will be 
	 * used when saving the data values to the provenance store.
	 * @return
	 */
	public Map<String, String> getProvenanceValuesMap();
 
	/**
	 * A mapping of data keys to the associated object stored in the provenance store.  This 
	 * is managed by the MIF/Provenance API and is needed here in order to travel with the data as
	 * it flows through the pipeline.
	 * @return
	 */
	public Map<String, String> getProvenanceIdMap();
	public void setProvenanceIdMap(Map<String, String> provIdMap);
}

Summary

This page has introduced the concepts of provenance and given an example application showing the features in live code. NOTE: The provenance API is a work in progress and liable to change moving forward. This page will be updated to reflect any changes.

 
old_provenance_page.txt · Last modified: 2010/05/28 14:10 by adamw