Skip to Main Content U.S. Department of Energy
 

Provenance Sample

Provenance is metadata about the source and manner in which data was processed. Provenance is important in scientific workflows in which experimental results must be analyzed or verified after they have been run.

This saple demonstrates creating and configuring a MifProvenanceHandler which is created by extending the class AbstractProvenanceHandler. The provenance handler is implemented by an interceptor which handles data passed to a module, before it is receied by the module's MIF processor and after the MIF processor has operated on the data. The user creates a provenance handler by impelementing two methods: handleBefore, which is called by MIF to perform some kind of provenance recording before it is received by the HelloNameProcessor, and handleAfter which is called after the MIF processor returns its data.

Sample Overview

This sample builds on the helloModule from the hello world example by setting a provenance handler on the module. As in the hello world sample, the user is prompted to enter a name via a stdio endpoint on the helloModule.

The diagram below shows the inner workings of the helloModule in which the message containing the name is recieved, intercepted by the provenance handler and written to a temporary file – along with some provenance data including the time at which it was processed – by the provenance handler's handleBefore method. Then the message is sent to the processor which prepends “hello” to the name string which is returned. Lastly, the provenance handler intercepts the string returned from the processor and writes its provenance data to the same temporary file.

Diagram of Provenance Sample

Running the Sample

To run this sample, start the HelloProvenanceDriver class:

bin/mif.sh -c gov.pnnl.mif.samples.provenance.HelloProvenanceDriver

This starts mif, and also prints the name of the temporary file where the provenance data will be written. Make a note of this temporary file so that you can check that the provenance data is written there after the pipeline processes its data.

2010-05-28 15:54:07,247 INFO  [MifManagerImpl]  -- MIF has started --
Creating temporary file for provenance: /tmp/provenance-1998268655351158797.txt

Then it prompts the user for a name. Enter a name and press return:

enter name:  Dave
before nameModule at 1275087426937
HelloNameProcessor: Hey,   Dave
after  nameModule at 1275087426942

As you can see above, the handleBefore method of the provenance handler prints “before name module at <time in ms>”, then the HelloNameProcessor receives the name and prepends “Hey,” to it. Lastly, the handleAfter method receives the data and prints “after nameModule at <time in ms>”. Both handle methods of the provenance handler also write their provenance data to the temporary file. Open up and view the contents of the file, which will look something like:

before module: nameModule. Data: [Dave] at 1275087426937
after  module: nameModule. Data: [Hey, Dave] at 1275087426942

This shows that the before handler recorded the name of the module (nameModule) it is capturing provenance for, the data it received (“Dave”), and the time in ms. it received the data (1275087426937), while the after handler did the same.

Code Walkthrough

Implementing the provenance handler

The provenance handler in this sample writes its meta data to a file. As noted above, a provenance handler is created by extending the AbstractProvenanceHandler class:

public class FileProvenanceHandler extends AbstractProvenanceHandler {

The following is how our handleBefore and handleAfter methods are implemented:

	public void handleBefore(Serializable data, MifModuleMetadata moduleMetadata) {
		write("before", data, moduleMetadata);
	}
 
	public void handleAfter(Serializable data, MifModuleMetadata moduleMetadata) {
		write("after ", data, moduleMetadata);
	}

As you can see, these simply call a write() method that writes the metadata (name of the module, and time in milliseconds) to a file and is also implemented in this class:

	private void write(String prepend, Serializable data, MifModuleMetadata moduleMetadata) {
		String module = moduleMetadata.getName();
		long time = System.currentTimeMillis();
		System.out.println(prepend + " " + module + " at " + time);
 
		try {
			writer.write(prepend + " module: " + module + ". Data: [" + data.toString() + "] at " + time + NL);
			writer.flush();
		} 
		catch (IOException e) {
			throw new RuntimeException(e);
		}
	}

Constructing the pipeline

The HelloProvenanceDriver class constructs a pipeline which contains the nameModule from the hello world example:

MifModule nameModule = pipeline.addMifModule(HelloNameProcessor.class, "stdio://stdin?promptMessage=enter name: ", null);
nameModule.setName("nameModule");

Then, a MifProvenanceHandler is created and set on the nameModule:

MifProvenanceHandler provHandler = pipeline.addProvenanceHandler( new FileProvenanceHandler(provFile) );
nameModule.setProvenanceHandler(provHandler);

Then, the pipeline is started as normal – that's all there is to it.

Endpoint Provenance Handler

In addition to user-created provenance handlers, MIF also includes a predefined handler which simply intercepts a module's data before and after a processor is called, sending the data to an arbitrary endpoint. This opens up the possibility of handling provenance on a remote machine or creating entire provenance processing pipelines.

For example, to change this sample to send its provenance metadata to a JMS endpoint, one would create an EndpointProvenanceHandler with the following code:

MifProvenanceHandler jmsProvHandler = pipeline.addEndpointProvenanceHandler("jms://topic:provenanceTopic");

Then, set the provenance handler on the nameModule as described above. Also, since this endpoint uses JMS, we need to also create a JMS connector as described in the transport documentation.

 
provenance_sample.txt · Last modified: 2010/05/28 16:35 by adamw