Visualizing provenance files as graphs

This class reads a file with serialized provenance data into a NetworkX graph.

It provides functionality for manipulating the graph to simplify the visualization, and also to select which details of the captured information will be displayed as node attributes. Finally, it allows saving the graph in formats used by graph visualization software, such as GEXF or GraphML.

See the External tools for provenance visualization section on the Installation section, for instructions on how to download and setup Gephi that can be used to visualize GEXF files.

class alpaca.ProvenanceGraph(*prov_file, annotations=None, attributes=None, strip_namespace=True, remove_none=True, use_name_in_parameter=True, use_class_in_method_name=True, time_intervals=True, value_attribute=None)[source]

Directed Acyclic Graph representing the provenance history stored in an RDF file structured with the Alpaca ontology.

The visualization is based on NetworkX, and the graph can be accessed through the graph attribute.

DataObjectEntity and FileEntity individuals are nodes, identified with the respective URIs. FunctionExecution activities are also loaded as nodes. Each of the three node types is identified by the type node attribute. Interval strings for timeline visualization in Gephi are provided as the Time Interval node attribute.

Each node has an attributes dictionary with general description:

  • for DataObjectEntity, the label node attribute contains the Python class name of the object (e.g., ndarray). The Python_name node attribute contains the full path to the class in the package (e.g., numpy.ndarray);

  • for FileEntity, the label node attribute is File;

  • for FunctionExecution activities, the label will be the function name (e.g. mean), and the Python_name node attribute will be the full path to the function in the package (e.g., numpy.mean).

Each node may also have additional attributes in the dictionary, with extended information:

  • for DataObjectEntity, it contains the Python object attributes and annotations that were saved as metadata in the PROV file;

  • for FileEntity, it contains the file information such as path and hash;

  • for FunctionExecution activities, it contains the values of the parameters used to call the function.

The node attributes to be included are selected by the annotations and attributes parameters during the initialization.

Finally, the graph can be simplified using methods for condensing memberships (e.g., elements inside lists) and simplification (e.g., repeated operation in tracks generated from loops).

Parameters:
prov_filestr or Path-like

Source file(s) with RDF provenance data in the Alpaca format based on W3C PROV-O. If multiple files are provided, all will be loaded into the same graph object. This is useful to integrate provenance captured from several sources for visualization (e.g., steps in workflows or parallel processes).

annotationstuple of str or ‘all’, optional

Names of all annotations of the objects to display in the graph as node attributes. Annotations are defined as values of an annotation dictionary that might be present in the object (e.g., Neo objects). In the PROV file, they are identified with the hasAnnotation property in individuals of the DataObjectEntity class. If ‘all’, all the annotations in the objects are going to be included. Default: None

attributestuple of str or ‘all’, optional

Names of all attributes of the objects to display in the graph as node attributes. Attributes are regular Python object attributes. In the PROV file, they are identified with the hasAttribute property in individuals of the DataObjectEntity class. If ‘all’, all the attributes in the objects are going to be included. Default: None

strip_namespacebool, optional

If False, the namespaces (i.e., attribute or annotation) will be shown for each requested attribute/annotation. For example, for an attribute ‘shape’, if strip_namespace is False, the key in the node attributes will be the full name ‘attribute:shape’. If True, the key in the node attributes will be just ‘shape’. The namespaces are annotation and attribute for object annotations and attributes, respectively. Default: True

remove_nonebool, optional

If True, the return nodes of functions that return None will be removed from the graph. This is useful to avoid cluttering if a function that returns None is called frequently. Default: True

use_name_in_parameterbool, optional

If True, the function name will be added to the parameter name in the node attributes (e.g., ‘function:param’). If False, the parameter name will be shown with a generic tag (e.g., ‘parameter:param’). Use this option if different functions share same parameter names, to avoid ambiguity. Default: True

use_class_in_method_namebool, optional

If True, function nodes that are methods in classes will be labeled with the class name as prefix (e.g., ClassName.method_name). If False, only the method name will appear in the node label (e.g., method_name). Default: True

time_intervalsbool, optional

If True, the nodes will have the Time Interval attribute containing time interval strings in the format supported by the Gephi timeline feature. If False, the attribute is not included. Default: True

value_attributestr, optional

If provided, an attribute named value_attribute will be added to the node attributes to show the values stored in the provenance information. Alpaca stores the values of objects of the builtin types str, bool, int, float and complex, as well as the NumPy numeric types (e.g. numpy.float64) by default. The values of additional types can be defined using the alpaca.settings.alpaca_setting() function. Default: None

Attributes:
graphnx.DiGraph

The NetworkX graph object representing the provenance read from the PROV file.

aggregate(group_node_attributes, use_function_parameters=True, output_file=None, remove_attributes=None, record_members=True)[source]

Creates a summary graph based on a selection of attributes of the nodes in the graph.

The attributes can be individualized for each node label (as defined by the label node attribute), so that different levels of aggregation are possible. Therefore, it is possible to generate visualizations with different levels of detail to progressively inspect the provenance trace.

In the summarized nodes, the member_count node attribute stores the number of nodes in the group. If requested, the list with the IDs of the original nodes that are part of that group can be stored in the members node attribute.

Parameters:
group_node_attributesdict

Dictionary selecting which attributes are used in the aggregation. The keys are the possible labels in the graph, and the values are tuples of the node attributes or callables used for determining supernodes.

For example, to aggregate Quantity nodes based on different shape attribute values, group_node_attributes would be {‘Quantity’: (‘shape’,)}. If passing an empty dictionary, no attributes will be considered, and the aggregation will be based on the topology (i.e., nodes at similar levels will be grouped according to the connectivity).

In addition to attribute names, callables that take the arguments (graph, node, data), where graph is the graph being aggregated, node is the node being evaluated for grouping, and data is the dictionary of attributes, can be used. The returned value is used to define the group. This allows flexibility when grouping, as attribute values can be transformed (e.g., extracting a token such as file extension from an attribute that stores the path as a string), or the relationship of the node to neighbors and values of edges can be checked. However, this will increase the time to evaluate the grouping criteria of a node.

use_function_parametersbool, optional

If True, the parameters of function nodes in the graph will be considered in the aggregation, i.e., if the same function is called with different parameters, different supernodes will be generated. If False, a single supernode will be produced, regardless of the different parameters used. Default: True

output_filestr or Path-like, optional

If None, a nx.DiGraph object will be returned. If not None, the graph will be saved in the provided path, and the function will return None. The file must have either the .gexf or the .graphml extension, to save as either GEXF or GraphML formats respectively. Default: None

remove_attributesstr or tuple of str, optional

Remove the specified node attributes from the aggregated graph. Default: None

record_membersbool, optional

If True, the summarized nodes will have the members attribute with the identifiers of all nodes that are part of the group. Default: True

Returns:
nx.DiGraph or None

If an output file was not specified in output_file, returns the aggregated graph as a NetworkX object. The original graph stored in graph is not modified. If an output file was specified, returns None.

Raises:
ValueError

If output_file is not None and the file does not have either ‘.gexf’ or ‘.graphml’ as extension.

Notes

This function is an adaptation of the snap_aggregation function included in NetworkX 2.6, which implemented the SNAP algorithm based on [1].

The function was modified to group the nodes based on different attributes or callables (using a dictionary based on the labels) instead of attributes that are common to all nodes.

During the summary graph generation, the attribute values are also summarized, so that the user has an idea of all the possible values in the group.

Please refer to the Open software licenses section for copyright and license information.

References

[1]

Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation for graph summarization. In Proc. 2008 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’08), pages 567–580, Vancouver, Canada, June 2008.

condense_memberships(preserve=None)[source]

Condense sequential entity membership relationships into a single node. This operation is done in-place, i.e., the graph stored as graph will be modified.

Membership relationships are used to describe relationships such as attributes (e.g. block.segments) or membership in containers (e.g., spiketrains[0]).

Parameters:
preservetuple of str, optional

List the labels of nodes that should not be condensed if present in a membership relationship. Default: None

remove_attributes(*attributes)[source]

Remove one or more attributes from the nodes.

Parameters:
attributesstr

Key(s) identifying the attribute(s) to be removed from the node attribute dictionary.

save_gexf(file_name)[source]

Writes the current provenance graph as a GEXF file.

save_graphml(file_name)[source]

Writes the current provenance graph as a GraphML file.