Provenance tracker decorator

The Alpaca class decorator can be used to instrument the functions used in the script, to track the inputs, outputs, and parameters.

class alpaca.Provenance(inputs, file_input=None, file_output=None, container_input=None, container_output=False)[source]

Class to capture and store provenance information in Python scripts.

The class is a callable object, to be used as a decorator to every function call from the script that will be tracked.

Parameters:
inputslist of str

Names of the arguments that are considered inputs to the function. An input is a variable or value with which the function will perform some computation or action. Arguments that only control the behavior of the function are considered parameters. The names can be for both positional or keyword arguments. Every argument that is not named in inputs, container_input, file_input or file_output will be considered as a parameter. If None, this parameter is ignored. If a function does not take any input (e.g., functions that generate data), inputs can be set to an empty list or None.

file_inputlist of str, optional

Names of the arguments that represent file(s) read from the disk by the function. Their hashes will be computed and stored. Default: None

file_outputlist of str, optional

Names of the arguments that represent file(s) write to the disk by the function. The hashes will be computed and stored. Default: None

container_inputlist of str, optional

Names of the arguments that are containers of data (e.g., a list with data structures used by the function). Alpaca will track and identify the elements inside the container, instead of the container itself. Default: None

container_outputbool or int or tuple, optional

The function outputs data inside a container (e.g., a list).

If True, Alpaca will track and identify the elements inside the container, instead of the container itself. It will iterate over the function output object and identify the individual elements. However, for dictionary outputs, the dictionary object is identified together with its elements, to retain information on the keys. For other containers, the container object is not identified.

If an integer, this defines a multiple-level (nested) container. The number defines the depth for which to identify and serialize the objects. In this case, the function output object will always be identified together with the element tree. For instance, consider the two-level list L = [[obj1, obj2], [obj3, obj4]]. With container_output=0, there will be a single function output node for list L. Starting from L, there will be two additional nodes for each of the inner lists (L[0] and L[1], i.e., all elements from level zero). With container_output=1, there will be a single function output node for list L. Starting from L, there will be two additional nodes for each of the inner lists (L[0] and L[1]). Finally, starting from each inner list, there will be output nodes for obj1 and obj2 (linked to L[0]) and for obj3 and obj4 (linked to L[1]). Therefore, all elements from level one are identified, and linked to the respective elements from level zero.

If a tuple, this defines a range of the levels in a nested container to consider when identifying the objects output by the function. For example, taking the same list above, a container_output=(0, 1) will start from level zero and stop at the elements from level one (similar to container_output=1). With container_output=(1, 1), the first level will be ignored as function output. The function will have two output nodes (directly for L[0] and L[1]). Starting from each inner list, there will be output nodes for obj1 and obj2 (linked to L[0]) and for obj3 and obj4 (linked to L[1]). Therefore, the first level (zero) of the container is ignored, and only elements from level one are described. The range feature is useful for functions where the relevant outputs are containers whose elements should also be described, but those containers are grouped inside a single return list instead of the function returning a tuple with the containers.

It is important to note that all levels identified as integers or range tuples should point to levels in the nested-container that contain iterables. For example, in the list L above, the level 2 are the objects objX. If container_output=2, Alpaca will try to iterate over each objX and describe their elements. If they are not iterable, an error will be raised.

Default: False

Raises:
ValueError

If inputs is not a list or not None.

Attributes:
activebool

If True, provenance tracking is active. If False, provenance tracking is suspended. This attribute is set using the activate()/deactivate() interface functions.

historylist of FunctionExecution

All events that were tracked. Each function call is structured in a named tuple FunctionExecution that stores:

  • ‘function’: FunctionInfo named tuple;

  • ‘inputs’: dict with the DataObject or File named tuples associated with every input value to the function;

  • ‘params’: dict with the positional/keyword argument names that are not data/file input/file output as keys. Values are the value of each argument as passed to the function call;

  • ‘output’: dict with the DataObject or File named tuples associated with the values returned by the function or files written to the disk;

  • ‘arg_map’: names of the positional arguments;

  • ‘kwarg_map’: names of the keyword arguments;

  • ‘call_ast’: ast.AST object containing the Abstract Syntax Tree of the code that generated the function call.

  • ‘code_statement’: str with the code statement calling the function.

  • ‘time_stamp_start’, ‘time_stamp_end’: str with the ISO representation of the start and end times of the function execution;

  • ‘return_targets’: names of the variables that store the function output(s) in the source code;

  • ‘order’: integer defining the order of this function call in the

    whole tracking history.

  • ‘execution_id’: str with the UUID of the particular function

    execution tracked.

source_filestr

Path to the script file being tracked.

session_idstr

Unique identifier (UUID) for this script execution.

inputslist

Names of the function arguments that are considered inputs.

file_inputslist

Names of the function arguments that are considered file inputs.

file_outputslist

Names of the function arguments that are considered file outputs.

container_inputslist

Names of the function arguments that are considered containers of data.

container_outputbool

True if the function outputs data in a container.