icat.ingest — Ingest metadata into ICAT
Added in version 1.1.0.
This module provides class icat.ingest.IngestReader that
reads Metadata ingest files to add them to ICAT. It is designed
for the use case of ingesting metadata for datasets created during
experiments.
The IngestReader is based on the general purpose
class XMLDumpFileReader. It differs from
that base class in restricting the vocabular of the input file: only
objects that need to be created during ingestion from the experiment
may appear in the input. This restriction is enforced by first
validating the input against an XML Schema Definition (XSD). In a
second step, the input is transformed into generic ICAT data XML
file format using an XSL Transformation (XSLT)
and then fed into XMLDumpFileReader. The
format of the input files may be customized to some extent by
providing custom versions of XSD and XSLT files, see
Customizing the input format below.
Some attributes and relations of the Dataset objects are
prescribed during the transformation into ICAT data file format,
namely the complete attribute and the name of the DatasetType
to relate them to. The prescribed values are set in class attributes
Dataset_complete and
DatasetType_name respectively. They
may be customized by overriding these class attributes.
The Dataset objects in the input will not be created by
IngestReader, because it is assumed that a
separate workflow in the caller will copy the content of datafiles to
the storage managed by IDS and create the corresponding Dataset
and Datafile objects in ICAT at the same time. But the attributes
of the datasets will be read from the input file and set in the
Dataset objects by IngestReader.
IngestReader will also create the related
DatasetTechnique, DatasetInstrument and DatasetParameter
objects read from the input file in ICAT.
- class icat.ingest.IngestReader(client, metadata, investigation)
Bases:
XMLDumpFileReaderRead metadata from XML ingest files into ICAT.
The input file may contain one or more datasets and related objects that must all belong to a single investigation. The file is first validated against an XML Schema Definition (XSD) and then transformed on-the-fly into generic ICAT data file format using an XSL Transformation (XSLT). The result of that transformation is fed into the parent class
XMLDumpFileReader.- Parameters:
client (
icat.client.Client) – a client object configured to connect to the ICAT server that the objects should be created in.metadata (
Pathor file object) – the input file. Either the path to the file or a file object opened for reading binary data.investigation (
icat.entity.Entity) – the investigation object that the input data should belong to.
- Raises:
icat.exception.InvalidIngestFileError – if the input in metadata is not valid.
Changed in version 1.3.0: drop class attribute
XSLT_namein favour ofXSLT_Map.Changed in version 1.3.0: inject an element
_environmentas first child of the root element into the input data.- SchemaDir = PosixPath('/usr/share/icat')
Path to a directory to read XSD and XSLT files from.
- XSD_Map = {('icatingest', '1.0'): 'ingest-10.xsd', ('icatingest', '1.1'): 'ingest-11.xsd'}
A mapping to select the XSD file to use. Keys are pairs of root element name and version attribute, the values are the corresponding name of the XSD file.
- XSLT_Map = {'icatingest': 'ingest.xslt'}
A mapping to select the XSLT file to use. Keys are the root element name, the values are the corresponding name of the XSLT file.
Added in version 1.3.0.
- Dataset_complete = 'false'
Value to prescribe in the complete attribute of datasets.
Note
The value for this class attribute is subject to change in version 2.0. You might want to override it in order to pin it to a value that is suitable for you.
Added in version 1.5.0.
- DatasetType_name = 'raw'
Name of the DatasetType to relate datasets to.
Added in version 1.5.0.
- get_xsd(ingest_data)
Get the XSD file.
Inspect the root element in the input data and lookup the tuple of element name and version attribute in
XSD_Map. The value is taken as a file name relative toSchemaDirand this path is returned.Subclasses may override this method to customize the XSD file to use. These derived versions may inspect the input data to select the appropriate file. Derived versions should raise
InvalidIngestFileErrorif they decide to reject the input data.- Parameters:
ingest_data (
lxml.etree._ElementTree) – input data- Returns:
path to the XSD file.
- Return type:
- Raises:
icat.exception.InvalidIngestFileError – if the pair of root element name and version attribute could not be found in
XSD_Map.
- get_xslt(ingest_data)
Get the XSLT file.
Inspect the root element in the input data and lookup the element name in
XSLT_Map. The value is taken as a file name relative toSchemaDirand this path is returned.Subclasses may override this method to customize the XSLT file to use. These derived versions may inspect the input data to select the appropriate file. Derived versions should raise
InvalidIngestFileErrorif they decide to reject the input data.- Parameters:
ingest_data (
lxml.etree._ElementTree) – input data- Returns:
path to the XSLT file.
- Return type:
- Raises:
icat.exception.InvalidIngestFileError – if the root element name could not be found in
XSLT_Map.
Changed in version 1.3.0: lookup the root element name in
XSLT_Maprather than using a static file name.
- get_environment(client)
Get the environment to be injected as an element into the input.
Subclasses may override this method to control the attributes set in the environment.
Note
If you override this method, it is advisable to call the inherited method from the parent class and augment the result. This avoids inadvertently dropping environment settings added in future versions. E.g. do something like the following in your subclass:
def get_environment(self, client): env = super().get_environment(client) env['mykey'] = 'value' return env
- Parameters:
client (
icat.client.Client) – the client object being used by this IngestReader.- Returns:
the environment.
- Return type:
Added in version 1.3.0.
- add_environment(client, ingest_data)
Inject environment information into input data.
The attributes set in the environment are determined by calling
get_environment(). Subclasses may override this method to fully control the process of adding the environment element.- Parameters:
client (
icat.client.Client) – the client object being used by this IngestReader.ingest_data (
lxml.etree._ElementTree) – input data
Added in version 1.3.0.
- getobjs_from_data(data, objindex)
Iterate over the objects in a data chunk.
Yield a new entity object in each iteration. The object is initialized from the data, but not yet created at the client.
- getobjs()
Iterate over the objects in the ingest file.
- ingest(datasets, dry_run=False, update_ds=False)
Ingest metadata from an ingest file.
Read the metadata provided as argument to the constructor. The acceptable set of objects in the input is restricted: only
Datasetand relatedDatasetInstrument,DatasetTechnique, andDatasetParameterobjects are allowed. TheDatasetobjects must be in the list provided as argument.If dry_run is
False, the related objects will be created in ICAT. In this case, the datasets in the argument must already have been created in ICAT beforehand (e.g. the id attribute must be set). If dry_run isTrue, the objects in the metadata will be checked for conformance, but nothing will be committed to ICAT. In this case, the datasets don’t need to be created beforehand.if update_ds is
True, the objects in the datasets argument will be updated: the attributes and the relations to other objects will be set to the values read from the input. This is particularly useful in conjunction with dry_run in order to update the datasets from the metadata prior to creating them in ICAT.- Parameters:
datasets (iterable of
icat.entity.Entity) – list of allowed datasets in the input.dry_run (
bool) – flag whether not to create related objects.update_ds (
bool) – flag whether to update the datasets in the argument.
- Raises:
icat.exception.InvalidIngestFileError – if the input is not valid, for instance if there is any unallowed object or duplicate objects.
icat.exception.SearchResultError – if any object references in the input could not be resolved.
Ingest process
The processing of the metadata during the instantiation of an
IngestReader object may be summarized by the
following steps:
Read the metadata and parse the
lxml.etree._ElementTree.Call
get_xsd()to get the appropriate XSD file and validate the metadata against that schema.Inject an
_environmentelement as first child of the root element, see below.Call
get_xslt()to get the appropriate XSLT file and transform the metadata into generic ICAT data XML file format.Feed the result of the transformation into the parent class
XMLDumpFileReader.
Once this initialization is done,
ingest() may be called to read the
individual objects defined in the metadata.
The environment element
During the processing of the metadata, an _environment element
will be injected as the first child of the root element. In the
current version of python-icat, this _environment element has the
following attributes:
- icat_version
Version of the ICAT server this client connects to, e.g. the
icat.client.Client.apiversionattribute of the client object being used by thisIngestReader.- dataset_complete
The value of
Dataset_complete.- datasettype_name
The value of
DatasetType_name.
More attributes may be added in future versions. This
_environment element may be used by the XSLT in order to adapt the
result of the transformation to the environment, in particular to
adapt the output to the ICAT schema version it is supposed to conform
to.
Changed in version 1.5.0: add attributes dataset_complete and datasettype_name.
Ingest example
It is assumed that the XSD and XSLT files (ingest-*.xsd,
ingest.xslt) provided with the python-icat source distribution are
installed in the directory pointed to by the class attribute
SchemaDir of
IngestReader. The core of an ingest script
might then look like:
from pathlib import Path
import icat
from icat.ingest import IngestReader
# prerequisite: search the investigation object to ingest into from
# ICAT and collect a list of dataset objects that should be ingested
# from the data collected at the experiment. The datasets should be
# instantiated (client.new('Dataset')) and include their respective
# datafiles, but not yet created at this point:
# investigation = client.assertedSearch(...)[0]
# datasets = [...]
# metadata = Path(...path to ingest file...)
# Make a dry run to check for errors and fail early, before having
# committed anything to ICAT yet. As a side effect, this will
# update the datasets, setting the attribute values that are read
# from the input file:
try:
reader = IngestReader(client, metadata, investigation)
reader.ingest(datasets, dry_run=True, update_ds=True)
except (icat.InvalidIngestFileError, icat.SearchResultError) as e:
raise RuntimeError("invalid ingest file") from e
# Create the datasets. In a real production script, you'd copy the
# content of the datafiles to IDS storage at the same time:
for ds in datasets:
ds.create()
# Now read the metadata into ICAT for real:
reader.ingest(datasets)
There is a somewhat more complete script in the example directory of the python-icat source distribution.
Customizing the input format
The ingest input file format may be customized by providing custom XSD
and XSLT files. The easiest way to do that is to subclass
IngestReader. In most cases, you’d only need to
override some class attributes as follows:
from pathlib import Path
import icat.ingest
class MyFacilityIngestReader(icat.ingest.IngestReader):
# Override the directory to search for XSD and XSLT files:
SchemaDir = Path("/usr/share/icat/my-facility")
# Override the XSD files to use:
XSD_Map = {
('legacyingest', '0.5'): "legacy-ingest-05.xsd",
('myingest', '4.3'): "my-ingest-40.xsd",
}
# Override the XSLT file to use:
XSLT_Map = {
'legacyingest': "legacy-ingest.xslt",
'myingest': "my-ingest.xslt",
}
XSD_Map and
XSLT_Map are mappings with
properties of the root element of the input data as keys and file
names as values. The methods
get_xsd() and
get_xslt() respectively inspect the
input file and use these mappings to select the XSD and XSLT file
accordingly. Note that XSD_Map
takes tuples of root element name and version attribute as keys, while
XSLT_Map uses the name of the root
element name alone. It is is assumed that it is fairly easy to
formulate adaptations to the input version directly in XSLT, so one
single XSLT file would be sufficient to cover all versions.
In the above example, MyFacilityIngestReader would recognize input files like
<?xml version='1.0' encoding='UTF-8'?>
<legacyingest version="0.5">
<!-- ... -->
</legacyingest>
and
<?xml version='1.0' encoding='UTF-8'?>
<myingest version="4.3">
<!-- ... -->
</myingest>
Input files having any other combination of root element name and version number would be rejected.
In more involved scenarios of selecting the XSD or XSLT files based on
the input, one may also override the
get_xsd() and
get_xslt() methods.