Metadata Guide


This Guide is intended to provide a simple generic working-level view of the needs, issues, and processes around metadata collection and creation as it relates to research data [7]

It is based on: Australian Research Data Commons (ARDC). (2020). ARDC Metadata Guide. Zenodo. https://doi.org/10.5281/zenodo.6459832

What is metadata?

While generally ‘meta-data’ is summarised as ‘data about data’, the following provides examples of what this actually means:

Finding data

Data formats such as text can be indexed and searched themselves (as in a simple Google search). However, the ability to search formats like audio, images and video is limited, and discoverability relies on searching the metadata. Discovery metadata helps researchers find data that, for example:

Determining the value of data

To assess the usefulness, value and quality of a data set, researchers need to understand the context around the data. This is given in metadata that:

Accessing data

Access to research data requires:

Using and reusing data

To make use of any data set, researchers need metadata on:

Proper recording of this information is important for the producers of the data as well as future users [7].

In agricultural field trials, it is common practice to take soil samples under standardised conditions and evaluate them using specific analysis methods. If important metadata such as the exact location of the samples (GPS coordinates), the sampling date, the sampling depth, the soil moisture at the time of sampling or the analysis methods used (e.g. potassium determination using the CAL method) are not documented, the data cannot be reliably interpreted later or compared with other data sets. Satellite or drone images also lose their informative value if metadata on the time of recording, sensor parameters or weather conditions are missing.

Levels of metadata

There are three levels of data groupings: data objects (Datasets and Components) can be collected into groups (Collections):

  1. Collection level metadata describes the collection as a whole. For example, the BonaRes database (https://maps.bonares.de) collects soil data from various long-term experiments (LTEs). The collection-level metadata describe, among other things, the objective, geographical coverage, participating institutes, licensing and interfaces for subsequent use (e.g. via APIs or download formats).
  2. Dataset level metadata describes individual objects. For example, within an LTE, the metadata describing the soil measurements of a specific year are on dataset level.
  3. Component level metadata - sometimes single objects can be made up of component parts: For example, a soil dataset can consist of several layer measurements per soil profile (e.g. 0-10 cm, 10-30 cm, 30-60 cm). Each layer represents a component and requires its own metadata - such as sample depth, analysis type or storage.

The collection approach and subsequent level of description impacts on discoverability, and cost and effort of management. Metadata ideally should be based on the needs of those for whom the collection is created [7].

EXAMPLE: Collection, dataset and component metadata

An agricultural research institute operates a central archive with data from multi-year field trials on crop rotation. To enable effective use and targeted reuse of this data, the data sets are described at several levels:

This hierarchical structure makes it possible to find specific data - e.g. all plot yields of spring barley in 2020 under reduced tillage - without having to manually search through countless individual files. It also supports automatic linking with other data sets, such as weather or soil condition [7].

Types of metadata

Metadata types are often grouped into functional types, but note that some elements will provide multiple functions.

The most common types are:

Metadata language

Understanding commonly used metadata terminology will help you better plan, collect and apply metadata [7].

Elements and schemas

Metadata schemas are an overall structure for metadata about a particular information resource or for a specific domain. A schema specifies a set of metadata concepts or terms (called elements), and their associated definitions (semantics) and relationships. The value given to each element is the content.

Metadata schemas often emerge from a single community group (e.g. the agrosystem science community) or can be developed to describe a specific type of experiment or domain. For example, MIAPPE (https://www.miappe.org) is a standard for describing plant phenotyping experiments and ISO 19115 (http://www.dcc.ac.uk/resources/metadata-standards/iso-19115) is an international standard for describing geographic information and services.

A schema may also specify:

EXAMPLE

One of the most common generic schemas is the Dublin Core Metadata Initiative (http://dublincore.org), which is the most widely adopted schema for descriptive metadata to date.

It is simple and generic, with just 15 elements in the original Dublin Core Metadata Element Set, including Title, Date, Type, Format, Creator, Coverage and Rights. Dublin Core has since been extended to 55 terms in the DCMI terms (http://dublincore.org/documents/dcmi-terms) namespace.

Each field can be basically free-form text, with some restrictions.

Dublin Core is used a lot on web pages, with the “dc” namespace – so you will often see things like “dc.Title” as a metadata tag inside a webpage’s HTML headers. If you are working with video, you may use Dublin Core and the MPe.g. 7 standard for video archives, and so get the “dc.Title” and the “MPe.g.7.Title” fields. Each has its own rules for how you can use them, found in the schema’s namespace (see below) [7].

Content rules and controlled vocabularies

Specifying rules around the allowable content and format of values in each metadata element improves accuracy and machine-readability of metadata, and hence discoverability of collections. Free-form text entry can lead to ambiguous data, for example, the date 3/10/15 could refer to 3 October or 10 March, in either 1915 or 2015.

Having a specific set of terms that can be used in a field, i.e. a controlled vocabulary, allows filtering and faceting of the data, improving search function.

Controlled vocabularies can be:

Likewise, formats can be:

Namespaces

A metadata schema’s ‘namespace’ declares a unique set of elements and definitions. By specifying the namespace(s) of the metadata schema(s) you are using, you can define which schema each element belongs to, and point people to the accepted definition of that element.

For example, the term “date” appears in many schemas. However, the way a “date” is defined or recorded may be different depending on the schema; “date” might refer variously to the date the data were published in one schema, or the date the data were collected in another schema [7].

Finding and choosing a metadata schema

Schemas range from the very generic to extremely discipline or resource-specific:

Schemas that are often used within the agrosystem science community include the following:

Adopt, adapt, or create your schema?

Although it is possible to develop a metadata schema from scratch, it is preferable to use or adapt existing standards and/or widely-established schemas, as they offer:

Your choices are:

  1. If there is just one obvious metadata standard, and it meets your needs, use it.
  2. If there are several obvious schemas that meet your needs, follow models of ‘good practice’ within your community.
  3. Where you can find no single appropriate schema:
    1. Adapt or extend an existing schema to better fit your needs, and document the changes you make very carefully using the documentation methods and mappings deployed by existing standards as a guide. Contact the ‘owners’ of the schema and attempt to work with them, as others may benefit from your changes.
    2. Alternatively, develop a new ‘application profile’, where various metadata elements (and the elements’ guidelines and documentation) are taken from different metadata schemas and mixed together.

If there is absolutely no schema you can use, check again; it’s a rare situation nowadays. If there is still no schema to be found, then you may have to develop one. However, it takes a fair bit of work, and you should bring together as many interested people in your discipline as you can [7].

Collecting and linking Metadata

Collecting Metadata

Automatic metadata collection

A lot of metadata can be created automatically during the data collection process. Many scientific instruments generate metadata alongside the data itself. An obvious example is digital cameras, where some provenance metadata is written at the same time as the photo is taken, e.g. location, time and date. The same is also true for UAV images or sensor data.

Automatic metadata collection avoids data entry errors and reduces the effort required. You can also set up an automated process to sanity-check the metadata when the data comes in.

Extracted metadata collection

Some metadata can be extracted from other systems. For example, a university’s human resource management, grant management or research management systems may be the best sources of information regarding researchers, research grants, or research projects.

Manual metadata collection

Some metadata requires human participation to create, and the process used depends a lot on the tools used to collect the data, for example, metadata captured in electronic lab notebooks (ELNs) or digital field books. It is also possible to build tools such as forms, which can present the metadata fields (and even pre-populate some if connected to other institutional systems) and automatically enforce the right vocabularies and data structures for each metadata record. Metadata records can be created as the data is being collected, which is preferable, or can be done later on when the researcher organises the data they have collected over some period of time, e.g. after a field trip.

Metadata is often hidden away in specification statements, database structures, data models, program code or master data reference structures. This metadata needs to be made explicit and human-readable to be useful [7].

Linking metadata and data

Metadata within the data file

The first place that metadata can go is inside the file with the data itself. There are many digital file formats that include a range of metadata fields, and some can be extended to hold almost anything. These include:

A benefit of storing metadata inside the file is that it moves with the file; the association between the data and its metadata is easy to maintain.

However, the downsides are:

Metadata as separate files

This solution will give you infinite flexibility for storing any and all kinds of metadata without restriction: write the metadata into a separate, well-structured file, (perhaps using XML or JSON), and associate that with the data file. A common approach to strengthen the file-metadata file association is to use the same filename stem, e.g. cat1.tiff is the image and cat1.xml is the metadata. This can improve the performance for searching and metadata modifications slightly. A well-organized folder structure and naming conventions are essential here!

However, the downsides are:

Spreadsheets and databases

Aggregating metadata for multiple datasets into a single spreadsheet or database gives you a lot more flexibility in searching, making changes, and in the metadata fields recorded.

Collection-level changes are very easy to make in a database, as datasets that belong to a certain collection are just flagged and pointed to the collection-level record. Only one record (the collection-level record) in the database has to be changed for every dataset in that collection to be changed.

The metadata is associated with the data by recording the filename and location (or URI) for the data within the metadata record, and updating this anytime the data location is changed [7].