This Guide is intended to provide a simple generic working-level view of the needs, issues, and processes around metadata collection and creation as it relates to research data [7]
It is based on: Australian Research Data Commons (ARDC). (2020). ARDC Metadata Guide. Zenodo. https://doi.org/10.5281/zenodo.6459832
While generally ‘meta-data’ is summarised as ‘data about data’, the following provides examples of what this actually means:
- Metadata is information about an object or resource that describes characteristics of that object, such as content, quality, format, location and access rights.
- Metadata can be used to describe physical objects (e.g. soil samples, plant material and crop seeds) as well as digital objects (e.g. documents, images, data sets and software).
- Metadata can take many different forms, from free text (e.g. a README file) to standardised, structured, machine-readable, extensible content.
- Metadata is analogous to any other form of data, in terms of how it is created, managed, linked and stored.
- Metadata is associated with the data it describes. It can be embedded within the data file, or recorded a separate text/spreadsheet file that is linked to the collection of data files it describes, or contained in a catalogue record that points to the research data collection.
- Metadata enables and enhances the discovery and reuse of data [7].
Finding data
Data formats such as text can be indexed and searched themselves (as in a simple Google search). However, the ability to search formats like audio, images and video is limited, and discoverability relies on searching the metadata. Discovery metadata helps researchers find data that, for example:
- relates to a research discipline of interest via field of research, keyword or vocabulary metadata.
- is generated by another researcher whose work is of interest via lead researcher or contributor metadata.
- relates to a geographical area of interest via geospatial metadata.
- relates to a specific crop species/variety of interest, a specific phenotypic trait of interest (e.g. root architecture, biomass growth), a specific environmental impact or field management aspect of interest ( e.g. fertilisation, crop rotation, erosion control) or a specific soil class of interest via descriptive and discipline-specific metadata.
- Is generated using a specific method of interest (e.g. a specific sensor used, a specific depth of measurement, a specific software/analysis method used) via process metadata [7].
Determining the value of data
To assess the usefulness, value and quality of a data set, researchers need to understand the context around the data. This is given in metadata that:
- describes why the data were collected, the experimental design and data collection methods
- links to the researchers and institution(s) involved
- identifies the research program or grant
- points to publications that have flowed from the research data
- explicitly provides provenance, licensing, rights, and technical information [7].
Accessing data
Access to research data requires:
- a direct download link to online data for open access, or
- contact information metadata for the data manager for mediated access [7].
Using and reusing data
To make use of any data set, researchers need metadata on:
- how the data is structured (e.g. table structure, time units, measurement series)
- what it describes (e.g. nutrient content, disease infestation, yield data)
- how to read it (e.g. column headings, units of measurement, coding)
- how the data was collected - including methodological details such as device types, calibrations, weather conditions, instrument settings or survey questions
- What can be done with the data? (e.g. about licence information, rights of use, public domain mark)
- how to acknowledge the original creators by citing the data.
Proper recording of this information is important for the producers of the data as well as future users [7].
In agricultural field trials, it is common practice to take soil samples under standardised conditions and evaluate them using specific analysis methods. If important metadata such as the exact location of the samples (GPS coordinates), the sampling date, the sampling depth, the soil moisture at the time of sampling or the analysis methods used (e.g. potassium determination using the CAL method) are not documented, the data cannot be reliably interpreted later or compared with other data sets. Satellite or drone images also lose their informative value if metadata on the time of recording, sensor parameters or weather conditions are missing.
There are three levels of data groupings: data objects (Datasets and Components) can be collected into groups (Collections):
- Collection level metadata describes the collection as a whole. For example, the BonaRes database (https://maps.bonares.de) collects soil data from various long-term experiments (LTEs). The collection-level metadata describe, among other things, the objective, geographical coverage, participating institutes, licensing and interfaces for subsequent use (e.g. via APIs or download formats).
- Dataset level metadata describes individual objects. For example, within an LTE, the metadata describing the soil measurements of a specific year are on dataset level.
- Component level metadata - sometimes single objects can be made up of component parts: For example, a soil dataset can consist of several layer measurements per soil profile (e.g. 0-10 cm, 10-30 cm, 30-60 cm). Each layer represents a component and requires its own metadata - such as sample depth, analysis type or storage.
The collection approach and subsequent level of description impacts on discoverability, and cost and effort of management. Metadata ideally should be based on the needs of those for whom the collection is created [7].
EXAMPLE: Collection, dataset and component metadata
An agricultural research institute operates a central archive with data from multi-year field trials on crop rotation. To enable effective use and targeted reuse of this data, the data sets are described at several levels:
- Collection level: The data is organized in collections according to trial type - e.g. “Crop rotation trials under conservation tillage”, “Soil monitoring on organically farmed land” or “Long-term trials on nitrogen fertilization in winter wheat”.
The collection level describes overarching metadata such as the locations of the trial sites, duration of the trial series, objectives, participating institutes and available data types (e.g. soil data, yield data, climate data).
- Dataset level: This level describes a specific trial year at a location, e.g. “Groß Kreutz long-term trial - year 2020”.
Metadata such as tillage method, cultivated crop rotation element (e.g. spring barley), sowing date, fertilization strategy, weather pattern and harvest date are documented here.
- Component level: Within a trial year, the data consists of several individual measurements, e.g. per plot or measurement time. For each plot, for example, yield data, disease infestation, nutrient analyses or drone images are recorded - each with specific metadata such as plot ID, GPS coordinates, measurement date, sensor parameters, laboratory methods used or scoring criteria.
This hierarchical structure makes it possible to find specific data - e.g. all plot yields of spring barley in 2020 under reduced tillage - without having to manually search through countless individual files. It also supports automatic linking with other data sets, such as weather or soil condition [7].
Metadata types are often grouped into functional types, but note that some elements will provide multiple functions.
The most common types are:
- Descriptive metadata. Information required for discovery and assessment of the collection
- e.g. title, contributors, subject or keywords, study description, and the location and dates of the study.
- Provenance metadata. This relates to the origins and processing of the data, and enables interpretation and reuse of the data. It ranges from the easily human readable to the highly technical, and usually requires some knowledge of the domain to create.
- e.g. Where did the data come from? Why was it collected? Who collected it, when and where? What instruments/ technologies were used to collect the data, and how were they set up? How has the data been processed?
- Technical metadata. Fundamental information for a person or a computer application to read the data.
- e.g. How is the data set up? What formats, and versions of formats, are used? How is the database configured? How does it relate to other data?
- Rights and access metadata. Information to enable access, and licensing or usage rules.
- e.g. How can someone access the data? Who is allowed to view or modify the data, or the metadata, and under what conditions? Who has some kind of authority over the data? Are there costs associated with access? Under what licence is the data being made available?
- Preservation metadata. This builds on the history from the Provenance, Rights and Technical metadata, and also includes information to allow the data to be managed for long-term accessibility.
- e.g. Has there been any restructuring or other changes to the files, e.g. due to migration to new file formats? What software has been used to access the data?
- Citation metadata. information required for someone to cite the data e.g. Creator(s), Publication Year, Title, Publisher, Identifier [7].
Understanding commonly used metadata terminology will help you better plan, collect and apply metadata [7].
Elements and schemas
Metadata schemas are an overall structure for metadata about a particular information resource or for a specific domain. A schema specifies a set of metadata concepts or terms (called elements), and their associated definitions (semantics) and relationships. The value given to each element is the content.
Metadata schemas often emerge from a single community group (e.g. the agrosystem science community) or can be developed to describe a specific type of experiment or domain. For example, MIAPPE (https://www.miappe.org) is a standard for describing plant phenotyping experiments and ISO 19115 (http://www.dcc.ac.uk/resources/metadata-standards/iso-19115) is an international standard for describing geographic information and services.
A schema may also specify:
- content rules e.g. required formats and controlled vocabularies
- the syntax in which elements must be encoded (or expressed), such as XML (Extensible Markup Language) [7].
EXAMPLE
One of the most common generic schemas is the Dublin Core Metadata Initiative (http://dublincore.org), which is the most widely adopted schema for descriptive metadata to date.
It is simple and generic, with just 15 elements in the original Dublin Core Metadata Element Set, including Title, Date, Type, Format, Creator, Coverage and Rights. Dublin Core has since been extended to 55 terms in the DCMI terms (http://dublincore.org/documents/dcmi-terms) namespace.
Each field can be basically free-form text, with some restrictions.
Dublin Core is used a lot on web pages, with the “dc” namespace – so you will often see things like “dc.Title” as a metadata tag inside a webpage’s HTML headers. If you are working with video, you may use Dublin Core and the MPe.g. 7 standard for video archives, and so get the “dc.Title” and the “MPe.g.7.Title” fields. Each has its own rules for how you can use them, found in the schema’s namespace (see below) [7].
Content rules and controlled vocabularies
Specifying rules around the allowable content and format of values in each metadata element improves accuracy and machine-readability of metadata, and hence discoverability of collections. Free-form text entry can lead to ambiguous data, for example, the date 3/10/15 could refer to 3 October or 10 March, in either 1915 or 2015.
Having a specific set of terms that can be used in a field, i.e. a controlled vocabulary, allows filtering and faceting of the data, improving search function.
Controlled vocabularies can be:
- locally defined (e.g. only allowing names to be selected from a list of employees), or an established standard (e.g. AGROVOC (https://agrovoc.fao.org/browse/agrovoc/en) as a thesaurus for agriculture related terms, including terms for crop species) or soil textures
Likewise, formats can be:
Namespaces
A metadata schema’s ‘namespace’ declares a unique set of elements and definitions. By specifying the namespace(s) of the metadata schema(s) you are using, you can define which schema each element belongs to, and point people to the accepted definition of that element.
For example, the term “date” appears in many schemas. However, the way a “date” is defined or recorded may be different depending on the schema; “date” might refer variously to the date the data were published in one schema, or the date the data were collected in another schema [7].
Schemas range from the very generic to extremely discipline or resource-specific:
- Generic schemas such as DCMI Metadata Terms (https://www.dublincore.org/specifications/dublin-core/dcmi-term) and the DataCite Metadata Schema (https://schema.datacite.org) are widely adopted and easy to use - but it is so generic that everyone can use them in quite different ways, as the description becomes more specialised.
- Discipline-specific schemas provide a richer and more targeted structure and vocabulary that allows detailed information to be provided in a more structured, granular format. Finding these can be as simple as searching the internet for “[discipline] metadata schema”, from a very high level to start with (e.g. “agriculture metadata schema”) and then getting steadily more specific (e.g. “crop yield metadata schema”) if required. You can also search metadata schema registries such as the Data Curation Centre’s List of Metadata Standards (http://www.dcc.ac.uk/resources/metadata-standards/list); or the Research Data Alliance’s Metadata Standards Directory (http://rd-alliance.github.io/metadata-directory) and FAIRsharing Registry (https://fairsharing.org) [7].
Schemas that are often used within the agrosystem science community include the following:
Adopt, adapt, or create your schema?
Although it is possible to develop a metadata schema from scratch, it is preferable to use or adapt existing standards and/or widely-established schemas, as they offer:
- Cost savings - the schema and its usage guidelines have been developed, thus saving time and effort.
- Access to help and advice - a standard is likely to have a community of users.
- Usability - users are likely to be familiar with a standard and its terminology.
- Interoperability - information can be easily shared between systems.
- Sustainability - schemas need maintenance and updating if they are to remain usable.
Your choices are:
- If there is just one obvious metadata standard, and it meets your needs, use it.
- If there are several obvious schemas that meet your needs, follow models of ‘good practice’ within your community.
- Where you can find no single appropriate schema:
- Adapt or extend an existing schema to better fit your needs, and document the changes you make very carefully using the documentation methods and mappings deployed by existing standards as a guide. Contact the ‘owners’ of the schema and attempt to work with them, as others may benefit from your changes.
- Alternatively, develop a new ‘application profile’, where various metadata elements (and the elements’ guidelines and documentation) are taken from different metadata schemas and mixed together.
If there is absolutely no schema you can use, check again; it’s a rare situation nowadays. If there is still no schema to be found, then you may have to develop one. However, it takes a fair bit of work, and you should bring together as many interested people in your discipline as you can [7].
A lot of metadata can be created automatically during the data collection process. Many scientific instruments generate metadata alongside the data itself. An obvious example is digital cameras, where some provenance metadata is written at the same time as the photo is taken, e.g. location, time and date. The same is also true for UAV images or sensor data.
Automatic metadata collection avoids data entry errors and reduces the effort required. You can also set up an automated process to sanity-check the metadata when the data comes in.
Some metadata can be extracted from other systems. For example, a university’s human resource management, grant management or research management systems may be the best sources of information regarding researchers, research grants, or research projects.
Some metadata requires human participation to create, and the process used depends a lot on the tools used to collect the data, for example, metadata captured in electronic lab notebooks (ELNs) or digital field books. It is also possible to build tools such as forms, which can present the metadata fields (and even pre-populate some if connected to other institutional systems) and automatically enforce the right vocabularies and data structures for each metadata record. Metadata records can be created as the data is being collected, which is preferable, or can be done later on when the researcher organises the data they have collected over some period of time, e.g. after a field trip.
Metadata is often hidden away in specification statements, database structures, data models, program code or master data reference structures. This metadata needs to be made explicit and human-readable to be useful [7].
The first place that metadata can go is inside the file with the data itself. There are many digital file formats that include a range of metadata fields, and some can be extended to hold almost anything. These include:
- text formats e.g. DOCX and PDF
- tabular formats e.g. XLSX
- image formats e.g. TIFF and JPEG
- video formats e.g. MPEG
- audio formats e.g. WAV
- specialist discipline formats e.g. HDF (Remote sensing).
A benefit of storing metadata inside the file is that it moves with the file; the association between the data and its metadata is easy to maintain.
However, the downsides are:
- Not every metadata field you want may be able to be added.
- Searching the collection is slow, as the computer has to open every single file for every single query, especially as the number of files and queries grows.
- Collection-level metadata is not easily managed. If you write a collection reference into each file and then decide to make a change to it, you have to edit every single affected file [7].
This solution will give you infinite flexibility for storing any and all kinds of metadata without restriction: write the metadata into a separate, well-structured file, (perhaps using XML or JSON), and associate that with the data file. A common approach to strengthen the file-metadata file association is to use the same filename stem, e.g. cat1.tiff is the image and cat1.xml is the metadata. This can improve the performance for searching and metadata modifications slightly. A well-organized folder structure and naming conventions are essential here!
However, the downsides are:
- the data is still on the same storage medium, and you still have to open the same number of files to make queries or changes
- it increases the risk of separating the data and the metadata when files are moved [7].
Spreadsheets and databases
Aggregating metadata for multiple datasets into a single spreadsheet or database gives you a lot more flexibility in searching, making changes, and in the metadata fields recorded.
Collection-level changes are very easy to make in a database, as datasets that belong to a certain collection are just flagged and pointed to the collection-level record. Only one record (the collection-level record) in the database has to be changed for every dataset in that collection to be changed.
The metadata is associated with the data by recording the filename and location (or URI) for the data within the metadata record, and updating this anytime the data location is changed [7].