The term metadata is used widely in the context of database and statistical systems, though there is not general agreement on exactly what it means. We take a very broad view:
Statistical Metadata is any information that is needed by people or systems to make proper and correct use of the real statistical data, in terms of capturing, reading, processing, interpreting, analysing and presenting the information (or any other use). In other words, statistical metadata is anything that might influence or control the way in which the core information is used by people or software.
It extends from very specific technical information, used, for example, to ensure that a data file is read correctly, or that a category in a summary table has the right label, or that the design of a sample is correctly taken into account when computing a statistical summary, right through to descriptive (intentional) information, for example about why a question was worded in a particular way or why a particular selection criterion was used for a sample.
Thus, metadata includes (but is not limited to) population definitions, sample designs, file descriptions and database schemas, codebooks and classification structures, processing details, checks, transformation, weighting, fieldwork reports and notes, conceptual motivations, table designs and layouts.
Metadata can be used informally by people who read it (and use it to affect the way they work with or interpret information), and formally by software to guide and control the way information is processed. Processes can also generate metadata.
Much of the metadata will be in formal, machine processable form, so that it can be part of the input to other processes. Much of it will be searchable, so that users can discover whether the database contains information of interest. Some of it will be formally structured, some of it not, some will describe concrete aspects of particular stored information, other will relate to more abstract concepts that underlie the objectives of the information.
Properly structured metadata is essential for
Note that this is much broader than the view represented by the Dublin Core standard for metadata. That is essentially a cataloguing approach, which specifies certain items of descriptive information that are needed to accompany a dataset: while valuable, this is restricted to purely textual material.
The capture or collection of metadata should ideally be integrated with the creation of the resource to which it relates. Experience has shown that creating the metadata manually as a separate, subsequent process is error-prone and time-consuming. Thus the creation of physical and operational metadata should be built into the design or production processes of the resources, and, as far as possible, the descriptive and conceptual metadata should be created as adjuncts to these processes.
The concept of linking is essential for metadata, so that, for example, it is possible to move from reading the description of something to looking at the thing described, perhaps in the style of hyperlinks. Linking is also needed for statistical information. For example, with aggregate data we must be able to link from the dimensions of a data cube to the variables in the source data that were aggregated, and to the classifications used to define the possible groupings for each dimension.
With metadata it is important to represent abstract concepts as well as concrete instances of these concepts. For example, the idea of the Purpose of a Journey will need to exist within a travel database, so that it can be defined, and this concept will need to be linked to classifications that distinguish different purposes, and to variables that record the purpose (according to a classification) of actual journeys. Thus a user will be able to link to all the components of the database related to the concept of Journey Purpose, not simply to find components that include the text ‘Journey Purpose’ in their description or label.
Page last updated 05 February 2008.