The future of Statistical Metadata (and metadata in general) does not lie in further amateur attempts to design XML data structures for some particular problem. Rather, it lies in a professional approach to system design and software development, producing metadata-aware applications that can use, capture, present, manipulate and re-use metadata and the associated resources which it describes.
We have had innumerable proposals for metadata systems and standards over the past decade, but few have produced anything usable or lasting. The few that have been successful have either been extremely simple in structure and concept – such as triple-s[1] for data and label exchange, or the Dublin Core[2] for resource discovery – or have involved very significant investment – such as the proprietary Dimensions system from SPSS, complete with an API and its own version of Visual Basic.
Other initiatives have produced valuable insights into the requirements for metadata in statistical applications, but little in the way of usable functionality. For example, in the field of questionnaire and instrument design, the IQML project[4] and the QEDML system[5] (both presented, along with several others, at previous ASC conferences) provide many useful ideas about the representation of questionnaires and the data capture process, but neither has produced a direct legacy – IQML came to a natural end, so that its influence was always going to be indirect, and the QEDML web site has not been updated since 2004, in spite of very considerable enthusiasm about its potential.
This is not the right point for an exhaustive discussion of such initiatives, but there are many that could be mentioned, often with considerable merit but with only indirect influence on actual metadata practice. A few exceptions are discussed later.
The purpose of metadata is to support the use of actual statistical data, or some other resource. For metadata to be used it must be both useful and useable. To be useful it must be pertinent to the task for which it is to be used, and must also be of adequate quality for this task. Where metadata (or any other resource) is not used there is no incentive for those collecting it to maintain quality, which in turn makes it less likely that it will be used, a vicious circle. We need to break that circle.
But usefulness is not enough to ensure usability. Reading extensive documentation sometimes provides some information but is rarely the best way to use metadata to support some task. This is particularly true when the metadata is presented as a raw XML document. Instead we need metadata-enabled applications which make use of metadata to simplify or enrich the functionality of the primary task of the user.
In the same way, metadata capture should not in general be a manual task. Metadata is rarely (perhaps never) new information, it is just ancillary information that an adjunct to some other task. So the process of performing the initial task should automatically capture the related metadata. For example, the design of a questionnaire (with all its wording, conditions, filters, skips, fill-ins, etc.), in a suitable design program, already contains all the metadata about the questionnaire structure that might be needed to inform later analysis of data collected from that questionnaire. We just need to expose that information in a way that can be used in other contexts. And the exposure makes it possible to enhance the design process as well, for example by importing questionnaire fragments from a Library or another questionnaire (perhaps designed in a different application).
Similarly, sample design specifications, or information about data transformations or adjustments can be captured automatically for subsequent use. Quality is maintained through such automatic capture, and, where additional manual input is needed, the quantity (the additional workload) is minimised, relevance should be clear, so quality should be high.
The idea of metadata-aware software is not new. Virtually all survey and statistical packages from SPSS onwards include a specific data dictionary (or codebook) which is used internally, and can sometimes be accessed externally. In the early 1980’s the World Fertility Survey was using processing facilities built around a data dictionary system (as were others), and a comprehensive description of such an integrated metadata-aware statistical processing system was presented by Jean-Pierre Kent[6] in an SSDBM paper of 1997.
Many designers of statistical and related systems have internal components which involve explicit representation of metadata, and its use within the system to enhance and facilitate the user experience. There is no direct incentive for developers to invest effort in the design of interfaces to expose this information to other applications, whether complementary or competitive. Such pressure has come from users or independent developers (often academics), and the simplest solution has been seen as the design of an exchange file format, for which import and export facilities can easily be appended to existing software.
The problem with this approach is that different developers use different concepts within their systems, or interpret and implement the same concepts in different ways and with different details and assumptions. The designers of the triple-s interchange standard recognised this problem and chose to take a highest common factor (HCF) approach. This results in a minimalist standard that is very limited in scope, but is usable in full by all the (initial) participating systems. This achieves the objective of moving datasets between supporting systems, but means that the user may have lost some information that was in the source system, because it is not available or is represented differently in the target one.
The independent approach has been to try to define a separate ‘standard’ as a super-set of concepts to which all systems can subscribe. Where this is done as a union of facilities from an existing set of systems there is a danger of too much detail. This can lead to ‘gothic castles’, which contain huge amounts of intricate (and sometimes interesting) detail, but no obvious overall structure. The more favoured approach (which was supported by a number of European 4th and 5th Framework projects through Eurostat) is to perform a more generic analysis and to identify concepts and principles that need to be represented, then to develop a metadata design based on these, using object concepts of inheritance and specialisation to handle details.
Most initiatives in this area (and not just within statistics) seem to concentrate on defining the data structure needed to store instances of metadata, usually as an XML document – hence the early derogatory remark in this paper. While defining a suitable structure is an essential part of the design process, it is only a step on the way to producing metadata-aware applications, and while XML is an essential tool for the exchange of metadata instances between independent applications, an XML Schema is not an adequate medium in which to specify processing functionality. Many proposals for standards do include some ideas for actual applications, but this is usually seen as a separate phase following after the structural definitions, and in consequence very little gets done.
The purpose of metadata is to support other tasks. So the design of a metadata system should be related to the design of applications which are metadata-aware. The functionality provided to support the use of the metadata is more important than the structure used to store it. This leads to the conclusion that metadata design should be seen as an instance of the more general systems design process, and should take advantage and make use of progress that has been made in that area.
A great deal of effort in recent decades in the Computer Science domain has been spent on the development of methodologies to support the development of application software. Most of this has been focussed on the Object-Oriented approach, and within statistics we have the application S (and R) as an example based on these ideas. A number of different methodologies (procedural specification methods for software development) grew from this base, followed in the late ‘90’s by a consolidation and standardisation effort that produced UML, the Unified Modelling Language[7]. UML is now in version 2.
UML is widely supported by tools based on the standard, from initial design through to implementation of applications. It covers all stages and aspects of system design, so that (at least in theory) it is possible to completely specify an application in UML, and then have that design automatically translated into an executable program in a suitable language. In that sense it is language-independent, though it is probably most frequently used with Java.
Designs in UML include data structures (as needed for internal use in applications as well as for exchange), but also cover the behavioural and procedural aspects of applications. Thus, when thinking about the information structures that are needed for a metadata application, it is also possible to specify the behaviour and processes that are expected to be associated with this information. UML specifically supports the idea of abstraction, so it is not necessary to immediately specify all the details of some process: it is sufficient to specify just the outline, with further details supplied later at an implementation stage.
Metadata needs to be used, and this use takes place in metadata-aware applications. So the design of metadata systems should take place in the context of (at least some of) the applications which will make use of such a metadata resource.
UML is an ideal medium in which to specify the design of such metadata systems, as it allows behaviour and process functionality to be designed, alongside the data structures needed to represent the metadata resource.
A number of recent metadata proposals have been designed in this way, and demonstrate the feasibility of this approach, specifically the SDMX[8] (Statistical Data and Metadata Exchange) standard for the exchange of aggregate data between statistical authorities, and Version 3 of the DDI[9] (Data Documentation Initiative) Codebook standard for documenting data resources.
As we start to think more specifically about the tasks for which statistical metadata can be used, we also need to think more widely about the areas of statistical processing that should be covered.
Statistical metadata started with data documentation in codebooks, containing label information about variables and codes within a single dataset. Extensions to this that have been implemented include multiple datasets, aggregated data, sample design and responsibility links. Specifications for derived variables and complete questionnaire specifications have also been attempted.
As part of the MetaNet[10] project, Froeschl and colleagues[11] produced the first version of UMAS, a Unified MetaInformation Architecture for Statistics, intended to cover all aspects of statistical processing. This contains many good ideas, but is an ambitious project that needs further work.
The Opus Project[12] also extends metadata well beyond data description, though with a more specific objective. It attempts to document statistical analysis processes based on statistical models and represent the statistical models used in the analysis of data, and thus to provide a link back from conclusions and results based on statistical information (evidence) to the assumptions and data used.
Click here to read the whole of this paper, which includes both a discussion of recent developments in the development of metadata standards, and a basic presentation of the work in the Opus Project to represent Statistical Models as metadata. A fuller presentation of the Opus metadata work is in the paper from SSDBM.
Page last updated 05 February 2008.