Models and Metadata

Introduction

Models are abstractions from real-world situations, designed to support some particular context. With Statistical Metadata we are mostly concerned with software to support the processing and analysis of statistical information. Models provide the opportunity to specify how information can be shared between stages of processes (so that later stages can make use of information entered in earlier ones) and how information and specifications can be moved between independent applications. Because we are supporting the development and use of software, our models need to be detailed and precise in their specification of the structures and semantics of the information. However, the model also determines a conceptual framework for process designers and software users, so they must be able to view elements of or generalisations from a model, with less detail than is needed by software developers. Furthermore, when developing a model we need to work with domain and subject specialists to discover their needs and to help them to agree on model components and structures. These people will probably need assistance to express this knowledge in ways and with sufficient precision for use in the model, and will need help in understanding the model representation of their knowledge, so that they can confirm that the model represents this knowledge correctly.

This web page is an extract from a full paper. Click here to download the full version of this paper, which includes a discussion of the content and roles of UML and XML, and of some of the practical issues of using models to support the development, implementation and use of systems using statistical metadata.

Statistical Metadata

For completeness, we reproduce here our definition of statistical metadata.

Statistical Metadata is any information that is needed by people or systems to make proper and correct use of the real statistical data, in terms of capturing, reading, processing, interpreting, analysing and presenting the information (or any other use). In other words, statistical metadata is anything that might influence or control the way in which the core information is used by people or software.

It extends from very specific technical information, used, for example, to ensure that a data file is read correctly, or that a category in a summary table has the right label, or that the design of a sample is correctly taken into account when computing a statistical summary, right through to descriptive (intentional) information, for example about why a question was worded in a particular way or why a particular selection criterion was used for a sample.

Thus, metadata includes (but is not limited to) population definitions, sample designs, file descriptions and database schemas, codebooks and classification structures, processing details, checks, transformation, weighting, fieldwork reports and notes, conceptual motivations, table designs and layouts.

Acknowledgements

The ideas presented here have been expounded and discussed during the MetaNet project, particularly in work groups 1 and 2 and at the final conference. Of particular importance is work with Chris Nelson, of Dimension EDI, but the author takes full responsibility for all the ideas and opinions expressed here.

The role of Modelling

Models are abstractions, designed to meet a particular need in a particular context. Thus the form and roles of models can be very different. Some examples may help to show some of the range.

Conceptual Models are an attempt to form a frame of reference for some domain or collection of constructs or concepts. These are often similar to classification structures, and such structures (for example the International Classification of Diseases – ICD – or NACE, the General Industrial Classification of Economic Activities within the European Communities) can be seen as conceptual models.

The Relational Database Model is a formal specification of the structures and behaviour for databases formed from sets of rectangular tables. This provides a conceptual framework for thinking about databases (one that is widely used) but is also sufficiently detailed and precise to be the basis for the implementation of many database software systems.

The Object Oriented approach is an alternative (more general) way of thinking about databases and program structures (an alternative paradigm), built using a different set of primitive constructs, assumptions and conventions.

The statistical Generalised Linear Model is a mathematical specification of the way in which a set of predictor variables influence a dependent variable, together with the form of the variability about that relationship. This model is very flexible and is widely used for estimating statistical relationships (using suitable software to calibrate the model to a particular data set), and for discussing the potential form of such relationships. Of course, there are many situations where the GLM is not an appropriate form of model.

Structural Models concentrate on the objects and attributes that are used to represent information structures. This is necessary for the exchange of information between systems, but needs to be accompanied by clear specifications of the intended purpose and use of the various elements. Inconsistent interpretation by independent users or implementers working with such a structure is a continuing concern, unless some enforcement mechanism can be specified and implemented.

With statistical metadata we are looking for models that allow us to interchange information between processes and systems and that provide a stable conceptual framework for users to work with complex information structures across processes and systems. We want to support users of statistical systems, support the automation of statistical processes, and exchange information between systems and processes.

We can have more than one model, focussing on different parts of the statistical process, but they should dovetail together when a wider picture is needed. And we should aim to get suitable models accepted as standards, agreed and used across the statistical domain.

Elements of Models

To these ends, we need models that provide formal specifications of components and relationships, avoiding misinterpretation. They must address:

Structure: how are the elements organised, how are elements grouped and related, what attributes are needed for each type of element.
Semantics: what do the elements represent, what rules and constraints apply to their attribute values, to their states and to the way in which they are used.
Methods: specifications of algorithms and processes that apply to the elements and the data they refer to.
Concepts: complete and detailed definitions of the terms and concepts that are the subjects and objects covered by the model and of the relationships between them. In some situations this may correspond to the idea of a thesaurus.

To construct models quickly and accurately we also need a modelling framework or workbench, which provides generic building blocks for model components and tools to support the design process.

Conclusions

UML is the ideal form to hold the master version of the design of models or systems related to statistical metadata. This is because it is able to represent precisely and in detail almost all aspects of the structures, processes, methods, constraints, relationships, interactions, etc. involved.

UML is too complex to be a natural form of communication except for technical specialists, so other forms of representation or documentation are needed to support the discovery of knowledge to feed into the model, and to support designers or users of related applications and systems. These are standard issues for User-Centred Design.

XML is the ideal format for the exchange of actual instances of statistical metadata between applications, because it is simple to transmit over standard protocols, and because it is well supported by manipulation APIs at the application level. However, it is not adequate to merely design an XML structure as specification of a model, since this cannot include a rigorous specification of the semantics of the elements of the structure, nor anything about processes or behaviour. Instead a suitable XML structure should be derived from the structural components of a full UML design.

Practical issues over the introduction and acceptance of new systems can be as important to an organisation as the detailed correctness of a model, so must enter in to the overall development process.

Click here to download the full version of this paper, which includes a discussion of the content and roles of UML and XML, and of some of the practical issues of using models to support the development, implementation and use of systems using statistical metadata.

Page last updated 08 October 2003.