Topics in Statistical Computing

The term metadata is used widely in the context of database and statistical systems, though there is not general agreement on exactly what it means. We take a very broad view:

It extends from very specific technical information, used, for example, to ensure that a data file is read correctly, or that a category in a summary table has the right label, or that the design of a sample is correctly taken into account when computing a statistical summary, right through to descriptive (intentional) information, for example about why a question was worded in a particular way or why a particular selection criterion was used for a sample.

Thus, metadata includes (but is not limited to) population definitions, sample designs, file descriptions and database schemas, codebooks and classification structures, processing details, checks, transformation, weighting, fieldwork reports and notes, conceptual motivations, table designs and layouts.

Metadata can be used informally by people who read it (and use it to affect the way they work with or interpret information), and formally by software to guide and control the way information is processed. Processes can also generate metadata.

From the perspective of use, we can formulate a definition more focussed on the way in which the meta-data supports the use of the underlying resource.

More generally, metadata is the information that the owner of a resource needs to supply to potential users of that resource, so that they can use it correctly. Click here for a paper presented at the 2007 ASC conference, on developments in this view of Statistical Metadata.

Models and Metadata

Models are abstractions from real-world situations, designed to support some particular context. With Statistical Metadata we are mostly concerned with software to support the processing and analysis of statistical information.

Models provide the opportunity to specify how information can be shared between stages of processes (so that later stages can make use of information entered in earlier ones) and how information and specifications can be moved between independent applications. Because we are supporting the development and use of software, our models need to be detailed and precise in their specification of the structures and semantics of the information.

However, the model also determines a conceptual framework for process designers and software users, so they must be able to view elements of or generalisations from a model, with less detail than is needed by software developers.

Furthermore, when developing a model we need to work with domain and subject specialists to discover their needs and to help them to agree on model components and structures. These people will probably need assistance to express this knowledge in ways and with sufficient precision for use in the model, and will need help in understanding the model representation of their knowledge, so that they can confirm that the model represents this knowledge correctly.

Statistical Models: Combining Data and Knowledge in Models

We collect data in order to increase our knowledge, but we always have some knowledge before we start. Our existing knowledge raises the questions for which we need more information, and it also guides us in deciding what further data to collect and how to collect it.

Models allow us to generalise from specific observed data to a wider situation. When we analyse data we (usually) update our knowledge. If we can find a formal representation for our knowledge, then a standard statistical technique provides a way to formalise the process of updating our knowledge. This can be the basis for the integration of multiple data sets that relate to different aspects of the same system. While of general importance, this approach is the only way of developing a coherent and integrated understanding of complex systems which are too extensive to observe with a single data set.

But complex methodology is difficult to understand, so we must address the issues of convincing users from the application domain that our models are appropriate and valid, and of making the results obtained from the methodology accessible.

Managing Statistical Models through Metadata

The Opus project has developed a Metadata Model (called StatModel) for the representation of statistical models, covering model specification, fitting processes and the use of results. The metadata model is designed to support applications which report on the provenance of results from statistical models, and also to support investigation of the reliability of results obtained from these models. It also has the potential (still to be refined) to support the exchange of models between fitting applications.

A paper about this methodology has been published by IEEE in the proceedings of the 2006 SSDBM conference. The presentation from that conference can be viewed here, and you can read the full paper here.

In the paper we present a general model of the information (meta-data) needed to represent statistical modelling and its results. We discuss the design process for this model (in terms of both objectives and methodology), along with the resultant design. However, we recognise that it is not enough to design a static structure for the meta-data, we also need functionality to support the end users of the statistical results in exploring the provenance and reliability of the results. We have identified some generic functionality that is useful in this, but also recognise that its presentation must be tailored to the domain of interest and the level of understanding of the user.

A poster about the application of this methodology to Transport Modelling was presented at the 2006 TRB meeting, and the corresponding paper was published in Transportation Research Record. Click here to view the poster.

D6.2 3.5Mb	Implementation Report on Using Information from Statistical Models	This is the most complete description of StatModel, and contains all the material on which the SSDBM paper is based.
D6.1 0.4Mb	Using Information from Statistical Models	This report elaborates the context in which the system was developed, and the objectives addressed. Much of this material also appears in D6.2.
D3.2 0.4Mb	Specifications for the Extension of the LATS Database System for the Transport Domain	This report is not a specification for the inclusion of modelling methodology in statistical databases, but rather an anticipation of the work in WP6 on making the results of modelling available in a statistical database context.
D3.1 1.2Mb	Proposals for Metadata for Generic Support of Statistical Modelling in Statistical Databases	This report contains discussion of the objectives for metadata for models, and initial ideas about the appropriate data structures required. Much of this latter part is superseded by D6.2.

SDC - Statistical Disclosure Control

Standard security considerations arise during the collection of information for statistical purposes, and statistical organisations need to ensure that information supplied by individual people or organisations is not revealed to unauthorised view. For this they expect to use the standard security mechanisms developed in other domains. However, the purpose of a statistical inquiry is to reveal information – to provide access, not restrict it. The objective of Statistical Disclosure Control is to control the risk that information about specific individuals can be extracted from amongst statistical summary results. This paper reviews the issues and outlines some of the solutions that are being used.

Repository Site Design: Supporting the Sharing of Resources and Ideas

This paper presents some design principles and requirements for a particular type of web site. This is the type of site whose function is to support the sharing of information and ideas among a community that has a common interest in the domain covered by the site, while at the same time encouraging others to explore the domain and learn from the site and the other members of the community. We assume that the primary interest is in the substantive material covered by the site, so that the graphical and presentational aspects of the design are of secondary importance. However, this does not mean that presentation is un-important, rather that the presentation must be designed to focus on support for the functionality of the site and the delivery of useful facilities for the users.

Describing this as a web site design issue is somewhat missing the point. Instead, what we are dealing with is the design of an application to facilitate delivery and sharing of information and ideas, an application that happens to use the Web to provide accessibility to users. So it is more relevant to concentrate on issues of application design, with all that implies in terms of modelling structures and processes, and in terms of usability and human-centred design. The design and presentation of particular pages within the site is important only in so far as it serves the principle function of the application.

Relational Databases for Statistics

Relational Databases certainly have a place in the tool kit used for the construction of systems for processing survey and statistical data.

Multi-way Tables for Statistical Summaries

Demand for statistical products is increasing, and producers need to find new ways for operating the production process and meeting customer requirements. Particular problems apply to large, recurrent surveys such as those often run by National Statistical Offices (NSOs). Clients (whether other departments or external, paying customers) often complain about delays in the production of paper series, and about the inflexibility of the published information.

One approach is to base the whole production process (after the capture and cleaning of raw data) on a database of aggregated data, in which the maximum amount of detail is retained. In particular, standard classifications (such as product type, or area, or time period) would be retained with the maximum amount of detail consistent with statistical reliability, and only grouped into broader categories when used. It is essential that the database has a formally defined structure and is supported by properly structured meta-data.

Introducing Formal Structure into Statistical Processing

This paper reviews the technologies required for the production and publication of large volumes of statistical information, such as produced by National Statistical Offices (NSOs). We review both academic and commercial developments, and present these in the context of an analysis of the tools and structures needed.

Our ultimate objective is improvement in the accessibility and use of data from large or continuous surveys and inquiries. These are characterised as having a significant volume of data processing and output, often with complex processing to produce representative and comparable figures, but not great depth of analysis. Production will generally be in the form of large amounts of classified summary data, and analysis will be in the form of comment on trends perceived in headline summary figures. Some clients will extract small volumes of data of specific interest, while others will want to perform more detailed analysis.

We are looking for an approach to be applied after initial data entry and cleaning, which provides support for: