Statistical RDBMS

The Role of Relational DataBase Systems in Survey and Statistical Processing

Relational Databases certainly have a place in the tool kit used for the construction of systems for processing survey and statistical data.

The relational model provides a useful way of thinking about data structure, whether or not a relational database is actually used.
Relational systems can handle more complex structures than most statistical systems, generally provide excellent tools for data entry and checking, and can readily be linked to other systems for analysis and presentation of results. Extensions can be implemented through programming.
Statistical systems generally are more highly developed in some areas of data management (for example, meta–data and missing values).

The Relational Model

The Relational Model is a logical specification of the functionality, behaviour and scope of a relational database. It gives us a formal conceptual framework for thinking about data and about what operations are possible on a database. It does not address issues about the storage and access methods that are provided by a particular implementation.

The model is defined in formal, mathematical terms. As a result, we can know what the behaviour should be of any software implementation of the model. This is more than just having a standard, since it defines the relationships between different functional components of systems, and provides users with a firm basis on which to build expertise and intuition. The model requires that specific functionality be provided by the implementation, but does not specify the physical storage and access methods to be provided. The choice of those methods is very important but is not dictated by the model.

The relational model provides both more flexibility and more formality for data structure than is found in most statistical systems, and a good relational database system provides an efficient ways to store, manipulate and retrieve data.

The relational model is not rich with features for statistical applications: additional data types (domains and semantics), data structures and operators could all provide much better support for statistical uses. Many of these requirements can be implemented for specific projects within an database system through programming, but the provision of additional generic features aimed at statistical uses is unlikely to happen with the current generation of products.

Relational Databases

A Relational DataBase Management System (RDBMS) is a database which implements the ideas of the relational model. All serious RDBMS products use the standard language SQL at a fundamental level for data access and manipulation. It can be used directly by people or (more usefully) by programs which need to access the database. In addition to efficient storage and retrieval, an RDBMS provides facilities to ensure the security and integrity of the data.

A great strength of relational systems is their ability to handle complex structures (such as hierarchies and networks). Data in different records can be linked together as required (through Views), and in different and unanticipated ways if necessary. In contrast, most statistical systems require any such linking to be performed either at the design stage (requiring compromises and restricting flexibility), or through physical file linking, which is inefficient and can cause problems for data entry.

It is only relatively recently that good RDBMS software has become available on PCs (with Access and Paradox), but now the advantages of relational database technology, once reserved for mini and mainframe users, can be exploited at any scale of processing.

As a very rough generalisation, the older larger RDBMS products (Oracle, Ingres, ...) provide the robust, secure and efficient data management systems which are needed for larger projects, and run well on larger systems (under Windows NT, Unix, etc.). The newer PC products (Access, Paradox, ...) running under MS Windows provide excellent facilities for screen, query and report design (supported by event-driven programming) which enable the construction of attractive and easy to use front end environments. Their database engines (the RDBMS part) are also quite adequate for running less demanding projects on well-specified PCs.

Different practical situations have different requirements and can get very different performance from different implementations. Similarly, poor choices in the physical design of a database can lead to poor performance whatever the capabilities of the RDBMS.

Client–Server Computing

The relational model explicitly supports the idea of Client-Server computing. With database systems this means that the programs which use the database (the front-end clients) can be separate from the server (the RDBMS) which handles access to and management of the data (security, storage, insertion, updating, retrieval, integrity, ...) .

The client and server can run on the same computer, but will more usually be on different computers (connected across a network), each chosen for its suitability for its specific task. SQL is the language used for communication between the client and server, and the connection mechanism has recently been simplified by the development of connection standards such as ODBC (Open DataBase Connectivity).

With client-server structures, users on PCs can use standard packages and tools to access data from a remote RDBMS as though it were stored locally. Some of the larger statistical packages (notably SAS and SPSS) now come with transparent RDBMS access mechanisms built in, while a standard facility in MS Windows (ODBC) means that data access can be available in all programs. Thus solutions can be constructed which link database and statistical systems to combine the strengths of each.

Statistical Requirements

Relational systems can provide useful facilities for appropriate statistical requirements, but are not a panacea. The model does not address every issue of importance with statistical data, and some systems can be inefficient for some tasks. With care these problems can usually be overcome with suitable design and implementation of structure and processing.

The relational model explicitly includes the concept of Null, which can be used to represent missing data. It is not enough where more than one type of missing value is needed (missing, refused, inapplicable, ...), though additional concepts can be introduced through programming.

With many projects the handling of meta-data (information about the data rather than about the data subjects) is important. Statistical packages provide facilities which are useful, if pragmatic and inflexible. Most relational systems do not provide much direct support for meta-data beyond that needed by the RDBMS itself, but through programming it is possible to make statistical meta-data explicit in a relational database, so that it can be updated if necessary and extracted when required for use in other contexts. The scope of the meta-data can be as basic or as comprehensive as required, and could include (for example) scales, codes and labels used for categorical data, general descriptions, definitions and nomenclature, even sample design and data quality information.

For some applications the omission of the concept of time is important. Dates and times can be stored and manipulated, but there is no direct support for ideas such as the next or previous occurrence of an event, nor for linking events of the same or different types through their time ordering. Again, a lot can be achieved for particular projects through programming.

Summary

Relational systems are a good solution for many statistical processing systems, particularly where the requirements are more complex than can be handled easily in standard statistical packages. They can provide excellent flexibility, both for data structures and for the implementation of data handling processes, and they can be integrated with statistical packages for analysis and presentation.

Further Reading

Date, C. J. An Introduction to DataBase Systems (7th Edition), Volume 1. Addison Wesley, ISBN: 0-201-68419-5.

Date, C. J. And Darwen, H. A Guide to the SQL Standard (4th Edition). Addison Wesley.

Westlake, A. J. (Ed.). Relational Databases. ASC, ISBN: 0 9521682 0 0.

Page last updated 08 October 2003.