Abstract

Standard security considerations arise during the collection of information for statistical purposes, and statistical organisations need to ensure that information supplied by individual people or organisations is not revealed to unauthorised view. For this they expect to use the standard security mechanisms developed in other domains. However, the purpose of a statistical inquiry is to reveal information – to provide access, not restrict it. The objective of Statistical Disclosure Control is to control the risk that information about specific individuals can be extracted from amongst statistical summary results. This paper reviews the issues and outlines some of the solutions that are being used.

This is a draft (not yet completed) based on a presentation to the IBM Almaden Institute Symposium on Privacy, April 2003. Click here to download the Word file (160 Kb) or here for the PowerPoint presentation (155 Kb).

Contents

1      Introduction 2

2      Statistical Confidentiality 2

2.1   Promise to Respondent 2

2.2   Action to preserve confidentiality 2

2.3   Release of Statistical Information 2

2.4   Disclosure Control 2

3      Generic Security 2

4      SDC Approaches and Problems 3

4.1   Release of Summary Tables 3

4.2   Release of anonymised individual records 3

4.3   Release of Detailed Tables 4

4.4   Controlled Tabulation on Demand 4

4.5   Access Control and Trust 4

5      Disclosure Risk and Cell Suppression in Summary Tables 4

5.1   Sensitivity 4

5.2   The Three-Firm Problem   5

5.3   Cell Suppression 5

5.4   Re-estimation and Feasible Ranges 5

6      Anonymised data records 6

7      Multiple Summary Tables 6

8      Tabulation On-Demand 7

9      Current Projects 7

10    Conclusions 7

10.1 Current Advice 7

1 Introduction

Statistical dissemination is all about releasing information. However, the released information is supposed to enlighten the reader about the average behavior of groups, not the particular details of specific individuals.

The problem of maintaining confidentiality of individual responses has been studied for several decades, and an extensive literature exists. In this paper we attempt to identify the main themes and approaches that have been used. We do not claim to be experts in the field, but have maintained a watching brief of the area, and report our interpretation of the current situation. Pointers to further and more detailed reading can be found in the References section.

2 Statistical Confidentiality

2.1 Promise to Respondent

Whether the respondent is a person, business or other organisation, an honest data collector will give a promise that the individual information provided will be kept confidential. Sometimes this promise is enshrined in law, as with the census. Procedures are then needed to fulfil this promise, and to be able to demonstrate that it has been fulfilled.

In passing, it is worth noting that this promise can affect the quality of responses, since fear of disclosure may inhibit the respondent from responding truthfully (or at all) to sensitive questions.

2.2 Action to preserve confidentiality

Different types of disclosure risk arise at different stages in the statistical life cycle. Probably the greatest risk of compromise is in transmission and processing of the original information. However, at this stage the requirement is not one that is specific to statistical information, so more standard security systems can be used to protect the source data.

2.3 Release of Statistical Information

The purpose of a statistical inquiry, or of statistical reporting from an administrative system, is to release information about the characteristics of the system. This is usually done by fitting models and estimating parameters of those models, or by reporting aggregate information for all or sub-groups of the respondents.

2.4 Disclosure Control

When statistical results are based on very small numbers of respondents it can be possible to infer or extract information about an individual from aggregated results. This is a particular problem when the intruder has auxiliary information about a targeted respondent or about other respondents in the same subgroup. Discussion of the approaches to this problem form the main part of this document.

3 Generic Security

All systems that store and process sensitive information need appropriate security mechanisms. It is not our intention to discuss these mechanisms in this report, but no statistical system can be considered to have and adequate approach to the preservation of confidentiality if it does not address this area.

Examples of security mechanisms include:

Encryption of records, particularly when being transmitted via potentially insecure channels. An example would be the transmission of responses over the internet from a respondent browser to a central data store.
Protection of the data store, including access control to ensure that only those with suitable authority can view and/or modify the stored information.
Working practices that ensure that unauthorised staff cannot view sensitive information through watching authorised users, or by accessing logged-in but unattended work stations.
Concealment of identification fields when they are not needed for the current process, such as when checking for consistency, or when transferring cleaned records to a statistical agency for consolidation and analysis.

These are all standard security practices, not specific to statistical information, and statistical systems should make use of best (or at least good) practice. The necessary degree of investment in such systems will depend on the actual sensitivity of the information being collected and stored.

Particular issues arise with complex record structures, or with record matching. The use of consistently applied non-informative identifiers (which may be allocated or computed by encryption) allows records to be linked without revealing the identity of the respondent. Thus it may be reasonable to release hospital identity numbers of patients to outside statistical organisations who may need to link records, but do not have access to the internal hospital information system.

Where different organisations pool information about the same patients, it may be appropriate to release partially identifying identifiers. An example might be the pooling of information about HIV and AIDS tests at a statistical organisation. It would not be reasonable to release full names of patients. A reasonable strategy would be to release Soundex codes (which are based on the surname), together with dates of birth. Since both name and date of birth are not unique and are often reported with error, this does not provide fool-proof matching, but it does provide a high probability that most matches are correct.

4 SDC Approaches and Problems

In this section we review the various approaches to statistical disclosure control – these different approaches are elaborated in later sections.

4.1 Release of Summary Tables

Historically, this is the most common situation. Statistical results, whether standard series of figures or supporting results for commentary and conclusions, are reported in tabular structures, nowadays with machine-readable files to accompany the paper versions. Because of constraints of space, the level of detail is generally low, with relatively low numbers of cells, and high numbers of respondents summarised in each. Such tables are generally handled singly, and are relatively easy to protect. A common approach is to suppress figures where the number of contributions is too low, or where a single respondent dominates the overall result.

Note that sensitivity is related to more population numbers in small groups, than to the sample count in a group. Sampling gives some additional protection, since you cannot be certain whether a particular individual is present within the dataset. Not available for census or register information (of peoples or organisations).

4.2 Release of anonymised individual records

Summary tables do not provide particularly wide scope for further analysis and investigation of the statistical data – formally, it is only possible to perform analyses for which the summaries include the sufficient statistics. As a result there is considerable demand from researchers for the release of individual data records. It is clearly not possible to release the actual source records, so some transformed version is released. An example is the Sample of Anonymised Records released from many censuses, including UK and USA.

The principle adopted is to remove direct identifiers and reduce detail in identifying measurements until individuals cannot be distinguished. The extent to which this is depends on the size of dataset and number and detail of attributes. Software is available to perform various approaches to this transformation. Unfortunately, this can result in a significant reduction in the amount of information actually provided by the released records. An alternative approach is to introduce uncertainty into the identification, by randomisation of responses.

4.3 Release of Detailed Tables

An alternative approach to the provision of more information is to construct a set of summary tables, designed to release as much detail as possible. In particular we can have multiple tables with the same dimensions but different levels of detail.

4.4 Controlled Tabulation on Demand

There have been various proposals to provide access to statistical information through a tabulation engine that has access to the underlying data records, but with the automatic application of disclosure control screening to the results, with appropriate suppression if needed. This turns out to be a very hard problem.

4.5 Access Control and Trust

Not all statistical disclosure is to completely unknown users, and, where users can be positively identified, undertakings to respect confidentiality may be sufficient. An obvious example is for internal work on sensitive information. This approach is often used by archives and releasers of information for secondary analysis. For example, most of the UK organisations that fund social research require that researchers deposit their results at the Data Archive at Essex University. The data owners can set conditions on who the information can be released to and what undertakings as to use must be given by the recipients, and the Archive then has the responsibility to manage the release process.

It is perhaps difficult to monitor this approach in a wide context – if disclosure does occur, can it be traced back to the disclosure point? And even if data users are trustworthy, are they sufficiently equipped to perform their undertaking to protect confidentiality

5 Disclosure Risk and Cell Suppression in Summary Tables

5.1 Sensitivity

A Summary Table has Dimensions which are Classifications (cf OLAP) and together define a lattice of Cells. The content of cells are Measures, counts, sums, means, etc.

What is Sensitive (or Risky)? Many proposals, but no agreement – not a resolved problem

N-k rule

‘Cell is Sensitive if top N respondents contribute at least k% of cell total’

This rule is widely used, but it measures Dominance, not Sensitivity and is thus disliked by theoreticians.

Sensitivity is difficult to define (it is a measure, not a state). Ultimately the objective is to quantify disclosure risk, but this depends on the ancillary knowledge of the intruder.

5.2 The Three-Firm Problem

Archetypical problem: Consider a statistical table, broken down by area and industry sector, giving number of businesses and total turnover. Assume this is based on a register, not a sample, so there is no protection through selection probabilities. Each business knows its own response, so may be able to use this additional information to infer information about its competitors.

Sufficient to consider three contributors – fewer than 3 is always sensitive.
For more than three, consider (at least one) group as an aggregate.

Can the second largest company estimate the size of the biggest?

Middle size has most information about others. Can be a coalition.
If sensitivity is a proportional concept, the estimation of the largest value will always be the most sensitive.
If Firm2 = 30%
Firm1 in range 40 – 70, not sensitive
If Firm2 = 10%
Firm1 in range 80 – 90, probably sensitive
Sensitivity depends on the pattern of values, not the absolute level. Few contributions make sensitivity more likely, but not necessary.
Proportional disclosure by the small party can magnify disclosure about the large one.

5.3 Cell Suppression

Not sufficient to hide a single (primary) sensitive cell – because we can re-estimate the value from the margins and the other cells. Two approaches.

Reduce detail in some classifications (so that more responses contribute to cells). Not always possible, as some classifications have fixed structure, some may not have natural re-groups for the sensitive cells, and sometimes the minimal loss of information (particularly from other cells) is too great.
Suppress additional (secondary) cells, at least two in every group for which an overall total (or equivalent) that includes the primary cell is available.

This is a complex optimisation problem, requiring a suitable objective function. Interesting (and complex) extension when cooperating intruders have information about different suppressed cells.

Various proposals for a measure for information loss, such as the number of cells suppressed, or the total of suppressed values. Better to use something related to the number of items of information needed to reconstruct the original table. Interesting discussion in [RoEt02].

5.4 Re-estimation and Feasible Ranges

Given a table with some suppressed cells and their marginal totals, it is possible to estimate the range of values for each suppressed cell that is consistent with the margins. Simple for 2x2 case, but more complex for more cells and more dimensions. Interesting discussion of some cases in [Cox02].

Strong suggestion from some authors that these ranges should be published (since clever reader can compute them) rather than completely suppressing the cells.

Problem: bounds can be very narrow, perhaps sufficient to constitute sensitivity, so should contribute to determination of secondary suppression set.

100	5	105
3	2	5
103	7	110

103 - 98	2 - 7	105
0 - 5	0 - 5	5
103	7	110

Related problem: multidimensional summary hyper cubes (OLAP cubes) with classification hierarchies on (some) dimensions. This releases all possible margins (more than usually published), so more information to refine the estimated ranges. If some cells are sensitive at the most detailed level, then even if no margins are sensitive in themselves, then may be sufficient to reconstruct sensitive ranges for the cells, so some suppression would be needed in the margins. Extension of previous problem – see [GiRe02].

6 Anonymised data records

Related problem, not exactly the same. Big demand for release of actual records, since it does not limit the type of statistical investigation (or model fitting) that can be done.

By releasing records you release all possible tabulations. Issue is identification of record – gives access to all values. Intruder can use knowledge of some values for an individual to find that individual and get access to other (sensitive) values. Identification risk depends on number of respondents in sub-groups. Extreme values are risky because they characterise small groups.

Three general approaches to protection.

Reduce detail in classifications. Ie increase the size of the smaller groups. Tends to result in the loss of a great deal of useful information.
Model-based imputation (perturbation). Fit a model to the data, then replace sensitive (or all) variables in risky (or all) response records with estimated or sampled information from the model. Includes just adding noise. Suppresses (or dilutes) any relationship not within the model. [PFS02]
Data Swapping. Choose set of swap variables (probably involved in identification), and randomly select (perhaps based on risk) pairs of response records, in which the swap variable values are then exchanged (as a set). All relationships within the swapped set, and all within the non-swapped set are retained. Relationships between swapped and other variables are diluted by a (possibly known) amount. May need rules to decide on swapping set (ie cannot scramble gender-specific values). [GKS03]

Lots of work, but no consensus on measuring disclosure risk, loss of information or achieved protection (eg [Elli02], and others in [Ferr02]),. Measurement of protection is very dependent on the assumed model.

7 Multiple Summary Tables

Common proposal is to release sets of partially detailed summary information in hierarchically structured (OLAP) cubes. This gives the user the flexibility to select or re-aggregate the information in ways that are appropriate to their use. Particularly applies to classifications that are sensitive when provided together in detail, but not when used with coarse groups. Examples are age, location and occupation. All three in detail is very risky, and perhaps any two. But any one in detail with the others in broad groups has low risk. So suggestion is to release (sets of) three summary tables, with the same variables but each having one of the sensitive variables summarised to detailed groups. User can then recombine the tables to explore models across the three variables. Detailed interactions cannot be detected, but this would usually provide more information about the relationships than protected micro data, because the producer has limited the set of tabulations (and so the degree of intrusion) that is possible.

However, can also use the sets of tables to determine feasible bounds, which may be sensitive. So necessary to apply SDC methods to the complete set of released tables, not just the tables individually. Seen this way, this is just (!) a heavyweight extension of the previous table protection problems.

8 Tabulation On-Demand

Don’t release the micro data directly, but allow users to request tabulations that are computed. Apply SDC methods to the tables before releasing them to the user. Problem of SDC over multiple tables (as before). Also (potentially more serious) problem of differencing between tables. Requires complete history of all tabulation requests, potentially by all users (since intruders may collaborate).

Basically infeasible to do the SDC on the fly, so have to use protected data, so no protection needed for requests, which limits the detail. When randomisation is used this must not be dynamic (ie redone for each request) or else the intruder can statistically smooth the perturbations.

9 Current Projects

Various tool developments within statistical offices.

EU supported project CASC (Computational Aspects of Statistical Confidentiality), taking forward the activities of the previous SDC project. Continuing the development of Argus software, directed by Statistics Netherlands (both micro and macro data).

Dissemination of SDC ideas included in AMRADS dissemination project, and planned for NIPS.

10 Conclusions

Statisticians need standard security procedures and processes to ensure security of basic information.

Additional problems for the release of statistical information while still controlling risk of disclosure about individuals.

Long history of work. Advances being made, but still no really clear solutions. Some methods very expensive and some problems very difficult.

Often based on heuristic approaches, not particularly strong on risk or evaluation measures. How do we know whether the methods are effective in practice?

10.1 Current Advice

Do the SDC up-front on the data and tables.
Provide on-line tabulation service without disclosure control, but based on anonymised records (perhaps use μ-Argus software).
Also provide sets of summary tables with more detail in interesting dimensions, used through same tabulation interface.
Provide a system for users to request additional data release, which is then judged for disclosure relative to that already released.

References

[Cox02] L. Cox. Bounds on Entries in 3-dimensional Contingency Tables Subject to Given Marginal Totals (2002). In [Ferr02].

[Elli02] M. Elliot. Integrating File and Record Level Disclosure Risk Assessment (2002). In [Ferr02].

[Ferr02] J. Domingo-Ferrer (Ed). Inference Control in Statistical Databases. Springer Verlag – ISBN 3-540-43614-6 (2002)

[GiRe02] S. Giessing & D. Repsilber. Tools and Strategies to Protect Multiple Tables with the GHQUAR Cell Suppression Engine (2002). In [Ferr02].

[GKS03] S. Gomatam, A. Karr & A. Sanil. A Risk-Utility Framework for Data Swapping (2003). Proceedings of Interface 2003, forthcoming.

[Hund02] A. Hundepool. The CASC Project (2002). In [Ferr02].

[McL86] M. McLeish. Prior Knowledge and the Security of a Dynamic Statistical Database. In Proceedings of 3^rd International Workshop on Statistical and Scientific Database Management. Eurostat. (1986).

[PFS02] S. Polettini, L. Franconi & J. Stander. Model Based Disclosure Protection (2002). In [Ferr02].

[RoEt02] D. Robertson & R. Ethier. Cell Suppression: Experience and Theory (2002). In [Ferr02].

[Will01] L. Willenborg & T. de Waal. Elements of Statistical Disclosure Control. Springer Verlag (2001).

[CASC] Computational Aspects of Statistical Confidentiality. http://neon.vb.cbs.nl/casc

[AMRADS] Accompanying Measure to Research and Development in Official Statistics. http://amrads.jrc.cec.eu.int

[NIPS] Network for the Improvement of Public Statistics. http://www.publicstatistics.net