Privacy Protection through Computational Workflows

Main

Description

Status

Research

Publications

Demo

People

Funding

Links

Research

Open Questions and Requirements

In our research we consider the following open questions and requirements which we derived from our insights from use cases, studied in conjunction with our prototype implementaiton.

A Usage-Oriented Policy Language

A language for representing privacy policies for workflows needs to be developed, together with a semantics for reasoning about it. The language needs to support a variety of aspects about private information and privacy relevant algorithms and support novel types of privacy policies, such as:
  • Algorithmic policies, to specify what kinds of data analysis algorithms are allowed. These could be allowed for specified data types, for specific data sources, or in a data-independent manner. For example, group detection algorithms could be disallowed for use with medical data sources. Another example would be to disable the use of group detection followed by event detection algorithms unless the accuracy of the data sources is above a certain level. This policy may be used to avoid positive identification of individuals as threats with an accuracy so low that it may be a concern for individuals' liberties. Algorithmic policies may be contingent on properties of intermediate data products. Such policies may also express that certain steps have to be performed before storing a result, or transmitting data over an unsecured network. Expressing and reasoning about these types of policies may build on Linear Temporal Logic which has proved useful in other areas of computer science, most notably software verification and more recently automated planning.
  • Query-based policies, to specify what kinds of questions the system is allowed to act upon. These include both user-issued queries as well as system-generated intermediate sub-queries. For example, queries regarding payments may be allowed to the system in accessing any kind of sources including medical and financial sources, while any sub-queries regarding the nature or details of patient treatment may be disallowed.
  • Data integration policies, to specify at the workflow level whether diverse data sources could be integrated through data mining steps. These would essentially control the legal joining of workflow strands.
  • Data creation policies, to specify what kinds of data may be created by the workflow. This could be specified via attribute types, entity types, or specific values.
  • Provenance policies, to specify what information needs to be recorded and for how long it needs to be kept. This would reflect privacy needs for auditing and the stature of limitations for such requirements. Without these policies, there are no limits to the amount of details that a system could be expected to provide well after a workflow is used, so it is best to state these expectations up front.
These policies augment and are complementary to access policies for specific data sources or services in the system.

Extending Workflow Systems

Given this language, existing workflow systems would need to be extended in the following three ways.
  1. Workflow creation and execution subsystem need to be extended. The workflow creation process that is responsible for selecting the data mining processes and data sources to be used in answering a query or line of inquiry needs to be governed by privacy policies that place constraints on the choices of data sources and algorithms. The extended workflow system should exercise full control over the design of the end-to-end data mining process before any computation occurs. The execution system needs to enforce privacy constraints that regard decisions about where data is being analyzed, and to enforce aspects that are only evaluable during execution itself. For example, a privacy policy may state that if the output of a clustering algorithm contains a cluster with less than k individuals then the analysis is not allowed. Generally the fidelity of the models of applied components will not be high enough to predict such situations ahead of execution.
  2. Workflow systems need to leave detailed provenance trails of how data was processed and what mechanisms were used to ensure compliance with privacy policies by the workflow, both in its design and in its execution, in order to support transparency and accountability regarding violation of privacy policies that regard the use of data. Re-execution of workflows through provenance trails could be used to prove, during an audit, that a given result was obtained as advertised.
  3. Workflow system should support a distributed architecture for storage and retrieval of policy information. There may be several ways in which privacy requirements enter the system. Privacy rules need to be associated with different entities in the system. Some privacy policies should be associated with data when it is collected. Other privacy policies would be associated with collections or types of data (e.g., all the data collected by a clinical trial). Yet other policies may be application or system specific (e.g., federal or state privacy laws that may apply).
An important open issue is the trade-off between privacy and result quality. Many privacy preserving operations abstract information from the data which leads to less accurate results. Data descriptions and algorithm models will have to be extended to represent the relative accuracy of algorithms based on abstraction data features.

Reasoning about Privacy and Privacy Policies

An important open question is the negotiation of policies. Mechanisms need to be developed that support argumentation of ``need to know'' to relax privacy requirements if needed. When the privacy policies are too constraining for the system to find a solution to a query, it is possible to explore relaxations of some subset of policies that would enable the original request to be fulfilled. By articulating the choices that the system rejected and the privacy policies that forbid those analyses, the system would be articulating its ``need to know'' for specific data sources and data products. Conversely, the developed mechanisms could be used to check whether existing information disclosure agreements are indeed necessary for the purpose, or whether the level of privacy could be increased, e.g., via the inclusion of additional anonymization steps, without aversely affecting the quality of the final result. Such mechanisms for reasoning about policies may also assist in the design of privacy policies themselves, by enabling exploration of allowable but undesirable workflows under a given set of policies. This is important, because it may be difficult to design policies that are complete, in the sense that there is no way to exploit sensitive data when complying with them.
<< Back to IKCAP