-
Notifications
You must be signed in to change notification settings - Fork 1
/
pp-algorithms.tex
141 lines (100 loc) · 9.79 KB
/
pp-algorithms.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
\chapter{Privacy Preserving Algorithms}\label{c:pp-algorithms}
\section{Challenges in Privacy Preserving Algorithms}\label{s:challenges}
A private computation scheme allows the user to compute any arbitrary function, while revealing no information to any individual server about the identity of that function.
Computation over encrypted data can be utilized in a wide variety of use-cases.
For instance, the problem of how to use encrypted database fields in search queries, or the problem of testing for membership in a set.
Another very common use case is the private set intersection, \textit{i.e.} \textit{``Department of homeland security (DHS) wants to check its list of terrorist suspects against the passenger manifest of a flight operated by a foreign airline.
Neither party is willing to reveal its information, however, if there is a (non-empty) intersection, DHS will not give the flight permission to land".}
Despite the variety of applications that encrypted computation can assist, the translation of any function to its privacy preserving equivalent is not a trivial process.
As mentioned in section \ref{s:homomorphic-encryption}, termination problems are introduced and the avoidance of them is not straightforward.
First, one has to identify the corresponding termination channels that needs protection against side-channels and leakage; that is, any branch decision which is based on encrypted data (\textit{i.e.} \texttt{if ($Enc(X)$ is True) then statement}).
Consecutively, they have to translate the termination channels using oblivious computation in the encrypted domain.
Commonly, the \texttt{if} statement is replaced by a loop over all possible values and an oblivious selection of the desired value.
Early termination conditions, in general, eventually leak sensitive information, which is why the execution time should only depend on the length of the inputs, not their plaintext values (\textit{i.e.}, execution requires maximum iterations independent of the input).
Below we examine the privacy preserving equivalents of some simple and widely used algorithms in order to clarify the termination problems and the translation process.
\section{Branching Oracles}\label{s:bros}
Although we cannot take branch decisions that are based on encrypted values, recent research \cite{mazonka2016cryptoleq} has implemented special constructions -- dubbed \emph{BRanching} \emph{Oracles} (\emph{BROs}) -- which obliviously evaluate the branch outcomes.
\emph{BROs} are treated as decision-making black boxes and enable an alternative solution to branching on private information.
All the algorithms presented in the following sections suppose that the underlying encrypted architecture supports a privacy\hyp preserving \emph{BRO}.
More specifically, \emph{BROs} enable comparison of encrypted values and obliviously return an encrypted boolean variable (true, false).
For instance, the following example illustrates the usage of \emph{BROs}:
\texttt{if (x > 1) a = b; else a = c; }
Since \texttt{x} is encrypted, the host is unable to make the branch decision whether \texttt{a} should take the value of \texttt{b} or \texttt{c}.
Using a branching oracle, an encryption of true or false will emerge from the comparison \texttt{(x > 1)} (\textit{i.e.} \texttt{gt = (x > 1)}).
Then, one can compute obliviously the branch outcome using the result of the private comparison as follows:
\texttt{a = b * gt + c * (1-gt)}.
Using this technique a programmer can replace conditional statements on private data with such oblivious selection clauses.
Therefore, the control flow dependencies are converted into data dependencies.
\section{Notation}\label{s:notation}
Our notation for encrypted values and for operations between encrypted variables consists of a variation of the classical mathematical operations.
Namely, $\enc{X}$ corresponds to the encryption of number $X$, while $\hplus$ , $\hminus$ , $\hmul$ and $\hdiv$ represent homomorphic addition, subtraction, multiplication and division respectively, as shown in equation \ref{eq:operands}.
In a like manner, $\heq$ (eq. \ref{eq:equal}), $\hneq$ (eq. \ref{eq:not-equal}) $\hlt$ (eq. \ref{eq:lt}) and $\hgt$ (eq. \ref{eq:gt}) represent private equality and private comparison, returning an encryption of one (that is $\enc{1}$) if the operands map to equal plaintexts (or respectively the first is less than the second), or an encryption of zero ($\enc{0}$) otherwise.
All those aforementioned comparisons are utilizing the branching oracle of the underlying encrypted architecture.
\begin{equation}\label{eq:operands}
\begin{aligned}
Enc(X) \hplus Enc(Y) = Enc(X + Y)\\
Enc(X) \hminus Enc(Y) = Enc(X - Y)\\
Enc(X) \hmul Enc(Y) = Enc(X \cdot Y)\\
Enc(X) \hdiv Enc(Y) = Enc(X \div Y)
\end{aligned}
\end{equation}
\begin{equation}\label{eq:equal}
Enc(X) \heq Enc(Y) :\begin{cases}
\enc{1}, & \text{if $X = Y$}.\\
\enc{0}, & \text{otherwise}.
\end{cases}
\end{equation}
\begin{equation}\label{eq:not-equal}
Enc(X) \hneq Enc(Y) :\begin{cases}
\enc{1}, & \text{if $X \neq Y$}.\\
\enc{0}, & \text{otherwise}.
\end{cases}
\end{equation}
\begin{equation}\label{eq:lt}
Enc(X) \hlt Enc(Y) :\begin{cases}
\enc{1}, & \text{if $X < Y$}.\\
\enc{0}, & \text{otherwise}.
\end{cases}
\end{equation}
\begin{equation}\label{eq:gt}
Enc(X) \hgt Enc(Y) :\begin{cases}
\enc{1}, & \text{if $X > Y$}.\\
\enc{0}, & \text{otherwise}.
\end{cases}
\end{equation}
Also, in the algorithms described below, the variables in \red{red} correspond to private values, while the variables in black represent public data.
\section{Transforming Algorithms to their Privacy Preserving Equivalent}\label{s:simple-algorithms}
In this section we will examine three simple algorithms (namely {\fontfamily{lmss}\textsc{Sum}, \textsc{Max}, \textsc{PIR}}) and how to develop them in a way to preserve data privacy.
Firstly, algorithm \ref{a:sum} computes the sum of an array of numbers and it is quite similar to the private sum algorithm.
The latter obliviously updates the encrypted sum with each value of the array, utilizing the homomorphic properties of the encrypted variables.
\import{./}{algorithms/sum.tex}
On the other hand, algorithm \ref{a:max} finds and returns the maximum number of an array.
In the private {\fontfamily{lmss}\textsc{Max}} algorithm, the host is not able to determine for each value of the array if it is greater from the current value of max.
However, it can obliviously update the value of max, based on the encrypted comparison outcome (alg. \ref{a:max} line 13, 14).
In the cases that the value of the $array$ is less than the current $max$ the $gt$ flag will have the value $\enc{0}$, otherwise $\enc{1}$.
Thus the multiplexer operation in line 14 each time will obliviously ``keep'' the greatest value and set the variable $max$ to it.
\import{./}{algorithms/max.tex}
Both textbook algorithms' time complexity (algorithms. \ref{a:sum} and \ref{a:max}) is \bigo{n} and the transformation to their privacy preserving equivalents did not affect their complexity.
However, the privacy\hyp preserving algorithm is inevitably more computationally intensive, since the encrypted operations require many additional CPU cycles.
\import{./}{algorithms/pir.tex}
The third and final example is the Information Retrieval algorithm, where a user maintains a database and desires to retrieve some data based on a key.
The simplest textbook algorithm searches each consecutive position on the array to find a key match, and immediately returns the corresponding value.
In contrast, the privacy preserving algorithm (PIR) \cite{chor1995private} is not able to return before searching the whole database without revealing any information about the key and the result.
In this pair of examples the worst case complexity is also \bigo{n}, however in the average case, the former will search the half database.
The idea here is to brute\hyp force all possible table positions and obliviously find and keep the value for the requested key.
It has become obvious that it is not straightforward to make more sophisticated algorithms privacy preserving.
However, similar techniques are applied for simple and more elaborate algorithms, in order to transform them to their privacy preserving equivalents, sacrificing performance for privacy.
Motivated by privacy concerns and from the challenges that arise in securing any function, in this thesis we concentrate on developing an end-to-end infrastructure for privacy preserving analytics and incorporate some essential algorithms for aggregation statistics and classification.
\section{Algorithms for Two Types of Data: Categorical \& Numerical}\label{s:two-types-of-data}
Due to the limitless applications that our system can serve, the input data can have many different types.
We have separated the data in two broad categories -- categorical and continuous, therefore our algorithms are also logically separated for those two different kinds of data.
Categorical variables are those that can take on one of a limited, and usually fixed, number of possible values.
For instance, blood type, gender, the age-group and the country that a person lives in.
On the other hand, numerical\myslash continuous data can take any real number, rendering uncountable the number of different possible values.
Examples of the latter category include weight, price, profits, etc.
Medical data, of course, appear in both ways, depending on the ``nature" of the attributes.
One more reason that this distinction should be made is that even for the same attribute (for instance ``Age"), a doctor may either classify the exact age, or the birth date, or even classify the patient in one group (\textit{i.e.} ``child" or ``adult").
Due to these two different kinds of data that exist in medical datasets, we have separated our algorithms in two categories respectively, categorical and numerical.
\input{histograms}
\input{id3}
\input{c45}