We are starting a series of blog posts on DP-SGD that will range from gentle introductions to detailed coverage of the math and of engineering details in making it work.
我们将开始撰写关于 DP-SGD 的一系列博客文章，内容从简单介绍到详细介绍数学和工程细节，以使其发挥作用。
In this first entry, we will go over the DP-SGD algorithm focusing on introducing the reader to the core concepts, without worrying about mathematical precision or implementation just yet (they will be covered in future episodes). The intended audience for this article is someone who has experience with training ML models, including deep nets via backpropagation using their favorite framework (PyTorch, of course 🙂).
在第一篇文章中，我们将介绍 DP-SGD 算法，重点是向读者介绍核心概念，而不用担心数学精度或实现（它们将在以后的剧集中介绍）。 本文的目标读者是具有训练 ML 模型经验的人，包括使用他们最喜欢的框架（当然是 PyTorch 🙂）通过反向传播训练深度网络。
Privacy in ML
We know that training a model is an attempt at induction: we learn something from our data and we plan to use it to predict something else in the future. To state the obvious plainly, this means that there is some information in our dataset, and by training a model we condense at least some of it into an artifact we plan to use later. We learn in Machine Learning 101 that memorization can happen, so it’s perhaps not surprising that memorization can indeed be exploited to extract information about training data from a model (see eg [Carlini et al, 2018], [Feldman 2020]).
我们知道训练模型是一种归纳尝试：我们从数据中学习一些东西，并计划用它来预测未来的其他东西。 简单地说，这意味着我们的数据集中有一些信息，通过训练模型，我们至少可以将其中的一些信息压缩成我们计划稍后使用的工件。 我们在机器学习 101 中了解到可以发生记忆，因此确实可以利用记忆从模型中提取有关训练数据的信息（参见例如 [Carlini et al, 2018]、[Feldman 2020]），这也许并不奇怪。
What is privacy, anyway? Let’s say we don’t know and let’s start fresh by looking at our problem. We can all agree that if the ML model has never seen your data in the first place, it cannot possibly violate your privacy. Let’s call this our baseline scenario. Intuitively, if you now added your own data to the training set and the resulting model changed a lot compared to the baseline, you would be concerned about your privacy. While this makes intuitive sense, the real world is more complex than this. In particular, there are two problems we should think about:
究竟什么是隐私？ 假设我们不知道，让我们重新审视我们的问题。 我们都同意，如果 ML 模型从一开始就从未见过您的数据，那么它就不可能侵犯您的隐私。 让我们称之为我们的基线场景。 直观地说，如果您现在将自己的数据添加到训练集中，并且生成的模型与基线相比发生了很大变化，您就会担心自己的隐私。 虽然这很直观，但现实世界比这更复杂。 具体来说，有两个问题值得我们思考：
- We know that any tweak in the training process, no matter how trivial, will change the resulting model significantly. Permuting the training data, rerandomizing initial parameters, or running another task on the same GPU will produce a different model with potentially very different weights. This means that we can’t simply measure how different the weights are in these two scenarios as that will never work.
- If everyone expected that absolutely no change would happen in a model if they added their data, it means that there would be no training data and hence no ML models! We can see that this constraint is a bit too rigid.
我们知道训练过程中的任何调整，无论多么微不足道，都会显着改变结果模型。 排列训练数据、重新随机化初始参数或在同一 GPU 上运行另一个任务将产生具有可能非常不同的权重的不同模型。 这意味着我们不能简单地衡量这两种情况下的权重有多大不同，因为这永远行不通。
如果每个人都期望在添加数据后模型中绝对不会发生任何变化，这意味着将没有训练数据，因此也就没有 ML 模型！ 我们可以看到这个约束有点过于僵化了。
Luckily for us, this was figured out by [Dwork et al, 2006] and the resulting concept of differential privacy provides a solution to both problems! For the first, rather than comparing the weights of the two models, we want to consider the probabilities of observing these weights. For the second, instead of insisting that nothing will change, let’s instead promise that while something will change, we guarantee it will never change by more than a specific and predefined amount. This way, we won’t learn too much to be nosy, but we can still learn enough to produce useful models.
对我们来说幸运的是，[Dwork 等人，2006] 发现了这一点，由此产生的差分隐私概念为这两个问题提供了解决方案！ 首先，我们不想比较两个模型的权重，而是要考虑观察这些权重的概率。 其次，与其坚持什么都不会改变，不如让我们承诺，虽然某些事情会发生变化，但我们保证它的变化永远不会超过特定和预定义的数量。 这样，我们就不会学到太多的八卦，但我们仍然可以学到足够的知识来产生有用的模型。
These two principles are embodied in the definition of differential privacy which goes as follows. Imagine that you have two datasets D and D′ that differ in only a single record (e.g., my data) and you interact with the data via a process or mechanism called M (this can be anything, more on this later). We can say that M is ε-differentially private if for every possible output x, the probability that this output is observed never differs by more than exp(ε) between the two scenarios (with and without my data).
这两个原则体现在差分隐私的定义中，如下所示。 想象一下，您有两个数据集 D 和 D’，它们仅在单个记录（例如，我的数据）上有所不同，并且您通过称为 M 的过程或机制与数据交互（这可以是任何内容，稍后会详细介绍）。 我们可以说 M 是 ε-差异私有的，如果对于每个可能的输出 x，观察到这个输出的概率在两种场景（有和没有我的数据）之间的差异永远不会超过 exp(ε)。
Or, if you prefer a formula:
∀ D and D′ that differ in one person’s data ∀ x: ℙ[M(D) = x] ≤ exp(ε) ⋅ ℙ[M(D′) = x]
One of the amazing things about differential privacy is that it imposes no limitations on the nature on M. It can be anything. It can be a database query, it can be asking a set of questions with pen and paper to a person, or even just storing it to disk or sending it over wire, or anything else you want. As long as M enjoys this property over its outputs, then it can claim its DP badge for a specific privacy budget ε. At the same time, you can choose what ε you want to be certified for: the higher it is, the less private you are (look at the formula: it means the probabilities are allowed to diverge more). For this reason, the quantity ε is commonly referred to as the privacy [loss] budget.
If we go back to our case of training a model, we now have a way to formally certify the privacy level of our training algorithm: if, after training two models, one of which on all data (mine included) and the other on all data except from mine, we can prove that all weights of the two models are observed with probabilities that lie within a predefined boundary of exp(ε) of each other, then we can claim the cool DP badge for our training procedure (that’s right! It’s the overall process that gets the badge, not the data and certainly not the trained model!).
Notice that this task is harder than it looks: we can’t simply try 1000 examples (or a million, or a billion) and check whether they match. We need to prove this for all values, including never previously observed ones. The only way out of this is math and theorems. The good news about this is that if somehow we do manage to get this done, then we know that no matter what, the privacy claim is always true. There can never be any future attack that will extract our precious data from a trained module, nor any bugs to exploit to circumvent our defense just like you can’t break Pythagoras’s theorem, so this is why it’s worth doing.
Providing a guarantee
So, how do we provide this guarantee then? The definition doesn’t say anything about the how.
It’s helpful to think about this problem on a simpler domain, so for now let us leave machine learning aside and focus on making private counting queries to a database — at the end of the day, we can see ML training as a special case of querying the training data to get numerical results out.
It is trivial to see that
COUNT(*) WHERE <cond> queries can lead to a complete privacy breakdown against a sufficiently determined attacker. Consider the following example of a database that consists of two fields
salary, with the latter being kept “private” by mandating it can only be shown in aggregates. By repeatedly running queries such as
COUNT(*) WHERE name="Alice" and salary < X, Alice’s salary can be recovered with binary search. Can we defend against this attack by disallowing queries that target individuals? If only! A pair of queries
COUNT(*) WHERE name<>"Alice" and salary < X and
COUNT(*) WHERE salary < X get the job done just as easily as before.
It may seem that these simple attacks can be thwarted by making the server’s answers a bit less precise. For instance, what if the server rounds its responses to the closest multiple of 10? Or, to confuse the attacker even more, chooses the rounding direction randomly?
A seminal result from the early 2000s due to Irit Dinur and Kobbi Nissim states, loosely, that too many accurate answers to too many questions will violate privacy almost surely. This phenomenon is known in the literature as Fundamental Law of Information Recovery and has been practically demonstrated in a variety of contexts time and time again. It effectively means that not only the answers cannot be overly precise, the error must grow with the number of answers if we want to avoid nearly total reconstruction of the dataset.
The notion of differential privacy turns these observations into actionable guidance.
The remarkable fact is that we can enforce differential privacy for counting queries by simply computing the precise answer and adding noise randomly sampled from a carefully chosen probability distribution. In its simplest form, a privacy-preserving mechanism can be implemented with a noise drawn from the Laplace distribution.
Of course, by asking the same query multiple times, the additive noise will average out and the true answer will emerge, which is exactly what Dinur-Nissim warned us about. Take that, differential privacy!
Differential privacy allows us to analyze this effect too, and in a very neat way: if you take a measurement from a mechanism with privacy budget ε₁ and a second measurement from another mechanism with privacy budget ε₂, then the total privacy budget will be simply ε₁+ε₂. Sleek, eh? This property is called (simple) composition. This means that if the mechanism guarantees that a single query has ε=1 and you want to issue three queries, the total privacy budget expended will be ε=3.
This “just add some noise” business sounds too good to be true, right? What if the attacker thinks really hard about the output of a differentially private computation, such as feeding it into a custom-made neural network trained to break privacy? Fear not! Differential privacy is preserved by post-processing, which means that results of running arbitrary computations on top of differentially private output won’t roll back the ε. Not only does it protect against clever adversaries, it gives us a lot of flexibility in designing differentially private mechanisms: once differential privacy is enforced anywhere in the data processing pipeline, the final output will satisfy differential privacy.
To recap, we learned that our solution will look like this:
- Our mechanism will be randomized, i.e., it will use noise.
- Our final privacy claim depends on the total number of interactions with the data.
- We can post-process results of a differentially private computation any way we want (as long as we don’t peek into the private dataset again).
Back to machine learning
To apply the concept of differential privacy to the original domain of machine learning, we need to land on two decisions: how we define “one person’s data” that separates D from D’ and what the mechanism M is.
Since in most applications of ML the inputs come without explicit user identifiers, with Federated Learning being one notable exception, we will default to protecting privacy of a single sample in the training dataset. We will discuss other options in future Medium posts.
As for the mechanism M, one possibility is to consider privacy of the model’s outputs only. This is indeed a valid option called private prediction, but it comes with many strings attached: the model can still memorize, so it’s up to your inference system to securely enforce those constraints. Also, this prevents us from ever releasing our ML model: if someone gets to see the weights, our privacy guarantees will be lost. This means that deploying on mobile will be considerably less safe, among others.
For this reason, it would be much preferable if we could instead insert the DP mechanism during model training, so that the resulting model could be safe for release. This brings us to the DP-SGD algorithm. (There is evidence that even when you only care about accuracy, private training still beats private prediction. See [van der Maaten, Hannun 2020] for a practical analysis and more discussion on the topic).
DP-SGD (Differentially-Private Stochastic Gradient Descent) modifies the minibatch stochastic optimization process that is so popular with deep learning in order to make it differentially private.
The core idea is that training a model in PyTorch can be done through access to its parameter gradients, i.e., the gradients of the loss with respect to each parameter of your model. If this access preserves differential privacy of the training data, so does the resulting model, per the post-processing property of differential privacy.
There is also an engineering angle here: since the PyTorch optimizer is already made to look at parameter gradients, we could add this noise business directly into it and we can hide away the complexity, allowing anyone to train a differentially private model simply. Profit!
This code sample can show how simple this is
optimizer = torch.optim.SGD(lr=args.lr) for batch in Dataloader(train_dataset, batch_size=32): x, y = batch y_hat = model(x) loss = criterion(y_hat, y) loss.backward() # Now these are filled: gradients = (p.grad for p in model.parameters()) for p in model.parameters(): # Add our differential privacy magic here p.grad += torch.normal(mean=0, std=args.sigma) # This is what optimizer.step() does p = p - args.lr * p.grad p.grad.zero_()
We have only one question left: how much noise should we be adding? Too little and we can’t respect privacy, too much and we are left with a private but useless model. This turns out to be more than a minor issue. Our ambition is to guarantee that we respect the privacy of each and every sample, not of every batch (since these aren’t a meaningful unit privacy-wise). We’ll cover the details in a future installment of this series, but the intuition is very straightforward: the right answer depends on the largest norm of the gradient in a minibatch, as that is the sample that is at most risk of exposure.
We need to add just enough noise to hide the largest possible gradient so that we can guarantee that we respect the privacy of each and every sample in that batch. To this end, we use the Gaussian mechanism that takes in two parameters, the noise multiplier and the bound on the gradient norm. But wait… The gradients that arise during training of a deep neural network are potentially unbounded. In fact, for outliers and mislabeled inputs they can be very large indeed. What gives?
If the gradients are not bounded, we’ll make them so ourselves! Let C be the target bound for the maximum gradient norm. For each sample in the batch, we compute its parameter gradient and if its norm is larger than C, we clip the gradient by scaling it down to C. Mission accomplished — all the gradients now are guaranteed to have norm bounded by C, which we naturally call the clipping threshold. Intuitively, this means that we disallow the model from learning more information than a set quantity from any given training sample, no matter how different it is from the rest.
This requires computing parameter gradients for each sample in a batch. We normally refer to them as per-sample gradients. Let’s spend a little more time here as these are a quantity that is normally not computed: usually, we process data in batches (in the code snippet above, the batch size is 32). The parameter gradients we have in
p.grad are the average of the gradients for each example, which is not what we want: we want 32 different
p.grad tensors, not their average into a single one.
optimizer = torch.optim.SGD(lr=args.lr) for batch in Dataloader(train_dataset, batch_size=32): all_per_sample_gradients =  # will have len = batch_size for sample in batch: x, y = sample y_hat = model(x) loss = criterion(y_hat, y) loss.backward() # Now p.grad for this x is filled # Need to clone it to save it per_sample_gradients = [p.grad.detach().clone() for p in model.parameters()] all_per_sample_gradients.append(per_sample_gradients) model.zero_grad() # p.grad is cumulative so we'd better reset it
Computing per-sample gradients like in the snippet above seems slow, and it is as it forces us to run backward steps for one example at a time, thus losing the benefit of parallelization. There is no standard way around this as once we look into
p.grad, the per-sample information will have been already lost. It is however at least correct — a batch gradient is a per-sample gradient if
batch_size=1. This method is called the microbatch method and it offers simplicity and universal compatibility (every possible layer is automatically supported) at the cost of training speed. Our library, Opacus, uses a different method that is much faster, at the cost of doing some extra engineering work. We will cover this method in-depth in a followup Medium. For now, let’s stick to microbatching.
Opacus (https://opacus.ai/) is a library that enables training PyTorch models with differential privacy
Putting it all together, we want to:
- Compute the per-sample gradients
- Clip them to a fixed maximum norm
- Aggregate them back into a single parameter gradient
- Add noise to it
Here’s some sample code to do just that:
from torch.nn.utils import clip_grad_norm_ optimizer = torch.optim.SGD(lr=args.lr) for batch in Dataloader(train_dataset, batch_size=32): for param in model.parameters(): param.accumulated_grads =  # Run the microbatches for sample in batch: x, y = sample y_hat = model(x) loss = criterion(y_hat, y) loss.backward() # Clip each parameter's per-sample gradient for param in model.parameters(): per_sample_grad = p.grad.detach().clone() clip_grad_norm_(per_sample_grad, max_norm=args.max_grad_norm) # in-place param.accumulated_grads.append(per_sample_grad) # Aggregate back for param in model.parameters(): param.grad = torch.stack(param.accumulated_grads, dim=0) # Now we are ready to update and add noise! for param in model.parameters(): param = param - args.lr * param.grad param += torch.normal(mean=0, std=args.noise_multiplier * args.max_grad_norm) param.grad = 0 # Reset for next iteration
This already gives a good idea of how to implement the DP-SGD algorithm, although this is clearly suboptimal and (as we shall see) not fully secure. In future Medium posts, we will cover how we bring back parallelization to DP-SGD, add support for cryptographically secure randomness, analyze the algorithm’s differential privacy, and finally train some models. Stay tuned!
To learn more about Opacus, visit opacus.ai and github.com/pytorch/opacus.