【论文记录】Stochastic gradient descent with differentially private updates

阅读：评论：0

The sample size required for a target utility level increases with the privacy constraint.
Optimization methods for large data sets must also be scalable.
SGD algorithms satisfy asymptotic guarantees

主要工作简介：
quad In this paper we derive differentially private versions of single-point SGD and mini-batch SGD, and evaluate them on real and synthetic data sets.
更多运用SGD的原因：
quad Stochastic gradient descent (SGD) algorithms are simple and satisfy the same asymptotic guarantees as more computationally intensive learning methods.
由于asymptotic guarantees带来的影响：
quad to obtain reasonable performance on finite data sets practitioners must take care in setting parameters such as the learning rate (step size) for the updates.
上述影响的应对之策：
quad Grouping updates into “minibatches” to alleviate some of this sensitivity and improve the performance of SGD. This can improve the robustness of the updating at a moderate expense in terms of computation, but also introduces the batch size as a free parameter.

优化目标：
quad solve a regularized convex optimization problem ： w ∗ = argmin w ∈ R d λ 2 ∥ w ∥ 2 + 1 n Σ i = 1 n l ( w , x i , y i ) w^* = mathop{ textbf{argmin} } limits_{ w in mathbb{R}^d} frac{lambda}{2} Vert w Vert ^2 + frac{1}{n} mathop{ Sigma }limits_{i=1}^n mathbb{l} (w,x_i,y_i) w∗=w∈Rdargmin2λ∥w∥2+n1i=1Σnl(w,xi,yi)
quad where w w w is the normal vector to the hyperplane separator, and l mathbb{l} l is a convex loss function.
quad 若 l mathbb{l} l 选为 logistic loss, 即 l ( w , x , y ) = l o g ( 1 + e − y w T x ) mathbb{l} (w,x,y)=log(1+e^{-yw^Tx}) l(w,x,y)=log(1+e−ywTx), 则 ⇒ Rightarrow ⇒ Logistic Regression
quad 若 l mathbb{l} l 选为 hinge loss, 即 l ( w , x , y ) = mathbb{l} (w,x,y)= l(w,x,y)= max ( 0 , 1 − y w T x ) (0,1-yw^Tx) (0,1−ywTx), 则 ⇒ Rightarrow ⇒ SVM

优化算法：
quad SGD with mini-batch updates ： w t + 1 = w t − η t ( λ w t + 1 b Σ ( x i , y i ) ∈ B t ▽ l ( w t , x i , y i ) ) w_{t+1} = w_t - eta_t Big( lambda w_t + frac{1}{b} mathop{Sigma}limits_{ (x_i,y_i) in B_t} triangledown mathbb{l} (w_t,x_i,y_i) Big) wt+1=wt−ηt(λwt+b1(xi,yi)∈BtΣ▽l(wt,xi,yi))
quad where η t eta_t ηt is a learning rate, the update at each step t t t is based on a small subset B t B_t Bt of examples of size b b b.

满足差分隐私的 mini-batch SGD ：
quad A differentially-private version of the mini-batch update ： w t + 1 = w t − η t ( λ w t + 1 b Σ ( x i , y i ) ∈ B t ▽ l ( w t , x i , y i ) + 1 b Z t ) w_{t+1} = w_t - eta_t Big( lambda w_t + frac{1}{b} mathop{Sigma}limits_{ (x_i,y_i) in B_t} triangledown mathbb{l} (w_t,x_i,y_i) ,+ frac{1}{b}Z_t Big) wt+1=wt−ηt(λwt+b1(xi,yi)∈BtΣ▽l(wt,xi,yi)+b1Zt)
quad where Z t Z_t Zt is a random noise vector in R d mathbb R ^d Rd drawn independently from the density: ρ ( z ) ∝ e − ( α / 2 ) ∥ z ∥ rho(z) propto e^{-(alpha/2) |z|} ρ(z)∝e−(α/2)∥z∥

使用上式的 mini-batch update 时, 此种updates满足 α alpha α-differentially private的条件：
quad T h e o r e m mathcal{Theorem ,} Theorem If the initialization point w o w_o wo is chosen independent of the sensitive data, the batches B t B_t Bt are disjoint, and if ∥ ▽ l ( w , x , y ) ∥ ≤ 1 | triangledown mathbb l(w,x,y)| leq 1 ∥▽l(w,x,y)∥≤1 for all w w w, and all ( x i , y i ) (x_i,y_i) (xi,yi), then SGD with mini-batch updates is α alpha α-differentially private.

实验现象：
quad batch size 为1时DP-SGD的方差比普通的SGD更大。但 batch size 调大后则方差减小了很多。
由此而总结出的经验：
quad In terms of objective value, guaranteeing differential privacy can come for “free” using SGD with moderate batch size.
实际上 batch size 带来的影响是先减后增
quad increasing the batch size improved the performance of private SGD, but there is a limit , much larger batch sizes actually degrade performance.

数据维度 d d d与隐私保护参数会影响实验所需的数据量：
quad Differentially private learning algorithms often have a sample complexity that scales linearly with the data dimension d d d and inversely with the privacy risk α alpha α. Thus a moderate reduction in α alpha α or increase in d d d may require more data.