Li, Cai & Li (2020) Transfer learning for high-dimensional linear regression: prediction, estimation, and minimax optimality, arXiv.

Introduction

Target model: $y_i^{(0)} = x_i^{(0)\top}\beta + \epsilon_i^{(0)}$ , $i = 1, \ldots, n_0$ .
$\beta \in \mathbb{R}^p$ , $p$ can be larger than $n_0$ . $s$ : number of nonzero elements of $\beta$ . $s$ is much smaller than $p$ .
Auxiliary models: $y_i^{(k)} = x_i^{(k)\top}\omega^{(k)} + \epsilon_i^{(k)}$ , $i = 1, \ldots, n_k$ , $k = 1, \ldots, K$ .
$\beta$ と $\omega^{(k)}$ は一般的に異なる．しかしもし両者が近い値をとるならば，target model をより効率的に推定できるかもしれない．
Let $\delta^{(k)} := \beta - \omega^{(k)}$ . "Informative" auxiliary samples を次の集合で定義する:

\begin{align*} \mathcal{A}_q := \{k : ||\delta^{(k)}||_q \le h \}, \;\; \text{for} \;\; q \in [0,1] \end{align*}

$h$ が小さいほど $\mathcal{A}_q$ は informative. $\mathcal{A}_q$ が空集合であることも許容．

Estimation with known informative auxiliary samples

Let $\mathcal{A} \equiv \mathcal{A}_1$ .

Oracle trans-Lasso algorithm

$\hat{\omega}^\mathcal{A} := \arg\min_\omega\frac{1}{2(n_\mathcal{A} + n_0)}\sum_{k \in \mathcal{A} \cup\{0\}} ||y^{(k)} - X^{(k)}\omega||^2_2 + \lambda_\omega ||\omega||_1$ for some $\lambda_\omega = c\sqrt{\log p/(n_\mathcal{A} + n_0)}$
Let $\hat \beta := \hat{\omega}^\mathcal{A} + \hat{\delta}^\mathcal{A}$ , where $\hat{\delta}^\mathcal{A} := \arg\min_\delta\frac{1}{2 n_0} ||y^{(0)} - X^{(0)}(\hat{\omega}^\mathcal{A} + \delta)||^2_2 + \lambda_\delta ||\delta||_1$ for some $\lambda_\delta = c\sqrt{\log p/(n_\mathcal{A} + n_0)}$ .

$\hat{\omega}^\mathcal{A}$ の probability limit を $\omega^\mathcal{A} = \beta + \delta^\mathcal{A}$ とする．また， $||\delta^\mathcal{A}||_1 \le h$ である．
Parameter space: $\Theta_q(s,h) := \{(\beta, \delta^{(1)}, \ldots , \delta^{(K)}) : ||\beta||_0 \le s, \; \max_{k \in \mathcal{A}_q}||\delta^{(k)}||_q \le h\}$
各 $k \in \mathcal{A}\cup \{0\}$ について， $X_i^{(k)} \sim$ i.i.d. Gaussian + $\epsilon_i^{(k)} \sim$ i.i.d. sub-Gaussian など仮定し，それらの下で以下を得る：

Theorem 1.

$\sup_{\beta \in \Theta_1(s,h)} \frac{1}{n_0}||X^{(0)}(\hat \beta - \beta)||^2_2 \lor ||\hat \beta - \beta||_2^2 = O_P\left( \frac{s \log p}{n_\mathcal{A} + n_0} + \frac{s \log p}{n_0} \land h \sqrt{\frac{\log p}{n_0}} \land h^2\right)$

$\mathcal{A}$ が空集合の時は右辺は $O_P(s \log p/n_0)$ ．これは通常のLassoと同じレート．
$h \lt\lt s\sqrt{\log p/n_0}$ ならば転移学習によって効率性が改善する．つまり， $\delta$ のほうが $\beta$ よりはるかに sparse である必要．

Theorem 2.

$\inf_{\hat \beta}\sup_{\beta \in \Theta_1(s,h)}\Pr\left( ||\hat \beta - \beta||_2^2 \ge c_1 \frac{s \log p}{n_\mathcal{A} + n_0} + c_2 \frac{s \log p}{n_0} \land h \sqrt{\frac{\log p}{n_0}} \land h^2\right) \ge \frac{1}{2}$

Theorem 1 + Theorem 2 => Oracle trans-Lasso は minimax rate optimal.

Unknown Set of Informative Auxiliary Samples

Let $\hat Q(\mathcal{I}, b) := \sum_{i \in \mathcal{I}}||y_i^{(0)} - x_i^{(0)\top}b||_2^2$ denote the sum of squared prediction error under $b$ , and $\Lambda^{L + 1} := \{\nu \in \mathbb{R}^{L + 1}: \nu_l \ge 0, \: \sum_{l=0}^L \nu_l = 1\}$ denote an $L$ -dimensional simplex.

Oracle trans-Lasso algorithm

Let $\mathcal{I}$ be a random subset of $\{1, \ldots, n\}$ .
Construct $L + 1$ candidate sets of $\mathcal{A}$ , $\{\hat G_0, \hat G_1, \ldots , \hat G_L\}$ with $\hat G_0 = \emptyset$ .
各 $\hat G_l$ と $\{X_\mathcal{I}^{(0)}, y_\mathcal{I}^{(0)}\}$ を用いて trans-Lasso を回す．その結果を $\hat \beta (\hat G_l)$ とする．
Compute $\hat \theta := \arg\min_{\theta \in \Lambda^{L + 1}} \hat Q(\mathcal{I}^c, \sum_{l=0}^L \theta_l \hat \beta(\hat G_l)) + \sum_{l = 0}^L \theta_l \hat Q(\mathcal{I}^c, \hat \beta(\hat G_l)) + \frac{2 \lambda_\theta}{n_0} \sum_{l=0}^L \theta_l \log (\theta_l).$ Output $\hat \beta^{\hat \theta} := \sum_{l = 0}^L \hat \theta_l \hat \beta(\hat G_l)$ .

$\hat G_0$ を空集合にしておくことで，auxiliary sample が完全に uninformative でも結果が（そこまで）悪化しないように保険．
$\Pr(\hat G_l \subseteq \mathcal{A}, \;\; \text{for some}\;\; 1\le l \le L) \to 1$ ならば Oracle trans-Lasso に近い結果が期待できる．
$\hat G_l$ として $2^K$ すべての組み合わせを考えるのは計算が大変（理論的にはOK？？）．
どうやってを決めればいい？
- $\mathcal{A}$ の構成により， $\{\delta^{(k)}\}_{k \in \mathcal{A}}$ と $\{\delta^{(k)}\}_{k \in \mathcal{A}^c}$ では前者のほうがより sparse.
- $\delta^{(k)}$ が sparse な $k$ -th sample から $\hat G_l$ に追加すればよい．ただし $\delta^{(k)}$ は未知なので $\{X^{(k)}, y^{(k)}\}$ と $\{X_\mathcal{I}^{(0)}, y_\mathcal{I}^{(0)}\}$ を用いて推定．具体的な手順は論文参照．

諸々の仮定の下で以下を得る：

Theorem 3.

$\frac{1}{|\mathcal{I}^c|}||X_{\mathcal{I}^c}^{(0)}(\hat \beta^{\hat \theta} - \beta)||^2_2 \lor ||\hat \beta^{\hat \theta} - \beta||_2^2 = O_P\left( \frac{s \log p}{n_{\mathcal{A}^o} + n_0} + \frac{s \log p}{n_0} \land h \sqrt{\frac{\log p}{n_0}} \land h^2 + \frac{\log K}{n_0}\right)$