迭代法的一般框架
Richardson 方法
对于线性方程组 A x = b \mathbf{A}\mathbf{x} = \mathbf{b} Ax = b ,将系数矩阵分解为:
A = P − Q \mathbf{A} = \mathbf{P} - \mathbf{Q} A = P − Q
其中 P \mathbf{P} P 为可逆矩阵,则可建立迭代格式:
P x ( k + 1 ) = Q x ( k ) + b ⇒ x ( k + 1 ) = M x ( k ) + g \mathbf{P}\mathbf{x}^{(k+1)} = \mathbf{Q}\mathbf{x}^{(k)} + \mathbf{b} \quad
\Rightarrow \quad \mathbf{x}^{(k+1)} = \mathbf{M}\mathbf{x}^{(k)} + \mathbf{g} P x ( k + 1 ) = Q x ( k ) + b ⇒ x ( k + 1 ) = M x ( k ) + g
其中 M = P − 1 Q \mathbf{M} = \mathbf{P}^{-1}\mathbf{Q} M = P − 1 Q 为迭代矩阵,g = P − 1 b \mathbf{g} = \mathbf{P}^{-1}\mathbf{b} g = P − 1 b 为常数项。
进一步改写为:
x ( k + 1 ) = x ( k ) + P − 1 ( b − A x ( k ) ) = x ( k ) + P − 1 r ( k ) \mathbf{x}^{(k+1)}
= \mathbf{x}^{(k)} + \mathbf{P}^{-1}(\mathbf{b} - \mathbf{A}\mathbf{x}^{(k)})
= \mathbf{x}^{(k)} + \mathbf{P}^{-1}\mathbf{r}^{(k)} x ( k + 1 ) = x ( k ) + P − 1 ( b − A x ( k ) ) = x ( k ) + P − 1 r ( k )
其中 残差向量 定义为:
r ( k ) ≡ b − A x ( k ) \mathbf{r}^{(k)} \equiv \mathbf{b} - \mathbf{A}\mathbf{x}^{(k)} r ( k ) ≡ b − A x ( k )
引入松弛因子 ω ( k ) \omega^{(k)} ω ( k ) ,得到更一般的 Richardson 迭代格式 :
x ( k + 1 ) = x ( k ) + ω ( k ) P − 1 r ( k ) \mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \omega^{(k)} \mathbf{P}^{-1} \mathbf{r}^{(k)} x ( k + 1 ) = x ( k ) + ω ( k ) P − 1 r ( k )
Richardson 方法分类
平稳 Richardson 方法 :ω ( k ) ≡ ω \omega^{(k)} \equiv \omega ω ( k ) ≡ ω 为常数(如 Jacobi、Gauss-Seidel、SOR 方法);
非平稳 Richardson 方法 :ω ( k ) \omega^{(k)} ω ( k ) 依赖于迭代步 k k k (如最速下降法、共轭梯度法)。
注 :
本讲聚焦于 A \mathbf{A} A 对称正定的情形,此时可将求解 A x = b \mathbf{A}\mathbf{x} = \mathbf{b} Ax = b 严格等价为极小化二次函数问题,从而可以使用最速下降与共轭梯度法。
对于一般的方程 A x = b \mathbf{A} \mathbf{x} = \mathbf{b} Ax = b ,
转化为 A ⊤ A x = A ⊤ b \mathbf{A}^\top \mathbf{A} \mathbf{x} = \mathbf{A}^\top \mathbf{b} A ⊤ Ax = A ⊤ b 即可同理求解。
二次函数极值等价性
设 A ∈ R n × n \mathbf{A} \in \mathbb{R}^{n \times n} A ∈ R n × n 为对称正定矩阵,则求解线性方程组 A x = b \mathbf{A}\mathbf{x} = \mathbf{b} Ax = b 等价于求解如下二次函数的极小值问题:
ϕ ( x ) = 1 2 x ⊤ A x − b ⊤ x \phi(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{A} \mathbf{x} - \mathbf{b}^\top \mathbf{x} ϕ ( x ) = 2 1 x ⊤ Ax − b ⊤ x
证明 :
只需证明 ϕ ( x ) \phi(\mathbf{x}) ϕ ( x ) 的极小值点 x ∗ \mathbf{x}^* x ∗ ⇔ \Leftrightarrow ⇔ A x = b \mathbf{A}\mathbf{x} = \mathbf{b} Ax = b 的解 x ∗ \mathbf{x}^* x ∗ 。
⇐ \Leftarrow ⇐ 方向:
事实上,对任意向量 x , y ∈ R n \mathbf{x}, \mathbf{y} \in \mathbb{R}^n x , y ∈ R n ,我们有如下恒等式:
1 2 x ⊤ A x − b ⊤ x = 1 2 ( x − y ) ⊤ A ( x − y ) − 1 2 y ⊤ A y + ( A y − b ) ⊤ x \frac{1}{2} \mathbf{x}^\top \mathbf{A} \mathbf{x} - \mathbf{b}^\top \mathbf{x}
= \frac{1}{2} (\mathbf{x} - \mathbf{y})^\top \mathbf{A} (\mathbf{x} - \mathbf{y})
- \frac{1}{2} \mathbf{y}^\top \mathbf{A} \mathbf{y}
+ (\mathbf{A}\mathbf{y} - \mathbf{b})^\top \mathbf{x} 2 1 x ⊤ Ax − b ⊤ x = 2 1 ( x − y ) ⊤ A ( x − y ) − 2 1 y ⊤ Ay + ( Ay − b ) ⊤ x
对于 A x = b \mathbf{A}\mathbf{x} = \mathbf{b} Ax = b 的解 x ∗ \mathbf{x}^* x ∗ ,在恒等式中取 y = x ∗ \mathbf{y} = \mathbf{x}^* y = x ∗ 得:
ϕ ( x ) = 1 2 ( x − x ∗ ) ⊤ A ( x − x ∗ ) − 1 2 ( x ∗ ) ⊤ A x ∗ \phi(\mathbf{x})
= \frac{1}{2} (\mathbf{x} - \mathbf{x}^*)^\top \mathbf{A} (\mathbf{x}
- \mathbf{x}^*) - \frac{1}{2} (\mathbf{x}^*)^\top \mathbf{A} \mathbf{x}^* ϕ ( x ) = 2 1 ( x − x ∗ ) ⊤ A ( x − x ∗ ) − 2 1 ( x ∗ ) ⊤ A x ∗
显然最小值点即为 x = x ∗ \mathbf{x} = \mathbf{x}^* x = x ∗ 。
⇒ \Rightarrow ⇒ 方向:
设 x ∗ \mathbf{x}^* x ∗ 是 ϕ ( x ) \phi(\mathbf{x}) ϕ ( x ) 的极小值点。由于 ϕ \phi ϕ 可微,极小值点处梯度必为零:
∇ ϕ ( x ∗ ) = A x ∗ − b = 0 \nabla \phi(\mathbf{x}^*) = \mathbf{A} \mathbf{x}^* - \mathbf{b} = \mathbf{0} ∇ ϕ ( x ∗ ) = A x ∗ − b = 0
也即 A x ∗ = b \mathbf{A} \mathbf{x}^* = \mathbf{b} A x ∗ = b ,因此 x ∗ \mathbf{x}^* x ∗ 是原方程的解。
向量空间记号约定
为简化表达,引入以下记号,其中 A \mathbf{A} A 为对称正定阵:
欧式内积 :
⟨ x , y ⟩ ≡ x ⊤ y \langle \mathbf{x}, \mathbf{y} \rangle \equiv \mathbf{x}^\top \mathbf{y} ⟨ x , y ⟩ ≡ x ⊤ y
A \mathbf{A} A -内积 :
⟨ x , y ⟩ A ≡ x ⊤ A y = ⟨ x , A y ⟩ = ⟨ A x , y ⟩ \langle \mathbf{x}, \mathbf{y} \rangle _{\mathbf{A}}
\equiv \mathbf{x}^\top \mathbf{A} \mathbf{y}
= \langle \mathbf{x}, \mathbf{A} \mathbf{y} \rangle
= \langle \mathbf{A} \mathbf{x}, \mathbf{y} \rangle ⟨ x , y ⟩ A ≡ x ⊤ Ay = ⟨ x , Ay ⟩ = ⟨ Ax , y ⟩
注 :因 A \mathbf{A} A 对称正定,⟨ ⋅ , ⋅ ⟩ A \langle \cdot\;, \cdot \rangle _{\mathbf{A}} ⟨ ⋅ , ⋅ ⟩ A 确实是 R n \mathbb{R}^n R n 上的内积。
A \mathbf{A} A -范数 :
∥ x ∥ A ≡ ⟨ x , x ⟩ A = x ⊤ A x \|\mathbf{x}\|_\mathbf{A}
\equiv \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle _{\mathbf{A}}}
= \sqrt{\mathbf{x}^\top \mathbf{A} \mathbf{x}} ∥ x ∥ A ≡ ⟨ x , x ⟩ A = x ⊤ Ax
于是目标函数可简写为:
ϕ ( x ) = 1 2 ⟨ x , x ⟩ A − ⟨ b , x ⟩ = 1 2 ∥ x − x ∗ ∥ A 2 + C \phi(\mathbf{x})
= \frac{1}{2} \langle \mathbf{x}, \mathbf{x} \rangle _{\mathbf{A}} - \langle \mathbf{b}, \mathbf{x} \rangle
= \frac{1}{2} \| \mathbf{x} - \mathbf{x}^* \|_{\mathbf{A}}^2 + C ϕ ( x ) = 2 1 ⟨ x , x ⟩ A − ⟨ b , x ⟩ = 2 1 ∥ x − x ∗ ∥ A 2 + C
几何意义 :ϕ ( x ) \phi(\mathbf{x}) ϕ ( x ) 是一个开口向上的抛物面(椭球面),其等高线为 A \mathbf{A} A -椭球。
最速下降法
最速下降法是梯度下降法在二次函数情形下的具体实现,其核心思想是在每一步沿当前点处函数下降最快的方向 (即负梯度方向)进行一维搜索。
设当前迭代点为 x ( k ) \mathbf{x}^{(k)} x ( k ) ,搜索方向为 p ( k ) \mathbf{p}^{(k)} p ( k ) ,步长为 α ( k ) \alpha^{(k)} α ( k ) ,则:
x ( k + 1 ) = x ( k ) + α ( k ) p ( k ) \mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \alpha^{(k)} \mathbf{p}^{(k)} x ( k + 1 ) = x ( k ) + α ( k ) p ( k )
其中 α ( k ) \alpha^{(k)} α ( k ) 满足:
α ( k ) = arg min α ∈ R ϕ ( x ( k ) + α p ( k ) ) = arg min α ∈ R ∥ x ( k ) + α p ( k ) − x ∗ ∥ A \begin{aligned}
\alpha^{(k)}
&= \argmin_{\alpha \in \mathbb{R}} \phi(\mathbf{x}^{(k)} + \alpha \mathbf{p}^{(k)}) \\
&= \argmin_{\alpha \in \mathbb{R}} \|\mathbf{x}^{(k)} + \alpha \mathbf{p}^{(k)} - \mathbf{x}^*\|_{\mathbf{A}}
\end{aligned} α ( k ) = α ∈ R arg min ϕ ( x ( k ) + α p ( k ) ) = α ∈ R arg min ∥ x ( k ) + α p ( k ) − x ∗ ∥ A
搜索方向的确定
函数 ϕ ( x ) \phi(\mathbf{x}) ϕ ( x ) 在 x ( k ) \mathbf{x}^{(k)} x ( k ) 处的梯度为:
∇ ϕ ( x ( k ) ) ≡ g ( k ) = A x ( k ) − b = − r ( k ) \nabla \phi(\mathbf{x}^{(k)})
\equiv \mathbf{g}^{(k)}
= \mathbf{A} \mathbf{x}^{(k)} - \mathbf{b} = -\mathbf{r}^{(k)} ∇ ϕ ( x ( k ) ) ≡ g ( k ) = A x ( k ) − b = − r ( k )
由于负梯度方向即为下降最快方向 ,故取搜索方向为:
p ( k ) = − g ( k ) = r ( k ) \mathbf{p}^{(k)} = -\mathbf{g}^{(k)} = \mathbf{r}^{(k)} p ( k ) = − g ( k ) = r ( k )
步长的确定
定义单变量函数:
f ( k ) ( α ) = ϕ ( x ( k ) + α p ( k ) ) f^{(k)}(\alpha) = \phi(\mathbf{x}^{(k)} + \alpha \mathbf{p}^{(k)}) f ( k ) ( α ) = ϕ ( x ( k ) + α p ( k ) )
展开得:
f ( k ) ( α ) = 1 2 ⟨ x ( k ) + α p ( k ) , x ( k ) + α p ( k ) ⟩ A − ⟨ b , x ( k ) + α p ( k ) ⟩ = 1 2 ⟨ x ( k ) , x ( k ) ⟩ A + α ⟨ x ( k ) , p ( k ) ⟩ A + α 2 2 ⟨ p ( k ) , p ( k ) ⟩ A − ⟨ b , x ( k ) ⟩ − α ⟨ b , p ( k ) ⟩ = ϕ ( x ( k ) ) + α ⟨ g ( k ) , p ( k ) ⟩ + α 2 2 ⟨ p ( k ) , p ( k ) ⟩ A \begin{aligned}
f^{(k)}(\alpha)
&= \frac{1}{2} \langle \mathbf{x}^{(k)} + \alpha \mathbf{p}^{(k)}, \mathbf{x}^{(k)} + \alpha \mathbf{p}^{(k)} \rangle _{\mathbf{A}}
- \langle \mathbf{b}, \mathbf{x}^{(k)} + \alpha \mathbf{p}^{(k)} \rangle \\
&= \frac{1}{2} \langle \mathbf{x}^{(k)},\mathbf{x}^{(k)} \rangle _{\mathbf{A}}
+ \alpha \langle \mathbf{x}^{(k)},\mathbf{p}^{(k)} \rangle _{\mathbf{A}}
+ \frac{\alpha^2}{2} \langle \mathbf{p}^{(k)},\mathbf{p}^{(k)}\rangle _{\mathbf{A}} \\
&\quad - \langle \mathbf{b}, \mathbf{x}^{(k)} \rangle - \alpha \langle \mathbf{b}, \mathbf{p}^{(k)} \rangle \\
&= \phi(\mathbf{x}^{(k)}) + \alpha \langle \mathbf{g}^{(k)}, \mathbf{p}^{(k)} \rangle
+ \frac{\alpha^2}{2} \langle \mathbf{p}^{(k)},\mathbf{p}^{(k)}\rangle _{\mathbf{A}}
\end{aligned} f ( k ) ( α ) = 2 1 ⟨ x ( k ) + α p ( k ) , x ( k ) + α p ( k ) ⟩ A − ⟨ b , x ( k ) + α p ( k ) ⟩ = 2 1 ⟨ x ( k ) , x ( k ) ⟩ A + α ⟨ x ( k ) , p ( k ) ⟩ A + 2 α 2 ⟨ p ( k ) , p ( k ) ⟩ A − ⟨ b , x ( k ) ⟩ − α ⟨ b , p ( k ) ⟩ = ϕ ( x ( k ) ) + α ⟨ g ( k ) , p ( k ) ⟩ + 2 α 2 ⟨ p ( k ) , p ( k ) ⟩ A
这是一个关于 α \alpha α 的二次函数,其最小值点为:
α ( k ) = − ⟨ g ( k ) , p ( k ) ⟩ ⟨ p ( k ) , p ( k ) ⟩ A = ⟨ r ( k ) , p ( k ) ⟩ ⟨ p ( k ) , p ( k ) ⟩ A \alpha^{(k)}
= -\frac{ \langle \mathbf{g}^{(k)}, \mathbf{p}^{(k)} \rangle }
{ \langle \mathbf{p}^{(k)},\mathbf{p}^{(k)}\rangle _{\mathbf{A}} }
= \frac{ \langle \mathbf{r}^{(k)}, \mathbf{p}^{(k)} \rangle }
{ \langle \mathbf{p}^{(k)},\mathbf{p}^{(k)}\rangle _{\mathbf{A}} } α ( k ) = − ⟨ p ( k ) , p ( k ) ⟩ A ⟨ g ( k ) , p ( k ) ⟩ = ⟨ p ( k ) , p ( k ) ⟩ A ⟨ r ( k ) , p ( k ) ⟩
代入 p ( k ) = − g ( k ) = r ( k ) \mathbf{p}^{(k)} = -\mathbf{g}^{(k)} = \mathbf{r}^{(k)} p ( k ) = − g ( k ) = r ( k ) 得:
α ( k ) = ⟨ g ( k ) , g ( k ) ⟩ ⟨ g ( k ) , g ( k ) ⟩ A = ⟨ r ( k ) , r ( k ) ⟩ ⟨ r ( k ) , r ( k ) ⟩ A \alpha^{(k)}
= \frac{ \langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} \rangle }
{ \langle \mathbf{g}^{(k)},\mathbf{g}^{(k)}\rangle _{\mathbf{A}} }
= \frac{ \langle \mathbf{r}^{(k)}, \mathbf{r}^{(k)} \rangle }
{ \langle \mathbf{r}^{(k)},\mathbf{r}^{(k)}\rangle _{\mathbf{A}} } α ( k ) = ⟨ g ( k ) , g ( k ) ⟩ A ⟨ g ( k ) , g ( k ) ⟩ = ⟨ r ( k ) , r ( k ) ⟩ A ⟨ r ( k ) , r ( k ) ⟩
梯度正交性
最速下降法具有相邻梯度(残差)正交的性质,也即:
⟨ g ( k − 1 ) , g ( k ) ⟩ = 0 , ∀ k ⩾ 1 \langle \mathbf{g}^{(k-1)}, \mathbf{g}^{(k)} \rangle = 0, \quad \forall k \geqslant 1 ⟨ g ( k − 1 ) , g ( k ) ⟩ = 0 , ∀ k ⩾ 1
证明 :
易得:
g ( k + 1 ) − g ( k ) = A ( x ( k + 1 ) − x ( k ) ) = α ( k ) A p ( k ) \mathbf{g}^{(k+1)} - \mathbf{g}^{(k)} = \mathbf{A} (\mathbf{x}^{(k+1)} - \mathbf{x}^{(k)}) = \alpha^{(k)} \mathbf{A} \mathbf{p}^{(k)} g ( k + 1 ) − g ( k ) = A ( x ( k + 1 ) − x ( k ) ) = α ( k ) A p ( k )
代入 p ( k ) = − g ( k ) \mathbf{p}^{(k)} = -\mathbf{g}^{(k)} p ( k ) = − g ( k ) 得:
⟨ ⋅ , g ( k + 1 ) ⟩ − ⟨ ⋅ , g ( k ) ⟩ = − α ( k ) ⟨ ⋅ , g ( k ) ⟩ A \langle \;\cdot\;, \mathbf{g}^{(k+1)} \rangle - \langle \;\cdot\;, \mathbf{g}^{(k)} \rangle
= -\alpha^{(k)} \langle \;\cdot\;, \mathbf{g}^{(k)} \rangle _\mathbf{A} ⟨ ⋅ , g ( k + 1 ) ⟩ − ⟨ ⋅ , g ( k ) ⟩ = − α ( k ) ⟨ ⋅ , g ( k ) ⟩ A
再代入 g ( k ) \mathbf{g}^{(k)} g ( k ) 并结合
α ( k ) = ⟨ g ( k ) , g ( k ) ⟩ / ⟨ g ( k ) , g ( k ) ⟩ A \alpha^{(k)} = \langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} \rangle / \langle \mathbf{g}^{(k)},\mathbf{g}^{(k)}\rangle _{\mathbf{A}} α ( k ) = ⟨ g ( k ) , g ( k ) ⟩ / ⟨ g ( k ) , g ( k ) ⟩ A
得:
⟨ g ( k − 1 ) , g ( k ) ⟩ = 0 \langle \mathbf{g}^{(k-1)}, \mathbf{g}^{(k)} \rangle = 0 ⟨ g ( k − 1 ) , g ( k ) ⟩ = 0
注 :该性质仅对相邻步成立;非相邻步一般不正交,这是最速下降法收敛较慢的根本原因(可能出现锯齿状)。
最速下降法代码
初始化 :给定初值 x ( 0 ) \mathbf{x}^{(0)} x ( 0 ) ,计算梯度 g ( 0 ) = A x ( 0 ) − b \mathbf{g}^{(0)} = \mathbf{A} \mathbf{x}^{(0)} - \mathbf{b} g ( 0 ) = A x ( 0 ) − b
循环 :
计算 t = A g ( k ) \mathbf{t} = \mathbf{A} \mathbf{g}^{(k)} t = A g ( k )
计算步长:
α ( k ) = ⟨ g ( k ) , g ( k ) ⟩ ⟨ g ( k ) , t ⟩ \alpha^{(k)} = \dfrac{ \langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} \rangle }
{ \langle \mathbf{g}^{(k)}, \mathbf{t} \rangle } α ( k ) = ⟨ g ( k ) , t ⟩ ⟨ g ( k ) , g ( k ) ⟩
更新解:x ( k + 1 ) = x ( k ) − α ( k ) g ( k ) \mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - \alpha^{(k)} \mathbf{g}^{(k)} x ( k + 1 ) = x ( k ) − α ( k ) g ( k )
更新梯度:g ( k + 1 ) = g ( k ) − α ( k ) t \mathbf{g}^{(k+1)} = \mathbf{g}^{(k)} - \alpha^{(k)} \mathbf{t} g ( k + 1 ) = g ( k ) − α ( k ) t
若 ∥ r ( k + 1 ) ∥ 2 = ∥ g ( k + 1 ) ∥ 2 < ε \|\mathbf{r}^{(k+1)}\|_2 = \|\mathbf{g}^{(k+1)}\|_2 < \varepsilon ∥ r ( k + 1 ) ∥ 2 = ∥ g ( k + 1 ) ∥ 2 < ε ,终止;否则继续循环
k = 0
g = A @ x - b
while k < N:
t = A @ g
alpha = np.dot(g, g) / np.dot(g, t) # 步长
x -= alpha * g # 更新解
g -= alpha * t # 更新梯度
if np.linalg.norm(g) < EPS :
break
k += 1
复杂度分析 :主要代价来自矩阵-向量乘法 A g ( k ) \mathbf{A} \mathbf{g}^{(k)} A g ( k ) 。若 A \mathbf{A} A 稀疏,则成本为 O ( nnz ( A ) ) \mathcal{O}(\text{nnz}(\mathbf{A})) O ( nnz ( A )) 。
最速下降法收敛性分析
引理 :设正定对称矩阵 A ∈ R n × n \mathbf{A} \in \mathbb{R}^{n \times n} A ∈ R n × n 的特征值为 0 < λ 1 ⩽ ⋯ ⩽ λ n 0 < \lambda_1 \leqslant \cdots \leqslant \lambda_n 0 < λ 1 ⩽ ⋯ ⩽ λ n ,
P ( t ) P(t) P ( t ) 是一个关于 t t t 的多项式,则有:
∥ P ( A ) x ∥ A ⩽ max 1 ⩽ i ⩽ n ∣ P ( λ i ) ∣ ∥ x ∥ A , x ∈ R n \|P(\mathbf{A})\mathbf{x}\|_{\mathbf{A}}
\leqslant \max_{1 \leqslant i \leqslant n}|P(\lambda_i)|\|\mathbf{x}\|_{\mathbf{A}}, \quad \mathbf{x} \in \mathbb{R}^n ∥ P ( A ) x ∥ A ⩽ 1 ⩽ i ⩽ n max ∣ P ( λ i ) ∣∥ x ∥ A , x ∈ R n
引理的证明 :
由于 A \mathbf{A} A 是正定对称矩阵,根据谱定理,存在正交矩阵 Q \mathbf{Q} Q 使得:
A = Q Λ Q ⊤ \mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^\top A = QΛ Q ⊤
其中 Λ = d i a g ( λ 1 , λ 2 , … , λ n ) \mathbf{\Lambda} = \mathrm{diag}(\lambda_1, \lambda_2, \ldots, \lambda_n) Λ = diag ( λ 1 , λ 2 , … , λ n ) ,
且 0 < λ 1 ⩽ λ 2 ⩽ ⋯ ⩽ λ n 0 < \lambda_1 \leqslant \lambda_2 \leqslant \cdots \leqslant \lambda_n 0 < λ 1 ⩽ λ 2 ⩽ ⋯ ⩽ λ n 。
对于多项式 P ( t ) = a 0 + a 1 t + a 2 t 2 + ⋯ + a m t m P(t) = a_0 + a_1 t + a_2 t^2 + \cdots + a_m t^m P ( t ) = a 0 + a 1 t + a 2 t 2 + ⋯ + a m t m ,有:
P ( A ) = a 0 I + a 1 A + a 2 A 2 + ⋯ + a m A m P(\mathbf{A}) = a_0\mathbf{I} + a_1\mathbf{A} + a_2\mathbf{A}^2 + \cdots + a_m\mathbf{A}^m P ( A ) = a 0 I + a 1 A + a 2 A 2 + ⋯ + a m A m
利用 A = Q Λ Q ⊤ \mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^\top A = QΛ Q ⊤ ,
可得 A k = Q Λ k Q ⊤ \mathbf{A}^k = \mathbf{Q}\mathbf{\Lambda}^k\mathbf{Q}^\top A k = Q Λ k Q ⊤ ,其中 k = 0 , 1 , 2 , … k = 0, 1, 2, \ldots k = 0 , 1 , 2 , …
因此:
P ( A ) = Q P ( Λ ) Q ⊤ = Q d i a g ( P ( λ 1 ) , P ( λ 2 ) , … , P ( λ n ) ) Q ⊤ P(\mathbf{A}) = \mathbf{Q}P(\mathbf{\Lambda})\mathbf{Q}^\top
= \mathbf{Q}\,\mathrm{diag}(P(\lambda_1), P(\lambda_2), \ldots, P(\lambda_n))\,\mathbf{Q}^\top P ( A ) = Q P ( Λ ) Q ⊤ = Q diag ( P ( λ 1 ) , P ( λ 2 ) , … , P ( λ n )) Q ⊤
令 y = Q ⊤ x \mathbf{y} = \mathbf{Q}^\top\mathbf{x} y = Q ⊤ x ,即 x = Q y \mathbf{x} = \mathbf{Q}\mathbf{y} x = Qy 。从而:
∥ x ∥ A 2 = x ⊤ A x = y ⊤ Λ y = ∑ i = 1 n λ i y i 2 \|\mathbf{x}\|_{\mathbf{A}}^2
= \mathbf{x}^\top\mathbf{A}\mathbf{x}
= \mathbf{y}^\top\mathbf{\Lambda}\mathbf{y}
= \sum_{i=1}^n \lambda_i y_i^2 ∥ x ∥ A 2 = x ⊤ Ax = y ⊤ Λy = i = 1 ∑ n λ i y i 2
∥ P ( A ) x ∥ A 2 = ( P ( A ) x ) ⊤ A ( P ( A ) x ) = x ⊤ P ( A ) ⊤ A P ( A ) x = y ⊤ P ( Λ ) Λ P ( Λ ) y = ∑ i = 1 n λ i [ P ( λ i ) ] 2 y i 2 \begin{aligned}
\|P(\mathbf{A})\mathbf{x}\|_{\mathbf{A}}^2
&= (P(\mathbf{A})\mathbf{x})^\top \mathbf{A} (P(\mathbf{A})\mathbf{x})
= \mathbf{x}^\top P(\mathbf{A})^\top \mathbf{A} P(\mathbf{A})\mathbf{x} \\
&= \mathbf{y}^\top P(\mathbf{\Lambda})\mathbf{\Lambda}P(\mathbf{\Lambda})\mathbf{y}
= \sum_{i=1}^n \lambda_i [P(\lambda_i)]^2 y_i^2
\end{aligned} ∥ P ( A ) x ∥ A 2 = ( P ( A ) x ) ⊤ A ( P ( A ) x ) = x ⊤ P ( A ) ⊤ A P ( A ) x = y ⊤ P ( Λ ) Λ P ( Λ ) y = i = 1 ∑ n λ i [ P ( λ i ) ] 2 y i 2
设 M = max 1 ⩽ i ⩽ n ∣ P ( λ i ) ∣ M = \max_{1 \leqslant i \leqslant n}|P(\lambda_i)| M = max 1 ⩽ i ⩽ n ∣ P ( λ i ) ∣ ,则有:
∥ P ( A ) x ∥ A 2 = ∑ i = 1 n λ i [ P ( λ i ) ] 2 y i 2 ⩽ M 2 ∑ i = 1 n λ i y i 2 = M 2 ∥ x ∥ A 2 \|P(\mathbf{A})\mathbf{x}\|_{\mathbf{A}}^2 = \sum_{i=1}^n \lambda_i [P(\lambda_i)]^2 y_i^2
\leqslant M^2 \sum_{i=1}^n \lambda_i y_i^2 = M^2 \|\mathbf{x}\|_{\mathbf{A}}^2 ∥ P ( A ) x ∥ A 2 = i = 1 ∑ n λ i [ P ( λ i ) ] 2 y i 2 ⩽ M 2 i = 1 ∑ n λ i y i 2 = M 2 ∥ x ∥ A 2
两边开方即证。
最速下降法误差估计 :
设 A \mathbf{A} A 的特征值满足 0 < λ 1 ⩽ λ 2 ⩽ ⋯ ⩽ λ n 0 < \lambda_1 \leqslant \lambda_2 \leqslant \cdots \leqslant \lambda_n 0 < λ 1 ⩽ λ 2 ⩽ ⋯ ⩽ λ n ,则对任意初始值 x ( 0 ) \mathbf{x}^{(0)} x ( 0 ) ,有:
∥ x ( k ) − x ∗ ∥ A ⩽ ( λ n − λ 1 λ n + λ 1 ) k ∥ x ( 0 ) − x ∗ ∥ A \|\mathbf{x}^{(k)} - \mathbf{x}^*\|_\mathbf{A} \leqslant
\left( \frac{ \lambda_n - \lambda_1 }{ \lambda_n + \lambda_1 } \right)^k \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_\mathbf{A} ∥ x ( k ) − x ∗ ∥ A ⩽ ( λ n + λ 1 λ n − λ 1 ) k ∥ x ( 0 ) − x ∗ ∥ A
证明 :
设 x ∗ \mathbf{x}^* x ∗ 是方程 A x = b \mathbf{A}\mathbf{x} = \mathbf{b} Ax = b 的精确解,定义误差向量 e ( k ) = x ( k ) − x ∗ \mathbf{e}^{(k)} = \mathbf{x}^{(k)} - \mathbf{x}^* e ( k ) = x ( k ) − x ∗ 。
则有:
g ( k ) = A x ( k ) − b = A e ( k ) \mathbf{g}^{(k)} = \mathbf{A} \mathbf{x}^{(k)} - \mathbf{b} = \mathbf{A} \mathbf{e}^{(k)} g ( k ) = A x ( k ) − b = A e ( k )
我们还有:
g ( k + 1 ) = g ( k ) + α ( k ) A p ( k ) = ( I − α ( k ) A ) g ( k ) \mathbf{g}^{(k+1)}
= \mathbf{g}^{(k)} + \alpha^{(k)} \mathbf{A} \mathbf{p}^{(k)}
= (\mathbf{I} - \alpha^{(k)} \mathbf{A}) \mathbf{g}^{(k)} g ( k + 1 ) = g ( k ) + α ( k ) A p ( k ) = ( I − α ( k ) A ) g ( k )
从而:
e ( k + 1 ) = ( I − α ( k ) A ) e ( k ) \mathbf{e}^{(k+1)} = (\mathbf{I} - \alpha^{(k)} \mathbf{A}) \mathbf{e}^{(k)} e ( k + 1 ) = ( I − α ( k ) A ) e ( k )
我们还知道 α ( k ) = arg min α ∈ R ∥ e ( k + 1 ) ∥ A \alpha^{(k)} = \argmin_{\alpha \in \mathbb{R}} \|\mathbf{e}^{(k+1)}\|_{\mathbf{A}} α ( k ) = arg min α ∈ R ∥ e ( k + 1 ) ∥ A ,从而:
∥ e ( k + 1 ) ∥ A = ∥ ( I − α ( k ) A ) e ( k ) ∥ A ⩽ ∥ ( I − α A ) e ( k ) ∥ A \|\mathbf{e}^{(k+1)}\|_{\mathbf{A}}
= \|(\mathbf{I} - \alpha^{(k)} \mathbf{A}) \mathbf{e}^{(k)}\|_{\mathbf{A}}
\leqslant \|(\mathbf{I} - \alpha \mathbf{A}) \mathbf{e}^{(k)}\|_{\mathbf{A}} ∥ e ( k + 1 ) ∥ A = ∥ ( I − α ( k ) A ) e ( k ) ∥ A ⩽ ∥ ( I − α A ) e ( k ) ∥ A
考虑一次多项式 P ( t ) = 1 − α t P(t) = 1 - \alpha t P ( t ) = 1 − α t ,则 P ( A ) = I − α A P(\mathbf{A}) = \mathbf{I} - \alpha \mathbf{A} P ( A ) = I − α A 。
根据引理:
∥ ( I − α A ) e ( k ) ∥ A ⩽ max 1 ⩽ i ⩽ n ∣ 1 − α λ i ∣ ⋅ ∥ e ( k ) ∥ A \|(\mathbf{I} - \alpha \mathbf{A})\mathbf{e}^{(k)}\|_{\mathbf{A}}
\leqslant \max_{1 \leqslant i \leqslant n}
|1 - \alpha \lambda_i| \cdot \|\mathbf{e}^{(k)}\|_{\mathbf{A}} ∥ ( I − α A ) e ( k ) ∥ A ⩽ 1 ⩽ i ⩽ n max ∣1 − α λ i ∣ ⋅ ∥ e ( k ) ∥ A
因此:
∥ e ( k + 1 ) ∥ A ⩽ min α ∈ R max 1 ⩽ i ⩽ n ∣ 1 − α λ i ∣ ⋅ ∥ e ( k ) ∥ A \|\mathbf{e}^{(k+1)}\|_{\mathbf{A}}
\leqslant \min_{\alpha \in \mathbb{R}} \max_{1 \leqslant i \leqslant n}|1 - \alpha \lambda_i| \cdot \|\mathbf{e}^{(k)}\|_{\mathbf{A}} ∥ e ( k + 1 ) ∥ A ⩽ α ∈ R min 1 ⩽ i ⩽ n max ∣1 − α λ i ∣ ⋅ ∥ e ( k ) ∥ A
由于 { λ i } i = 1 n ⊂ [ λ 1 , λ n ] \{\lambda_i\}_{i=1}^n \subset [\lambda_1, \lambda_n] { λ i } i = 1 n ⊂ [ λ 1 , λ n ] ,且 ∣ 1 − α λ ∣ |1-\alpha \lambda| ∣1 − α λ ∣ 是 λ \lambda λ 的线性函数,其最大值必在区间端点达到:
max 1 ⩽ i ⩽ n ∣ 1 − α λ i ∣ ⩽ max λ ∈ [ λ 1 , λ n ] ∣ 1 − α λ ∣ = max { ∣ 1 − α λ 1 ∣ , ∣ 1 − α λ n ∣ } \max_{1 \leqslant i \leqslant n}|1 - \alpha \lambda_i|
\leqslant \max_{\lambda \in [\lambda_1, \lambda_n]}|1 - \alpha \lambda|
= \max\{|1 - \alpha \lambda_1|, |1 - \alpha \lambda_n|\} 1 ⩽ i ⩽ n max ∣1 − α λ i ∣ ⩽ λ ∈ [ λ 1 , λ n ] max ∣1 − α λ ∣ = max { ∣1 − α λ 1 ∣ , ∣1 − α λ n ∣ }
最优的 α ∗ \alpha^* α ∗ 应使得 ∣ 1 − α λ 1 ∣ = ∣ 1 − α λ n ∣ |1 - \alpha \lambda_1| = |1 - \alpha \lambda_n| ∣1 − α λ 1 ∣ = ∣1 − α λ n ∣ ,且符号相反,也即 1 − α ∗ λ 1 = − ( 1 − α ∗ λ n ) 1 - \alpha^* \lambda_1 = -(1 - \alpha^* \lambda_n) 1 − α ∗ λ 1 = − ( 1 − α ∗ λ n )
解得 α ∗ = ( λ 1 + λ n ) / 2 \alpha^* = (\lambda_1 + \lambda_n)/2 α ∗ = ( λ 1 + λ n ) /2 。代入得:
max 1 ⩽ i ⩽ n ∣ 1 − α ∗ λ i ∣ = 1 − 2 λ 1 λ 1 + λ n = λ n − λ 1 λ n + λ 1 \max_{1 \leqslant i \leqslant n}|1 - \alpha^* \lambda_i|
= 1 - \frac{2\lambda_1}{\lambda_1 + \lambda_n}
= \frac{\lambda_n - \lambda_1}{\lambda_n + \lambda_1} 1 ⩽ i ⩽ n max ∣1 − α ∗ λ i ∣ = 1 − λ 1 + λ n 2 λ 1 = λ n + λ 1 λ n − λ 1
综上,得到单步误差缩减 ∥ e ( k + 1 ) ∥ A ⩽ λ n − λ 1 λ n + λ 1 ∥ e ( k ) ∥ A \|\mathbf{e}^{(k+1)}\|_{\mathbf{A}} \leqslant \frac{\lambda_n - \lambda_1}{\lambda_n + \lambda_1} \|\mathbf{e}^{(k)}\|_{\mathbf{A}} ∥ e ( k + 1 ) ∥ A ⩽ λ n + λ 1 λ n − λ 1 ∥ e ( k ) ∥ A 。
递推 k k k 次即证。
注 :
收敛速率由特征值决定;
当 λ n / λ 1 ≫ 1 \lambda_n / \lambda_1 \gg 1 λ n / λ 1 ≫ 1 (如高长宽比椭球),收敛极慢(锯齿形路径);
共轭梯度法
共轭梯度法是对最速下降法的根本性改进:不再使用负梯度方向,而是构造一组 A \mathbf{A} A -共轭搜索方向 { d ( k ) } \{\mathbf{d}^{(k)}\} { d ( k ) } ,使得在 n n n 步内精确收敛(无舍入误差下),避免震荡。
A-共轭与共轭方向
定义 :
A \mathbf{A} A -共轭 :对称正定矩阵 A \mathbf{A} A 下, 向量 x , y \mathbf{x}, \mathbf{y} x , y 满足:
⟨ x , y ⟩ A = x ⊤ A y = 0 \langle \mathbf{x}, \mathbf{y} \rangle _\mathbf{A} = \mathbf{x}^\top \mathbf{A} \mathbf{y} = 0 ⟨ x , y ⟩ A = x ⊤ Ay = 0
A \mathbf{A} A -共轭向量组 :{ d ( 0 ) , d ( 1 ) , … , d ( m ) } \{\mathbf{d}^{(0)}, \mathbf{d}^{(1)}, \dots, \mathbf{d}^{(m)}\} { d ( 0 ) , d ( 1 ) , … , d ( m ) } 满足:
⟨ d ( i ) , d ( j ) ⟩ A = 0 , ∀ i ≠ j \langle \mathbf{d}^{(i)}, \mathbf{d}^{(j)} \rangle _\mathbf{A} = 0, \quad \forall i \ne j ⟨ d ( i ) , d ( j ) ⟩ A = 0 , ∀ i = j
性质 :
A \mathbf{A} A -共轭向量组线性无关;
R n \mathbb{R}^n R n 中最多存在 n n n 个线性无关的 A \mathbf{A} A -共轭向量;
若 { d ( i ) } i = 0 n − 1 \{\mathbf{d}^{(i)}\}_{i=0}^{n-1} { d ( i ) } i = 0 n − 1 构成 A \mathbf{A} A -共轭基,则任意向量 x \mathbf{x} x 可唯一投影表示为:
x = ∑ i = 0 n − 1 λ i d ( i ) , λ i = ⟨ x , d ( i ) ⟩ A ⟨ d ( i ) , d ( i ) ⟩ A \mathbf{x} = \sum_{i=0}^{n-1} \lambda_i \mathbf{d}^{(i)}, \quad
\lambda_i = \frac{ \langle \mathbf{x}, \mathbf{d}^{(i)} \rangle _\mathbf{A} }
{ \langle \mathbf{d}^{(i)}, \mathbf{d}^{(i)} \rangle _\mathbf{A} } x = i = 0 ∑ n − 1 λ i d ( i ) , λ i = ⟨ d ( i ) , d ( i ) ⟩ A ⟨ x , d ( i ) ⟩ A
特殊地,对于解 x ∗ \mathbf{x}^* x ∗ ,有:
x ∗ = ∑ i = 0 n − 1 λ i d ( i ) , λ i = ⟨ b , d ( i ) ⟩ ⟨ d ( i ) , d ( i ) ⟩ A \mathbf{x}^* = \sum_{i=0}^{n-1} \lambda_i \mathbf{d}^{(i)}, \quad
\lambda_i = \frac{ \langle \mathbf{b}, \mathbf{d}^{(i)} \rangle }
{ \langle \mathbf{d}^{(i)}, \mathbf{d}^{(i)} \rangle _\mathbf{A} } x ∗ = i = 0 ∑ n − 1 λ i d ( i ) , λ i = ⟨ d ( i ) , d ( i ) ⟩ A ⟨ b , d ( i ) ⟩
几何意义 :A \mathbf{A} A -共轭方向在拉伸后的坐标系下相互垂直(椭球主轴方向),沿这些方向搜索更加高效。
递推构造共轭方向
共轭梯度法递推构造了一组 A \mathbf{A} A -共轭基,并且一边构造一边进行求解(投影)。具体构造如下:
初值:
x ( 0 ) , d ( 0 ) = − g ( 0 ) \mathbf{x}^{(0)}, \quad
\mathbf{d}^{(0)} = -\mathbf{g}^{(0)} x ( 0 ) , d ( 0 ) = − g ( 0 )
先构造新的解和梯度:
x ( k ) = x ( k − 1 ) + α ( k − 1 ) A d ( k − 1 ) \mathbf{x}^{(k)} = \mathbf{x}^{(k-1)} + \alpha^{(k-1)} \mathbf{A} \mathbf{d}^{(k-1)} x ( k ) = x ( k − 1 ) + α ( k − 1 ) A d ( k − 1 )
其中 α ( k − 1 ) \alpha^{(k-1)} α ( k − 1 ) 由前述 确定(实际上是确保了相邻梯度正交):
α ( k − 1 ) = − ⟨ g ( k − 1 ) , d ( k − 1 ) ⟩ ⟨ d ( k − 1 ) , d ( k − 1 ) ⟩ A \alpha^{(k-1)}
= -\frac{ \langle \mathbf{g}^{(k-1)}, \mathbf{d}^{(k-1)} \rangle }
{ \langle \mathbf{d}^{(k-1)}, \mathbf{d}^{(k-1)} \rangle _\mathbf{A} } α ( k − 1 ) = − ⟨ d ( k − 1 ) , d ( k − 1 ) ⟩ A ⟨ g ( k − 1 ) , d ( k − 1 ) ⟩
从而可得新的梯度 g ( k ) = A x ( k ) − b \mathbf{g}^{(k)} = \mathbf{A} \mathbf{x}^{(k)} - \mathbf{b} g ( k ) = A x ( k ) − b
再构造新的共轭方向:
d ( k ) = − g ( k ) + β ( k − 1 ) d ( k − 1 ) \mathbf{d}^{(k)} = -\mathbf{g}^{(k)} + \beta^{(k-1)} \mathbf{d}^{(k-1)} d ( k ) = − g ( k ) + β ( k − 1 ) d ( k − 1 )
其中 β ( k − 1 ) \beta^{(k-1)} β ( k − 1 ) 由 A \mathbf{A} A -共轭条件 ⟨ d ( k − 1 ) , d ( k ) ⟩ A = 0 \langle \mathbf{d}^{(k-1)}, \mathbf{d}^{(k)} \rangle _\mathbf{A} = 0 ⟨ d ( k − 1 ) , d ( k ) ⟩ A = 0 确定:
β ( k − 1 ) = ⟨ g ( k ) , d ( k − 1 ) ⟩ A ⟨ d ( k − 1 ) , d ( k − 1 ) ⟩ A \beta^{(k-1)}
= \frac{ \langle \mathbf{g}^{(k)}, \mathbf{d}^{(k-1)}\rangle _\mathbf{A} }
{ \langle \mathbf{d}^{(k-1)}, \mathbf{d}^{(k-1)}\rangle _\mathbf{A} } β ( k − 1 ) = ⟨ d ( k − 1 ) , d ( k − 1 ) ⟩ A ⟨ g ( k ) , d ( k − 1 ) ⟩ A
定理 :
共轭梯度法递推构造的向量组 { d ( k ) } \{\mathbf{d}^{(k)}\} { d ( k ) } 是 A \mathbf{A} A -共轭基,即:
⟨ d ( i ) , d ( j ) ⟩ A = 0 , ∀ i ≠ j \langle \mathbf{d}^{(i)}, \mathbf{d}^{(j)} \rangle _\mathbf{A} = 0, \quad \forall i \neq j ⟨ d ( i ) , d ( j ) ⟩ A = 0 , ∀ i = j
证明 :
数学归纳法,同时证明以下三个性质对任意 k ⩾ 0 k \geqslant 0 k ⩾ 0 成立:
A \mathbf{A} A -共轭性 :
⟨ d ( i ) , d ( j ) ⟩ A = 0 , \langle \mathbf{d}^{(i)}, \mathbf{d}^{(j)} \rangle _\mathbf{A} = 0, ⟨ d ( i ) , d ( j ) ⟩ A = 0 , ,对 0 ⩽ j < i ⩽ k 0 \leqslant j < i \leqslant k 0 ⩽ j < i ⩽ k
梯度正交性 :⟨ g ( i ) , g ( j ) ⟩ = 0 \langle \mathbf{g}^{(i)}, \mathbf{g}^{(j)} \rangle = 0 ⟨ g ( i ) , g ( j ) ⟩ = 0 ,对 0 ⩽ j < i ⩽ k 0 \leqslant j < i \leqslant k 0 ⩽ j < i ⩽ k
梯度与方向正交 :⟨ g ( i ) , d ( j ) ⟩ = 0 \langle \mathbf{g}^{(i)}, \mathbf{d}^{(j)} \rangle = 0 ⟨ g ( i ) , d ( j ) ⟩ = 0 ,对 0 ⩽ j < i ⩽ k 0 \leqslant j < i \leqslant k 0 ⩽ j < i ⩽ k
已知迭代关系为:
g ( k + 1 ) = g ( k ) + α ( k ) A d ( k ) , d ( k ) = − g ( k ) + β ( k − 1 ) d ( k − 1 ) \mathbf{g}^{(k+1)} = \mathbf{g}^{(k)} + \alpha^{(k)}\mathbf{A}\mathbf{d}^{(k)}, \quad
\mathbf{d}^{(k)} = -\mathbf{g}^{(k)} + \beta^{(k-1)} \mathbf{d}^{(k-1)} g ( k + 1 ) = g ( k ) + α ( k ) A d ( k ) , d ( k ) = − g ( k ) + β ( k − 1 ) d ( k − 1 )
也即:
⟨ ⋅ , g ( m + 1 ) ⟩ = ⟨ ⋅ , g ( m ) ⟩ + α ( m ) ⟨ ⋅ , d ( m ) ⟩ A ⟨ ⋅ , d ( m ) ⟩ A = − ⟨ ⋅ , g ( m ) ⟩ A + β ( m − 1 ) ⟨ ⋅ , d ( m − 1 ) ⟩ A \begin{align}
\langle \;\cdot\;, \mathbf{g}^{(m+1)} \rangle
&= \langle \;\cdot\;, \mathbf{g}^{(m)} \rangle + \alpha^{(m)}\langle \;\cdot\;, \mathbf{d}^{(m)} \rangle _\mathbf{A} \\
\langle \;\cdot\;, \mathbf{d}^{(m)} \rangle _\mathbf{A}
&= -\langle \;\cdot\;, \mathbf{g}^{(m)} \rangle _\mathbf{A}
+ \beta^{(m-1)} \langle \;\cdot\;, \mathbf{d}^{(m-1)} \rangle _\mathbf{A}
\end{align} ⟨ ⋅ , g ( m + 1 ) ⟩ ⟨ ⋅ , d ( m ) ⟩ A = ⟨ ⋅ , g ( m ) ⟩ + α ( m ) ⟨ ⋅ , d ( m ) ⟩ A = − ⟨ ⋅ , g ( m ) ⟩ A + β ( m − 1 ) ⟨ ⋅ , d ( m − 1 ) ⟩ A
⟨ d ( 1 ) , d ( 0 ) ⟩ A = 0 \langle \mathbf{d}^{(1)}, \mathbf{d}^{(0)}\rangle _\mathbf{A} = 0 ⟨ d ( 1 ) , d ( 0 ) ⟩ A = 0 显然。
利用 d ( 0 ) = − g ( 0 ) \mathbf{d}^{(0)} = -\mathbf{g}^{(0)} d ( 0 ) = − g ( 0 ) 和
α ( 0 ) = ⟨ g ( 0 ) , g ( 0 ) ⟩ / ⟨ g ( 0 ) , g ( 0 ) ⟩ A \alpha^{(0)} = \langle \mathbf{g}^{(0)}, \mathbf{g}^{(0)} \rangle
/ \langle \mathbf{g}^{(0)}, \mathbf{g}^{(0)} \rangle _\mathbf{A} α ( 0 ) = ⟨ g ( 0 ) , g ( 0 ) ⟩ / ⟨ g ( 0 ) , g ( 0 ) ⟩ A
,在 ( 1 ) (1) ( 1 ) 式中代入 g ( 0 ) \mathbf{g}^{(0)} g ( 0 ) 和 m = 0 m=0 m = 0 :
⟨ g ( 0 ) , g ( 1 ) ⟩ = ⟨ g ( 0 ) , g ( 0 ) ⟩ − α ( 0 ) ⟨ g ( 0 ) , g ( 0 ) ⟩ A = 0 \langle \mathbf{g}^{(0)}, \mathbf{g}^{(1)} \rangle
= \langle \mathbf{g}^{(0)}, \mathbf{g}^{(0)} \rangle - \alpha^{(0)}\langle \mathbf{g}^{(0)}, \mathbf{g}^{(0)} \rangle _\mathbf{A}
= 0 ⟨ g ( 0 ) , g ( 1 ) ⟩ = ⟨ g ( 0 ) , g ( 0 ) ⟩ − α ( 0 ) ⟨ g ( 0 ) , g ( 0 ) ⟩ A = 0
假设对某个 k ⩾ 1 k \geqslant 1 k ⩾ 1 ,上述三性质对 k k k 成立。下证对于 k + 1 k+1 k + 1 也成立。
先证梯度正交性 :
⟨ g ( k + 1 ) , g ( j ) ⟩ = 0 \langle \mathbf{g}^{(k+1)}, \mathbf{g}^{(j)} \rangle = 0 ⟨ g ( k + 1 ) , g ( j ) ⟩ = 0 对 j ⩽ k j \leqslant k j ⩽ k 。
j = k j = k j = k 时:由递推方式显然。
j < k j \lt k j < k 时:
在 ( 1 ) (1) ( 1 ) 式中代入 g ( j ) \mathbf{g}^{(j)} g ( j ) 和 m = k m=k m = k :
⟨ g ( j ) , g ( k + 1 ) ⟩ = ⟨ g ( j ) , g ( k ) ⟩ + α ( k ) ⟨ g ( j ) , d ( k ) ⟩ A = 0 \langle \mathbf{g}^{(j)}, \mathbf{g}^{(k+1)} \rangle
= \langle \mathbf{g}^{(j)}, \mathbf{g}^{(k)} \rangle
+ \alpha^{(k)}\langle \mathbf{g}^{(j)}, \mathbf{d}^{(k)} \rangle _\mathbf{A}
= 0 ⟨ g ( j ) , g ( k + 1 ) ⟩ = ⟨ g ( j ) , g ( k ) ⟩ + α ( k ) ⟨ g ( j ) , d ( k ) ⟩ A = 0
再证梯度方向正交性 :
⟨ g ( k + 1 ) , d ( j ) ⟩ = 0 \langle \mathbf{g}^{(k+1)}, \mathbf{d}^{(j)} \rangle = 0 ⟨ g ( k + 1 ) , d ( j ) ⟩ = 0 对 j ⩽ k j \leqslant k j ⩽ k 。
j = k j = k j = k 时:由递推方式显然。
j < k j < k j < k 时:
在 ( 1 ) (1) ( 1 ) 式中代入 d ( j ) \mathbf{d}^{(j)} d ( j ) 和 m = k m=k m = k :
⟨ d ( j ) , g ( k + 1 ) ⟩ = ⟨ d ( j ) , g ( k ) ⟩ + α ( k ) ⟨ d ( j ) , d ( k ) ⟩ A = 0 \langle \mathbf{d}^{(j)}, \mathbf{g}^{(k+1)} \rangle
= \langle \mathbf{d}^{(j)}, \mathbf{g}^{(k)} \rangle
+ \alpha^{(k)}\langle \mathbf{d}^{(j)}, \mathbf{d}^{(k)} \rangle _\mathbf{A}
= 0 ⟨ d ( j ) , g ( k + 1 ) ⟩ = ⟨ d ( j ) , g ( k ) ⟩ + α ( k ) ⟨ d ( j ) , d ( k ) ⟩ A = 0
最后证明 A \mathbf{A} A -共轭性 :
⟨ d ( k + 1 ) , d ( j ) ⟩ A = 0 \langle \mathbf{d}^{(k+1)}, \mathbf{d}^{(j)} \rangle _\mathbf{A} = 0 ⟨ d ( k + 1 ) , d ( j ) ⟩ A = 0 对 j ⩽ k j \leqslant k j ⩽ k 。
证毕。
共轭参数的等价形式
Hestenes–Stiefel (HS) :
β HS ( k − 1 ) = ⟨ g ( k ) , d ( k − 1 ) ⟩ A ⟨ d ( k − 1 ) , d ( k − 1 ) ⟩ A \beta^{(k-1)}_{\text{HS}}
= \frac{\langle \mathbf{g}^{(k)}, \mathbf{d}^{(k-1)} \rangle _{\mathbf{A}}}
{\langle \mathbf{d}^{(k-1)}, \mathbf{d}^{(k-1)} \rangle _{\mathbf{A}}} β HS ( k − 1 ) = ⟨ d ( k − 1 ) , d ( k − 1 ) ⟩ A ⟨ g ( k ) , d ( k − 1 ) ⟩ A
Crowder–Wolfe (CW) :
β CW ( k − 1 ) = ⟨ g ( k ) , g ( k ) − g ( k − 1 ) ⟩ ⟨ d ( k − 1 ) , g ( k ) − g ( k − 1 ) ⟩ \beta^{(k-1)}_{\text{CW}}
= \frac{\langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} - \mathbf{g}^{(k-1)} \rangle}
{\langle \mathbf{d}^{(k-1)}, \mathbf{g}^{(k)} - \mathbf{g}^{(k-1)} \rangle} β CW ( k − 1 ) = ⟨ d ( k − 1 ) , g ( k ) − g ( k − 1 ) ⟩ ⟨ g ( k ) , g ( k ) − g ( k − 1 ) ⟩
等价性证明 :
利用 g ( k ) − g ( k − 1 ) = α ( k − 1 ) A d ( k − 1 ) \mathbf{g}^{(k)} - \mathbf{g}^{(k-1)} = \alpha^{(k-1)} \mathbf{A} \mathbf{d}^{(k-1)} g ( k ) − g ( k − 1 ) = α ( k − 1 ) A d ( k − 1 ) ,代入 HS 即得。
Dixon (D) :
β D ( k − 1 ) = − ⟨ g ( k ) , g ( k ) ⟩ ⟨ d ( k − 1 ) , g ( k − 1 ) ⟩ \beta^{(k-1)}_{\text{D}}
= -\frac{\langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} \rangle}
{\langle \mathbf{d}^{(k-1)}, \mathbf{g}^{(k-1)} \rangle} β D ( k − 1 ) = − ⟨ d ( k − 1 ) , g ( k − 1 ) ⟩ ⟨ g ( k ) , g ( k ) ⟩
等价性证明 :
利用前述性质代入 CW 即得。
Fletcher–Reeves (FR) :
β FR ( k − 1 ) = ⟨ g ( k ) , g ( k ) ⟩ ⟨ g ( k − 1 ) , g ( k − 1 ) ⟩ \beta^{(k-1)}_{\text{FR}}
= \frac{\langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} \rangle}
{\langle \mathbf{g}^{(k-1)}, \mathbf{g}^{(k-1)} \rangle} β FR ( k − 1 ) = ⟨ g ( k − 1 ) , g ( k − 1 ) ⟩ ⟨ g ( k ) , g ( k ) ⟩
等价性证明 :
利用
⟨ d ( k ) , g ( k ) ⟩ = − ⟨ g ( k ) , g ( k ) ⟩ \langle \mathbf{d}^{(k)}, \mathbf{g}^{(k)} \rangle = -\langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} \rangle ⟨ d ( k ) , g ( k ) ⟩ = − ⟨ g ( k ) , g ( k ) ⟩
代入 D 即得。
Polak–Ribière–Polyak (PRP) :
β PRP ( k − 1 ) = ⟨ g ( k ) , g ( k ) − g ( k − 1 ) ⟩ ⟨ g ( k − 1 ) , g ( k − 1 ) ⟩ \beta^{(k-1)}_{\text{PRP}}
= \frac{\langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} - \mathbf{g}^{(k-1)} \rangle}
{\langle \mathbf{g}^{(k-1)}, \mathbf{g}^{(k-1)} \rangle} β PRP ( k − 1 ) = ⟨ g ( k − 1 ) , g ( k − 1 ) ⟩ ⟨ g ( k ) , g ( k ) − g ( k − 1 ) ⟩
等价性证明 :显然。
Dai–Yuan (DY) :
β DY ( k − 1 ) = ⟨ g ( k ) , g ( k ) ⟩ ⟨ d ( k − 1 ) , g ( k ) − g ( k − 1 ) ⟩ \beta^{(k-1)}_{\text{DY}}
= \frac{\langle \mathbf{g}^{(k)}, \mathbf{g}^{(k)} \rangle}
{\langle \mathbf{d}^{(k-1)}, \mathbf{g}^{(k)} - \mathbf{g}^{(k-1)} \rangle} β DY ( k − 1 ) = ⟨ d ( k − 1 ) , g ( k ) − g ( k − 1 ) ⟩ ⟨ g ( k ) , g ( k ) ⟩
等价性证明 :显然。
共轭梯度法代码
初始化 :
给定初值 x ( 0 ) \mathbf{x}^{(0)} x ( 0 ) ,
计算梯度 g ( 0 ) = A x ( 0 ) − b \mathbf{g}^{(0)} = \mathbf{A}\mathbf{x}^{(0)} - \mathbf{b} g ( 0 ) = A x ( 0 ) − b ,
设搜索方向 d ( 0 ) = − g ( 0 ) \mathbf{d}^{(0)} = -\mathbf{g}^{(0)} d ( 0 ) = − g ( 0 )
循环 :
计算 t = A d ( k ) \mathbf{t} = \mathbf{A}\mathbf{d}^{(k)} t = A d ( k )
计算 s = ⟨ d ( k ) , t ⟩ s = \langle \mathbf{d}^{(k)}, \mathbf{t} \rangle s = ⟨ d ( k ) , t ⟩ (即 ⟨ d ( k ) , d ( k ) ⟩ A \langle \mathbf{d}^{(k)}, \mathbf{d}^{(k)} \rangle _\mathbf{A} ⟨ d ( k ) , d ( k ) ⟩ A )
计算步长:
α ( k ) = − ⟨ g ( k ) , d ( k ) ⟩ s \alpha^{(k)} = -\dfrac{ \langle \mathbf{g}^{(k)}, \mathbf{d}^{(k)} \rangle }{ s } α ( k ) = − s ⟨ g ( k ) , d ( k ) ⟩
更新解:x ( k + 1 ) = x ( k ) + α ( k ) d ( k ) \mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \alpha^{(k)} \mathbf{d}^{(k)} x ( k + 1 ) = x ( k ) + α ( k ) d ( k )
更新梯度:g ( k + 1 ) = g ( k ) + α ( k ) t \mathbf{g}^{(k+1)} = \mathbf{g}^{(k)} + \alpha^{(k)} \mathbf{t} g ( k + 1 ) = g ( k ) + α ( k ) t
若 ∥ r ( k + 1 ) ∥ 2 = ∥ g ( k + 1 ) ∥ 2 < ε \|\mathbf{r}^{(k+1)}\|_2 = \|\mathbf{g}^{(k+1)}\|_2 < \varepsilon ∥ r ( k + 1 ) ∥ 2 = ∥ g ( k + 1 ) ∥ 2 < ε ,终止
计算 β ( k ) \beta^{(k)} β ( k ) :选择等价形式中的一种,一般采用 FR 形式。
β ( k ) = ∥ g ( k + 1 ) ∥ 2 2 ∥ g ( k ) ∥ 2 2 \beta^{(k)} = \dfrac{ \|\mathbf{g}^{(k+1)}\|_2^2 }{ \|\mathbf{g}^{(k)}\|_2^2 } β ( k ) = ∥ g ( k ) ∥ 2 2 ∥ g ( k + 1 ) ∥ 2 2
更新搜索方向:d ( k + 1 ) = − g ( k + 1 ) + β ( k ) d ( k ) \mathbf{d}^{(k+1)} = -\mathbf{g}^{(k+1)} + \beta^{(k)} \mathbf{d}^{(k)} d ( k + 1 ) = − g ( k + 1 ) + β ( k ) d ( k )
k = 0
g = A @ x - b
d = g.copy()
g_norm = np.dot(g, g) # 存储 ||g^(k)||^2
while k < N:
t = A @ d
s = np.dot(d, t)
alpha = - np.dot(g, d) / s # 步长
x += alpha * d # 更新解
g += alpha * t # 更新梯度
g_norm_new = np.dot(g, g)
if g_norm_new < EPS :
break
beta = g_norm_new / g_norm
g_norm = g_norm_new
d = - g + beta * d # 更新搜索方向
k += 1
复杂度分析 :
与最速下降法相同,每步主要代价为一次矩阵-向量乘法 A d ( k ) \mathbf{A}\mathbf{d}^{(k)} A d ( k ) ,稀疏矩阵下成本为 O ( nnz ( A ) ) \mathcal{O}(\text{nnz}(\mathbf{A})) O ( nnz ( A )) 。
但共轭梯度法在精确算术下至多 n n n 步收敛,实际中通常远少于 n n n 步即可达到精度要求,收敛速度显著优于最速下降法。
共轭梯度法收敛性分析
在 n n n 步内精确收敛(即 x ( n ) = x ∗ \mathbf{x}^{(n)} = \mathbf{x}^* x ( n ) = x ∗ );
实际中因舍入误差,通常在 k ≪ n k \ll n k ≪ n 步即达到所需精度;
收敛速率依赖于特征值分布,但远优于最速下降法。
方法对比与总结
对比
特性 最速下降法 (SD) 共轭梯度法 (CG) 搜索方向 p ( k ) = − g ( k ) \mathbf{p}^{(k)} = -\mathbf{g}^{(k)} p ( k ) = − g ( k ) (负梯度)d ( k ) = − g ( k ) + β ( k − 1 ) d ( k − 1 ) \mathbf{d}^{(k)} = -\mathbf{g}^{(k)} + \beta^{(k-1)} \mathbf{d}^{(k-1)} d ( k ) = − g ( k ) + β ( k − 1 ) d ( k − 1 ) (A \mathbf{A} A -共轭基)步长公式 α ( k ) = ∥ r ( k ) ∥ 2 r ( k ) T A r ( k ) \alpha^{(k)} = \dfrac{\|\mathbf{r}^{(k)}\|^2}{\mathbf{r}^{(k)T} \mathbf{A} \mathbf{r}^{(k)}} α ( k ) = r ( k ) T A r ( k ) ∥ r ( k ) ∥ 2 α ( k ) = r ( k ) T d ( k ) d ( k ) T A d ( k ) \alpha^{(k)} = \dfrac{\mathbf{r}^{(k)T} \mathbf{d}^{(k)}}{\mathbf{d}^{(k)T} \mathbf{A} \mathbf{d}^{(k)}} α ( k ) = d ( k ) T A d ( k ) r ( k ) T d ( k ) 方向正交性 ⟨ r ( k − 1 ) , r ( k ) ⟩ = 0 \langle \mathbf{r}^{(k-1)}, \mathbf{r}^{(k)} \rangle = 0 ⟨ r ( k − 1 ) , r ( k ) ⟩ = 0 (相邻)B ( d ( i ) , d ( j ) ) = 0 , i ≠ j B(\mathbf{d}^{(i)}, \mathbf{d}^{(j)}) = 0, \; i \ne j B ( d ( i ) , d ( j ) ) = 0 , i = j (全局 A \mathbf{A} A -共轭)实际收敛速度 慢,∼ ( λ n − λ 1 λ n + λ 1 ) k \sim \left(\frac{\lambda_n-\lambda_1}{\lambda_n+\lambda_1}\right)^k ∼ ( λ n + λ 1 λ n − λ 1 ) k 快,理论上 n n n 步得到精确解,对良态/病态矩阵均稳健 空间复杂度 O ( n ) O(n) O ( n ) (存 x , r \mathbf{x}, \mathbf{r} x , r )O ( n ) O(n) O ( n ) (存 x , r , d \mathbf{x}, \mathbf{r}, \mathbf{d} x , r , d )时间复杂度 1 × matvec (A r \mathbf{A} \mathbf{r} Ar ) 1 × matvec (A d \mathbf{A} \mathbf{d} Ad )
最速下降法的缺陷 :负梯度方向在狭长椭球中会反复在主轴间震荡,每次修正仅消除一个方向误差,效率低下。
CG 的优势 :通过 A \mathbf{A} A -共轭性,确保每次搜索方向独立修正一个主轴方向的误差,相当于在 A-内积下进行正交化搜索,极大提升效率。
预处理共轭梯度法
预处理共轭梯度法(PCG)。实际应用中,为加速 CG 收敛, 常引入预处理矩阵 M ≈ A M \approx A M ≈ A (对称正定),求解等价系统:
M − 1 A x = M − 1 b M^{-1} A x = M^{-1} b M − 1 A x = M − 1 b
或更稳定的:
M − 1 / 2 A M − 1 / 2 y = M − 1 / 2 b , x = M − 1 / 2 y M^{-1/2} A M^{-1/2} y = M^{-1/2} b, \quad x = M^{-1/2} y M − 1/2 A M − 1/2 y = M − 1/2 b , x = M − 1/2 y
此时 CG 在 M M M -内积下进行,收敛速率依赖于 κ ( M − 1 A ) \kappa(M^{-1}A) κ ( M − 1 A ) 。常用预处理子:
对角预处理(Jacobi):M = diag ( A ) M = \text{diag}(A) M = diag ( A )
不完全 Cholesky 分解(IC)
代数多重网格(AMG)