Math Problem Statement

Let A ∈ Rn×n be a constant matrix and b ∈ Rn be a constant vector.

Let z ∈ Rn. Consider the function g(z) defined as

g(z) =

1

2

∥Az − b∥22

1

2

∥z∥22

Consider the problem of finding the minimum of g(z) using gradient

descent. Derive the optimal step size αk to be used at iteration number

k of the gradient descent scheme assuming estimate zk−1 from the

previous iteration.

Solution

To derive the optimal step size αk\alpha_k for minimizing the function g(z)g(z) using gradient descent, we first need to calculate the gradient of g(z)g(z) and then find the step size that minimizes g(z)g(z) along the direction of the negative gradient at each iteration.

Step 1: Define the function and compute its gradient

The function g(z)g(z) is given by:

g(z)=12Azb22+12z22g(z) = \frac{1}{2} \|Az - b\|_2^2 + \frac{1}{2} \|z\|_2^2

This can be expanded as:

g(z)=12((Azb)T(Azb))+12(zTz)g(z) = \frac{1}{2} \left((Az - b)^T (Az - b)\right) + \frac{1}{2} \left(z^T z\right)

Expanding the first term:

(Azb)T(Azb)=zTATAz2bTAz+bTb(Az - b)^T (Az - b) = z^T A^T A z - 2b^T A z + b^T b

Thus, g(z)g(z) becomes:

g(z)=12(zTATAz2bTAz+bTb)+12zTzg(z) = \frac{1}{2} \left( z^T A^T A z - 2b^T A z + b^T b \right) + \frac{1}{2} z^T z

The gradient of g(z)g(z) with respect to zz is:

g(z)=g(z)z=ATAzATb+z\nabla g(z) = \frac{\partial g(z)}{\partial z} = A^T A z - A^T b + z

So,

g(z)=(ATA+I)zATb\nabla g(z) = (A^T A + I)z - A^T b

Step 2: Gradient descent update rule

In gradient descent, the update rule is given by:

zk=zk1αkg(zk1)z_k = z_{k-1} - \alpha_k \nabla g(z_{k-1})

where αk\alpha_k is the step size at iteration kk.

Step 3: Derive the optimal step size αk\alpha_k

The optimal step size αk\alpha_k is determined by minimizing g(z)g(z) along the direction of the negative gradient. This requires minimizing the function:

ϕ(α)=g(zk1αg(zk1))\phi(\alpha) = g(z_{k-1} - \alpha \nabla g(z_{k-1}))

Substituting zk=zk1αg(zk1)z_k = z_{k-1} - \alpha \nabla g(z_{k-1}) into g(z)g(z):

ϕ(α)=g(zk1αg(zk1))\phi(\alpha) = g(z_{k-1} - \alpha \nabla g(z_{k-1}))

Expanding g(zk1αg(zk1))g(z_{k-1} - \alpha \nabla g(z_{k-1})):

ϕ(α)=12A(zk1αg(zk1))b22+12zk1αg(zk1)22\phi(\alpha) = \frac{1}{2} \|A(z_{k-1} - \alpha \nabla g(z_{k-1})) - b\|_2^2 + \frac{1}{2} \|z_{k-1} - \alpha \nabla g(z_{k-1})\|_2^2

This expands to:

ϕ(α)=12{Azk1αAg(zk1)b22+zk1αg(zk1)22}\phi(\alpha) = \frac{1}{2} \left\{ \|Az_{k-1} - \alpha A \nabla g(z_{k-1}) - b\|_2^2 + \|z_{k-1} - \alpha \nabla g(z_{k-1})\|_2^2 \right\}

To find the optimal αk\alpha_k, differentiate ϕ(α)\phi(\alpha) with respect to α\alpha and set it to zero:

dϕ(α)dα=g(zk1)T(AT(Azk1b)+zk1)+αg(zk1)T(ATA+I)g(zk1)\frac{d\phi(\alpha)}{d\alpha} = -\nabla g(z_{k-1})^T \left( A^T(Az_{k-1} - b) + z_{k-1} \right) + \alpha \nabla g(z_{k-1})^T (A^T A + I) \nabla g(z_{k-1})

Setting this equal to zero and solving for α\alpha:

0=g(zk1)Tg(zk1)+αg(zk1)T(ATA+I)g(zk1)0 = -\nabla g(z_{k-1})^T \nabla g(z_{k-1}) + \alpha \nabla g(z_{k-1})^T (A^T A + I) \nabla g(z_{k-1})

αk=g(zk1)Tg(zk1)g(zk1)T(ATA+I)g(zk1)\alpha_k = \frac{\nabla g(z_{k-1})^T \nabla g(z_{k-1})}{\nabla g(z_{k-1})^T (A^T A + I) \nabla g(z_{k-1})}

Conclusion

The optimal step size αk\alpha_k for the gradient descent at iteration kk is given by:

αk=g(zk1)22g(zk1)T(ATA+I)g(zk1)\alpha_k = \frac{\|\nabla g(z_{k-1})\|_2^2}{\nabla g(z_{k-1})^T (A^T A + I) \nabla g(z_{k-1})}

This step size ensures the most efficient descent at each iteration.


Would you like further details on any part of this derivation or have additional questions? Here are some related questions for further exploration:

  1. How does the optimal step size αk\alpha_k affect the convergence rate of gradient descent?
  2. What happens if we use a fixed step size instead of the optimal αk\alpha_k?
  3. How can the convergence be analyzed using the eigenvalues of ATA+IA^T A + I?
  4. What would the gradient descent update look like if we introduced a regularization term in g(z)g(z)?
  5. Can you implement this gradient descent method in a programming language like Python or MATLAB?

Tip: When optimizing functions using gradient descent, the choice of step size is crucial for convergence and efficiency, and can be problem-specific.

Ask a new question for Free

By Image

Drop file here or Click Here to upload

Math Problem Analysis

Mathematical Concepts

Matrix Algebra
Gradient Descent

Formulas

Gradient of a Quadratic Function

Theorems

-

Suitable Grade Level

Advanced Undergraduate