Suppose we expect a response variable to be determined by a linear
combination of a subset of potential covariates. Then the LARS algorithm
provides a means of producing an estimate of which variables to
include, as well as their coefficients.
Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector. The algorithm is similar to forward stepwise regression,
but instead of including variables at each step, the estimated
parameters are increased in a direction equiangular to each one's
correlations with the residual.
Advantages:
1. It is computationally just as fast as forward selection.
2. It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune the model.
3. If two variables are almost equally correlated with the response,
then their coefficients should increase at approximately the same rate.
The algorithm thus behaves as intuition would expect, and also is more
stable.
4. It is easily modified to produce solutions for other estimators, like the LASSO.
5. It is effective in contexts where p >> n (IE, when the number of dimensions is significantly greater than the number of points).
Disadvantages:
1. With any amount of noise in the dependent variable and with high
dimensional multicollinear independent variables, there is no reason to
believe that the selected variables will have a high probability of
being the actual underlying causal variables. This problem is not unique
to LARS, as it is a general problem with variable selection approaches
that seek to find underlying deterministic components. Yet, because LARS
is based upon an iterative refitting of the residuals, it would appear
to be especially sensitive to the effects of noise.
2. Since almost all high dimensional data in the real world will just
by chance exhibit some fair degree of collinearity across at least some
variables, the problem that LARS has with correlated variables may limit
its application to high dimensional data.
Python code:
import numpy as np
import mlpy
import matplotlib.pyplot as plt # required for plotting'
diabetes = np.loadtxt("diabetes.data", skiprows=1) #http://www.stanford.edu/~hastie/Papers/LARS/diabetes.data
x = diabetes[:, :-1]
y = diabetes[:, -1]
x -= np.mean(x, axis=0) # center x
x /= np.sqrt(np.sum((x)**2, axis=0)) # normalize x
y -= np.mean(y) # center y
lars = mlpy.LARS()
lars.learn(x, y)
lars.steps() # number of steps performed
lars.beta()
lars.beta0()
est = lars.est() # returns all LARS estimates
beta_sum = np.sum(np.abs(est), axis=1)
fig = plt.figure(1)
plot1 = plt.plot(beta_sum, est)
xl = plt.xlabel(r'$\sum{|\beta_j|}$')
yl = plt.ylabel(r'$\beta_j$')
plt.show()
No comments:
Post a Comment