How Regression Is Applied in Contemporary Computing
Extensions
Numerous extensions of linear regression have been developed, which allow some or all of the assumptions underlying the basic model to be relaxed.
Simple and multiple linear regression
Example of simple linear regression, which has one independent variable
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression (not to be confused with multivariate linear regression).
Multiple linear regression is a generalization of simple linear regression to the case of more than one independent variable, and a special case of general linear models, restricted to one dependent variable. The basic model for multiple linear regression is
In the formula above we consider n observations of one dependent variable and p independent variables. Thus, Yi is the ith observation of the dependent variable, Xij is ith observation of the jth independent variable, j = 1, 2, ..., p. The values βj represent parameters to be estimated, and εi is the ith independent identically distributed normal error.
In the more general multivariate linear regression, there is one equation of the above form for each of m > 1 dependent variables that share the same set of explanatory variables and hence are estimated simultaneously with each other:
for all observations indexed as i = 1, ... , n and for all dependent variables indexed as j = 1, ... , m.
Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term, multivariate linear regression, refers to cases where y is a vector, i.e., the same as general linear regression.
General linear models
Heteroscedastic models
Generalized linear models
Generalized linear models (GLMs) are a framework for modeling response variables that are bounded or discrete. This is used, for example:
- when modeling positive quantities (e.g. prices or populations) that vary over a large scale - which are better described using a skewed distribution such as the log-normal distribution or Poisson distribution (although GLMs are not used for log-normal data, instead the response variable is simply transformed using the logarithm function);
- when modeling categorical data, such as the choice of a given candidate in an election (which is better described using a Bernoulli distribution/binomial distribution for binary choices, or a categorical distribution/multinomial distribution for multi-way choices), where there are a fixed number of choices that cannot be meaningfully ordered;
- when modeling ordinal data, e.g. ratings on a scale from 0 to 5, where the different outcomes can be ordered but where the quantity itself may not have any absolute meaning (e.g. a rating of 4 may not be "twice as good" in any objective sense as a rating of 2, but simply indicates that it is better than 2 or 3 but not as good as 5).
Generalized linear models allow for an arbitrary link function, g, that relates the mean of the response variable(s) to the predictors: . The link function is often related to the distribution of the response, and in particular it typically has the effect of transforming between the
range of the linear predictor and the range of the response variable.
Some common examples of GLMs are:
- Poisson regression for count data.
- Logistic regression and probit regression for binary data.
- Multinomial logistic regression and multinomial probit regression for categorical data.
- Ordered logit and ordered probit regression for ordinal data.
Hierarchical linear models
Errors-in-variables
Group effects
In a multiple linear regression model
parameter of predictor variable
represents the individual effect of
. It has an interpretation as the expected change in the response variable
when
increases by one unit with other predictor variables held constant. When
is strongly correlated with other predictor variables, it is improbable that
can increase by one unit with other variables held constant. In this case, the interpretation of
becomes problematic as it is based on an improbable condition, and the effect of
cannot be evaluated in isolation.
For a group of predictor variables, say, , a group effect
is defined as a linear combination of their parameters
where is a weight vector satisfying
. Because of the constraint on \({w_{j}}},
is also referred to as a normalized group effect. A group effect
has an interpretation as the expected change in
when variables in the group
change by the amount
, respectively, at the same time with variables not in the group held constant. It generalizes the individual effect of a variable to a group of variables in that
if
, then the group effect reduces to an individual effect, and
if
and
for
, then the group effect also reduces to an individual effect. A group effect
is said to be meaningful if the underlying simultaneous changes of the
variables
is probable.
Group effects provide a means to study the collective impact of strongly correlated predictor variables in linear regression models. Individual effects of such variables are not well-defined as their parameters do not have good interpretations. Furthermore, when the sample size is not large, none of their parameters can be accurately estimated by the least squares regression due to the multicollinearity problem. Nevertheless, there are meaningful group effects that have good interpretations and can be accurately estimated by the least squares regression. A simple way to identify these meaningful group effects is to use an all positive correlations (APC) arrangement of the strongly correlated variables under which pairwise correlations among these variables are all positive, and standardize all predictor variables in the model so that they all have mean zero and length one. To illustrate this, suppose that
is a group of strongly correlated variables in an APC arrangement and that they are not strongly correlated with predictor variables outside the group. Let
be the centred
and
be the standardized
. Then, the standardized linear regression model is
Parameters in the original model, including
, are simple functions of
in the standardized model. The standardization of variables does not change their correlations, so
is a group of strongly correlated variables in an APC arrangement and they are not strongly correlated with other predictor variables in the standardized model. A group effect of
is
and its minimum-variance unbiased linear estimator is
where is the least squares estimator of
. In particular, the average group effect of the
standardized variables is
which has an interpretation as the expected change in when all
in the strongly correlated group increase by
th of a unit at the same time with variables outside the group held constant. With strong positive correlations and in standardized units, variables in the group are approximately equal, so they are likely to increase at the same time and in similar amount. Thus, the average group effect
is a meaningful effect. It can be accurately estimated by its minimum-variance unbiased linear estimator
, even when individually none of the
can be accurately estimated by
.
Not all group effects are meaningful or can be accurately estimated. For example, is a special group effect with weights
and
for
, but it cannot be accurately estimated by
. It is also not a meaningful effect. In general, for a group of
strongly correlated predictor variables in an APC arrangement in the standardized model, group effects whose weight vectors
are at or near the centre of the simplex
(
) are meaningful and can be accurately estimated by their minimum-variance unbiased linear estimators. Effects with weight vectors far away from the centre are not meaningful as such weight vectors represent simultaneous changes of the variables that violate the strong positive correlations of the standardized variables in an APC arrangement. As such, they are not probable. These effects also cannot be accurately estimated.
Applications of the group effects include (1) estimation and inference for meaningful group effects on the response variable, (2) testing for "group significance" of the variables via testing
versus
, and (3) characterizing the region of the predictor variable space over which predictions by the least squares estimated model are accurate.
A group effect of the original variables can be expressed as a constant times a group effect of the standardized variables
. The former is meaningful when the latter is. Thus meaningful group effects of the original variables can be found through meaningful group effects of the standardized variables.