Skip to Tutorial Content

Why multiple regression?

Before we start, you may download the data again and store it in a new folder, e.g. named “BAQM_MultipleRegression”. The best way to do is to open a R-project.

With single regression, we used our data set housingrents and explained the rent variable with area - repeat this step below, the file is stored as object data.

head(data)
mymodel1 <- lm(rent~area, data=data)
summary(mymodel1)

Motivation of the case

In the data, there is a dummy-variable called nre, indicating whether apartments are rent out by nre real estate company. The company nre is confronted with the accusation of discriminating its customers and charging too high rents.

We now calculate descriptive statistics area, econage measuring the economig age of the apartments, and rent, split by the factor variable nre.

desc <- data %>% group_by(nre) %>% 
  summarise(mean.rent=mean(rent,na.rm=T),
            mean.age=mean(econage,na.rm=T),
            mean.area=mean(area,na.rm=T))

desc

First, we test whether rents in the data-set are significantly higher or not for nre company.

According to the strategy above, we test the rent variable for significant differences across nre.

by(data$rent,data$nre,shapiro.test)

Since both \(p\)-values fall below \(\alpha = 5\%\), we conclude that at least one sub-sample is not normally distributed and thus we apply a non-parametric test.

wilcox.test(data$rent~data$nre)

Since the \(p\)-value falls below \(\alpha=5\%\), we reject the null-hypothesis of equal means (actually medians) - rents for nre-apartments are significantly higher.

Remark - repeat the above procedure by testing the logarithm of data$rent.

by(log(data$rent),data$nre,shapiro.test)

Now, both \(p\)-values indicate a normal-distribution of the logarithmised rent variable - if the log of a variable is normally distributed, then the original variable follows a log-normal-distribution. Now, we also could test for significant differences using a 2-sample-t-test. Before doing so, we need to test for equality of variances:

var.test(log(data$rent)~data$nre)

The \(p\)-value is above \(\alpha=5\%\), so we use a 2-sample-t-test with equal variances:

t.test(log(data$rent)~data$nre, var.equal=TRUE)

Unsurprisingly, we obtain the same result and conclude using this strategy that there are significant differences in rents across nre and non-nre-apartments!

Of course, we need to take into account whether there are other systematic differences between nre and non-nre apartments - this task can be accomplished using multiple regression technique.

For this reason, test the following: 1. Is the difference in size of nre and non-nre apartments significant?
2. Is the difference of economic age of the apartments significant?
3. What about the proportion of balconies of nre and non-nre apartments?

area:

by(data$area,data$nre,shapiro.test)
wilcox.test(data$area~data$nre)

econage:

by(data$econage,data$nre,shapiro.test)
var.test(data$econage~data$nre)
t.test(data$econage~data$nre,var.equal=T)

For both variables area and econage, we find a significant difference between nre and non-nre apartments. Also, we have already shown that area has a positive impact on rents and thus, a certain part of the rent difference of 0 CHF can be explained by the fact that nre company rents out larger apartments.

balcony: Since both variables are factor variables (categorical variables), we apply a \(\chi^2\)-Test. The null-hypothesis states no difference between relative frequencies of flats with balconies between type of nre.

t <- table(data$balcony,data$nre)
prop.table(t,2)
chisq.test(t, correct=F)

We observe a higher proportion of balconies for nre-apartments, but the observation is statistically not significant.

Now, you may download the code below to your BAQM-MultipleRegression folder.

Applying multiple regression

No we first would like to extend our model such that we do not only explain rent with area but also with econage.

We first explore the three variables visually by applying the code below:

my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point() + 
    geom_smooth(method=lm, fill="blue", color="blue", ...) 
    # geom_smooth(method=loess, fill="blue", color="blue", ...)
  p
}

mygraph = ggpairs(data,columns = c(4,3,6), lower = list(continuous = my_fn))
mygraph

We learn that…

  • area is positively correlated with rent and both variables show a quite heavy right-skew (we already know…).

  • econage is negatively correlated with rent - older flats therefore seem to be less expensive (on average) - thus, we might expect the variable to be important for our model.

  • econageage and area are negatively correlated, this means that older flats are smaller, or smaller flats are older.

Whether we take new variables in the model is dependent (at least) on the following:

  • is it plausible that the new variable affects the dependent variable in some economic sense?

  • does a goodness of fit measure significantly increase? We talk about \(R_{adj}^2\), the Aikake Information Criteria \(AIC\) and the Bayesian Information Criteria \(BIC\). BIC is defined as \(BIC = n\cdot ln(\frac{SSR}{n})+k\cdot ln(n)\), with \(n\) being the number of observations, \(k\) the number of regressions and \(SSR\) the sum of squared residuals, \(\sum e_i^2\). Obviously, also BIC penalizes a model when taking on too many (unnecessary) variables. The lower BIC or AIC, the better.

Below, calculate BIC of our first model using the function BIC(mymodel1)

head(data)
mymodel1 <- lm(rent~area, data=data)
BIC(mymodel1)

Now, if we add a new variable, we want BIC to drop.

Now, we add the new variable to a model called mymodel2 and compare the two outputs. What can you say about coefficients, what about change in \(R_{adj}^2\) or change in \(BIC\)?

mymodel1 <- lm(rent~area, data=data)
BIC(mymodel1)
summary(mymodel1)

mymodel2 <- lm(rent~area+econage, data=data)
BIC(mymodel2)
summary(mymodel2)

What can we conclude?

In terms of changes in goodness of fit measures, we note see the following:

Both changes seem to be significant (according to the previous table). If we only compared two univariate regression (one with rent on area and compared it two a regression with rent on econage, we see that the regression with area performs better but not significantly).

Beware that the above table only is a rule of thumb and that sometimes we do not know at which threshold to judge a change in a goodness of fit measure as significant.

Monte-Carlo Simulation (very briefly explained)

reps<-2^14
deltabic<-rep(0,reps)
deltaradj2<-rep(0,reps)

for(i in 1:reps){
  data$x <- rnorm(nrow(data),0,1)
  mymodel2 <- lm(rent~area+x,data=data)
  deltabic[i]<-BIC(mymodel1)-BIC(mymodel2)
  deltaradj2[i]<-summary(mymodel2)$adj.r.squared-summary(mymodel1)$adj.r.squared
}

quantile(deltabic,c(0.95,0.99,0.999))
quantile(deltaradj2,c(0.95,0.99,0.999))
  • In row 1, we define how many times we want to simulate a change in \(BIC\) or \(R_{adj}^2\)
  • In rows 2-3, we open empty vectors with zero entries of size reps.
  • In row 5, we start a loop.
  • In row 6, we generate a random variable \(x\). This variable has nothing to do with our data (e.g. is not correlated to any variable) and therefore, adding this variable to our model should not have any impact.
  • In row 7, we calculate the new model and then in rows 8-9, we caclulate the change in \(BIC\) and the change in \(R_{adj}^2\) and store the result in the prepared vectors.
  • In rows 12-13, we calculate the upper quantiles and refer to them being the threshold for no, weak and strong significance.

In case the server is too busy with the Monte-Carlo simulation, please use the code below for local simulation:

This supports the table which we have seen previously on significant changes in \(\Delta BIC\).

## 
## Call:
## lm(formula = rent ~ area + econage, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -976.62 -308.18  -44.52  207.44 2218.28 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept) 1400.187    232.462   6.023 0.0000000128 ***
## area           6.992      1.298   5.388 0.0000002719 ***
## econage      -26.844      5.156  -5.206 0.0000006299 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 508.1 on 149 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.443 
## F-statistic: 61.05 on 2 and 149 DF,  p-value: < 0.00000000000000022

We now want to interpret the output:

  • The coefficient of area accounts for 6.99 \(\frac{CHF}{m^2}\). It is significantly different from zero.
  • The coefficient of econage accounts for -26.84 \(\frac{CHF}{year}\).
  • Since area and econage are negatively correlated and since we initially neglected econage in our first model, we overestimated the coefficient of area substantially - why? Because larger flats are by tendency less aged (which also increases the rent), but this effect was attributed to the coefficient of area.
  • From the change in \(R^2\) and \(BIC\), we clearly can see that this new model is superior to the one before.

Adding factor variables

We now want to extend the model with factor variables. For instance, we might consider to use the rooms variable either as a numeric or as a categorical variable.

Below, generate a mymodel3 object that uses rooms (as numeric) and econage variables and test whether the model is superior compared to mymodel2.

mymodel3 <- lm(rent~rooms+econage, data=data)
summary(mymodel3)
BIC(mymodel3)
BIC(mymodel2)

What do we note? mymodel2 with the area variable as size indicator has a higher adjusted \(R^2\) or a lower BIC, therefore, we should not consider the rooms variable as size indicator.

Why should we not use both size-indicators?

  • The two variables are practically measuring the same effect, the size of the apartments
  • The variables are highly correlated (\(r=0.869\))
  • Using both variable leads to multi-collinearity

Below, generate a mymodelmc model with rooms, area and econage. Report variance inflation factors using the vif() function from car package. In general, vif-factors above 4 are a hint for multi-collinearity, factors above 10 are highly critical.

mymodelmc <- lm(rent~rooms+area+econage, data=data)
summary(mymodelmc)
vif(mymodelmc)

Note: In economic applications of regression, you often have variables that are strongly correlated and you should carefully check for multi-collinearity. In the example above, we would remove rooms from our model.

Using factor variables

Now, generate a mymodel4 object with the rooms variable as a factor variable. We can directly declare it as a factor when defining the model using factor(rooms).

mymodel4 <- lm(rent~factor(rooms)+econage, data=data)
summary(mymodel4)
BIC(mymodel4)
BIC(mymodel2)

What differences do we see compared to the previous case? Previously, we used rooms as a numeric variable. Its coefficient was: 172.8. This means with each additional room, an apartments rent increases by this amount.

Now, we use rooms as a factor. This implies that R will generate binary variables for each factor level except one - as you saw in the result, the effect of 1-room apartments was not reported. The reason is multi-collinearity, or, in other words, the base level (1-room apartments) is measured by the intercept.

## 
## Call:
## lm(formula = rent ~ factor(rooms) + econage, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -923.30 -287.58  -46.32  198.07 2098.63 
## 
## Coefficients:
##                Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)    1671.275    227.006   7.362 0.0000000000125 ***
## factor(rooms)2  287.450    145.612   1.974        0.050273 .  
## factor(rooms)3  242.963    144.847   1.677        0.095622 .  
## factor(rooms)4  577.999    150.791   3.833        0.000188 ***
## factor(rooms)5  656.300    177.488   3.698        0.000308 ***
## factor(rooms)6  955.100    239.366   3.990        0.000104 ***
## econage         -28.497      5.425  -5.253 0.0000005239429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 516.9 on 145 degrees of freedom
## Multiple R-squared:  0.4464, Adjusted R-squared:  0.4235 
## F-statistic: 19.49 on 6 and 145 DF,  p-value: < 0.00000000000000022
  • What we note again is that BIC of this model is higher or adj. \(R^2\) is lower than simply using area as size-indicator.
  • According to this model, 1-room apartmens (with zero economic age), on average have rent of 1671.3 CHF. 2-room apartments, on average, cost 1671.3+287.4 per month and so on, 4-room apartments cost round(mymodel4$coefficients[1],1)`+578 CHF per month, and so on.
  • We see that the model does not price the increase in 1 room equally across all apartments. Using the function linearHypothesis(), we can test whether this is the case or not, e.g. whether each additional room is priced by adding \(173\) CHF to the rent.
linearHypothesis(mymodel4,c("factor(rooms)2=173",
                            "factor(rooms)3=346",
                            "factor(rooms)4=519",
                            "factor(rooms)5=692",
                            "factor(rooms)6=865"))

From the above, we see that each additional room can be assumed to be equally prized and thus, there is no need to add a factor-variable! We now can save our code locally:

Housingrents by nre

As we know, the goal was initially to test whether apartments rent out by nre company are more expensive. We have seen that:

Calculate, by nre, the average rent, size and age of the apartments:

bynre <- data %>% group_by(nre) %>% summarise(rent=mean(rent,na.rm=TRUE),
                                              area=mean(area,na.rm=TRUE),
                                              econage=mean(econage,na.rm=TRUE))
bynre

Now, we may declare nre as a factor-variable (we just also could use factor(nre)), but declaring factors is important:

data$nre <- factor(data$nre,levels=c(0,1),labels=c("non-nre","nre"))
table(data$nre)

Now, we enhance our previous mymodel2 with econage and area with nre. Store this in mymodel5. Can we still say that nre discriminates its customers by charging too high rents?

data$nre <- factor(data$nre,levels=c(0,1),labels=c("non-nre","nre"))
table(data$nre)
mymodel5 <- lm(rent~area+econage+nre,data=data)
summary(mymodel5)
summary(mymodel2)
BIC(mymodel5); BIC(mymodel2)

Interestingly, also this model is not superior to mymodel2.

  • After controlling for area and econage, the nre-apartments on average are only 80 CHF per month more expensive, whereas the result is not significant.

  • The coefficient can be explained as follows, we have: \(\Delta area = 16.06\), \(\Delta age = -4.52\). Thus, the rent difference that can be attributed to this differences is: \(6.88 \cdot 16.06 + -26.19 \cdot -4.52 = 228.9\). Also, since we know that the total difference in average rents are \(309.16\) CHF, we can explain the remaining effect.

Adding the balcony variable has no effect, thus we arrived at our model: lm(rent~area+econage).

Finished? Not quite…

Interaction effects and margins

So far, our favourite model is mymodel2. However, lets produce some marginal plots using ggpredict:

# play around with shifting the variables positions
myps <- ggpredict(mymodel5, terms=c("area","nre","econage"))
myps
plot(myps)

Obviously, the regression lines are always parallel. In some situations, we might change this by adding interaction terms to the model.

Calculate and interpret the following model: mymodel6 <- lm(rent~I(area)*I(econage),data=data):

mymodel6 <- lm(rent~I(area)*I(econage),data=data)
BIC(mymodel6)
BIC(mymodel2)

Our model now is, obviously:

\(rent = 783.2+13.4\cdot area - 5.5\cdot age - 0.24\cdot area \cdot age\)

Thus, the marginal effects of \(area\) or \(age\) now are:

\(\frac{\partial rent}{\partial area} = 13.4 - 0.24\cdot age\)

and

\(\frac{\partial rent}{\partial age} = -5.5 - 0.24\cdot area\)

So, we can see that the positive effect of \(area\) on \(rent\) decreases with increasing economic age of the apartment - whether this makes sense or not has to be judged by a real estate expert, probably not - also, such a complicated model should be avoided when the gain in adjusted \(R^2\) or loss in \(BIC\) are that samll!

The negative effect of \(age\) on \(rent\) gets stronger with increasing area of the apartments. Also here, of course, we could draw marginal effects:

# play around with shifting the variables positions
myps <- ggpredict(mymodel6, terms=c("area","econage"))
myps
plot(myps)

In short, only use interaction effects when you can justify their existence from a domain-knowledge perspective with economic reasoning!

For instance, we also could test whether econage and area has a different impact on housingrents for nre-apartments.

mymodel7 <- lm(rent~I(area)*I(nre)+I(econage)*I(nre), data=data)
summary(mymodel7)
BIC(mymodel7)

myps <- ggpredict(mymodel7, terms=c("area","econage","nre"))
myps
plot(myps)

As we see quite fast, the interaction effects are all insignificant.

Ridge/Lasso - Regression

We now learnt that multiple regression is a technique that minimizes the following expression:

\(SSR = \sum_i(y_i-b_0-\sum_j b_j \cdot x_{j,i})^2\)

An often encountered problem is, however, that sometimes we have many variables and not many observations. In this case, least squares method as multiple regression may fail, estimated parameters will have a huge variance and we have overfitting in our model.

In other words, we are estimating too many parameters.

Thus, we are now minimizing a new function (Lasso-Regression):

\(SSR_L = SSR + \lambda \cdot \sum_{j=1}^{k}|b_k|\)

This means that for each coefficient \(b_k\) which is estimated not to be zero, we are ‘penalizing’ the model. This also means, in turn, that unimportant variables are set equal to zero, which is automatically done by the algorithm.

Careful: Although this seems to be a very powerful technique, sometimes, economic reasoning prescribes adding certain variables although they are not ‘important’!!!

Note:

  • If \(\lambda=0\), we fall back to multiple regression.

  • If \(\lambda=\infty\), each coefficient will be set to zero.

We need to use R-packages glmnet and caret, the R-Machine-Learning library.

For this reason, we need to transform our data in a datamatrix where the factor variables are categorical (e.g. each level represents a binary variable).

Below, try the following:

  1. Add each factor-variable (but not id) to a new dataframe mydata.
  2. Make sure that the variables are recognized by R as factor-variables.
  3. Use the function dummyVars() to perform onehot-encoding - read the help-file if necessary.
  4. Apply the onehot-encoding by using the object from step 3 and the function predict()
mydata <- data[,c("rooms","nre","balcony")]
is.factor(mydata$rooms); is.factor(mydata$nre); is.factor(mydata$balcony)

mydata$rooms <- factor(mydata$rooms,levels=1:6,labels=as.character(1:6))
mydata$nre <- factor(mydata$nre,levels=0:1,labels=c("non-nre","nre"))

onehot <- dummyVars(" ~ .", data = mydata, fullRank = FALSE)
mydata.lasso <- data.frame(predict(onehot, newdata = mydata))
head(mydata.lasso)

Now, we add the quantitative variable to mydata.lasso, but only econage and area, not the rent variable.

mydata.lasso$econage <- data$econage
mydata.lasso$area <- data$area
head(mydata.lasso)

Hyperparameter tuning

Now, we need to find an optimal parameter for \(\lambda\) - this procedure is called hyperparameter tuning. This is achieved by applying K-Fold-Cross-Validation (K-Fold-CV).

  • The data are split into \(K\) groups of equal size. In a first step, the first group is a test-data-set and the model is fit using groups \(2 \dots K\).

  • Then, a goodness of fit measure, e.g. the mean absolute error \(MAE\), is calculated using the test-data-set.

  • The procedure is repeated using the remaining 9 groups as test-data-sets.

  • At the end, we calculate the average for the goodness of fit measure.

  1. Below, we now define a vector grid that contains in possible \(\lambda\) values from \(0 < \lambda < \infty\).
  2. We calculate the Lasso model with all possible \(\lambda\)-values: this is done with the function lasso.mod <- glmnet(x,y,alpha = 1, lambda = ), whereas x is our data matrix mydata.lasso and y is our dependent variable.
  3. Using the function cv.out <- cv.glmnet(x,y, alpha = 1,type.measure = 'mae', nfolds = 10), we apply cross-validation for our model.
  4. We save the optimal \(\lambda\) value using cv.out$lambda.min
grid <- 10^seq(10,-2,length=100)
# Step 1
grid <- 10^seq(10,-2,length=100)

# Step 2
lasso.mod <- glmnet(data.matrix(mydata.lasso), data$rent, alpha = 1, lambda = grid)

# Step 3
set.seed(1)
cv.out <- cv.glmnet(data.matrix(mydata.lasso), data$rent, alpha = 1, 
                    type.measure = "mae", nfolds = 10)

# Step 4
# Speichere optimales Lambda
bestlam <- cv.out$lambda.min

Finally, using \(\lambda = 82.606\), we predict our model and return coefficients:

lasso.pred.rent <- predict(lasso.mod, s = bestlam, newx = data.matrix(mydata.lasso))


coef_lasso <- predict(lasso.mod , s = bestlam, type = "coefficients")

# return coefficients
coefDF <- data.frame(coefNames = rownames(coef_lasso), coef = unname(coef_lasso[,1]))

coefDF %>% filter(round(as.numeric(abs(coef)),2) > 0)

Obviously, the model selection confirms our prior selection according to mymodel2 - coefficients are however not identical due to the fact that the minimization function is different (using \(\lambda = 82.61\)).

Comparing the two models

In the code below, calculate the predicted values for both models and calculate a measure of fit two contrast the two models against each other.

\(MAE = \frac{1}{n}\cdot \sum (|y_i - \hat{y}_i|)\)

lasso.pred.rent <- predict(lasso.mod, s = bestlam, newx = data.matrix(mydata.lasso))
ols.pred.rent <- predict(mymodel2,newdata=data)
rent <- data$rent

mae <- function(y,yhat){
  mean((abs(y-yhat)))
}

mae(rent,lasso.pred.rent)
mae(rent,ols.pred.rent)

Obviously, \(MAE\) does not vary to strongly.

Multiple Regression