## Why multiple regression?

Before we start, you may download the data again and store it in a new folder, e.g. named “BAQM_MultipleRegression”. The best way to do is to open a R-project.

With single regression, we used our data set `housingrents`

and explained the `rent`

variable with `area`

- repeat this step below, the file is stored as object `data`

.

```
head(data)
mymodel1 <- lm(rent~area, data=data)
summary(mymodel1)
```

### Motivation of the case

In the data, there is a dummy-variable called `nre`

, indicating whether apartments are rent out by `nre`

real estate company. The company `nre`

is confronted with the accusation of discriminating its customers and charging too high rents.

We now calculate descriptive statistics `area`

, `econage`

measuring the economig age of the apartments, and `rent`

, split by the factor variable `nre`

.

```
desc <- data %>% group_by(nre) %>%
summarise(mean.rent=mean(rent,na.rm=T),
mean.age=mean(econage,na.rm=T),
mean.area=mean(area,na.rm=T))
desc
```

First, we **test** whether `rents`

in the data-set are **significantly** higher or not for `nre`

company.

According to the strategy above, we test the `rent`

variable for significant differences across `nre`

.

`by(data$rent,data$nre,shapiro.test)`

Since both \(p\)-values fall below \(\alpha = 5\%\), we conclude that at least one sub-sample is **not** normally distributed and thus we apply a non-parametric test.

`wilcox.test(data$rent~data$nre)`

Since the \(p\)-value falls below \(\alpha=5\%\), we **reject** the null-hypothesis of equal means (actually medians) - rents for `nre`

-apartments are **significantly higher**.

**Remark** - repeat the above procedure by testing the logarithm of data$rent.

`by(log(data$rent),data$nre,shapiro.test)`

Now, both \(p\)-values indicate a **normal**-distribution of the logarithmised `rent`

variable - if the `log`

of a variable is **normally** distributed, then the original variable follows a **log-normal-distribution**. Now, we also could test for significant differences using a 2-sample-t-test. Before doing so, we need to test for equality of variances:

`var.test(log(data$rent)~data$nre)`

The \(p\)-value is above \(\alpha=5\%\), so we use a 2-sample-t-test with equal variances:

`t.test(log(data$rent)~data$nre, var.equal=TRUE)`

Unsurprisingly, we obtain the **same** result and conclude using this strategy that there are **significant differences** in rents across `nre`

and `non-nre`

-apartments!

Of course, we need to take into account whether there are **other systematic differences** between `nre`

and `non-nre`

apartments - this task can be accomplished using multiple regression technique.

For this reason, test the following: 1. Is the difference in size of `nre`

and `non-nre`

apartments significant?

2. Is the difference of economic age of the apartments significant?

3. What about the proportion of balconies of `nre`

and `non-nre`

apartments?

** area**:

```
by(data$area,data$nre,shapiro.test)
wilcox.test(data$area~data$nre)
```

** econage**:

```
by(data$econage,data$nre,shapiro.test)
var.test(data$econage~data$nre)
t.test(data$econage~data$nre,var.equal=T)
```

For **both** variables `area`

and `econage`

, we find a **significant difference** between `nre`

and `non-nre`

apartments. Also, we have already shown that `area`

has a **positive impact** on rents and thus, a **certain part** of the rent difference of 0 CHF can be explained by the fact that `nre`

company rents out larger apartments.

** balcony**: Since both variables are factor variables (categorical variables), we apply a \(\chi^2\)-Test. The null-hypothesis states

**no difference**between relative frequencies of flats with balconies between type of nre.

```
t <- table(data$balcony,data$nre)
prop.table(t,2)
chisq.test(t, correct=F)
```

We observe a higher proportion of balconies for `nre`

-apartments, but the observation is statistically **not significant**.

Now, you may download the code below to your BAQM-MultipleRegression folder.

## Applying multiple regression

No we first would like to **extend** our model such that we do not only explain `rent`

with `area`

but also with `econage`

.

We first explore the three variables visually by applying the code below:

```
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=lm, fill="blue", color="blue", ...)
# geom_smooth(method=loess, fill="blue", color="blue", ...)
p
}
mygraph = ggpairs(data,columns = c(4,3,6), lower = list(continuous = my_fn))
mygraph
```

We learn that…

`area`

is**positively**correlated with`rent`

and both variables show a quite heavy right-skew (we already know…).`econage`

is**negatively**correlated with`rent`

- older flats therefore seem to be less expensive (on average) - thus, we might expect the variable to be important for our model.`econageage`

and`area`

are**negatively**correlated, this means that**older**flats are**smaller**, or**smaller**flats are older.

**Whether** we take new variables in the model is dependent (at least) on the following:

is it plausible that the new variable affects the dependent variable in some economic sense?

does a goodness of fit measure significantly increase? We talk about \(R_{adj}^2\), the Aikake Information Criteria \(AIC\) and the Bayesian Information Criteria \(BIC\). BIC is defined as \(BIC = n\cdot ln(\frac{SSR}{n})+k\cdot ln(n)\), with \(n\) being the number of observations, \(k\) the number of regressions and \(SSR\) the sum of squared residuals, \(\sum e_i^2\). Obviously, also BIC penalizes a model when taking on too many (unnecessary) variables. The

**lower**BIC or AIC, the better.

Below, calculate BIC of our first model using the function `BIC(mymodel1)`

```
head(data)
mymodel1 <- lm(rent~area, data=data)
BIC(mymodel1)
```

Now, if we add a **new variable**, we want BIC to **drop**.

Drop in BIC | Judgement |

0 - 2 | no significance in change |

2 - 6 | weak to medium significance in change |

6 - 10 | strong significance in change |

**Now**, we add the new variable to a model called `mymodel2`

and compare the two outputs. What can you say about coefficients, what about change in \(R_{adj}^2\) or change in \(BIC\)?

```
mymodel1 <- lm(rent~area, data=data)
BIC(mymodel1)
summary(mymodel1)
mymodel2 <- lm(rent~area+econage, data=data)
BIC(mymodel2)
summary(mymodel2)
```

What can we **conclude**?

In terms of changes in **goodness of fit measures**, we note see the following:

Model | BIC | Adj.R2 |

Model 1 | 2,362.9 | 0.346 |

Model 2 | 2,342.5 | 0.443 |

Change | -20.4 | 0.097 |

Both changes seem to be significant (according to the previous table). If we only compared two univariate regression (one with `rent`

on `area`

and compared it two a regression with `rent`

on `econage`

, we see that the regression with `area`

performs better but not significantly).

Beware that the above table **only is a rule of thumb** and that sometimes we do not know at which threshold to judge a change in a goodness of fit measure as significant.

### Monte-Carlo Simulation (very briefly explained)

```
reps<-2^14
deltabic<-rep(0,reps)
deltaradj2<-rep(0,reps)
for(i in 1:reps){
data$x <- rnorm(nrow(data),0,1)
mymodel2 <- lm(rent~area+x,data=data)
deltabic[i]<-BIC(mymodel1)-BIC(mymodel2)
deltaradj2[i]<-summary(mymodel2)$adj.r.squared-summary(mymodel1)$adj.r.squared
}
quantile(deltabic,c(0.95,0.99,0.999))
quantile(deltaradj2,c(0.95,0.99,0.999))
```

- In row 1, we define how many times we want to simulate a change in \(BIC\) or \(R_{adj}^2\)

- In rows 2-3, we open empty vectors with zero entries of size
`reps`

.

- In row 5, we start a loop.

- In row 6, we generate a random variable \(x\). This variable has
**nothing**to do with our data (e.g. is not correlated to any variable) and therefore, adding this variable to our model should**not**have any impact.

- In row 7, we calculate the new model and then in rows 8-9, we caclulate the change in \(BIC\) and the change in \(R_{adj}^2\) and store the result in the prepared vectors.

- In rows 12-13, we calculate the upper quantiles and refer to them being the threshold for
**no**,**weak**and**strong**significance.

In case the server is too busy with the Monte-Carlo simulation, please use the code below for **local** simulation:

This supports the table which we have seen previously on significant changes in \(\Delta BIC\).

```
##
## Call:
## lm(formula = rent ~ area + econage, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -976.62 -308.18 -44.52 207.44 2218.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1400.187 232.462 6.023 0.0000000128 ***
## area 6.992 1.298 5.388 0.0000002719 ***
## econage -26.844 5.156 -5.206 0.0000006299 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 508.1 on 149 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.443
## F-statistic: 61.05 on 2 and 149 DF, p-value: < 0.00000000000000022
```

We now want to interpret the output:

- The coefficient of
`area`

accounts for 6.99 \(\frac{CHF}{m^2}\). It is significantly different from zero.

- The coefficient of
`econage`

accounts for -26.84 \(\frac{CHF}{year}\).

- Since
`area`

and`econage`

are negatively correlated and since we initially neglected`econage`

in our first model, we**overestimated**the coefficient of`area`

substantially - why? Because larger flats are by tendency less aged (which also increases the rent), but this effect was attributed to the coefficient of area.

- From the change in \(R^2\) and \(BIC\), we clearly can see that this new model is
**superior**to the one before.

## Adding factor variables

We now want to extend the model with factor variables. For instance, we might consider to use the `rooms`

variable **either** as a numeric or as a categorical variable.

Below, generate a `mymodel3`

object that uses `rooms`

(as numeric) and `econage`

variables and test whether the model is superior compared to `mymodel2`

.

```
mymodel3 <- lm(rent~rooms+econage, data=data)
summary(mymodel3)
BIC(mymodel3)
BIC(mymodel2)
```

What do we note? `mymodel2`

with the `area`

variable as size indicator has a higher adjusted \(R^2\) or a lower BIC, therefore, we should not consider the `rooms`

variable as size indicator.

Why should we **not** use both size-indicators?

- The two variables are practically measuring the same effect, the
**size**of the apartments

- The variables are highly correlated (\(r=0.869\))

- Using both variable leads to multi-collinearity

Below, generate a `mymodelmc`

model with `rooms`

, `area`

and `econage`

. Report **variance inflation factors** using the `vif()`

function from `car`

package. In general, `vif`

-factors above 4 are a hint for multi-collinearity, factors above 10 are highly critical.

```
mymodelmc <- lm(rent~rooms+area+econage, data=data)
summary(mymodelmc)
vif(mymodelmc)
```

**Note:** In economic applications of regression, you **often** have variables that are strongly correlated and you should carefully check for multi-collinearity. In the example above, we would remove `rooms`

from our model.

### Using factor variables

Now, generate a `mymodel4`

object with the `rooms`

variable as a **factor** variable. We can directly declare it as a factor when defining the model using `factor(rooms)`

.

```
mymodel4 <- lm(rent~factor(rooms)+econage, data=data)
summary(mymodel4)
BIC(mymodel4)
BIC(mymodel2)
```

What differences do we see compared to the previous case? Previously, we used `rooms`

as a **numeric** variable. Its coefficient was: 172.8. This means with each additional room, an apartments rent increases by this amount.

Now, we use rooms as a **factor**. This implies that `R`

will generate binary variables for each factor level except one - as you saw in the result, the effect of 1-room apartments was not reported. The reason is multi-collinearity, or, in other words, the **base level** (1-room apartments) is measured by the **intercept**.

rooms | rooms2 | rooms3 | rooms4 | rooms5 | rooms6 |

1 | 0 | 0 | 0 | 0 | 0 |

1 | 0 | 0 | 0 | 0 | 0 |

2 | 1 | 0 | 0 | 0 | 0 |

3 | 0 | 1 | 0 | 0 | 0 |

5 | 0 | 0 | 0 | 1 | 0 |

4 | 0 | 0 | 1 | 0 | 0 |

5 | 0 | 0 | 0 | 1 | 0 |

3 | 0 | 1 | 0 | 0 | 0 |

2 | 1 | 0 | 0 | 0 | 0 |

6 | 0 | 0 | 0 | 0 | 1 |

6 | 0 | 0 | 0 | 0 | 1 |

1 | 0 | 0 | 0 | 0 | 0 |

2 | 1 | 0 | 0 | 0 | 0 |

4 | 0 | 0 | 1 | 0 | 0 |

```
##
## Call:
## lm(formula = rent ~ factor(rooms) + econage, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -923.30 -287.58 -46.32 198.07 2098.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1671.275 227.006 7.362 0.0000000000125 ***
## factor(rooms)2 287.450 145.612 1.974 0.050273 .
## factor(rooms)3 242.963 144.847 1.677 0.095622 .
## factor(rooms)4 577.999 150.791 3.833 0.000188 ***
## factor(rooms)5 656.300 177.488 3.698 0.000308 ***
## factor(rooms)6 955.100 239.366 3.990 0.000104 ***
## econage -28.497 5.425 -5.253 0.0000005239429 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 516.9 on 145 degrees of freedom
## Multiple R-squared: 0.4464, Adjusted R-squared: 0.4235
## F-statistic: 19.49 on 6 and 145 DF, p-value: < 0.00000000000000022
```

- What we note again is that
**BIC**of this model is**higher**or adj. \(R^2\) is**lower**than simply using`area`

as size-indicator.

- According to this model, 1-room apartmens (with zero economic age), on average have rent of 1671.3 CHF. 2-room apartments, on average, cost 1671.3+287.4 per month and so on, 4-room apartments cost round(mymodel4$coefficients[1],1)`+578 CHF per month, and so on.
- We see that the model does not price the increase in 1 room equally across all apartments. Using the function
`linearHypothesis()`

, we can test whether this is the case or not, e.g. whether each additional room is priced by adding \(173\) CHF to the rent.

```
linearHypothesis(mymodel4,c("factor(rooms)2=173",
"factor(rooms)3=346",
"factor(rooms)4=519",
"factor(rooms)5=692",
"factor(rooms)6=865"))
```

From the above, we see that each additional room can be assumed to be **equally** prized and thus, there is no need to add a factor-variable! We now can save our code locally:

## Housingrents by nre

As we know, the goal was initially to test whether apartments rent out by **nre** company are more expensive. We have seen that:

Calculate, by `nre`

, the average rent, size and age of the apartments:

```
bynre <- data %>% group_by(nre) %>% summarise(rent=mean(rent,na.rm=TRUE),
area=mean(area,na.rm=TRUE),
econage=mean(econage,na.rm=TRUE))
bynre
```

Now, we may declare `nre`

as a factor-variable (we just also could use `factor(nre)`

), but declaring factors is important:

```
data$nre <- factor(data$nre,levels=c(0,1),labels=c("non-nre","nre"))
table(data$nre)
```

**Now**, we enhance our previous `mymodel2`

with `econage`

and `area`

with `nre`

. Store this in `mymodel5`

. Can we still say that nre discriminates its customers by charging too high rents?

```
data$nre <- factor(data$nre,levels=c(0,1),labels=c("non-nre","nre"))
table(data$nre)
mymodel5 <- lm(rent~area+econage+nre,data=data)
summary(mymodel5)
summary(mymodel2)
BIC(mymodel5); BIC(mymodel2)
```

Interestingly, also this model is **not** superior to `mymodel2`

.

After controlling for

`area`

and`econage`

, the`nre`

-apartments on average are only 80 CHF per month more expensive, whereas the result is**not**significant.The coefficient can be explained as follows, we have: \(\Delta area = 16.06\), \(\Delta age = -4.52\). Thus, the rent difference that can be attributed to this differences is: \(6.88 \cdot 16.06 + -26.19 \cdot -4.52 = 228.9\). Also, since we know that the total difference in average rents are \(309.16\) CHF, we can explain the remaining effect.

Adding the balcony variable has no effect, thus we arrived at our model: `lm(rent~area+econage)`

.

Finished? Not quite…

## Interaction effects and margins

So far, our favourite model is `mymodel2`

. However, lets produce some marginal plots using `ggpredict`

:

```
# play around with shifting the variables positions
myps <- ggpredict(mymodel5, terms=c("area","nre","econage"))
myps
plot(myps)
```

Obviously, the regression lines are always parallel. In some situations, we might change this by adding **interaction terms** to the model.

Calculate and interpret the following model: `mymodel6 <- lm(rent~I(area)*I(econage),data=data)`

:

```
mymodel6 <- lm(rent~I(area)*I(econage),data=data)
BIC(mymodel6)
BIC(mymodel2)
```

Our model now is, obviously:

\(rent = 783.2+13.4\cdot area - 5.5\cdot age - 0.24\cdot area \cdot age\)

Thus, the **marginal** effects of \(area\) or \(age\) now are:

\(\frac{\partial rent}{\partial area} = 13.4 - 0.24\cdot age\)

and

\(\frac{\partial rent}{\partial age} = -5.5 - 0.24\cdot area\)

So, we can see that the **positive effect** of \(area\) on \(rent\) **decreases** with **increasing** economic age of the apartment - whether this makes sense or not has to be judged by a real estate expert, probably not - also, such a complicated model should be avoided when the gain in adjusted \(R^2\) or loss in \(BIC\) are that samll!

The **negative** effect of \(age\) on \(rent\) gets stronger with increasing area of the apartments. Also here, of course, we could draw marginal effects:

```
# play around with shifting the variables positions
myps <- ggpredict(mymodel6, terms=c("area","econage"))
myps
plot(myps)
```

In short, only use interaction effects when you can justify their existence from a domain-knowledge perspective with economic reasoning!

For instance, we also could test whether `econage`

and `area`

has a different impact on housingrents for `nre`

-apartments.

```
mymodel7 <- lm(rent~I(area)*I(nre)+I(econage)*I(nre), data=data)
summary(mymodel7)
BIC(mymodel7)
myps <- ggpredict(mymodel7, terms=c("area","econage","nre"))
myps
plot(myps)
```

As we see quite fast, the interaction effects are all insignificant.

## Ridge/Lasso - Regression

We now learnt that multiple regression is a technique that minimizes the following expression:

\(SSR = \sum_i(y_i-b_0-\sum_j b_j \cdot x_{j,i})^2\)

An often encountered problem is, however, that sometimes we have many variables and not many observations. In this case, least squares method as multiple regression may fail, estimated parameters will have a huge variance and we have **overfitting** in our model.

In other words, we are estimating too many parameters.

Thus, we are now minimizing a new function (Lasso-Regression):

\(SSR_L = SSR + \lambda \cdot \sum_{j=1}^{k}|b_k|\)

This means that for each coefficient \(b_k\) which is estimated **not** to be zero, we are ‘penalizing’ the model. This also means, in turn, that unimportant variables are set equal to zero, which is automatically done by the algorithm.

**Careful:** Although this seems to be a very powerful technique, sometimes, economic reasoning prescribes adding certain variables although they are not ‘important’!!!

Note:

If \(\lambda=0\), we fall back to multiple regression.

If \(\lambda=\infty\), each coefficient will be set to zero.

We need to use `R`

-packages `glmnet`

and `caret`

, the `R`

-Machine-Learning library.

For this reason, we need to transform our data in a **datamatrix** where the factor variables are **categorical** (e.g. each level represents a binary variable).

Below, try the following:

- Add
**each**factor-variable (but not`id`

) to a new dataframe`mydata`

. - Make sure that the variables are recognized by
`R`

as factor-variables. - Use the function
`dummyVars()`

to perform onehot-encoding - read the help-file if necessary.

- Apply the onehot-encoding by using the object from step 3 and the function
`predict()`

```
mydata <- data[,c("rooms","nre","balcony")]
is.factor(mydata$rooms); is.factor(mydata$nre); is.factor(mydata$balcony)
mydata$rooms <- factor(mydata$rooms,levels=1:6,labels=as.character(1:6))
mydata$nre <- factor(mydata$nre,levels=0:1,labels=c("non-nre","nre"))
onehot <- dummyVars(" ~ .", data = mydata, fullRank = FALSE)
mydata.lasso <- data.frame(predict(onehot, newdata = mydata))
head(mydata.lasso)
```

Now, we add the quantitative variable to `mydata.lasso`

, but only `econage`

and `area`

, **not** the `rent`

variable.

```
mydata.lasso$econage <- data$econage
mydata.lasso$area <- data$area
head(mydata.lasso)
```

### Hyperparameter tuning

Now, we need to find an optimal parameter for \(\lambda\) - this procedure is called hyperparameter tuning. This is achieved by applying **K-Fold-Cross-Validation** (K-Fold-CV).

The data are split into \(K\) groups of equal size. In a first step, the first group is a

**test-data-set**and the model is fit using groups \(2 \dots K\).Then, a goodness of fit measure, e.g. the mean absolute error \(MAE\), is calculated using the

**test-data-set**.The procedure is repeated using the remaining 9 groups as

**test-data-sets**.At the end, we calculate the average for the goodness of fit measure.

- Below, we now define a vector
`grid`

that contains in possible \(\lambda\) values from \(0 < \lambda < \infty\).

- We calculate the Lasso model with all possible \(\lambda\)-values: this is done with the function
`lasso.mod <- glmnet(x,y,alpha = 1, lambda = )`

, whereas`x`

is our data matrix`mydata.lasso`

and`y`

is our dependent variable.

- Using the function
`cv.out <- cv.glmnet(x,y, alpha = 1,type.measure = 'mae', nfolds = 10)`

, we apply cross-validation for our model.

- We save the optimal \(\lambda\) value using
`cv.out$lambda.min`

`grid <- 10^seq(10,-2,length=100)`

```
# Step 1
grid <- 10^seq(10,-2,length=100)
# Step 2
lasso.mod <- glmnet(data.matrix(mydata.lasso), data$rent, alpha = 1, lambda = grid)
# Step 3
set.seed(1)
cv.out <- cv.glmnet(data.matrix(mydata.lasso), data$rent, alpha = 1,
type.measure = "mae", nfolds = 10)
# Step 4
# Speichere optimales Lambda
bestlam <- cv.out$lambda.min
```

Finally, using \(\lambda = 82.606\), we predict our model and return coefficients:

```
lasso.pred.rent <- predict(lasso.mod, s = bestlam, newx = data.matrix(mydata.lasso))
coef_lasso <- predict(lasso.mod , s = bestlam, type = "coefficients")
# return coefficients
coefDF <- data.frame(coefNames = rownames(coef_lasso), coef = unname(coef_lasso[,1]))
coefDF %>% filter(round(as.numeric(abs(coef)),2) > 0)
```

Obviously, the model selection confirms our **prior selection** according to `mymodel2`

- coefficients are however not identical due to the fact that the minimization function is different (using \(\lambda = 82.61\)).

### Comparing the two models

In the code below, calculate the predicted values for both models and calculate a measure of fit two contrast the two models against each other.

\(MAE = \frac{1}{n}\cdot \sum (|y_i - \hat{y}_i|)\)

```
lasso.pred.rent <- predict(lasso.mod, s = bestlam, newx = data.matrix(mydata.lasso))
ols.pred.rent <- predict(mymodel2,newdata=data)
rent <- data$rent
mae <- function(y,yhat){
mean((abs(y-yhat)))
}
mae(rent,lasso.pred.rent)
mae(rent,ols.pred.rent)
```

Obviously, \(MAE\) does not vary to strongly.