Skip to Tutorial Content

Introduction to the data

The concept of the tutorial

Our data

In this tutorial, we use a dataset that is called housingrents.csv. You find the dataset in our Moodle course, or, alternatively, you may download it here and store it in a folder, e.g. BAQM_SimpleRegression.

After generating a project in R or after setting the work-directory in R that contains, the data, the data can be read in with the following code:

data <- read.csv("housingrents.csv",sep=";")

For the tutorial, the data is already available in the dataframe data. Display the head of the data below. Also, with the command str you can investigate what kind of variables the data contain.

head(data)
str(data)
spec(data)

We have the following variables:

  • id, an unique identifier of the apartments in the dataset
  • rooms, size of the apartments (can be treated as numeric variable or as a factor)
  • area, size of the apartments, in \(m^2\) (numeric variable)
  • rent, monthly rent fee of the apartments in \(CHF\) (numeric variable)
  • nre, dummy variable that indicates whether the apartment is rented out by company nre (factor variable)
  • econage, economic age of the apartment, years since the apartment was built or last renovated (numeric variable)
  • balcony, character indicating whether the apartment has a balcony or not (factor variable)

Our overall goal is to build a regression model that can estimate rents (the dependent variable), given a set of independent or explanatory variables.

Recall that categorical variables are factor variables. In R, they can be summarised using frequency tables (table()) or by building a barplot using the command barplot().

For numeric variables, we can provide a five-number summary statistic using the command summary. For a simple graphical analysis of a numerical variable, we can use commands hist() and boxplot(). In the code below, please provide a univariate analysis of all variables in the data.

table(data$rooms)
summary(data$rooms)
# rooms
table(data$rooms)
summary(data$rooms)

# area
summary(data$area)
# rent
summary(data$rent)
# nre
summary(data$nre)
table(data$nre)
# econage
summary(data$econage)
# balcony
table(data$balcony)

Answer the following questions on our data

# rent
summary(data$rent)
hist(data$rent)
boxplot(data$rent)

# rooms
table(data$rooms)
barplot(table(data$rooms))
length(which(is.na(data$rooms)))

# area
mean(data$area,na.rm=T)
quantile(data$area,c(0.1,0.9),type=2,na.rm=TRUE)

# age
length(which(is.na(data$econage)))
cor(data$econage,data$rent,use="pairwise.complete")

# balcony
table(data$balcony)
length(which(is.na(data$balcony)))

# nre
table(data$nre)

Quiz

Now, you may copy and save this part of code to a file 01_SimpleRegression.R.

  1. Store the code in the same folder as you previously stored the housingrents.csv file
  2. Then, open the code by double-clicking it.
  3. Navigate to File \(\rightarrow\) New Project and press save when being prompted.
  4. The new project wizard will open and you choose Existing Directory and then, you navigate to the folder BAQM_SimpleRegression
  5. Done - you have generated a R-project.

Important Information

how we work...

Notes on a Data Science Project

Normally, a data-science project deals with at least these steps:

  1. Overview, data cleaning and matching (not necessary)
  2. Descriptive Statistics and data wrangling - done (data wrangling not necessary here)
  3. Exploratory Analysis (bivariate and multivariate analysis of the data)
  4. Feature Engineering - create new features as predictors (new variables) - here, it might be interesting to add, e.g., commuting-time to the next railway station or so (if possible)
  5. Model Selection (estimate many models and choose the best one in a certain metric)
  6. Model deployment - deploy your model as a webservice to other persons

Our first regression

We have already seen that our data consist in \(n=152\) observations.

In order to start our introduction to regression analysis, we first only analyse a subset of the data.

This is not a train-test-split as often done in Machine Learning, but an approach that is rather handy since it allows you to follow calculations more accurately with less data points.

We want to choose \(n=10\) random data points. We set the random number generator with the code-snippet:

RNGkind(sample.kind="Rounding")
set.seed(33)

This ensures that everyone who executes the code will obtain the same ‘random’ data points.

RNGkind(sample.kind="Rounding")
set.seed(33)
subdata <- data[sample(1:nrow(data),size=10,replace=F),c("area","rent")]

Descriptive Statistics

Now, calculate descriptive statistics for rent and area in the subdata dataframe.

mx<-mean(subdata$area,na.rm=T)
my<-mean(subdata$rent,na.rm=T)
sx<-sd(subdata$area,na.rm=T)
sy<-sd(subdata$rent,na.rm=T)
r<-cor(subdata$area,subdata$rent,use="pairwise.complete")

Apparently, we obtain the following descriptive statistics:

Simple Regression

The objective is to calculate the best linear predictor for the dependent variable \(y\) (rent) with the independent variable \(x\) (area), thus:

\(y_i = a\cdot x_i + b + e_i\)

\(\hat{y_i}=a\cdot x_i + b\)

\(e_i\) is the error term or the difference between real values for rents \(y_i\) and predicted values \(\hat{y_i}\). Thus, \(e_i=y_i - \hat{y_i}\).

The main question is how we can estimate \(a\) and \(b\), the slope and the intercept of the regression equation.

\(\color{red}{\text{$\hat{y_1}=2.5\cdot x + 750$}}\)
\(\color{blue}{\text{$\hat{y_2}=14\cdot x - 150$}}\)
\(\color{green}{\text{$\hat{y_3}=20\cdot x + 750$}}\)

We draw a scatter plot (blue points, the cross indicates the mean coordinates) and try to decide which regression line \(\hat{y_i}\) is the best fit for our data.

Below, you find a scatter plot of the \(10\) data points, along with the three proposed regression lines. With the App below, you can move around the black line (which currently lies on the \(\color{red}{\text{red}}\) line. You can also draw the residuals - which line leads to the smallest sum of squared residuals?

\(SSR = \sum e_i^2\).

The \(\color{blue}{\text{blue}}\) line represents the best fit.

The minimization process of \(SSR = \sum (y_i - a\cdot x_i - b)^2\) with respect to \(a\) and \(b\) leads to the following formulas:

\[a = r\cdot \frac{s_y}{s_x}\] \[b = \bar{y}-a\cdot \bar{x}\]

We obtain: \(a=0.923 \cdot \frac{807.23}{53.26} = 13.99\).
and for the intercept: \(b=1181.7-13.99\cdot 1181.7 = -148.46\)

Thus, the regression line is:

\[\hat{y_i}=13.99\cdot x -148.46\]

Where \(x\) stands for the area and \(\hat{y}\) for the predicted rent.

For instance, we see that the mean area of \(\bar{x}=95.1\) is not a data point, but the regression line allows a prediction for the rent of an apartment of this size:

\(\hat{y}=13.99\cdot 95.1 -148.46 = 1181.7\)

Obviously, the regression line hits the mean of the data perfectly.

What does the result mean? \(a=13.99\) is the slope of the regression line and measures the (linear) dependency of rent on area, thus, a \(1m^2\) increase in the size of a flat increases the rent by \(13.99\) CHF. It makes sense that the coefficient is positive. \(b\) is the predicted rent of a flat, when the area \(x\) is set to zero (\(x=0\)). We would expect a rent of 0 if area \(x\) is zero.

Regression with R

With the following R-code, we perform a regression:

mymodel <- lm(rent~area, data=subdata)
summary(mymodel)

The first line represents the specification of the regression model: rent, the dependent variable, is regressed on area, the independent variable. The second argument specifies that the data is within the dataframe subdata. The second line of code returns the output of the regression.

mymodel <- lm(rent~area, data=subdata)
summary(mymodel)

Now, you may copy the second part of the code 02_SimpleRegression.R.

  1. Store the code in the same folder as you previously stored the housingrents.csv file

Interpret our first regression

We now want to interpret the regression results:

## 
## Call:
## lm(formula = rent ~ area, data = subdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -303.9 -245.5 -123.8  181.5  559.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -148.462    222.182  -0.668 0.522813    
## area          13.987      2.063   6.779 0.000141 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 329.7 on 8 degrees of freedom
## Multiple R-squared:  0.8517, Adjusted R-squared:  0.8332 
## F-statistic: 45.96 on 1 and 8 DF,  p-value: 0.0001407

Overall interpretation

  • The standard deviation of the residuals \(e_i\) is: \(s_e = 329.67\). Recall that the standard deviation of the rents was \(s_y = 807.23\).
  • The degrees of freedom (\(df\)) is \(n-k-1=10-1-1=8\). \(k\) stands for the number of regressors used, here \(k=1\).
  • \(R^2\) amounts to 0.8517. This means that our model, where we explain rents with the area variable, can explain 85.17% in the variation of the dependent variable, the rents.
  • We focus on adjusted \(R^2\): Formula: \(R_{adj}^2 = 1-(1-R^2)\cdot \frac{n-1}{n-k-1}\). This means: The more independent variables we use in our model, the more we get ‘penalized’ and explained variation is reduced. Do not use ‘unneccessary’ independent variables in your model, (here, \(n=10\), \(k=1\)). Adjusted \(R^2\) is: 83.32%.

Further, The \(F\)-statistic tests whether all coefficients in the model are significantly different from zero. Our model from the sample is: \(y_i = a\cdot x_i +b+e_i\). With greek letters, we want to refer to the true coefficients in the population (all apartments in Switzerland, thus \(y_i=\alpha \cdot x_i+\beta+\epsilon_i\)). The null hypothesis is:

\(H_0\): \(\alpha=0\) and \(\beta=0\), e.g. at confidence level \(100-\alpha=95\%\),
\(H_A\): \(\alpha \neq 0\) or \(\beta \neq 0\), \(\alpha = 5\%\).

From the table above, we can see that the \(p\)-Value is below \(\alpha=5\%\). Thus, we reject \(H_0\) and say that the coefficients are (jointly) significantly different from zero - this is a result which we desire! Our sample allows us to conclude that there is a significant linear relationship between rents and area!

Coefficient-based interpretation for area

  • The coefficient \(a\) amounts to \(a=13.99\frac{CHF}{m^2}\). This means that a \(1m^2\) increase in the size of a flat increases the monthly rent by \(13.99CHF\).
  • The standard error of this coefficient is \(SE_a = 2.06\frac{CHF}{m2}\). The standard error measures the statistical uncertainty of the sample.
  • The \(t\)-statistic tells us how far the coefficient \(a\) is away from zero, expressed in numbers of standard-errors. We obtain \(t=\frac{13.99}{2.06}=6.78\). This yields a \(p\)-value of 0.01% and tests the Null-hypothesis \(H_0\) whether the coefficient is zero or not.
    \(H_0\): \(\alpha = 0\), \(100-\alpha=95\%\)
    \(H_A\): \(\alpha \neq 0\), \(\alpha = 5\%\).
    Since the \(p\)-value falls below \(\alpha = 5\%\), we reject \(H_0\) and say that the coefficient of area is statistically significantly different from zero. Of course we desire this result - we expact area to have a positive impact on rents.

Coefficient-based interpretation for the intercept

  • The coefficient \(a\) amounts to \(a=-148.46CHF\). This means that an apartment of size \(0m^2\) is valued by the model at \(-148.46CHF\).
  • The standard error of this coefficient is \(SE_b = 222.18CHF\). The standard error measures the statistical uncertainty of the sample.
  • The \(t\)-statistic tells us how far the coefficient \(b\) is away from zero, expressed in numbers of standard-errors. We obtain \(t=\frac{-148.46}{2.06}=-0.67\). This yields a \(p\)-value of 52.28% and tests the Null-hypothesis \(H_0\) whether the coefficient is zero or not.
    \(H_0\): \(\beta = 0\), \(100-\alpha=95\%\)
    \(H_A\): \(\beta \neq 0\), \(\alpha = 5\%\).
    Since the \(p\)-value falls above \(\alpha = 5\%\), we accept \(H_0\) and say that the intercept of the model is statistically not significantly different from zero. We also are ‘glad’ about this result: A \(0m^2\) room apartment’s value is zero.

Note on interpreting model coefficients from a regression…

  1. A coefficient exhibits a strong significance (significantly different from \(0\)), if the \(p\)-value falls below \(1\%\), medium significance, if the \(p\)-value falls below \(5\%\), and weak significance, if the coefficent falls below \(10\%\)
  2. Make sure you apply reasoning on the sign of the coefficient
  3. Also, check the \(F\)-statistic - it could be that its \(p\)-value reveals an overall significance of the model even if none of the coefficients are significant!
  4. Make sure you check adjusted \(R^2\) and not \(R^2\). Adjusted \(R^2\) penalizes a regression model when taking up too many explanatory variables.

Univariate Regression with full sample

Now, we take the full sample and have a closer look at regression results.

The file data now consists in the full sample. Below, run a univariate regression that explains rent with area and then answer the following questions.

mymodel <- lm(rent~area, data=data)
summary(mymodel)

Quiz

Inference after Univariate Regression

Predictions

With a regression model, we can on one hand quantify the effect that area has on rents. On the other hand, we also want to be able to make predictions.

The mymodel object stores two vectors in itself: fitted.values and residuals. For the moment, we only look at the first observation with Index 1: Write an R-Code that gives the following output: area and rent of this apartment, predicted rent of this apartment and the residuals. Also, try to replicate the predicted rent using the coefficients object within mymodel.

area1 <- data$area[1]
rent1 <- data$rent[1]
predicted.rent1 <- mymodel$fitted.values[1]
residuals.rent1 <- mymodel$residuals[1]

print(paste("area: ",area1," rent: ",rent1,"   predicted rent: ",predicted.rent1,"  residual:",residuals.rent1,sep=""))

a<-mymodel$coefficients[2]
b<-mymodel$coefficients[1]
a*area1+b

This was probably not too difficult. However, what if we wanted to make predictions at area values that we do not observe in the data, e.g. at a hypothethical area of c(0,30,87,300) square-meters? For doing this, we need the predict-command.

The command is:
predict(object, newdata = , interval = c('confidence','prediction'), se.fit=c(TRUE,FALSE), level = 0.95)

  • The object is the previously estimated mymodel-object, or any other model
  • The option newdata is indicating that the model should make predictions at new data, the input should be of type data.frame
  • With the option interval, we are indicating whether we want an estimation of the standard-error of the prediction at the mean (option confidence) OR for an individual value (option prediction). What does this mean - consider the apartment of area size \(30m^2\). Then, the option confidence calculates the standard-error for an ‘average’ apartment of size \(30m^2\). As you know, there might be many different \(30m^2\) apartments, e.g. such with or without a balcony, newly built apartments vs. 50 years old apartments and so on. Accordingly, the option confidence is resulting in the standard-error of the average rent for such apartments, whereas the option prediction is giving the standard-error of an individual rent for such an apartment - the standard-error for the first type will be smaller than for the second type.
  • se.fit = TRUE specifies to display the standard-error for each prediction.
  • level specifies the confidence-level \(1-\alpha\), which is set by default to \(0.95\).

Create a dataframe new with a column area=c(0,30,87,300). Then, make a prediction once for the option prediction and once for the option confidence - what do you note?

new <- data.frame(area=c(0,30,87,300))
new <- data.frame(area=c(0,30,87,300))
# prediction at the mean
predict(mymodel, newdata = new, interval='confidence', se.fit=TRUE)
# individual prediction
predict(mymodel, newdata = new, interval='prediction', se.fit=TRUE)

Note on interpreting model coefficients from a regression…

  1. Note that the coefficients are: For area: \(a = 10.64\). This coefficient is equal to \(a=r\cdot \frac{s_y}{s_x}\), where \(r\) is the correlation between the two variables and \(s_y\) and \(s_x\) are the standard deviations. Convince yourself of this equality in the code below.
r <- cor(data$area,data$rent)
sy <- sd(data$rent)
sx <- sd(data$area)
r*sy/sx
  1. Second, the intercept of the regression line \(y=a\cdot x + b\) goes through the point \((\bar{x}|\bar{y})\). In other words, a prediction at area=86.84 is the mean value of rent, which is rent = 1046. This means a regression line is able to perfectly predict the mean of the data!
  2. Predictions have lowest standard error at the mean of the data. The standard error however increases when moving away from the mean. Therefore, be careful with predictions that are out of sample,

Testing Coefficient - Hypotheses

Finally, we want to obtain Confidence-Intervals for our coefficients. Since our data just represent a sample, we do not know about the ‘true’ parameters \(\alpha\) and \(\beta\) in the population. We obtain confidence-intervals using the function confint.lm(mymodel,level=...) - Now, calculate a \(95\%\) and \(99\%\) confidence interval for our coefficients.

confint.lm(mymodel, level=0.95)
confint.lm(mymodel, level=0.99)

What does this mean?

E.g. we might test the null-hypothesis that:

\(H_0\): \(\alpha = 13 \frac{CHF}{m^2}\), at a confidence level of \(1-\alpha = 95\%\),
\(H_A\): \(\alpha \neq 13 \frac{CHF}{m^2}\), at \(\alpha = 5\%\)

Since \(\alpha = 13\frac{CHF}{m^2}\) is not within the 95%-confidence-interval, we reject \(H_0\). The rent per sqm is significantly different from \(13\frac{CHF}{m^2}\)

In addition, we also might test the coefficient to be equal to \(13\), we use the function linearHypothesis from the car-package.

linearHypothesis(mymodel, "area=13")
linearHypothesis(mymodel, "area=13")

# other tests
linearHypothesis(mymodel,"(Intercept)=0")
linearHypothesis(mymodel,"1*(Intercept)+87*area = 1150")

Explanation:
1. The p-value of the test falls below \(5\%\), therefore, we reject \(H_0\) that \(\alpha=13\).
2. The p-value, of course, is the same as in the output of summary(mymodel) (look at the intercept) - it is significantly different from zero.
3. The last line represents a test for a prediction with area=87 (at the mean). The actual value is \(1240.3\) CHF per month, which is significantly different from \(1150\) CHF per month since the p-value falls below \(5\%\).

Now, you may copy the second part of the code 03_SimpleRegression.R.

  1. Store the code in the same folder as you previously stored the housingrents.csv file

Model Diagnostics

A regression model has several theoretical assumptions which should be checked. If these observations are fulfilled, we call the regression estimator (ordinary least squares estimator or OLS estimator) BLUE (Best Linear Unbiased Estimator).

  • Best means, that the estimator has the lowest possible variance (efficient estimator)
  • Unbiased means that the expected value of our coefficients are identical with the value from the population, e.g. \(E(a) = \alpha\).

Assumption (1): Linearity

The regression model \(y_i = a\cdot x_i + b + e_i\) assumes a linear relationship between the dependent variable \(y\) and the independent variable \(x\).

  • The linearity can be visually checked by plotting the error terms \(e_i\) (residuals) over the predicted values of the model (\(\hat{y_i}\)).

  • There should not be any obvious functional dependency between the residuals and the fitted values (e.g. parabolic shape or seasonal pattern).

  • If linearity of a model is not fulfilled, adding new variables such as squared variables may solve the situation.

The first graph of the autoplot() function allows for visual inspection of linearity.

# consider first graph
autoplot(mymodel)

In the specific example, linearity seems to be fulfilled - the blue trend line is (more or less) horizontal.

A non-linear relationship tells us that our coefficient estimates are biased.

In addition, we might test for serial correlation in the error terms - serial correlation would be present in a pattern where residuals show some recurring pattern, e.g. seasonality. We accomplish this with the durbinWatsonTest() from the car package.

# consider first graph
durbinWatsonTest(mymodel)

Since the \(p\)-value lies above \(5\%\), we accept the null-hypothesis of no serial-correlation.

Assumption (2): Normal distribution of the error terms

  • If the error terms \(e_i\) are normally distributed, the points lie on a 45°-line in a QQ-Plot, or at least only randomly fluctuate around that line.

  • If this is not the case, we see certain paterns around the 45°-line.

  • A non-normal distribution can potentially be corrected by transforming the variables in the model, e.g. by taking logarithms.

The second graph in autoplot() allows for investigating a QQ-Plot of the residuals \(e_i\). In addition, the command shapiro.test() applies a statistical test for normality of the residuals - if it’s \(p\)-value is above \(5\%\), normality can be assumed.

# consider second graph
autoplot(mymodel)
shapiro.test(mymodel$residuals)

Obviously, the upper quantiles of our residuals deviate from the 45°-line. This is a rather typical behavior for right-skewed data. Non-normality of residuals leads to unbiased but inefficient coefficients.

Shortly re-estimate the model by taking logarithms of rent and area below. What do you note?

mymodel2 <- lm(log(rent)~log(area), data = data)
summary(mymodel2)
# consider second graph
autoplot(mymodel2)
shapiro.test(mymodel2$residuals)
  • The \(p\)-value is still below \(5\%\) but at least above \(1\%\), accordingly the problem would no more be that persistent.
  • The interpretation changes. Looking at the results, we interpret the coefficient of log(area) as follows: If the area of an apartment increases by 1%, the price is expected to increase by 0.747%.
  • Note that \(R_{adj}^2\) has (quite) substantially increased - therefore, taking logarithms would be an appropriate solution here (which we not do at the moment for reasons of interpretation).

Assumption (3) - Homoscedasticity

Homoscedasticity means that the variance of the residuals should be constant.

  • The variance of the error term is constant over the predicted values

  • In the scale-location plot, there should not be any obvious dependency between scale and location (of predictions).

  • Heteroscedasticity leads to inefficient coefficient estimates.

  • In the case of multiple regression, heteroscedasticity can be avoided by omitting certain variables. In practice, robust estimates of the variance-covariance matrix should be obtained (package sandwich)

  • The R function ols_test_breusch_pagan() from package olsrr tests the null hypothesis of homoscedasticity. If the p-value falls below 5%, the model suffers from heteroscedasticity.

# consider third graph
autoplot(mymodel)
ols_test_breusch_pagan(mymodel)

In the 3rd plot, we can see that the Variance (more accurately the square root of standardized residuals) are increasing with increasing fitted values \(\hat{y_i}\). This means that heteroscedasticity is present - which yields inefficient coefficient estimates. Also, the \(p\)-value from the Breusch-Pagan test indicates heteroscedasticity.

Below, re-run the model again with log-transformed variables and interpret the result of the Breusch-Pagan test.

mymodel2 <- lm(log(rent)~log(area), data = data)
summary(mymodel2)  
# consider third graph
autoplot(mymodel2)
ols_test_breusch_pagan(mymodel2)

Also here, we can clearly see that log-transformation not only improves non-normality of the residuals but also results in homoscedasticity. For ease of interpretation, we still stick with the original values in the model. If heteroscedasticity is present, you also can use a different Variance-Covariance-Matrix that accounts for heteroscedasticity using the vcovHC() function from the sandwhich package (later…).

Assumption (4) - no influential observations or outliers

Outliers are observations that are badly predicted by the model. The either have extremely large or negative residuals. We can identify outliers using the function outlierTest() from the car package.

  • Outliers are present if the limit for the standardised residuals of \(\pm 3\) are violated, or if there is a leverage larger than \(\frac{2(k+1)}{n}\).

  • Outliers that are marked (with numbers) should be investigated more closely.

  • Outliers distort both: accuracy of coefficient estimates (bias) and efficiency!

  • Test your model against influential outliers.
    Possibility 1: Remove outliers from your model, recalculate the model, compare coefficients and other results. (A problem might be that new data points will be marked as outliers).
    Possibility 2: Transform the data in the model (e.g. by taking logarithms)
    Possibility 3: The package MASS enables you to calculate robust regression: rlm() The function uses a different weighting of observations than OLS does. Influential outliers are given less weight.
    Possibility 4: Apply quantile regression (e.g. regression not at the mean but regression at conditional quantiles – package quantreg, using rq()).

Re-run function autoplot() and consider the fourth graph:
In addition, use the function outlierTest() form car package to identify potential outliers.

# consider fourth graph
autoplot(mymodel)

ot <- outlierTest(mymodel)
ot

Now, we might remove these outliers - but let’s shortly look at rows 98 and 151.

# consider fourth graph
autoplot(mymodel)

ot <- outlierTest(mymodel)
index <- sort(as.numeric(names(ot$rstudent)))
data[index,c("rent","area")]

We can see that the rent for both flat is surprisingly high - given the fact we estimated a value of \(a = 10.6 \frac{CHF}{m^2}\). We now remove the outliers from the data and recalculate our model according to Possibility 1 above.

# consider fourth graph
autoplot(mymodel)
ot <- outlierTest(mymodel)
index <- sort(as.numeric(names(ot$rstudent)))
data1 <- data[-index,]
mymodel.p1 <- lm(rent~area,data=data1)
summary(mymodel.p1)

outlierTest(mymodel.p1)
# we also might remove '63'
data2 <- data1[-which(rownames(data1)=='63'),]
mymodel2 <- lm(rent~area,data=data2)
outlierTest(mymodel2)

# we also might remove '102'
data3 <- data2[-which(rownames(data2)=='102'),]
mymodel3 <- lm(rent~area,data=data3)
outlierTest(mymodel3)
summary(mymodel3)

As we can see, adjusted \(R^2\) is increasing by approximately \(1\%\), so we might remove the two outliers and test again if we identify outliers - and remove them as long as Bonferroni \(p\)-values are below \(1\%\). We alreday tested possibility 2 from above - there are no outliers in the upper part of the distribution using this method. We proceed with possibilities 3 and 4:

# rlm()
mymodel.rlm <- MASS::rlm(rent~area,data=data)
summary(mymodel.rlm)

# rq()
mymodel.rq <- quantreg::rq(rent~area,data=data,tau = c(0.5))
summary(mymodel.rq)

In contrast to outliers, high-influential data points are not necessarily having high residuals but uncommon combinations of predictor variables.

Using the function influencePlot(), we now may identify outliers: 98, 151, and we may identify influential observations: 137, 148 - interestingly, these are the observations with the two largest living areas of \(200m^2\) and \(250m^2\) in the data.

influencePlot(mymodel)

Now, you may copy the second part of the code 04_SimpleRegression.R.

  1. Store the code in the same folder as you previously stored the housingrents.csv file

The fun-part - Homegate

The univariate regression explains a dependent variable \(y\) with an independent variable \(x\). We stay with the example of rents and want to explain monthly rents (in CHF) with living space (in \(m^2\)).

Consider, for instance, the municipality 8712 Stäfa (ZH). The variables are as follows:

Descriptive Statistics:

The goal of the univariate regression calculation is to find the parameters \(a\) and \(b\). The following applies:

\(y_i = a\cdot x_i + b + e_i\)

Where:

  • \(x_i\) and \(y_i\) are the coordinates \(x\) and \(y\) of an arbitrary data point \(i\)
  • \(a\) and \(b\) are the parameters (slope and intercept) of the regression line to be estimated.
  • We call the part \(a\cdot x_i+b\) the value predicted by the regression calculation \(\hat{y_i}\).
  • \(e_i\) denote the error terms or residuals. These are the prediction errors of the model \(\hat{y_i}=a\cdot x_i+b\).

Now, to determine the parameters \(a\) and \(b\), it is necessary to establish a quality measure based on the prediction errors and to minimize this measure.

This quality measure is the sum of the squared error terms.

With the app below, you can retrieve data from homegate for a date (today or in the past) you might want to choose. Then, you may choose your home municipality by clicking in the box ‘choose PLZ’, delete the entry with backspace-button, and write the PLZ of your choice.

Afterwards, you may download the data and play with the app a bit - the goal is that you repeat our analysis with your own data.

Descriptive Statistics

What is the 'best' parameter choice?

R-Code

The code below assumes that the data exists as dataframe df:
mymodel <- lm(rent~area,data=df)
summary(mymodel)

Short explanation of R-Output

Calculate a prediction with R


.

Your turn! Download the data file for the specific municipality you have chosen. Then perform a simple regression of rents on area, make predictions, and run all kind of specification tests and model diagnostics.

Simple Regression