---
title: "R_Hello-World_Regression"
author: "Robert McCulloch"
date: "1/24/2026"
output:
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,collapse=FALSE)
```

# Hello World Data Analysis in R

In this notebook we will do a basic data analysis in R.

Our goal is to see how the price of a used car depends on characteristics of the car (the features).  

We will  

* read in the data to an R data frame
* pull off price, mileage, and year  
* divide price and mileage by 1,000
* do some simple summaries  
* run some regressions to relate car price to car features and get the standard inference.
* learn how R handles a categorical variable


## Read in the Data and Get the Variables We Want

First we will read in the data from the file susedcars.csv.

```{r rdat, include=TRUE, echo=TRUE, cache=TRUE}
cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
```

cd is now a *variable* which refers to a data structure that hold the data read
from the file susedcars.csv.  

The cd ``is'' an R data.frame.  
The data structures in R are: vector, list, and data.frame.   
We will see examples below.  

You can also read directly from a file on your machine but you need 
your *working directory* to be set to the directory (or folder) where the file is.  
You can use the functions getwd and setwd to check on your working directory.  
See also /Session/Set Working Directory in Rstudio.  

```{r}
getwd()  ## note also setwd
```

```{r,seecd1}
cat("cd is a: ",typeof(cd),"\n")
cat("is it a data frame:",is.data.frame(cd),"\n")
cat("number of rows and columns: ", dim(cd),"\n")
cat("the column names are:")
names(cd)
```

So, there are a variety of ways to see what an object is in R.  
One function that gives a lot of information about an object is ```str```.  

```{r,dependson='rdat'}
str(cd)
```

```{r}
head(cd)
```

Each row is an observation and corresponds to a used car for which we have measured the variables.  
We want to see how the price relates the the other variables which are features of the car.  

Let's just use mileage and year to start simply.  

We will pull off the price, mileage and year columns.  
Note the use of the $ notation to pull off a variable from a data.frame.  
Our goal will be to see how price relates to mileage and year. 
We will divide both price and mileage by 1,000 to make the results easier to understand.   

```{r tdat,include=TRUE, echo=TRUE, cache=TRUE,dependson='rdat'}
cd = cd[,c("price","mileage","year")]
cd$price = cd$price/1000
cd$mileage = cd$mileage/1000
head(cd)
```


### vectors

The function c creates a vector.  
A vector is just a list of things but they all must have the same type.  
For example:

```{r}
temp = c("price","mileage","year")
print(length(temp))
print(temp[2])
print(temp[2:3])
print(temp[c(1,3)])

junk = c(1,'rob')
print(junk) # 1 is turned into a character string

stuff = c(1,5, 22)
typeof(stuff)
```

### Back to our Data  

We can get a summary of our data.frame.  

```{r ,include=TRUE, echo=TRUE,dependson='tdat'}
summary(cd)
```

Since all our variables are numeric, let's use the pairwise correlations to get a quick feeling
for the relationships.  

```{r ,include=TRUE, echo=TRUE,dependson='tdat'}
cor(cd)
```

Remember, a correlation is between -1 and 1.  

The closer the correlation is to 1, the stronger the linear relationship between the variables,
with a positive slope.  

The closer the correlation is to -1, the stronger the linear relationship between the variables,
with a negative slope. 

So it looks like the bigger the mileage is, the lower the price of the car.  
The bigger the year is, the higher the price of the car.  

Makes sense!!

Note that we can also pull off columns and rows using integer indices starting at 1.  

```{r}
cd1 = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/susedcars.csv")
cd1 = cd1[1:8,c(1,4,5)] #first 8 rows, columns 1, 4, and 5.
print(names(cd1))
cd1
```

## Functions, Data Strutures, Help 

R basically works with functions and data structures.  
For example, to read in the data, we called the function read.csv and it returned a data.frame.  

To get help on a function use ?functionName as in ?read.csv.

Functions can have a lot of arguments, many of which will be optional with defaults.  

## Plot y and versus each x

Let's get a histogram of y=price.  

```{r}
hist(cd$price,nclass=20,main='',xlab='price')  #nclass is the number of bins, it is an optional argument
```

Now let's plot mileage vs. price.


```{r pl-mileage-price, include=TRUE, echo=TRUE,dependson="tdat"}
plot(cd$mileage,cd$price,xlab="mileage",ylab="price")
title(main="mileage vs. price")
```

And year vs. price.  

```{r pl-year-price, include=TRUE, echo=TRUE, dependson="tdat"}
plot(cd$year,cd$price,xlab="year",ylab="price",col='blue',pch=16,cex=.5,cex.lab=1.5,cex.axis=.7)
title(main="year vs. price",cex.main = 2)
```

The ``cex'' arguments control the size of various things.  
pch controls the plot symbol.  

Clearly, price is related to both year and mileage.  
Clearly, the relationship is not linear !!!  

What we really want to **learn** is the joint relationship betwee *price* and the pair of variables
(*mileage*,*year*) !!!  

Essentially, the modern statistical tools or *Machine Learning* enables us to learn the relationships
from data without making strong assumptions.

In the expression:
$$
price = f(mileage,year)
$$
we would like to know the function $f$.  

## Using the ggplot R-package

**Note that the vast number of R packages available in R is a major reason for using it**.

Above I have used the graphics in "base R".  
``ggplot`` has become a very popular alternative graphics environment.  

Note that to use ggplot, you would have to first install the R package and then load the library
as below.  

```{r}
# install.packages("ggplot2")  # if not already installed
library(ggplot2)
p = ggplot(data=cd,aes(x=mileage,y=price)) + geom_point()
p
```

Note that in the bottom right frame in Rstudio there is a tab for managing R packages.  

Note that ggplot is part of the "tidyverse" which is a set of R packages which has become quite
popular.  See the book "R For Data Science" by Wickham.  

\

Another nice plot is

```{r}
pairs(cd)
```

Boxplots are super useful for plotting a numeric variable versus a categorical variable.   
How does price relate to trim?   

```{r}
cdtemp = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cdtemp$trim = factor(cdtemp$trim)  #see next section for factor
cdtemp$price = cdtemp$price/1000
boxplot(price~trim,cdtemp)
```


## Categorical Variables in R, factors  

When looking at data, a fundamental distinction is between variables which are numeric and variables which
are categorical.  

For example, in our cars data, price is numeric and trim is categorical.  

*Both in building models and plotting, you have to think about categorical variables differently from numeric ones !!*  

R very explicitly allows you to make the distinction between numeric and categorical variables
and many important R functions adapt to the kind of variable you are using.  

To see a simple example, let's add color to our data.  

```{r}
cd1 = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd1 = cd1[,c(1,4,5,2)] #all rows, columns 1, 4, 5, and 2
print(names(cd1))
cd1$price = cd1$price/1000
cd1$mileage = cd1$mileage/1000
summary(cd1)
```


trim is treated differently from the numerics, but not too intelligently.  
We can fix this, by telling R that trim is a *factor*.

```{r}
cd1$trim = as.factor(cd1$trim)
summary(cd1)
```

The variable ```trim``` is now treated as cateorical.  
A categorical has a fixed set of possible categories or "levels".  

```{r}
levels(cd1$trim)
```

You can check "what" a variable is.  

```{r}
cat("is color a factor: ",is.factor(cd1$trim),"\n")
cat("is price a factor: ",is.factor(cd1$price),"\n")
```

## apply

```{r}
sapply(cd1,is.factor) # simple apply
```

There are a bunch of apply functions.

```{r}
print(apply(cd,2,mean)) ## apply function mean to columns of cd
print(c(mean(cd[,1]),mean(cd[,2]),mean(cd[,3])))
```

## Regression with just mileage 

Our goal is to relate y=price to x=(mileage,price).
Let's start simple and just use mileage.

Our *simple linear regression model* is

$$
price = \beta_0 + \beta_1 \, mileage + \epsilon
$$
$\epsilon$ represents the part of price we cannot learn from mileage.

Let's regress price on mileage to estimate the model: 

```{r}
#regress y=price on x=mileage, price and mileage are in the data.frame cd
lmmod1 = lm(price~mileage,cd)
cat("estimated coefficients are: ",lmmod1$coef,"\n")
summary(lmmod1)
```
So, the fitted model is

$$
price = 56.36 - .35 \, mileage + \epsilon
$$

Let's plot the training data with the fitted line.

```{r}
plot(cd$mileage,cd$price,xlab="mileage",ylab="price",col="blue",pch=16)
abline(lmmod1$coef,col="red",lwd=2)
title(main="mileage vs. price with linear fit")
legend("topright",legend=c("training data","linear fit"),col=c("blue","red"),pch=c(16,1),lwd=c(1,1),bty="n")
```

Pretty bad !!  You better take an applied machine learning course!!


## Run the Regression of y=price on X=(mileage,year)

Let's run a linear regression of *price* on *mileage* and *year*.  

Our model is: 

$$
price = \beta_0 + \beta_1 \, mileage + \beta_2 \, year + \epsilon
$$
This model assumes a linear relationship.  

*We already know this may not be a good  idea* !!!  

Let's go ahead and *fit* the model.  

Fitting the model to data will give us estimates of the 
parameters $(\beta_0,\beta_1,\beta_2)$.  

The error term $\epsilon$ represents the part of price we cannot
know from (*mileage*,*year*).  

\ 

```{r reg-price-mileage-year, include=TRUE, echo=TRUE,dependson="tdat", cache=TRUE}
lmmod = lm(price~mileage+year,cd)
cat("the coefficients are:",lmmod$coefficients,"\n")
```
So, the fitted relationship is  


$$
price = -5365.49 - 0.154 \, mileage + 2.7 \, year
$$
Note the use of the formula ```price~mileage+year``` to specify the model.  
This is used extensively in R.

### What is a list in R  

What is ```lmmod``` ?

First we need to know what a list is in R.
For example:

```{r}
tempL = list(a='rob',b=c(1,45,22))
cat('the list\n')
print(tempL)
cat('the components of the list\n')
print(names(tempL))
cat('just the b component\n')
print(tempL$b)
```

### Back to lmmod  

```{r,dependson='reg-price-mileage-year'}
cat("lmmod is a list: ",is.list(lmmod),"\n")
cat("lmmod has named components:\n ",names(lmmod),"\n")
## to get more information about a regression, get its summary
summary(lmmod)
```
There is useful information in the summary:  

```{r}
temp = summary(lmmod)
names(temp)
```


## Get and Plot the Fits  

Let's get the fitted values.  

For each observation in our data set the fits are
$$
\hat{price}_i = -5365.49  - 0.154 \, mileage_i + 2.7 \, year_i
$$


```{r get-fits, include=TRUE, echo=TRUE,dependson=c("tdat","reg-price-mileage-year")}
yhat = lmmod$fitted.values
cat("the type of yhat is:",typeof(yhat),"\n")
cat("is it a vector?:",is.vector(yhat),"\n")
cat("length of yhat is: ",length(yhat),"\n")
```
```{r}
plot(cd$price,yhat)
abline(0,1,col="red")
title(main='y=price vs yhat')
```

**Clearly, it is really bad !!**  

Machine Learning will enable us to get it right fairly automatically.

## Predictions

Many models in R get predictions by calling the predict function on the model data structure (object).

We have a fitted model object lmmod and we will call predict on it.

```{r lm-predict, include=TRUE, echo=TRUE,dependson=c("rdat","reg-price-mileage-year")}
xpdf = data.frame(mileage=c(40,100),year=c(2010,2004)) 
ypred = predict(lmmod,xpdf)
print(xpdf)
print(ypred)
```

Let's check it:

```{r}
sum(lmmod$coef * as.double(c(1,xpdf[1,])))
```

## General problems and terminology in machine learning

The data we used to "fit" our model, is called the *training data*.  

When we look at predictions for observations in the training data
(as we did for ```yhat```)
we say we are looking at *in-sample* fits or the fit on the training data.  

When we predict at observations not in the training data (as we did for ```ypred```), then we are
predicting *out of sample*.  

Out of sample prediction is the name of the game in predictive modeling or *supervised learning*.  

Out of sample prediction is always a more interesting test since you have not already seen.  

A *huge* part of Machine Learning is about building models to obtain good out of sample predictions!!!!

In supervised learning, we learn the parameters of our model on training data and then predict on the out of sample test data.  
A good model makes accurate predictions on the out of sample test data.  
We predict the target variable y with the features x.  
As noted above, in our example y = price x=(mileage,year).  
In statistics y would be called the dependent variable and x would be called the independent variables.  

There is also unsupervised learning in which the goal is to understand struture in data without neccessarily  
have a target variable y that we want to predict.  

For example, if I regress y=price on x=mileage, I am doing supervised learning.  
If I compute the correlation between y and x that is unsupervised learning.  

Another key distinction is classification versus regression.   
In regression, our target variable y is numeric. In classification, y is categorical.   
For example, if we were trying to predict if a used car sells in the next month, then y would be   
binary categorical sells/does not sell.   

Many of our key modelling approaches have versions for regression and classification   
where some of the basic ideas are the same but we need to change some basic details.   
For example we have linear regression for a numeric y and logistic regression for a binary y.   

## Standard Errors from Standard Regression Output

From our linear regression fit, we got estimates for the parameters.  

Or in Machine learning lingo, we learned the parameters from the training data.  

Often we want to know a lot more about the model fit.  

In particular,  we might want to know the *standard errors* associated with the parameter
estimates.  

```{r,dependson="reg-price-mileage-year"}
summary(lmmod)
```
  

The standard error associate with the estimate of the slope for *mileage* is .0083.  

The confidence interval for $\beta_1$, the *mileage* slope is:

```{r}
-.1537 + c(-2,2)*0.0083
```

The confidence interval is supposed to give us some indication about our uncertainty about
the coefficient given the information in the data:  
small interval: data tells you a lot  
big interval: data not that informative.  

In basic statistics, there is a big emphasis on quantifying uncertainty.  
In Machine Learning, it is more about prediction.  

Recall that $R^2$ is the square of the correlation between $y$ and $\hat{y}$:

```{r}
yyhat = cbind(cd$price,yhat)
dim(yyhat)
cor(yyhat)
```

The squared correlation is .91239^2 =  `r .91239^2`, which is the same as in the regression output.  

So, R-squared with just mileage is .66 but with year and mileage it is .83.

## Let's try the variable trim  

Does knowing the trim of the car help us predict?  

Let's try it.  
How do we add trim to a linear regression model?

We alread have cd1:

```{r}
head(cd1)
```

The lm function "automatically" handles a factor (a categorical variable):

```{r}
lmt = lm(price~mileage + year + trim,cd1)
print(summary(lmt))
```

According to $R^2$, trim did not do much!!  

How did lm put trim in the model?  

```{r}
summary(cd1$trim)
```

We add binary indicators for three of the four color levels.

```{r}
head(model.matrix(lmt))
```

```{r}
cd1$trim[1:6]
```


```{r,fig.width=12,fig.height=6,fig.width=7}
yhatt = lmt$fitted.values
cvec= rep('white',nrow(cd1))
cvec[cd1$trim=='430'] = 'black'
cvec[cd1$trim == '500'] = 'purple'
cvec[cd1$trim=='550'] = 'darkgrey'
cvec[cd1$trim == 'other'] = 'red'
plot(yhat,yhatt,col=cvec,cex=.5)  
```

Even though R-squared was not much bigger when we included trim, it does seem to make a bit   
of a difference as a practial matter.  

## Regression In Matrix Notation

Let's write our multiple regression model using vector/matrix notation and use 
basic matrix operations to check the predicted and fitted values.

The general multiple regression model is written:

$$
Y_i = \beta_0 + \beta_1  x_{i1} + \beta_2  x_{i2} + \ldots  \beta_p  x_{ip} + \epsilon_i, \; i=1,2,\ldots,n,
$$

where $i$ indexes observations and $x_{ij}$ is the value of the $j^{th}$ $x$ in the 
$i^{th}$ observation.
If we let  

\begin{equation}
x_i = \left[
\begin{array}{c}
1 \\
x_{i1} \\
x_{i2} \\
\vdots \\
x_{ip}
\end{array}
\right], \;
X= \left[
\begin{array}{c}
x_1' \\
x_2' \\
\vdots \\
x_n'
\end{array}
\right], \;\;
y = \left[
\begin{array}{c}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{array}
\right], \;\;
\epsilon = \left[
\begin{array}{c}
\epsilon_1 \\
\epsilon_2 \\
\vdots \\
\epsilon_n
\end{array}
\right], \;\;
\beta = \left[
\begin{array}{c}
\beta_0 \\
\beta_1 \\
\beta_2 \\
\vdots \\
\beta_p
\end{array}
\right]
\end{equation},

then we can write the model in matrix form:

$$
y = X \beta + \epsilon.
$$
Let's check this.

```{r}
X = cbind(rep(1,nrow(cd)),as.matrix(cd[,c(2,3)]))
head(X)
```
```{r}
bhat = matrix(lmmod$coef,ncol=1)
yhatM = X %*% bhat
print(summary(yhatM - yhat))
```