10 Regression

We should be suspicious if the data points all fall exactly on the straight line of prediction

1 Regression to the mean

Learning Objectives

By the end of this lesson, you will be able to:

Evaluate the question of simple regression
Discuss the data and assumptions of simple regression
Graph simple regression
Perform tests and alternatives for simple regression

“The general rule is straightforward but has surprising consequences: whenever the correlation between two scores is imperfect, there will be regression to the mean.”

- Francis Galton

One of the most common and powerful tools in the statistical toolbox is linear regression. The concept and basic toolset was created in conjunction with investigating the heritable basis of resemblance between children and their parents (e.g. height) by Francis Galton.

Exemplary of one of the greatest traditions in science, a scientist identified a problem, created a tool to solve the problem, and then immediately shared the tool for the greater good. This is a slight digression from our purposes here, but you can learn more about it here:

2 The question of simple regression

The essential motivation for simple linear regression is to relate the value of a numeric variable to that of another variable. There may be several objectives to the analysis:

Predict the value of a variable based on the value of another
Quantify variation observed in one variable attributable to another
Quantify the degree of change in one variable attributable to another
Null Hypothesis Significance Testing for aspects of these relationships

2.1 A few definitions

Equation (1) is the classic linear regression model. (NB, here we make a distinction between the equation representing the statistical model, and the R formula that we will use to implement it)

$\alpha$ (alpha, intercept) and $\beta$ (beta, slope) are the so-called regression parameters
y and x are the dependent and predictor variables, respectively
$\epsilon$ (epsilon) represents the “residual error” (basically the error not accounted for by the model)

Equation 2 is our assumption for the residual error

Gaussian with a mean of 0 and a variance we estimate with our model

Equation 3 is our sum of squares (SS) error for the residuals

the variance of residuals is the SSres/(n-2), where n is our sample size

Equation 4 $\hat\beta$ is our estimate of the slope

Equation 4 $\hat\alpha$ is our estimate of the intercept

3 Data and assumptions

We will explore the simple regression model in R using the Kaggle fish market dataset.

library(openxlsx)

# NB your file may be in a different location to mine!
fish <- read.xlsx('data/10-fish.xlsx')

# Download the fish data .xlsx file linked above and load it into R
# (I named my data object "fish") 
# Try this:

names(fish)

[1] "Species" "Weight"  "Length1" "Length2" "Length3" "Height"  "Width"

table(fish$Species)


    Bream    Parkki     Perch      Pike     Roach     Smelt Whitefish 
       35        11        56        17        20        14         6

# slice out the rows for Perch

fish$Species=="Perch" #just a reminder

  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [97]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[109]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE

perch <- fish[fish$Species=="Perch" , ]
head(perch)

   Species Weight Length1 Length2 Length3 Height  Width
73   Perch    5.9     7.5     8.4     8.8 2.1120 1.4080
74   Perch   32.0    12.5    13.7    14.7 3.5280 1.9992
75   Perch   40.0    13.8    15.0    16.0 3.8240 2.4320
76   Perch   51.5    15.0    16.2    17.2 4.5924 2.6316
77   Perch   70.0    15.7    17.4    18.5 4.5880 2.9415
78   Perch  100.0    16.2    18.0    19.2 5.2224 3.3216

3.1 Assumptions

The principle assumptions of simple linear regression are:

Linear relationship between variables
Numeric continuous data for the dependent variable (y); numeric continuous (or numeric ordinal) data on the for the predictor variable (x)
Independence of observations (We assume this for the different individual Perch in our data)
Gaussian distribution of residuals (NB this is not the same as assuming the raw data are Gaussian! We shall diagnose this)
Homoscedasticity (this means the residual variance is approximately the same all along the x variable axis - we shall diagnose this)

4 Graphing

The traditional way to graph the simple linear regression is with a scatterplot, with the dependent variable on the y axis and the predictor variable on the x axis. The regression equation above can be used to estimate the line of best fit for the sample data, which is predicted value of y. Thus, prediction is one of the functions here (as in predicting the value of y given a certain value of x if there were to be further data collection). This regression line is often incorporated in plots representing regression.

The simple regression function in R is lm() (for linear model). In order to estimate the line of best fit and the regression coefficients, we will make use of it.

# Try this:
# A simple regression of perch Height as the predictor variable (x)
# and Width as the dependent (y) variable

# First make a plot
plot(y = perch$Height, x = perch$Width,
     ylab = "Height", xlab = "Width",
     main = "My perch regression plot",
     pch = 20, col = "blue", cex = 1)

# Does it look there is a strong linear relationship
# (it looks very strong to me)

# In order to draw on the line of best fit we must calculate the regression

# ?lm 

# We usually would store the model output in an object

mylm <- lm(formula = Height ~ Width, # read y "as a function of" x 
           data =  perch)
mylm # NB the intercept (0.30), and the slope (1.59)


Call:
lm(formula = Height ~ Width, data = perch)

Coefficients:
(Intercept)        Width  
     0.2963       1.5942

# We use the abline() function to draw the regression line onto our plot
# NB the 

# ?abline

abline(reg = mylm) # Not bad

# Some people like to summarize the regression equation on their plot
# We can do that with the text() function
# y = intercept + slope * x

# ?text

text(x = 3,    # x axis placement
     y = 11,   # y axis placement
     labels = "y = 0.30 + (1.59) * x")

4.1 Testing the assumptions

The data scientist must take responsibility for the assumptions of their analyses, and for validating the statistical model. A basic part of Exploratory Data Analysis (EDA) is to formally test and visualize the assumptions. We will briefly do this in a few ways.

Before we begin, it is important to acknowledge that this part of the analysis is subjective and it is subtle, which is to say that it is hard to perform without practice. As much as we wish that Null Hypothesis Significance Testing is totally objective, the opposite is true, and the practice of data analysis requires experience.

Here, we will specifically test two of the assumption mentioned above, that of Gaussian residual distribution, and that of homoscedasticity. We will examine both graphically, and additionally we will formally test the assumption of Gaussian residuals.

To start with, let’s explicitly visualize the residuals. This is a step that might be unusual for a standard exploration of regression assumptions, but for our purposes here it will serve to be explicit about what the residuals actually are.

## Test assumptions ####
# Try this:

# Test Gaussian residuals

# Make our plot and regression line again
plot(y = perch$Height, x = perch$Width,
     ylab = "Height", xlab = "Width",
     main = "My perch RESIDUAL plot",
     pch = 20, col = "blue", cex = 1)
abline(reg = mylm)

# We can actually "draw on" the magnitude of residuals
arrows(x0 = perch$Width,
       x1 = perch$Width,
       y0 = predict(mylm), # start residual line on PREDICTED values
       y1 = predict(mylm) + residuals(mylm), # length of residual
       length = 0) # makes arrowhead length zero (or it looks weird here)

Note the residuals are perpendicular the the x-axis. This is because residuals represent DEVIATION of each OBSERVED y from the PREDICTED y for a GIVEN x.
The Gaussian assumption is that relative to the regression line, the residual values should be, well, Gaussian (with mean of 0 and a variance we estimate)! There should be more dots close to the line with small distance from the regression line, and few residuals farther away

4.2 Closer look at the residual distribution

Remember how we visually examine distributions? With a frequency histogram and possibly a q-q plot right? Here we will do those for a peek, but we will also add a formal, objective test of deviation from normality. This part of exploratory data analysis is subtle and requires experience (i.e. it is hard), and there are many approaches. Our methods here are a starting point.

# residual distribution
# Try this:

library(car) # for qqPlot()

Loading required package: carData

par(mfrow = c(1,2)) # Print graphs into 1x2 grid (row,column)

hist(residuals(mylm), main = "")
qqPlot(residuals(mylm))

124 118 
 52  46

par(mfrow = c(1,1)) # Set back to 1x1

4.3 Diagnosis - take 1

The histogram is “shaped a little funny” for Gaussian
Slightly too many points in the middle, slightly too few between the mean and the extremes in the histogram
Very slight right skew in the histogram
Most points are very close to the line on the q-q plot, but there are a few at the extremes that veer off
Two points are tagged as outliers a little outside the error boundaries on the q-q plot (rows 118 and 124, larger than expected observations)

4.4 Diagnosis - take 2

It is your job as a data scientist to be skeptical of data, assumptions, and conclusions. Do not pussyfoot this.

It is not good enough to merely make these diagnostic graphs robotically; the whole point is to judge whether the the assumptions have been violated. This is important (and remember, hard) because if the assumptions are not met it is unlikely that the dependent statistical model is valid. Here, we can look a little closer at the histogram and the expected Gaussian distribution, and we can also perform a formal statistical test to help us decide.

## Gussie up the histogram ####

# Make a new histogram
hist(residuals(mylm), 
     xlim = c(-2, 2), ylim = c(0,.9),
     main = "",
     prob = T) # We want probability density this time (not frequency)

# Add a density line to just help visualize "where the data are"
lines(                       # lines() function
  density(residuals(mylm)),   # density() function
  col = "green4", lty = 1, lwd = 3) # Mere vanity

# Make x points for theoretical Gaussian
x <- seq(-1,+1,by=0.02) 

# Draw on theoretical Gaussian for our residual parameters
curve(dnorm(x, mean = mean(residuals(mylm)),
            sd = sd(residuals(mylm))),
      add = T,
      col = "blue", lty = 3, lwd = 3) # mere vanity

# Draw on expected mean
abline(v = 0, # vertical line at the EXPECTED resid. mean = 0
       freq = F,
       col = "red", lty = 2, lwd = 3) # mere vanity

Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): "freq" is
not a graphical parameter

# Add legend
legend(x = .6, y = .9,
       legend = c("Our residuals", "Gaussian", "Mean"),
       lty = c(1,3,2),
       col = c("green4", "blue","red"), lwd = c(3,3,3))

Diagnosis

Near the mean, our residual density is slightly higher than expected under theoretical Gaussian
Between -0.5 and -1 and also between 0.5 and +1 our residual density is lower than expected under theoretical Gaussian
Overall the differences are not very extreme
The distribution is mostly symmetrical around the mean

Finally, let’s perform a statistical test of whether there is evidence our residuals deviate from Gaussian. There are a lot of options for this, but we will only consider one here for illustration, in the interest of brevity. We will (somewhat arbitrarily) use the Shapiro-Wilk test for Gaussian.

Side note: Tests like this are a bit atypical within the NHST framework, in that usually when we perform a statistical test, we have a hypothesis WE BELIEVE TO BE TRUE that there is a difference (say between the regression slope and zero, or maybe between 2 means for a different test). In this typical case we are testing against the null of NO DIFFERENCE. When we perform such a test and examine the p-value, we compare the p-value to our alpha value.

The tyranny of the p-value

The rule we traditionally use is that we reject the null of no difference if our calculated p-value is lower than our chosen alpha (usually 0.05**). When testing assumptions of no difference we believe to be true, like here, we still typically use the 0.05 alpha threshold. In this case, when p > 0.05, we can take it as a lack of evidence that there is a difference. NB this is slightly different than consituting EVIDENCE that there is NO DIFFERENCE!

**The good old p-value is sometimes misinterpreted, or relied on “too heavily”. Read more about this important idea in Altman and Krzywinski 2017.

## Shapiro test ####
# Try this:
  
shapiro.test(residuals(mylm))

R output

Reporting the test of assumptions

The reporting of evidence supporting claims that assumptions underlying statistical tests have been tested and are “OK”, etc., are often understated even though they are a very important part of the practice of statistics. Based on the results of our Shapiro-Wilk test, we might report our findings in this way in a report (in a Methods section), prior to reporting the results of our regression (in the Results section):

We found no evidence our assumption of Gaussian residual distribution was violated (Shapiro-Wilk: W = 0.97, n = 56, p = 0.14)

Diagnostic plots and heteroscedasticity

Despite being challenging to pronounce and spell heteroscedasticiy, (help pronouncing it here; strong opinion about spelling it here), the concept of heteroscedasticity is simple - the that variance of the residuals should be constant across the predicted values. We usually examine this visually, which is easy to do in R.

## Heteroscedsticity ####

# Try this:
plot(y = residuals(mylm), x = fitted(mylm),
     pch = 16, cex = .8) 

# There is a lot hidden inside our regression object
summary(mylm)$sigma # Voila: The residual standard error

(uci <- summary(mylm)$sigma*1.96) # upper 95% confidence interval
(lci <- -summary(mylm)$sigma*1.96) # upper 95% confidence interval

# Add lines for mean and upper and lower 95% CI
abline(h = c(0, uci, lci),
       lwd = c(2,2,2),
       lty = c(2,3,3),
       col = c("blue", "red", "red"))

What we are looking for in this graph, ideally, is an even spread of residuals across the x-axis representing our fitted values. Remember, the x axis here represent perch Width, and each data point is a single observation of perch Height. The blue reference line is the mean PREDICTED perch Height for each value of Width. The difference between each data point and the horizontal line at zero is the residual difference, or residual error.

We are also looking for an absence of any systematic pattern in the data, that might suggest a lack of independence.

We see:

There is not a perfect spread of residual variation across the whole length of the fitted values. Because our sample size is relatively small, it is a matter of opinion whether this is “okay” or “not okay”.
There seem to be two groupings of values along the x-axis. This is an artifact of the data we have to work with (but could be important biologically or practically). For each of these groups, the residual spread appears similar.
The left hand side of the graph appears to have very low residual variance, but then there are only a few data points there and we expect most of the points to be near the line prediction anyway.
All things considered, one might be inclined to proceed, concluding there is no strong evidence of heteroscedasticity.

5 Test and alternatives

You have examined your data and tested assumption of simple linear regression, and are happy to proceed. Let’s look at the main results of regression.

## Regression results ####
# Try this:

# Full results summary
summary(mylm)


Call:
lm(formula = Height ~ Width, data = perch)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.23570 -0.28886 -0.02948  0.27910  1.55439 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.29630    0.20543   1.442    0.155    
Width        1.59419    0.04059  39.276   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5342 on 54 degrees of freedom
Multiple R-squared:  0.9662,    Adjusted R-squared:  0.9656 
F-statistic:  1543 on 1 and 54 DF,  p-value: < 2.2e-16

This full results summary is important to understand (NB the summary() function will produce different output depending on the class() and kind of object passed to it).

Call This is the R formula representing the simple regression statistical model
Residuals This is summary statistics of the residuals. Nice, but typically we would go beyond this in our EDA like we did above.
Coefficients in “ANOVA” table format. This has the estimate and standard erropr of the estimates for your regression coefficients, for the intercept (Intercept) and for the slope for you dependent variable Width. Here, the y-intercept coefficient is 0.30 and the slope is 1.59.
The P-values in simple regression are associated with the parameter estimates (i.e., are they different to zero). If the P-value is much less than zero, standard R output converts it to scientific notation. Here, the P-value is reported in the column called Pr(>|t|). The intercept P-value is 0.16 ( which is greater than alpha = 0.05, so we conclude there is no evidence of difference to 0 for the intercept). The slope P-value is output as <2e-16, which is 0.00..<11 more zeros>..002. We would typically report P-values less than 0.0001 as P < 0.0001
Multiple R-Squared The simple regression test statistics is typically reported as the R-squared value, which can be interpreted as the proportion of variance in the dependent variable explained by our model. This is very high for our model, 0.97 (i.e. 97% of the variation in perch Width is explained by perch Height).

6 Reporting results

A typical way to report results for our regression model might be:

We found a significant linear relationship for Height predicting Weight in perch (regression: R-squared = 0.97, df = 1,54, P < 0.0001).

Of course, this would be accompanied by an appropriate graph if important and relevant in the context of other results.

As usual, reporting copied and pasted results that have not been summarized appropriately is regarded as very poor practice, even for beginning students.

6.1 Alternatives to regression**

There are actually a large number of alternatives to simple linear regression in case our data do not conform to the assumptions. Some of these are quite advanced and beyond the scope of this Bootcamp (like weighted regression, or else specifically modelling the variance in some way). The most reasonable solutions to try first would be data transformation, or possibly if it were adequate to merely demonstrate a relationship between the variables, Spearman Rank correlation. A final alternative of intermediate difficulty, might be to try nonparametric regression, like implemented in Kendal-Theil-Siegel nonparametric regression.

7 Practice exercises

For the following exercises, use the dataset in the file 10-fish.xlsx. The dataset is in tidy format. Take advantage of this by looking at the terse information present in the data dictionary tab. The data are measurements of fish from a fish market, for a range of species. There will be some data handling as part of the exercises below, a practical and important part of every real data analysis.

7.1 Loading and Exploring the Fish Dataset

Read in the data from the file 10-fish.xlsx. Use the read.xlsx() function from the openxlsx package. Print the first few rows of the data, and check the structure of the data. Show your code.

Solution: Loading and Exploring Data

# Load the required package
library(openxlsx)

# Read in the fish dataset
fish <- read.xlsx('data/10-fish.xlsx')

# Print the first few rows
head(fish)

# Check the structure of the data
str(fish)

# Get a summary of the dataset
summary(fish)

# Check how many fish of each species
table(fish$Species)

7.2 Predicting Weight from Length

Create a linear model to predict fish weight from Length1 using the lm() function. Print the summary of the model, and interpret the coefficients. Show your code.

Solution: Predicting Weight from Length

# Load the dataset
library(openxlsx)
fish <- read.xlsx('data/10-fish.xlsx')

# Create a linear model to predict Weight from Length1
weight_model <- lm(Weight ~ Length1, data = fish)

# Print the summary of the model
summary(weight_model)

# Create a scatter plot with regression line
plot(fish$Length1, fish$Weight,
     xlab = "Length1 (cm)",
     ylab = "Weight (g)",
     main = "Fish Weight vs Length1",
     pch = 20)

# Add the regression line
abline(weight_model, col = "red", lwd = 2)

# Add the regression equation to the plot
coef_values <- coef(weight_model)
eq_text <- sprintf("Weight = %.2f + %.2f × Length1", coef_values[1], coef_values[2])
text(20, 800, eq_text, pos = 4)

The linear model predicts fish weight based on Length1. The intercept represents the estimated weight when Length1 is zero (which is not biologically meaningful in this context). The slope coefficient indicates how much the weight increases for each one-unit (cm) increase in Length1.

The summary output provides: - Coefficient estimates and their significance - R-squared value showing how much variance in weight is explained by length - F-statistic and p-value for the overall model significance

The relationship between fish length and weight is typically non-linear (following a cubic relationship), so a linear model may not be the best fit. However, it provides a useful starting point for understanding the relationship.

7.3 Assessing Model Assumptions

Assess whether the assumptions of linear regression are met for the model in the previous question. Use diagnostic plots and formal tests. Show your code and interpret the results.

Solution: Assessing Model Assumptions

# Load the dataset
library(openxlsx)
fish <- read.xlsx('data/10-fish.xlsx')

# Create the model
weight_model <- lm(Weight ~ Length1, data = fish)

# 1. Check for linearity and homoscedasticity using residual plots
plot(weight_model$fitted.values, weight_model$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs Fitted Values",
     pch = 20)
abline(h = 0, col = "red", lty = 2)

# 2. Check for normality of residuals
# Histogram
hist(weight_model$residuals,
     main = "Histogram of Residuals",
     xlab = "Residuals",
     breaks = 20)

# Q-Q plot
qqnorm(weight_model$residuals)
qqline(weight_model$residuals, col = "red")

# Shapiro-Wilk test for normality
shapiro.test(weight_model$residuals)

# 3. Check for influential observations
plot(cooks.distance(weight_model),
     type = "h",
     main = "Cook's Distance",
     ylab = "Cook's Distance",
     xlab = "Observation Number")
abline(h = 4/length(weight_model$residuals), col = "red", lty = 2)

The diagnostic plots and tests reveal several issues with the linear model:

Linearity and Homoscedasticity: The residuals vs. fitted values plot shows a clear pattern (fan shape), indicating that the relationship between Length1 and Weight is not linear and the variance is not constant (heteroscedasticity).
Normality of Residuals: The histogram and Q-Q plot show that residuals are not normally distributed. The Shapiro-Wilk test confirms this with a p-value < 0.05, indicating significant deviation from normality.
Influential Observations: The Cook’s distance plot identifies several influential points that may be disproportionately affecting the model.

These violations suggest that a simple linear model is not appropriate for this data. Possible solutions include: - Transforming the variables (e.g., log transformation) - Using a non-linear model - Considering separate models for different species of fish

7.4 Transforming Non-Linear Relationships

Plot perch$Weight ~ perch$Length2. The relationship is obviously not linear but curved. Devise and execute a solution to enable the use of linear regression, possibly by transforming the data. Show any relevant code and briefly explain your results and conclusions.

Solution: Transforming Non-Linear Relationships

# Load the dataset
library(openxlsx)
fish <- read.xlsx('data/10-fish.xlsx')

# Extract perch data
perch <- fish[fish$Species == "Perch", ]

# Plot the original relationship
plot(perch$Length2, perch$Weight,
     xlab = "Length2 (cm)",
     ylab = "Weight (g)",
     main = "Perch Weight vs Length2",
     pch = 20)

# The relationship looks cubic (Weight ~ Length^3), which makes biological sense
# Try log transformation of both variables

# Log transform both variables
log_weight <- log(perch$Weight)
log_length <- log(perch$Length2)

# Plot the transformed relationship
plot(log_length, log_weight,
     xlab = "log(Length2)",
     ylab = "log(Weight)",
     main = "Log-Transformed: Perch log(Weight) vs log(Length2)",
     pch = 20)

# Fit a linear model to the transformed data
log_model <- lm(log_weight ~ log_length)
summary(log_model)

# Add regression line to the transformed plot
abline(log_model, col = "red", lwd = 2)

# Check the model assumptions on the transformed data
# Residual plot
plot(log_model$fitted.values, log_model$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs Fitted (Log-Transformed Model)",
     pch = 20)
abline(h = 0, col = "red", lty = 2)

# Q-Q plot for normality
qqnorm(log_model$residuals)
qqline(log_model$residuals, col = "red")

# Shapiro-Wilk test
shapiro.test(log_model$residuals)

The relationship between fish weight and length is typically described by the allometric growth equation: Weight = a × Length^b

Where ‘b’ is often close to 3 (cubic relationship), reflecting that as a fish grows in length, its weight increases proportionally to its volume.

By log-transforming both variables, we convert this non-linear relationship to a linear one: log(Weight) = log(a) + b × log(Length)

The transformed model shows: 1. A much more linear relationship (as seen in the scatter plot) 2. More evenly distributed residuals (homoscedasticity) 3. Residuals that are closer to normally distributed

The slope coefficient (b) in the log-transformed model is approximately 3, confirming the cubic relationship between length and weight. This is consistent with biological expectations: as fish grow, their weight increases proportionally to their volume (length³).

This transformation demonstrates how non-linear relationships can often be linearized through appropriate transformations, allowing us to use linear regression techniques.

7.5 Exploring Morphological Covariance

Explore the data for perch and describe the covariance of all of the morphological, numeric variables using all relevant means, while being as concise as possible. Show your code.

Solution: Exploring Morphological Covariance

# Load the dataset
library(openxlsx)
fish <- read.xlsx('data/10-fish.xlsx')

# Extract perch data
perch <- fish[fish$Species == "Perch", ]

# Remove non-numeric columns for correlation analysis
perch_numeric <- perch[, c("Weight", "Length1", "Length2", "Length3", "Height", "Width")]

# Calculate correlation matrix
cor_matrix <- cor(perch_numeric)
round(cor_matrix, 3)

# Calculate covariance matrix
cov_matrix <- cov(perch_numeric)
round(cov_matrix, 3)

# Create a pairs plot to visualize relationships
pairs(perch_numeric, 
      main = "Relationships Between Perch Morphological Variables",
      pch = 20,
      col = "blue")

# Calculate summary statistics
summary(perch_numeric)

# Calculate coefficient of variation (CV) for each variable
cv <- function(x) {sd(x) / mean(x) * 100}
sapply(perch_numeric, cv)

The analysis of perch morphological variables reveals:

Strong Correlations: All morphological variables are highly positively correlated (r > 0.9), indicating that as one dimension increases, the others increase proportionally. This is expected in fish morphology where growth tends to be relatively proportional.
Length Measurements: The three length measurements (Length1, Length2, Length3) show extremely high correlations (r > 0.99), suggesting they capture essentially the same information about fish size.
Weight Relationships: Weight shows strong but slightly lower correlations with linear measurements, reflecting the non-linear (approximately cubic) relationship between length and weight.
Variability: The coefficient of variation (CV) is highest for Weight, indicating greater relative variability in this measurement compared to the linear dimensions.
Covariance Structure: The covariance matrix shows that Weight has the largest absolute covariance with other variables, which is expected given its different scale and cubic relationship with linear dimensions.

This exploration confirms that perch exhibit typical fish allometry, with proportional growth across different body dimensions and a predictable weight-length relationship. For modeling purposes, we could likely use any single length measurement to represent fish size, as they are highly redundant.

7.6 Create Your Own Regression Question

Write a plausible practice question involving the the exploration or analysis of regression. Make use of the fish data from any species except for Perch.

Solution: Creating a Regression Question

# Load the fish data
library(openxlsx)
fish <- read.xlsx('data/10-fish.xlsx')

# Check available species
unique(fish$Species)

# Select Bream data for this example
bream <- fish[fish$Species == "Bream", ]

Practice Question:

“Bream fish are commonly studied in fisheries research. Using the fish market dataset, investigate whether Height or Width is a better predictor of Weight for Bream. Specifically:

Create separate simple linear regression models for Weight ~ Height and Weight ~ Width
Compare the models using R-squared values and residual analysis
Test whether combining both predictors in a multiple regression model (Weight ~ Height + Width) significantly improves prediction accuracy
Create appropriate visualizations to support your findings
Based on your analysis, which measurement(s) would you recommend fisheries researchers use to estimate Bream weight in the field?”

Solution Code:

# Extract Bream data
bream <- fish[fish$Species == "Bream", ]

# Create the two simple regression models
height_model <- lm(Weight ~ Height, data = bream)
width_model <- lm(Weight ~ Width, data = bream)

# Compare model summaries
summary(height_model)
summary(width_model)

# Create a multiple regression model with both predictors
combined_model <- lm(Weight ~ Height + Width, data = bream)
summary(combined_model)

# ANOVA to formally compare the models
anova(height_model, combined_model)
anova(width_model, combined_model)

# Create visualizations
par(mfrow = c(2, 2))

# Plot 1: Weight vs Height with regression line
plot(bream$Height, bream$Weight,
     xlab = "Height (cm)",
     ylab = "Weight (g)",
     main = "Bream Weight vs Height",
     pch = 20)
abline(height_model, col = "blue", lwd = 2)

# Plot 2: Weight vs Width with regression line
plot(bream$Width, bream$Weight,
     xlab = "Width (cm)",
     ylab = "Weight (g)",
     main = "Bream Weight vs Width",
     pch = 20)
abline(width_model, col = "red", lwd = 2)

# Plot 3: Residuals of Height model
plot(fitted(height_model), residuals(height_model),
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals: Height Model",
     pch = 20)
abline(h = 0, col = "blue", lty = 2)

# Plot 4: Residuals of Width model
plot(fitted(width_model), residuals(width_model),
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals: Width Model",
     pch = 20)
abline(h = 0, col = "red", lty = 2)

par(mfrow = c(1, 1))

# Create a 3D visualization of the multiple regression model
# (Note: In base R, we can approximate this with a custom function)
# Create a grid of Height and Width values
height_range <- seq(min(bream$Height), max(bream$Height), length.out = 10)
width_range <- seq(min(bream$Width), max(bream$Width), length.out = 10)
pred_grid <- expand.grid(Height = height_range, Width = width_range)

# Predict Weight for each combination
pred_grid$Weight <- predict(combined_model, newdata = pred_grid)

# Create a simple contour plot
contour(x = height_range, 
        y = width_range, 
        z = matrix(pred_grid$Weight, nrow = length(height_range)),
        xlab = "Height (cm)",
        ylab = "Width (cm)",
        main = "Predicted Weight Contours (g)",
        levels = seq(0, 1000, by = 100))

# Add the actual data points
points(bream$Height, bream$Width, pch = 20)

The analysis would reveal: 1. Both Height and Width are strong predictors of Bream weight 2. The R-squared values would show which single measurement explains more variance in weight 3. The multiple regression would likely show that using both measurements improves prediction accuracy 4. The residual plots would reveal whether the models meet regression assumptions 5. The recommendation would be based on both statistical performance and practical field measurement considerations