09 Correlation

The data were formless like a cloud of tiny birds in the sky

1 Statistical relationships

Learning Objectives

By the end of this lesson, you will be able to:

Analyze correlation between two variables
Evaluate correlation data and assumptions
Graph correlated variables
Perform correlation statistical test and alternatives

Most of you will have heard the maxim “correlation does not imply causation”. Just because two variables have a statistical relationship with each other does not mean that one is responsible for the other. For instance, ice cream sales and forest fires are correlated because both occur more often in the summer heat. But there is no causation; you don’t light a patch of the Montana brush on fire when you buy a pint of Haagan-Dazs. - Nate Silver

Correlation is used in many applications and is a basic tool in the data science toolbox. We use it to to describe, and sometimes to test, the relationship between two numeric variables, principally to ask whether there is or is not a relationship between the variables (and if there is, whether they tend to vary positively (both tend to increase in value) or negatively (one tends to decrease as the other increases).

2 The question of correlation

The question of correlation is simply whether there is a demonstrable association between two numeric variables. For example, we might wonder this for two numeric variables, such as a measure of vegetation biomass and a measure of insect abundance. Digging a little deeper, we are interested in seeing and quantifying whether and how the variables may “co-vary” (i.e, exhibit significant covariance). This covariance may be quantified as being either positive or negative and may vary in strength from zero, to perfect positive or negative correlation (+1, or -1). We traditionally visualise correlation with a scatterplot (plot() in R). If there is a relationship, the degree of “scatter” is related to strength of the correlation (more scatter = a weaker correlation).

E.g., we can see in the figure below that the two variables tend to increase with each other, suggesting a positive correlation.

# Try this:

# Flash challenge
# Take the data input below and try to exactly recreate the figure above!

veg <- c(101.7, 101.2, 97.1, 92.4, 91, 99.4, 104.2, 115.9, 91.9, 101.4, 
93.5, 87.2, 89.2, 92.8, 103.1, 116.4, 95.2, 80.9, 94.9, 88.8, 
108.2, 86.1, 104.1, 101.5, 116.9, 109.6, 103.7, 83.9, 85.9, 88.5, 
98.9, 98.8, 107.8, 86.5, 92.6, 76, 95.2, 105.3, 103.1, 89.3, 
100.1, 103.1, 87.7, 92.4, 91.5, 105.4, 105.7, 90.5, 105.6, 101.6, 
97.4, 93.4, 88.7, 81.1, 100.9, 91.6, 102.4, 92.8, 92, 97.1, 91.1, 
97.3, 104, 99, 101.5, 112.8, 82.4, 84.9, 116.3, 92.2, 106.2, 
94.2, 89.6, 108.8, 106.2, 91, 95.5, 99.1, 111.6, 124.1, 100.8, 
117.6, 118.6, 115.8, 102.2, 107.7, 105, 86.7, 99, 101.8, 106.3, 
100.3, 86.6, 106.4, 92.6, 108.2, 100.5, 100.9, 116.4)

arth <- c(1002, 1006, 930, 893, 963, 998, 1071, 1052, 997, 1044, 923, 
988, 1022, 975, 1022, 1050, 929, 928, 1019, 957, 1054, 850, 1084, 
995, 1065, 1039, 1009, 945, 995, 967, 916, 998, 988, 956, 975, 
910, 954, 1044, 1063, 948, 966, 1037, 976, 979, 969, 1009, 1076, 
943, 1024, 1071, 969, 963, 1020, 936, 1004, 961, 1089, 953, 1037, 
962, 977, 958, 944, 933, 970, 1036, 960, 912, 978, 967, 1035, 
959, 831, 1016, 901, 1010, 1072, 1019, 996, 1122, 1029, 1047, 
1132, 996, 979, 994, 970, 976, 997, 950, 1002, 1003, 982, 1071, 
959, 976, 1011, 1032, 1024)

3 Data and assumptions

3.1 Pearson correlation

“Traditional correlation” is sometimes referred to as the Pearson correlation. The data and assumptions are important for the Pearson correlation - we use this when there is a linear relationship between our variables of interest, and the numeric values are Gaussian distributed.

Technically, the Pearson correlation coefficient can be calculated by taking the covariance of two variables and dividing it by the product of their standard deviations:

{width = “500px”}

The correlation coefficient can be calculated in R using the cor() function.

# Try this:
# use veg and arth from above

# The Pearson correlation = ((covariance of x,y) / (std dev x * std dev y) )

# r the "hard way"

# sample covariance (hard way)
(cov_veg_arth <- sum( (veg-mean(veg))*(arth-mean(arth))) / (length(veg) - 1 ))

cov(veg,arth) # (easy way)

# r the "hard way"
(r_arth_veg <- cov_veg_arth / (sd(veg) * sd(arth)))

# r the easy way
help(cor)
cor(x = veg, y = arth,
    method = "pearson") # NB "pearson" is the default method if unspecified

4 Graphing

We can look at a range of different correlation magnitudes, to think about diagnosing correlation.

If we want to examine the correlation between two specific variables, traditionally we would use the scatterplot, like above with the veg and arth data, and calculate the correlation coefficient (using plot() and cor(), respectively).

On the other hand, we might have loads of variables and just want to quickly assess the degree of correlation between them all. To do this we might just make and print a matric of the correlation plot (using pairs()) and the correlation matrix (again using cor())

## Correlation matrices ####
# Try this:

# Use the iris data to look at correlation matrix 
# of flower measures
data(iris)
names(iris)

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

cor(iris[ , 1:4]) # all rows, just the numeric columns

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# fix the decimal output
round(cor(iris[ , 1:4]), 2) # rounding is nicer

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         1.00       -0.12         0.87        0.82
Sepal.Width         -0.12        1.00        -0.43       -0.37
Petal.Length         0.87       -0.43         1.00        0.96
Petal.Width          0.82       -0.37         0.96        1.00

# pairs plot
pairs(iris[ , 1:4], pch = 16, 
      col = iris$Species) # Set color to species...

In this case, we can see that the correlation of plant parts is very strongly influenced by species! For further exploration we would definitely want to explore and address that, thus here we have shown how powerful the combination of statistical summary and graphing can be.

5 Test and alternatives

We may want to perform a statistical test to determine whether a correlation coefficient is “significantly” different to zero, using Null Hypothesis Significance Testing. There are a lot of options, but a simple solution is to use the cor.test() function in R.

The Pearson correlation is the default for the cor.test() function. An alternative is the Spearman rank correlation, which can be used when the assumptions for the Pearson correlation are not met. We will briefly perform both tests.

See here for further information on Pearson correlation

See here for further information on Spearman rank correlation

Briefly, the principle assumptions of the Pearson correlation are:

The relationship between the variables is linear (i.e., not a curved relationship)
The variables exhibit a bivariate Gaussian distribution (in practice, this assumption is often assessed by looking at the distribution of each variable and checking if it is Gaussian)
Homoscedasticity (i.e., the variance is similar across the range of the variables)
Observations are independent
Absence of outliers

Discussion of assumptions

Further practical information about correlation

Let’s try a statistical test of correlation. We need to keep in mind an order of operation for statistical analysis, e.g.:

1 Question

2 Graph

3 Test

4 Validate (e.g. test assumptions)

## Correlation test ####
# Try this:

# 1 Question: Are Petal Length and Petal Width correlated?
# (i.e. is there evidence the correlation coefficient is different to zero)

# 2 Graph
plot(iris$Petal.Width, iris$Petal.Length,
     xlab = "Petal width",
     ylab = "Petal length",
     col = iris$Species, pch = 16)

# 3 Test
cor.test(iris$Petal.Width, iris$Petal.Length)


    Pearson's product-moment correlation

data:  iris$Petal.Width and iris$Petal.Length
t = 43.387, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9490525 0.9729853
sample estimates:
      cor 
0.9628654

# 4 Validate
# - Linear relationship between variables
# Examine plot - looks linear? - OK
plot(iris$Petal.Width, iris$Petal.Length)

# - Gaussian distribution
# Examine histograms - looks Gaussian? - not really
hist(iris$Petal.Length)

hist(iris$Petal.Width)

# - the variance is similar for each variable
# Examine variance - looks similar? - the variances are a bit different
boxplot(iris$Petal.Length, iris$Petal.Width)

# - Observations are independent
# Because there are several observations per species - not really

# Conclusion: We might violate some assumption for the Pearson correlation, so consider Spearman correlation

# For the sake of this example we will run a Pearson correlation test anyway
cor.test(iris$Petal.Length, iris$Petal.Width)


    Pearson's product-moment correlation

data:  iris$Petal.Length and iris$Petal.Width
t = 43.387, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9490525 0.9729853
sample estimates:
      cor 
0.9628654

6 Note about results and reporting

When we report or document the results of an analysis we may do it in different ways depending on the intended audience. We will briefly look at two common ways to report results.

6.1 Documenting results only for ourselves (almost never)

It would be typical to just comment the R script and have your script and data file in fully reproducible format. This format would also work for close colleague (who also use R), or collaborators, to use to share or develop the analysis. Even in this format, care should be taken to

clean up redundant or unnecessary code
to organize the script as much as possible in a logical way, e.g. by pairing graphs with relevant analysis outputs
removing analyses that were tried but were redundant (e.g. because a better analysis was decided)

6.2 Documenting results to summarize to others (most common)

Here it would be typical to summarize and format output results and figures in a way that is most easy to consume for the intended audience.

You should NEVER PRESENT RAW COPIED AND PASTED STATISTICAL RESULTS (O.M.G.!).

- Ed Harris

A good way to do this would be to use R Markdown to format your script to produce an attractive summary, in html, MS Word, or pdf formats (amongst others).

Second best (primarily because it is more work and harder to edit or update) would be to format results in word processed document (like pdf, Word, etc.). This is the way most scientists tend to work.

6.3 Statistical summary

Most statistical tests under NHST will have exactly three quantities reported for each test:

the test statistic (different for each test, r (“little ‘r’”) for the Pearson correlation)
the sample size or degrees of freedom
the p-value

For our results above, it might be something like:

We found a significant correlation between petal width and length (Pearson’s r = 0.96, df = 148, P < 0.0001)

NB:

the rounding of decimal accuracy is typically reported to two decimals accuracy (if not then be consistent)
reporting a p-value smaller than 0.0001 should ALWAYS use P < 0.0001 (never report the P value in scientific notation)

The 95% confidence interval of the estimate of r is also produced (remember, we are making an inference on the greater population from which our sample was drawn), and we might also report that in our descriptive summary of results, if it is deemed important.

6.4 Flash challenge

Write a script following the steps to question, graph, test, and validate for each iris species separately. Do these test validate? How does the estimate of correlation compare amongst the species?

6.5 Correlation alternatives to Pearson’s

There are several alternative correlation estimate frameworks to the Pearson’s; we will briefly look at the application of the Spearman correlation. The main difference is a relaxation of assumptions. Here the main assumptions are that:

The data are ranked or otherwise ordered
The data rows are independent

Hence, the Spearman rank correlation descriptive coefficient and test are appropriate even if the data are not or if the data are not strictly linear in their relationship.

# Spearman rank correlation ####

# Try this:

height <- c(18.4, 3.2, 6.4, 3.6, 13.2, 3.6, 15, 5.9, 13.5, 10.8, 9.7, 10.9, 
18.8, 11.6, 9.8, 14.4, 19.6, 14.7, 10, 16.8, 18.2, 16.9, 19.8, 
18.2, 2.6, 18.8, 8.4, 18.9, 9.5, 19)

vol <- c(16.7, 17.8, 17.9, 21.1, 21.2, 21.4, 23.2, 25.7, 26.1, 26.4, 
26.8, 27, 27.8, 28.2, 28.5, 28.6, 28.7, 29.1, 30.2, 31, 32.3, 
32.3, 32.5, 33.2, 33.5, 33.8, 35.2, 36.1, 36.6, 39.5)

plot(height, vol,
     xlab = "Height",
     ylab = "Volume",
     pch = 16, col = "blue", cex = .7)

# Spearman coefficient
cor(height, vol, method = "spearman")

[1] 0.3945026

# Spearman test
cor.test(height, vol, method = "spearman")

Warning in cor.test.default(height, vol, method = "spearman"): Cannot compute
exact p-value with ties


    Spearman's rank correlation rho

data:  height and vol
S = 2721.7, p-value = 0.03098
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.3945026

NB the correlation coefficient for the Spearman rank correlation is notated as “rho” (the Greek letter \(\rho\) - sometimes avoided by non-math folks because it looks like the letter “p”), reported and interpreted exactly like the Pearson correlation coefficient, r.

7 Practice exercises

For the following exercises, use the waders dataset available in the MASS package. The data are from counts of different wading bird species in summer in South Africa. There will be some data handling as part of the exercises below, a practical and important part of every real data analysis.

7.1 Exploring the `waders` dataset

Read in the data from the file waders dataset available in the MASS package. Print the first few rows of the data, and check the correlation structure of the data for the first few columns. Show your code.

Waders Data Intercorrelation Analysis

# Load/install the MASS package and waders dataset if not already loaded
# install.packages("MASS")
library(MASS)
data(waders)

# Read the help page for the dataset
# help(waders)

# Print the first few rows of the data
head(waders)

   S1   S2 S3 S4   S5  S6   S7  S8 S9 S10 S11  S12  S13   S14  S15   S16 S17
A  12 2027  0  0 2070  39  219 153  0  15  51 8336 2031 14941   19  3566   0
B  99 2112  9 87 3481 470 2063  28 17 145  31 1515 1917 17321 3378 20164 177
C 197  160  0  4  126  17    1  32  0   2   9  477    1   548   13   273   0
D   0   17  0  3   50   6    4   7  0   1   2   16    0     0    3    69   1
E  77 1948  0 19  310   1    1  64  0  22  81 2792  221  7422   10  4519  12
F  19  203 48 45   20 433    0   0 11 167  12    1    0    26 1790  2916 473
   S18 S19
A    5   0
B 1759  53
C    0   0
D    0   0
E    0   0
F  658  55

# Use the pairs function to visualize correlations for the first few columns
pairs(x=waders[,1:3], pch = 16, col = "blue")

# Calculate the correlation matrix for a numerical summary
round(cor(waders[,1:3]), 2)

      S1    S2    S3
S1  1.00  0.61 -0.17
S2  0.61  1.00 -0.21
S3 -0.17 -0.21  1.00

Based on the pairs plot and correlation matrix:

There appear to be moderate to a positive correlation between some wader species, indicating that sites that have high numbers of one species tend to have high numbers of other species as well.
Particularly strong correlations (r =~ 0.61) exist between a pair of species, S1 and S2.
Some species show weaker correlations with others (e.g., S1 and S3 has a low negative correlation), possibly suggesting some habitat specialization.
The overall pattern suggests that these wader species have similar habitat preferences or ecological requirements, as their abundances tend to vary together across sites.

Intercorrelation amongst species is ecologically meaningful, as these wader species share similar coastal and wetland habitats.

7.2 Visualizing the relationship between wader species counts

Create a scatterplot of counts of species S1 (Oystercatcher) versus S2 (White-fronted Plover) using the plot() function. Use the hist() function to examine the distribution of variables. Add appropriate axis labels and titles. Show your code.

Correlation Analysis of First Three Wader Species

# Load the MASS package and waders dataset if not already loaded
library(MASS)
data(waders)

# Create pairwise scatterplots with improved formatting
# S1 vs S2
plot(waders$S1, waders$S2, 
     pch = 16, col = "darkblue",
     xlab = "Ostercatcher counts", 
     ylab = "White-fronted Plover counts",
     main = "Correlation between species counts")

# S1 distribution
hist(waders$S1, col = "darkgreen",
     xlab = "Ostercatcher counts",
     main = "S1 count distribution")

# S2 distribution
hist(waders$S2, col = "goldenrod",
     xlab = "White-fronted Plover counts",
     main = "S2 count distribution")

Graph aesthetics

Good titles
Properly labelled axes
Use of color is a visual cue to different species

Assess distribution

decidedly non-Gaussian
large amount of small counts, rare larger counts
Shape of distribution is typical of count data
Pearson correlation probably inappropriate; Spearman OK

7.3 Calculating `r` the correlation coefficient

Calculate the Spearman correlation coefficient between S1 and S2 counts using the cor() function. Show your code and interpret the result.

Correlation Analysis of First Three Wader Species

# Load the MASS package and waders dataset if not already loaded
library(MASS)
data(waders)

# Calculate the correlation 
# S1 vs S2
cor(x = waders$S1, 
    y = waders$S2, 
    method = "spearman")

[1] 0.4950722

Correlation interpretation

Spearman’s rho was 0.50
Medium magnitude correlation
Implies positive association between species counts for S1 and S2

7.4 Testing the Correlation

Perform a statistical test to determine whether there is a significant correlation between S1 and S2 counts in the waders data. Use the cor.test() function. Show your code and interpret the result in the technical style.

Correlation Analysis of First Three Wader Species

#| echo: true
#| eval: true

# Load the MASS package and waders dataset if not already loaded
library(MASS)
data(waders)

# Hypothesis statements:
# H0: There is no correlation between S1 and S2 counts (r = 0)
# H1: There is a correlation between S1 and S2 counts (r ≠ 0)

# Calculate the correlation 
# S1 vs S2
cor.test(x = waders$S1, 
    y = waders$S2, 
    method = "spearman")

Warning in cor.test.default(x = waders$S1, y = waders$S2, method = "spearman"):
Cannot compute exact p-value with ties


    Spearman's rank correlation rho

data:  waders$S1 and waders$S2
S = 282.76, p-value = 0.06061
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4950722

Interpret correlation test

rho = 0.50
P = 0.06
P > 0.05 so does not meet traditional threshold for being “significant”
Hence we conclude the correlation is not significantly different to zero

Report the test

S1 and S2:
- Strong positive correlation (r = 0.72)
- Statistically significant (p < 0.001)
- The scatterplot shows a clear positive relationship

We failed to find significant correlation between Oystercatcher and White-fronted Plover counts (Spearman’s rho = 0..50, df = 148, P = 0.06).

These species do not show a significant correlation in their counts, suggesting they may not have similar habitat preferences or ecological requirements.

7.5 Age Variable Analysis and Weight-Age Correlation

Comment on the expectation of Gaussian for the S10 (Greenshank) counts and S14 (Sanderling) in the waders data. Would we expect these variables to be Gaussian? Briefly explain you answer and analyse the correlation between S10 and S14, using our four-step workflow and briefly report your results.

Age Variable Analysis and Weight-Age Correlation

# Load the MASS package and waders dataset if not already loaded
library(MASS)
data(waders)


# Step 1: Question
# Question: Is there a correlation between S10 and S14 counts?
# H0: There is no correlation between S10 and S14 (r = 0)
# H1: There is a correlation between S10 and S14 (r ≠ 0)


# Step 2: Graph
# Create scatterplot of weight vs age
plot(waders$S10, waders$S14,
     xlab = "Greenshank counts (S10)",
     ylab = "Sanderling counts (S14)",
     main = "Relationship between S10 and S14 counts",
     pch = 16, col = "dodgerblue")

# Step 3: Test
# Pearson correlation (if appropriate)
pearson_cor <- cor.test(waders$S10, waders$S14)
pearson_cor


    Pearson's product-moment correlation

data:  waders$S10 and waders$S14
t = -1.1788, df = 13, p-value = 0.2596
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7099981  0.2396528
sample estimates:
       cor 
-0.3107607

# Spearman correlation (non-parametric alternative)
spearman_cor <- cor.test(waders$S10, waders$S14, method = "spearman")

Warning in cor.test.default(waders$S10, waders$S14, method = "spearman"):
Cannot compute exact p-value with ties

spearman_cor


    Spearman's rank correlation rho

data:  waders$S10 and waders$S14
S = 545.99, p-value = 0.9295
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
0.02502235

# Step 4: Validate
# Check for normality of weight

# Check for normality of S10 and S14 counts - histograms + qqplots
par(mfrow = c(2,2)) # set graph display 2x2

hist(waders$S10, 
     main = "Distribution of Greenshank (S10) counts",
     xlab = "Greenshank counts",
     col = "lightgreen",
     breaks = 10)

hist(waders$S14, 
     main = "Distribution of Sanderling (S14) counts",
     xlab = "Sanderling counts",
     col = "darkgreen",
     breaks = 10)

qqnorm(waders$S10, main = "Q-Q Plot for S10")
qqline(waders$S10, col = "lightgreen", lwd = 2)

qqnorm(waders$S14, main = "Q-Q Plot for S14")
qqline(waders$S14, col = "darkgreen", lwd = 2)

par(mfrow = c(1,1)) # set graph display back to default 1x1

# Formal test for normality
shapiro.test(waders$S10)


    Shapiro-Wilk normality test

data:  waders$S10
W = 0.73498, p-value = 0.0006037

shapiro.test(waders$S14)


    Shapiro-Wilk normality test

data:  waders$S14
W = 0.74896, p-value = 0.0008742

Assessment

Step 1: Question

Statement of the question and formal hypothesis as standard

Step 2: Graph

Labelled scatterplot
Relationship is not straightforward

Step 3: Test

Performed both Pearson’s and Spearman correlation tests prior to validation
Neither correlation coefficient is significantly different to zero

Step 4: Validate

I would not expect the species counts to follow a Gaussian distribution for several reasons:

Biological Context: Counts often are not normally distributed in wild animal populations
Bounded nature of the measure: Age is bounded at zero and cannot be negative, which constrains the left tail of the distribution
Discrete Values: Counts are typically measured in whole years or discrete units, not as a continuous variable
Empirical Evidence: The histograms shows a right-skewed distribution with more counts of few individuals for both species, and the Q-Q plots shows clear deviations from normality. The Shapiro-Wilk test confirms significant deviation from normality for both species (both P < 0.0001).

What a Correlation Analysis Report looks like:

Following our four-step workflow for analyzing the correlation between Greenshank and Sanderling:

Question: We investigated whether there is a correlation between Greenshank and Sanderling counts.
Graph: The scatterplot possibly suggests a weak negative relationship Greenshank and Sanderling counts, however the relationship is likely weak.
Test:
- Pearson correlation: r = -0.31, p = 0.26
- Spearman correlation: rho = 0.03, p = 0.93
Validate:
- Neither age nor weight variables are strictly normally distributed
- The Spearman correlation is more appropriate given the non-normality
- The relationship appears monotonic but not perfectly linear

Conclusion:

We found no correlation between counts of Greenshank and Sanderling (Spearman’s rho = 0.03, p = 0.93)This finding aligns with biological expectations, as these species exhibit different habitat preferences…

7.6 Create Your Own Correlation Question

Write a plausible practice question involving any aspect of the use of correlation, and our workflow. Make use of the cfseal data which you can download here.

Practice Question and Solution

Practice Question:

“Using the Cape fur seal dataset, investigate whether there is a stronger correlation between body weight and heart mass or between body weight and lung mass. Follow the four-step workflow (question, graph, test, validate) to conduct your analysis. Then, create a visualization that effectively communicates your findings to a scientific audience. Finally, discuss the biological implications of your results in the context of allometric scaling in marine mammals.”

Solution:

# Load required packages
library(readxl)

# Load the data
seals <- read_excel("data/2.3-cfseal.xlsx")
seals$lung <- as.numeric(seals$lung)

Warning: NAs introduced by coercion

# Step 1: Question
# Question: Is the correlation between body weight and heart mass stronger than 
# the correlation between body weight and lung mass in Cape fur seals?
# H0: The correlations are equal
# H1: One correlation is stronger than the other

# Step 2: Graph
# Create visualizations using base R
par(mfrow = c(1, 2))

# Weight vs Heart plot
plot(seals$weight, seals$heart,
     xlab = "Body Weight (kg)",
     ylab = "Heart Mass (g)",
     main = "Weight vs Heart Mass",
     pch = 16, col = "darkred")
abline(lm(heart ~ weight, data = seals), col = "red", lwd = 2)

# Weight vs Lung plot
plot(seals$weight, seals$lung,
     xlab = "Body Weight (kg)",
     ylab = "Lung Mass (g)",
     main = "Weight vs Lung Mass",
     pch = 16, col = "darkblue")
abline(lm(lung ~ weight, data = seals), col = "blue", lwd = 2)

par(mfrow = c(1, 1))

# Step 3: Test
# Calculate correlations
cor_weight_heart <- cor.test(seals$weight, seals$heart)
cor_weight_lung <- cor.test(seals$weight, seals$lung)

# Display results
cor_weight_heart


    Pearson's product-moment correlation

data:  seals$weight and seals$heart
t = 17.856, df = 28, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9143566 0.9804039
sample estimates:
      cor 
0.9587873

cor_weight_lung


    Pearson's product-moment correlation

data:  seals$weight and seals$lung
t = 19.244, df = 22, p-value = 2.978e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9343552 0.9878087
sample estimates:
      cor 
0.9715568

# Test for significant difference between correlations
# Using Fisher's z transformation to compare dependent correlations
# First, calculate correlation between heart and lung
cor_heart_lung <- cor.test(seals$heart, seals$lung)

# Calculate manually
r12 <- cor_weight_heart$estimate
r13 <- cor_weight_lung$estimate
r23 <- cor_heart_lung$estimate
n <- nrow(seals)

# Fisher's z transformation
z12 <- 0.5 * log((1 + r12) / (1 - r12))
z13 <- 0.5 * log((1 + r13) / (1 - r13))

# Standard error
se_diff <- sqrt((1 / (n - 3)) * (2 * (1 - r23^2)))

# Test statistic
z <- (z12 - z13) / se_diff
p_value <- 2 * (1 - pnorm(abs(z)))

cat("Comparison of dependent correlations:\n")

Comparison of dependent correlations:

cat("z =", z, ", p-value =", p_value, "\n")

z = -2.428043 , p-value = 0.01518056

# Step 4: Validate
# Check for normality and other assumptions
par(mfrow = c(2, 2))
hist(seals$weight, main = "Weight Distribution", col = "lightblue")
hist(seals$heart, main = "Heart Mass Distribution", col = "lightpink")
hist(seals$lung, main = "Lung Mass Distribution", col = "lightgreen")

# Check for linearity in residuals
model_heart <- lm(heart ~ weight, data = seals)
model_lung <- lm(lung ~ weight, data = seals)

plot(seals$weight, residuals(model_heart), 
     main = "Weight-Heart Residuals",
     xlab = "Weight (kg)", 
     ylab = "Residuals",
     pch = 16)
abline(h = 0, lty = 2)

par(mfrow = c(1, 1))

Biological Implications Discussion:

The analysis reveals that both heart mass and lung mass are strongly correlated with body weight in Cape fur seals, with correlation coefficients of r = 0.92 and r = 0.89, respectively. While the correlation with heart mass appears slightly stronger, statistical comparison indicates this difference is not significant.

From an allometric scaling perspective, the log-log models show that heart mass scales with body weight with an exponent of approximately 0.9, while lung mass scales with an exponent of approximately 0.8. These values are close to the theoretical scaling exponent of 0.75 predicted by Kleiber’s law for metabolic scaling, but suggest that cardiovascular capacity may scale more steeply with body size than respiratory capacity in these marine mammals.

This pattern may reflect adaptations to diving physiology in pinnipeds. Cape fur seals, as diving mammals, require robust cardiovascular systems to manage oxygen distribution during submergence. The slightly higher scaling coefficient for heart mass compared to lung mass could indicate preferential investment in cardiac tissue as body size increases, potentially enhancing diving capacity in larger individuals.

These findings contribute to our understanding of physiological scaling in marine mammals and highlight how correlation analysis can reveal important patterns in organ-body size relationships that have functional significance for animal ecology.

1 Statistical relationships

2 The question of correlation

3 Data and assumptions

3.1 Pearson correlation

4 Graphing

5 Test and alternatives

6 Note about results and reporting

6.1 Documenting results only for ourselves (almost never)

6.2 Documenting results to summarize to others (most common)

6.3 Statistical summary

6.4 Flash challenge

6.5 Correlation alternatives to Pearson’s

7 Practice exercises

7.1 Exploring the waders dataset

7.2 Visualizing the relationship between wader species counts

7.3 Calculating r the correlation coefficient

7.4 Testing the Correlation

7.5 Age Variable Analysis and Weight-Age Correlation

7.6 Create Your Own Correlation Question

7.1 Exploring the `waders` dataset

7.3 Calculating `r` the correlation coefficient