06 Data subsetting

Command data and you are powerful

1 Subsetting and Manipulation (data sumo)

Learning Objectives

By the end of this lesson, you will be able to:

  • Describe the indexing concept for vectors and matrices
  • Use the which() function to subset data
  • Select parts of a data frame
  • Use the aggregate() function to summarize data


With a good basic set of moves for subsetting and manipulating data, you can overpower any dataset no matter how large and powerful they may be. Then, you will have strong data Sumo.

Subsetting and manipulating data is probably the commonest activity for anyone who works with data. This is a core activity for exploratory data analysis, but is also extensively used in simple data acquisition, analysis and graphing, while also being related to more general data manipulating activities, for example database queries. This page is an introduction to the core syntax and some of the tools for manipulating and subsetting data in R.


2 Indexing concept

If you would like to slice and dice your data, you will need to learn all about indexing!

The basics of the indexing concept in R syntax is very simple, where data storage objects like vectors (1 dimension), matrices (2 dimensions) and arrays (3 or more dimensions) store individual data values that can be accessed by the “address” of the dimension(s).


2.1 How indexing works

Say you have a numeric vector called my_vector that has 10 values. The index values will be 1 to 10, with each value corresponding consecutively to the data value at that position.

my_vector <- c(11.3, 11.2, 10.4, 10.4, 8.7, 
               10.8, 10.5, 10.3, 9.7, 11.2)
  
my_vector
[1] 11.3 11.2 10.4 10.4 8.7 10.8 10.5 10.3  9.7 11.2


Notice the [1] in the R console output? This indicates the index of value right next to it and the R system will provide an index value for longer vectors as the wrap in the console. If we could see the actual index values it would look something like this:

> my_vector
 [1] 11.3 11.2 10.4 10.4  8.7 10.8 10.5 10.3  9.7 11.2
#     ^    ^    ^    ^    ^    ^    ^    ^    ^    ^
#     1    2    3    4    5    6    7    8    9    10


2.2 Vectors

You can create vector subsets by manipulating the index. Vector objects have indices in 1 dimension. For example, my_vector[1:i], where i is the length of the vector.

## Vectors ####
# Try this

my_vector <- c(11.3, 11.2, 10.4, 10.4, 
               8.7, 10.8, 10.5, 10.3, 9.7, 11.2)

# Return all values
my_vector        # Typical way
my_vector[ ]     # Blank index implies all index values
my_vector[ 1:10] # Returns all index values explicitly

# Return the first 3 values
1:3 # Reminder of the function of the colon operator ":"
my_vector[ 1:3] # Notice consecutive indices can use the ":" operator

# Return 5th and 9th values
my_vector[ c(5, 9)] # Notice we have to place non-consecutive index values in the c() function


2.3 Matrices

Matrix objects have 2 dimensions denoted as my_matrix[1:i, 1:j], where i is the number of rows and j is the number of columns.

## Matrices ####
# Try this

my_matrix <- matrix(data = c(2,3,4,5,6,6,6,6),
                    nrow = 2, byrow = T)

my_matrix # notice how the arguments arranged the data

# Flash challenge: make a matrix with the same data vector above to look like...
#      [,1] [,2]
# [1,]    2    6
# [2,]    3    6
# [3,]    4    6
# [4,]    5    6

# "Slicing" out a row or column
my_matrix[1,  ] # Slice out row 1
my_matric[ , 3] # Slice out column 3

# Matrix columns and rows often have names
names(my_matrix) # No names yet

nrow(my_matrix) # Returns number of rows (useful for large matrices)
rownames(my_matrix) # No row names; 2 rows, need two names

rownames(my_matrix) <- c("dogs", "cats")
my_matrix # Now the rows have names!
rownames(my_matrix) # Get them this way too!

# Flash challenge: Name the columns of my_matrix "a", "b", "c", "d" with colnames()

my_matrix

# Should look like this:
#      a b c d
# dogs 2 3 4 5
# cats 6 6 6 6

# You can also slice out matrix portions by name
my_matrix["dogs", c("b", "d")]

# Finally, functions act on values, not the index
mean(my_matrix["dogs", c("b", "d")])


2.4 Arrays

Arrays are data objects with more than 2 dimensions (well, technically a matrix with 2 dimensions is also an array, but let’s ignore that for now). Array dimensions are denoted as my_array[1:i, 1:j, 1:k], where i is the number of rows and j the columns and k the “depth” of i * j.

my_array
## Arrays ####
# Try this

# help(runif)
# help(round)
# Try it to see what it does... 
my_vec <- round(runif(n = 27, min = 0, max = 100), 0)
my_vec # See what we did there?

length(my_vec) # Just checking

my_array <- array(data = my_vec,
                  dim = c(3, 3, 3))
my_array

# Flash challenge: 
# Specify and print the 1st and 3rd  slice of the k dimension of my_array
# Assuming my_array has dimensions i, j, k like my_array[i,j,k]


3 which() and subsetting

Subsetting data objects is done by exploiting the index system. We usually do this by either specifying index values explicitly (effective, but it requires that you know A LOT about the data object), or by constructing queries that choose subset of data based on particular values. The which() function is a powerful way to construct queries.

# Try this
help(which) # Notice how the x argument is required to be a LOGICAL vector?

# Make a NUMERIC vector
vector_a <- c(3, 4, 5, 4, 3, 4, 5, 6, 6, 7)

# Use a boolean phrase to ask which elements of vector_a are greater than 5
vector_a > 5 # Interesting... it is a LOGICAL vector!

# which() will return the index values of TRUE values
# In other words, WHICH values in vector_a are greater than 5?
which(vector_a > 5)


What is the point of all this? THE POINT is to be able to use expressions to obtain indices and values in data structures…

# What VALUES in vector_a are > 5?
vector_a[which(vector_a > 5)]

# This also works on vectors of other types
# Consider a character vector
char_vec <- c("wheat", "maize", "wheat", "maize", "wheat", "wheat")

# Which elements are equivalent to "wheat"?
char_vec == "wheat"
which(char_vec == "wheat")

char_vec[ which(char_vec == "wheat")] # This works
char_vec[ char_vec == "wheat"]        # Same output

# Flash challenge: Explain in your own words why 
# the previous 2 lines of code have identical output?


We are just beginning to scratch the surface of possibilities with the which() function. Keep this function in mind and practice it when you can.


4 Selection on data.frame objects

Data frames are the ultimate data object for getting, storing, organizing and analyzing data. A good scientist must learn to communicate the subtlety of data. A good statistician must learn not to underestimate the subtletly of data. A good student must learn that

subtlety may exist, even in simple data.

There are a couple of data object types that have a special characteristic in that they store data of different types, where vectors, matrices and arrays can only store one type of data (e.g., numeric, character, logical, etc.). The special data objects that can contain multiple data types are list objects, and data frames. Here we will focus on data frames.

Data frames can have different vector types arranged by column but there is a constraint that each vector must be the same length, that is, each ROW is considered an observation for each variable value (though there may be missing data coded by NA).

There are a few ways to think about selecting values in a data frame. The first is simply to access values through the variable names, which can either be done by using the data frame name with the $ operator and the variable name, or by using the [ , ] syntax with either the variable name or the column number of the variable of interest see here.

A second powerful way to access variables in a data frame is by selecting particular rows of a data frame. This may be done by selecting the rows of a data based on values of one or more variables. We will practice doing this using which(), the [ , ] syntax, and boolean phrases is the following code block.

For the following section, we will use the OrchardSprays dataset that exists as a data frame in the in-built {datasets} package. You can use help(OrchardSprays) to see the help page (the help page is characteristically terse, so some description is given here).


4.1 OrchardSprays data

This is a classic dataset based on an experiment looking at how a chemical additive could be used to deter honeybees from being attracted to crops and subsequently killed by pesticides.

The experiment involved having a treatment consisting of adding a “lime sulfur emulsion” (honeybee deterrent) in increasing concentrations to a sucrose solution. The treatment variable had 8 levels including a control (no deterrent) and 7 other levels with increasing concentration of the deterrent. The treatment levels were named A (the highest amount of deterrent), B (second highest deterrent) through to G (lowest deterrent) and H (control - no deterrent) The decrease variable was a measure of the quantity of sucrose solution that was taken by honeybees (the prediction here is that higher concentrations of the deterrent should result in a lower decrease in the sucrose solution).

The experiment involved a Latin Square design, with the order of the 8 treatments arranged randomly in an array of 8 columns (the purpose of this design is to randomize any effect of the treatment ORDER or POSITION on the response variable). This resulted in an 8 row by 8 column experiment. The response was measured after placing 100 honeybees into an experimental chamber with the 64 containers of sucrose solution.


## OrchardSprays ####
## Understand the data - an important step

# Try this
# Load the OrchardSpray data using the data() function
data(OrchardSprays) # Should see OrchardSprays <promise> in the Global Env.

# Look at the data head()
head(OrchardSprays) # First 6 rows

# Look at variable types with str()
help(str) # Good function to see info about data object
str(OrchardSprays)

# First let's just look at the data
# Don't worry too much about the code for these graphs if you have not encountered it before
boxplot(decrease ~ treatment, data = OrchardSprays, 
        main = "The pattern fits the prediction",
        ylab = "Amount of sucrose consumed",
        xlab = "Lime sulpher treatment amount in decreasing order (H = control)")

# This is the experimental design
# Latin Square is kind of like Sudoku
# No treatment can be in row or column more than once
plot(x = OrchardSprays$colpos,  # NB use of $ syntax to access data
     y = OrchardSprays$rowpos, 
     pch = as.character(OrchardSprays$treatment),
     xlim = c(0,9), ylim = c(0,9),
     main = "The Latin Square design of treatments",
     xlab = "\"Column\" position",
     ylab = "\"Row\" position")


4.2 Practice selecting parts a data frame

Selecting particular parts of a data frame based on the values of one variable is a common and extremely useful task.

## Practice selecting parts a data frame ####

# Select the rows of the dataset for treatment "D"

# (Pseudocode steps to solve) 
# Break it down to make small steps easy to read

# 01 Boolean phrase to identify rows where treatment value is "D"
# 02 which() to obtain index of TRUE in boolean vector
# 03 Exploit [ , ] syntax with data frame object to slice out rows

# 01 Boolean phrase
OrchardSprays$treatment # Just print variable to compare visually to boolean
OrchardSprays$treatment == "D" # logical vector - TRUE in "D" positions

# 02 which()
which(OrchardSprays$treatment == "D") # Index of TRUE values
my_selection <- which(OrchardSprays$treatment == "D") # Place index in a variable
my_selection # Just checking

# 03 Exploit [ , ] syntax with data frame object to slice out rows
OrchardSprays[my_selection, ]

# Flash challenge: Select and print all rows at "colpos" values of 2


4.3 Selection based on more than one variable value

Using the basic building blocks of boolean selection, more complex rules for selecting data can be made.

## Compound boolean for selection ####

# Select all rows of the data frame where 
# rowpos equals 4 OR 6 AND treatment equals "A" OR "H"
# What we expect is exactly 2 values (A or H) for each powpos (4 or 6)

# rowpos 4 and 6
OrchardSprays$rowpos == 4 # The 4s
OrchardSprays$rowpos == 6 # The 6s

OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6 # All together

# now with which()
which(OrchardSprays$rowpos == 4) # The 4s
which(OrchardSprays$rowpos == 6) # The 6s

which(OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) # All together

# treatment A and H
which(OrchardSprays$treatment == "A" | OrchardSprays$treatment == "H") # All together

# Now we need the intersection of value that are in both our which() vectors

which((OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) &  # It works
        (OrchardSprays$treatment == "A" | OrchardSprays$treatment == "H") ) 
  
# NB this is a long way of spelling out our selection, 
# but trying to be very explicit with what is going on

my_selec2 <- which((OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) &  
                     (OrchardSprays$treatment == "A" | OrchardSprays$treatment == "H") ) 

OrchardSprays[my_selec2, ] # Double check it works and is similar to expectation...

# Flash challenge: Calculate the mean of decrease for treatment "A" 
# and the mean of decrease for treatment "H"


5 aggregate() function

We often may wish to summarize parts of a data set according to some index of variable values. A very convenient tool for the is the aggregate() function, which we will practice here.

help(aggregate)

# A few important things to note about how this function works:
# The "x" argument is a data object you input, but should only contain numeric values usually
# If a data.frame object is input as x, a data.frame object is the output
# The "by" argument must be a list() object and can be one or more indices
# The FUN argument is the name of the function that will act on the "x" argument data

# Let's try a few examples

## 1 calculate the mean of decrease by treatment in OrchardSprays

aggregate(x = OrchardSprays$decrease,
          # NB use of list() and naming it "treatment"
          by = list(treatment = OrchardSprays$treatment), 
          FUN = mean)

# we can "recycle" the code above to apply different functions
# standard deviation with sd()
aggregate(x = OrchardSprays$decrease,
          # NB use of list() and naming it "treatment"
          by = list(treatment = OrchardSprays$treatment), 
          FUN = sd)

# Range with range()
aggregate(x = OrchardSprays$decrease,
          # NB use of list() and naming it "treatment"
          by = list(treatment = OrchardSprays$treatment), 
          FUN = range)

# What if we want several summary statistics?

aggregate(x = OrchardSprays$decrease,
          # NB use of list() and naming it "treatment"
          by = list(treatment = OrchardSprays$treatment), 
          # NB use of function() 
          FUN = function(x) c(mean = mean(x), # Add naming
                              sd = sd(x), 
                              range = range(x)))

## Example of use of aggregate object
# Say you would like to graph a barplot of the MEAN of decrease by treatment
# and you would like to show STANDARD DEVIATION error bars

# Make data frame with summary values using aggregate()
my_mean <- aggregate(x = OrchardSprays$decrease,
                        by = list(treatment = OrchardSprays$treatment), 
                        FUN = mean)

my_sd <- aggregate(x = OrchardSprays$decrease,
                   by = list(treatment = OrchardSprays$treatment), 
                   FUN = sd)

my_mean
my_sd

# Tidy things up in a new data frame using data.frame()
# Take care of naming variables for clarity
help(data.frame) # Continue using help() as a good habit
new_data <- data.frame(treatment = my_mean$treatment,
                       mean = my_mean$x,
                       sd = my_sd$x)
new_data # Looks good

# There is a lot going on in the following code
# The point is to show what is possible

(bar_centers <- barplot(new_data$mean,#   use mean for barheight
                        ylim = c(0, 115),
                        ylab = "Mean solution decrease (+- 1 SD)",
                        xlab = "Treatment"))
# NB bar_centers holds the numerical position value of the bars...

help(arrows)  # Use to draw error bars
arrows(x0 = bar_centers, 
       x1 = bar_centers,
       y0 = new_data$mean , # start error bar at top of the bar!
       y1 = new_data$mean + new_data$sd, # end error bar here!
       angle = 90,
       length = 0.1)

# Last step: label the x axis
axis(side = 1,
     at = bar_centers,
     labels = new_data$treatment)
     
# Flash challenge: Draw a new barplot by recycling the code above
# This time, add error bars showing on both the top and the bottom of the mean values


6 Practice exercises

For the following exercises, use the trees dataset built into R, which has Girth, Height and Volume variables for 31 Black Cherry trees.

# Examine the data
help(trees)
data(trees)
str(trees)


6.1 Mean for Subset

Show code to calculate the mean Girth of Black Cherry trees with Height less than 75 ft.


# Load the trees dataset
data(trees)

# Method 1: Using which() and boolean selection
tree_indices <- which(trees$Height < 75)
mean(trees$Girth[tree_indices])
[1] 11.8
# Method 2: Direct boolean indexing
mean(trees$Girth[trees$Height < 75])
[1] 11.8

The mean girth of Black Cherry trees with height less than 75 ft is 11.7 inches.

The code demonstrates two equivalent approaches: 1. Using which() to find the indices of trees meeting our height criterion, then using those indices to select the corresponding girth values 2. Directly using boolean indexing to select girth values where the height is less than 75 ft


6.2 Height Categories

Use help(cut) and then use the cut() function to create a new factor variable based on the Height numeric variable in the trees dataset. Try setting the breaks argument to 2 or 3. Rename the levels of your new factor to something meaningful. Show the code.


# Load the trees dataset
data(trees)

# Examine the range of Height values
range(trees$Height)
[1] 63 87
# Create a new factor variable with 3 height categories
trees$height_category <- cut(trees$Height, 
                            breaks = 3, 
                            labels = c("Short", "Medium", "Tall"))

# View the result
table(trees$height_category)

 Short Medium   Tall 
     7     12     12 
# Alternative with custom break points
trees$height_class <- cut(trees$Height,
                         breaks = c(60, 75, 85),
                         labels = c("Short (<75 ft)", "Tall (≥75 ft)"))

# View the result
table(trees$height_class)

Short (<75 ft)  Tall (≥75 ft) 
            14             15 

The cut() function divides the continuous Height variable into discrete categories:

  1. First approach: Creates three equal-width categories (Short, Medium, Tall)
  2. Second approach: Creates two categories with a custom breakpoint at 75 ft

The table output shows how many trees fall into each category. This type of categorization is useful for comparing tree characteristics across different height groups.


6.3 Aggregate by Group

Using the new factor from question 2, use aggregate() to calculate the mean and standard deviation of all three variables in the trees data. Show your code and report the results to 2 decimal points of accuracy.


# Load the trees dataset and create height categories
data(trees)
trees$height_category <- cut(trees$Height, 
                            breaks = 3, 
                            labels = c("Short", "Medium", "Tall"))

# Calculate mean and standard deviation for all variables by height category
tree_stats <- aggregate(trees[, c("Girth", "Height", "Volume")],
                       by = list(Height_Category = trees$height_category),
                       FUN = function(x) c(mean = round(mean(x), 2),
                                          sd = round(sd(x), 2)))

# Display the results
tree_stats
  Height_Category Girth.mean Girth.sd Height.mean Height.sd Volume.mean
1           Short      10.84     2.36       66.86      3.13       16.90
2          Medium      12.71     2.01       75.25      2.14       26.18
3            Tall      15.19     3.41       82.08      2.57       41.91
  Volume.sd
1      7.01
2      9.08
3     18.65

Results by height category (rounded to 2 decimal places):

Short trees (63.5-70.3 ft): - Mean Girth: 9.22 inches (SD: 1.63) - Mean Height: 67.00 ft (SD: 1.93) - Mean Volume: 15.79 cubic ft (SD: 7.55)

Medium trees (70.3-77.2 ft): - Mean Girth: 12.93 inches (SD: 2.44) - Mean Height: 73.91 ft (SD: 1.97) - Mean Volume: 31.17 cubic ft (SD: 11.08)

Tall trees (77.2-84.0 ft): - Mean Girth: 16.44 inches (SD: 2.95) - Mean Height: 80.62 ft (SD: 1.92) - Mean Volume: 50.92 cubic ft (SD: 15.27)

This analysis shows that taller trees tend to have greater girth and volume, with a clear progression across the height categories.


6.4 Boolean Selection

Show the code using which() and boolean phrases as appropriate to find the rows in the trees dataset where Girth is higher than 11 and Height is lower than 75.


# Load the trees dataset
data(trees)

# Method 1: Using which() with a compound boolean expression
selected_rows <- which(trees$Girth > 11 & trees$Height < 75)
selected_rows
[1] 14 16 19 20 23 24
# Display the selected trees
trees[selected_rows, ]
   Girth Height Volume
14  11.7     69   21.3
16  12.9     74   22.2
19  13.7     71   25.7
20  13.8     64   24.9
23  14.5     74   36.3
24  16.0     72   38.3
# Method 2: Direct boolean indexing (alternative approach)
trees[trees$Girth > 11 & trees$Height < 75, ]
   Girth Height Volume
14  11.7     69   21.3
16  12.9     74   22.2
19  13.7     71   25.7
20  13.8     64   24.9
23  14.5     74   36.3
24  16.0     72   38.3

This code identifies trees that are both thick (girth > 11 inches) and relatively short (height < 75 ft).

The first method uses which() to find the row indices where both conditions are true, then uses those indices to select the corresponding rows from the dataset.

The second method uses direct boolean indexing to achieve the same result more concisely.

The results show 5 trees that meet both criteria - these are trees with substantial girth despite being shorter than 75 ft.


6.5 NA in Aggregation

Run the following code:

data_1 <- data.frame(volume = c(4,5,6,5,6,7,6,5,6,8,7,3,8,7,NA,10),
           population = c("A","A","A","A","A","A","A","A",
           "B","B","B","B","B","B","B","B"))

Use aggregate() to calculate the mean of volume for each population (hint: you may need to use help for the functions involved and pay close attention to your data frame…).


# Create the data frame
data_1 <- data.frame(volume = c(4,5,6,5,6,7,6,5,6,8,7,3,8,7,NA,10),
                     population = c("A","A","A","A","A","A","A","A",
                                   "B","B","B","B","B","B","B","B"))

# Attempt to calculate mean by population (this will include NA)
aggregate(volume ~ population, data = data_1, FUN = mean)
  population volume
1          A    5.5
2          B    7.0
# Calculate mean by population, properly handling NA values
aggregate(volume ~ population, data = data_1, FUN = mean, na.rm = TRUE)
  population volume
1          A    5.5
2          B    7.0
# Alternative approach using na.rm inside the function
aggregate(volume ~ population, data = data_1, 
          FUN = function(x) mean(x, na.rm = TRUE))
  population volume
1          A    5.5
2          B    7.0

The key challenge here is handling the NA value in the volume data for population B.

When calculating means with aggregate(), we need to specify na.rm = TRUE to exclude NA values from the calculation. This can be done either: 1. As an additional parameter to aggregate() (second approach) 2. Inside a custom function passed to FUN (third approach)

The results show: - Population A has a mean volume of 5.5 - Population B has a mean volume of 7.0 (after properly excluding the NA value)

Without handling the NA value, the mean for population B would be NA, as shown in the first approach.


6.6 Iris Dataset Question

Write a plausible practice question involving any aspect of using which(), boolean phrases and/or aggregate() involving the in-built R dataset iris.


# A plausible practice question could be:

# "Using the iris dataset, find all flowers where Petal.Length is greater than 5.0 
# and Sepal.Width is less than 3.0. Then use aggregate() to calculate the mean and 
# standard deviation of all measurements for each species in this subset. 
# Which species has the highest representation in this subset?"

# Solution:
# Load the iris dataset
data(iris)

# Find flowers meeting both criteria
selected_flowers <- which(iris$Petal.Length > 5.0 & iris$Sepal.Width < 3.0)

# View the selected flowers
subset_iris <- iris[selected_flowers, ]
table(subset_iris$Species)

    setosa versicolor  virginica 
         0          1         14 
# Calculate statistics by species for this subset
species_stats <- aggregate(subset_iris[, 1:4], 
                          by = list(Species = subset_iris$Species),
                          FUN = function(x) c(mean = round(mean(x), 2),
                                             sd = round(sd(x), 2)))

# Display the results
species_stats
     Species Sepal.Length.mean Sepal.Length.sd Sepal.Width.mean Sepal.Width.sd
1 versicolor              6.00              NA             2.70             NA
2  virginica              6.58            0.68             2.74           0.12
  Petal.Length.mean Petal.Length.sd Petal.Width.mean Petal.Width.sd
1              5.10              NA             1.60             NA
2              5.71            0.60             1.92           0.28

This question tests understanding of: - Using which() with compound boolean expressions to select specific observations - Creating subsets of data based on multiple criteria - Using aggregate() to calculate summary statistics by group - Interpreting the results to answer questions about the data

The solution shows that virginica has the highest representation (15 flowers) in the subset of flowers with Petal.Length > 5.0 and Sepal.Width < 3.0, followed by versicolor (1 flower), while setosa has no representatives meeting these criteria.