With a good basic set of moves for subsetting and manipulating data, you can overpower any dataset no matter how large and powerful they may be. Then, you will have strong data Sumo.
Subsetting and manipulating data is probably the commonest activity for anyone who works with data. This is a core activity for exploratory data analysis, but is also extensively used in simple data acquisition, analysis and graphing, while also being related to more general data manipulating activities, for example database queries. This page is an introduction to the core syntax and some of the tools for manipulating and subsetting data in R.
2 Indexing concept
If you would like to slice and dice your data, you will need to learn all about indexing!
The basics of the indexing concept in R syntax is very simple, where data storage objects like vectors (1 dimension), matrices (2 dimensions) and arrays (3 or more dimensions) store individual data values that can be accessed by the “address” of the dimension(s).
2.1 How indexing works
Say you have a numeric vector called my_vector that has 10 values. The index values will be 1 to 10, with each value corresponding consecutively to the data value at that position.
Notice the [1] in the R console output? This indicates the index of value right next to it and the R system will provide an index value for longer vectors as the wrap in the console. If we could see the actual index values it would look something like this:
You can create vector subsets by manipulating the index. Vector objects have indices in 1 dimension. For example, my_vector[1:i], where i is the length of the vector.
## Vectors ##### Try thismy_vector<-c(11.3, 11.2, 10.4, 10.4, 8.7, 10.8, 10.5, 10.3, 9.7, 11.2)# Return all valuesmy_vector# Typical waymy_vector[]# Blank index implies all index valuesmy_vector[1:10]# Returns all index values explicitly# Return the first 3 values1:3# Reminder of the function of the colon operator ":"my_vector[1:3]# Notice consecutive indices can use the ":" operator# Return 5th and 9th valuesmy_vector[c(5, 9)]# Notice we have to place non-consecutive index values in the c() function
2.3 Matrices
Matrix objects have 2 dimensions denoted as my_matrix[1:i, 1:j], where i is the number of rows and j is the number of columns.
## Matrices ##### Try thismy_matrix<-matrix(data =c(2,3,4,5,6,6,6,6), nrow =2, byrow =T)my_matrix# notice how the arguments arranged the data# Flash challenge: make a matrix with the same data vector above to look like...# [,1] [,2]# [1,] 2 6# [2,] 3 6# [3,] 4 6# [4,] 5 6# "Slicing" out a row or columnmy_matrix[1, ]# Slice out row 1my_matric[ , 3]# Slice out column 3# Matrix columns and rows often have namesnames(my_matrix)# No names yetnrow(my_matrix)# Returns number of rows (useful for large matrices)rownames(my_matrix)# No row names; 2 rows, need two namesrownames(my_matrix)<-c("dogs", "cats")my_matrix# Now the rows have names!rownames(my_matrix)# Get them this way too!# Flash challenge: Name the columns of my_matrix "a", "b", "c", "d" with colnames()my_matrix# Should look like this:# a b c d# dogs 2 3 4 5# cats 6 6 6 6# You can also slice out matrix portions by namemy_matrix["dogs", c("b", "d")]# Finally, functions act on values, not the indexmean(my_matrix["dogs", c("b", "d")])
2.4 Arrays
Arrays are data objects with more than 2 dimensions (well, technically a matrix with 2 dimensions is also an array, but let’s ignore that for now). Array dimensions are denoted as my_array[1:i, 1:j, 1:k], where i is the number of rows and j the columns and k the “depth” of i * j.
my_array
## Arrays ##### Try this# help(runif)# help(round)# Try it to see what it does... my_vec<-round(runif(n =27, min =0, max =100), 0)my_vec# See what we did there?length(my_vec)# Just checkingmy_array<-array(data =my_vec, dim =c(3, 3, 3))my_array# Flash challenge: # Specify and print the 1st and 3rd slice of the k dimension of my_array# Assuming my_array has dimensions i, j, k like my_array[i,j,k]
3 which() and subsetting
Subsetting data objects is done by exploiting the index system. We usually do this by either specifying index values explicitly (effective, but it requires that you know A LOT about the data object), or by constructing queries that choose subset of data based on particular values. The which() function is a powerful way to construct queries.
# Try thishelp(which)# Notice how the x argument is required to be a LOGICAL vector?# Make a NUMERIC vectorvector_a<-c(3, 4, 5, 4, 3, 4, 5, 6, 6, 7)# Use a boolean phrase to ask which elements of vector_a are greater than 5vector_a>5# Interesting... it is a LOGICAL vector!# which() will return the index values of TRUE values# In other words, WHICH values in vector_a are greater than 5?which(vector_a>5)
What is the point of all this? THE POINT is to be able to use expressions to obtain indices and values in data structures…
# What VALUES in vector_a are > 5?vector_a[which(vector_a>5)]# This also works on vectors of other types# Consider a character vectorchar_vec<-c("wheat", "maize", "wheat", "maize", "wheat", "wheat")# Which elements are equivalent to "wheat"?char_vec=="wheat"which(char_vec=="wheat")char_vec[which(char_vec=="wheat")]# This workschar_vec[char_vec=="wheat"]# Same output# Flash challenge: Explain in your own words why # the previous 2 lines of code have identical output?
We are just beginning to scratch the surface of possibilities with the which() function. Keep this function in mind and practice it when you can.
4 Selection on data.frame objects
Data frames are the ultimate data object for getting, storing, organizing and analyzing data. A good scientist must learn to communicate the subtlety of data. A good statistician must learn not to underestimate the subtletly of data. A good student must learn that
subtlety may exist, even in simple data.
There are a couple of data object types that have a special characteristic in that they store data of different types, where vectors, matrices and arrays can only store one type of data (e.g., numeric, character, logical, etc.). The special data objects that can contain multiple data types are list objects, and data frames. Here we will focus on data frames.
Data frames can have different vector types arranged by column but there is a constraint that each vector must be the same length, that is, each ROW is considered an observation for each variable value (though there may be missing data coded by NA).
There are a few ways to think about selecting values in a data frame. The first is simply to access values through the variable names, which can either be done by using the data frame name with the $ operator and the variable name, or by using the [ , ] syntax with either the variable name or the column number of the variable of interest see here.
A second powerful way to access variables in a data frame is by selecting particular rows of a data frame. This may be done by selecting the rows of a data based on values of one or more variables. We will practice doing this using which(), the [ , ] syntax, and boolean phrases is the following code block.
For the following section, we will use the OrchardSprays dataset that exists as a data frame in the in-built {datasets} package. You can use help(OrchardSprays) to see the help page (the help page is characteristically terse, so some description is given here).
The experiment involved having a treatment consisting of adding a “lime sulfur emulsion” (honeybee deterrent) in increasing concentrations to a sucrose solution. The treatment variable had 8 levels including a control (no deterrent) and 7 other levels with increasing concentration of the deterrent. The treatment levels were named A (the highest amount of deterrent), B (second highest deterrent) through to G (lowest deterrent) and H (control - no deterrent) The decrease variable was a measure of the quantity of sucrose solution that was taken by honeybees (the prediction here is that higher concentrations of the deterrent should result in a lower decrease in the sucrose solution).
The experiment involved a Latin Square design, with the order of the 8 treatments arranged randomly in an array of 8 columns (the purpose of this design is to randomize any effect of the treatment ORDER or POSITION on the response variable). This resulted in an 8 row by 8 column experiment. The response was measured after placing 100 honeybees into an experimental chamber with the 64 containers of sucrose solution.
## OrchardSprays ###### Understand the data - an important step# Try this# Load the OrchardSpray data using the data() functiondata(OrchardSprays)# Should see OrchardSprays <promise> in the Global Env.# Look at the data head()head(OrchardSprays)# First 6 rows# Look at variable types with str()help(str)# Good function to see info about data objectstr(OrchardSprays)# First let's just look at the data# Don't worry too much about the code for these graphs if you have not encountered it beforeboxplot(decrease~treatment, data =OrchardSprays, main ="The pattern fits the prediction", ylab ="Amount of sucrose consumed", xlab ="Lime sulpher treatment amount in decreasing order (H = control)")# This is the experimental design# Latin Square is kind of like Sudoku# No treatment can be in row or column more than onceplot(x =OrchardSprays$colpos, # NB use of $ syntax to access data y =OrchardSprays$rowpos, pch =as.character(OrchardSprays$treatment), xlim =c(0,9), ylim =c(0,9), main ="The Latin Square design of treatments", xlab ="\"Column\" position", ylab ="\"Row\" position")
4.2 Practice selecting parts a data frame
Selecting particular parts of a data frame based on the values of one variable is a common and extremely useful task.
## Practice selecting parts a data frame ##### Select the rows of the dataset for treatment "D"# (Pseudocode steps to solve) # Break it down to make small steps easy to read# 01 Boolean phrase to identify rows where treatment value is "D"# 02 which() to obtain index of TRUE in boolean vector# 03 Exploit [ , ] syntax with data frame object to slice out rows# 01 Boolean phraseOrchardSprays$treatment# Just print variable to compare visually to booleanOrchardSprays$treatment=="D"# logical vector - TRUE in "D" positions# 02 which()which(OrchardSprays$treatment=="D")# Index of TRUE valuesmy_selection<-which(OrchardSprays$treatment=="D")# Place index in a variablemy_selection# Just checking# 03 Exploit [ , ] syntax with data frame object to slice out rowsOrchardSprays[my_selection, ]# Flash challenge: Select and print all rows at "colpos" values of 2
4.3 Selection based on more than one variable value
Using the basic building blocks of boolean selection, more complex rules for selecting data can be made.
## Compound boolean for selection ##### Select all rows of the data frame where # rowpos equals 4 OR 6 AND treatment equals "A" OR "H"# What we expect is exactly 2 values (A or H) for each powpos (4 or 6)# rowpos 4 and 6OrchardSprays$rowpos==4# The 4sOrchardSprays$rowpos==6# The 6sOrchardSprays$rowpos==4|OrchardSprays$rowpos==6# All together# now with which()which(OrchardSprays$rowpos==4)# The 4swhich(OrchardSprays$rowpos==6)# The 6swhich(OrchardSprays$rowpos==4|OrchardSprays$rowpos==6)# All together# treatment A and Hwhich(OrchardSprays$treatment=="A"|OrchardSprays$treatment=="H")# All together# Now we need the intersection of value that are in both our which() vectorswhich((OrchardSprays$rowpos==4|OrchardSprays$rowpos==6)&# It works(OrchardSprays$treatment=="A"|OrchardSprays$treatment=="H"))# NB this is a long way of spelling out our selection, # but trying to be very explicit with what is going onmy_selec2<-which((OrchardSprays$rowpos==4|OrchardSprays$rowpos==6)&(OrchardSprays$treatment=="A"|OrchardSprays$treatment=="H"))OrchardSprays[my_selec2, ]# Double check it works and is similar to expectation...# Flash challenge: Calculate the mean of decrease for treatment "A" # and the mean of decrease for treatment "H"
5 aggregate() function
We often may wish to summarize parts of a data set according to some index of variable values. A very convenient tool for the is the aggregate() function, which we will practice here.
help(aggregate)# A few important things to note about how this function works:# The "x" argument is a data object you input, but should only contain numeric values usually# If a data.frame object is input as x, a data.frame object is the output# The "by" argument must be a list() object and can be one or more indices# The FUN argument is the name of the function that will act on the "x" argument data# Let's try a few examples## 1 calculate the mean of decrease by treatment in OrchardSpraysaggregate(x =OrchardSprays$decrease,# NB use of list() and naming it "treatment" by =list(treatment =OrchardSprays$treatment), FUN =mean)# we can "recycle" the code above to apply different functions# standard deviation with sd()aggregate(x =OrchardSprays$decrease,# NB use of list() and naming it "treatment" by =list(treatment =OrchardSprays$treatment), FUN =sd)# Range with range()aggregate(x =OrchardSprays$decrease,# NB use of list() and naming it "treatment" by =list(treatment =OrchardSprays$treatment), FUN =range)# What if we want several summary statistics?aggregate(x =OrchardSprays$decrease,# NB use of list() and naming it "treatment" by =list(treatment =OrchardSprays$treatment), # NB use of function() FUN =function(x)c(mean =mean(x), # Add naming sd =sd(x), range =range(x)))## Example of use of aggregate object# Say you would like to graph a barplot of the MEAN of decrease by treatment# and you would like to show STANDARD DEVIATION error bars# Make data frame with summary values using aggregate()my_mean<-aggregate(x =OrchardSprays$decrease, by =list(treatment =OrchardSprays$treatment), FUN =mean)my_sd<-aggregate(x =OrchardSprays$decrease, by =list(treatment =OrchardSprays$treatment), FUN =sd)my_meanmy_sd# Tidy things up in a new data frame using data.frame()# Take care of naming variables for clarityhelp(data.frame)# Continue using help() as a good habitnew_data<-data.frame(treatment =my_mean$treatment, mean =my_mean$x, sd =my_sd$x)new_data# Looks good# There is a lot going on in the following code# The point is to show what is possible(bar_centers<-barplot(new_data$mean,# use mean for barheight ylim =c(0, 115), ylab ="Mean solution decrease (+- 1 SD)", xlab ="Treatment"))# NB bar_centers holds the numerical position value of the bars...help(arrows)# Use to draw error barsarrows(x0 =bar_centers, x1 =bar_centers, y0 =new_data$mean , # start error bar at top of the bar! y1 =new_data$mean+new_data$sd, # end error bar here! angle =90, length =0.1)# Last step: label the x axisaxis(side =1, at =bar_centers, labels =new_data$treatment)# Flash challenge: Draw a new barplot by recycling the code above# This time, add error bars showing on both the top and the bottom of the mean values
6 Practice exercises
For the following exercises, use the trees dataset built into R, which has Girth, Height and Volume variables for 31 Black Cherry trees.
# Examine the datahelp(trees)data(trees)str(trees)
6.1 Mean for Subset
Show code to calculate the mean Girth of Black Cherry trees with Height less than 75 ft.
Mean Girth for Shorter Trees
# Load the trees datasetdata(trees)# Method 1: Using which() and boolean selectiontree_indices<-which(trees$Height<75)mean(trees$Girth[tree_indices])
[1] 11.8
# Method 2: Direct boolean indexingmean(trees$Girth[trees$Height<75])
[1] 11.8
The mean girth of Black Cherry trees with height less than 75 ft is 11.7 inches.
The code demonstrates two equivalent approaches: 1. Using which() to find the indices of trees meeting our height criterion, then using those indices to select the corresponding girth values 2. Directly using boolean indexing to select girth values where the height is less than 75 ft
6.2 Height Categories
Use help(cut) and then use the cut() function to create a new factor variable based on the Height numeric variable in the trees dataset. Try setting the breaks argument to 2 or 3. Rename the levels of your new factor to something meaningful. Show the code.
Creating Height Categories
# Load the trees datasetdata(trees)# Examine the range of Height valuesrange(trees$Height)
[1] 63 87
# Create a new factor variable with 3 height categoriestrees$height_category<-cut(trees$Height, breaks =3, labels =c("Short", "Medium", "Tall"))# View the resulttable(trees$height_category)
Short Medium Tall
7 12 12
# Alternative with custom break pointstrees$height_class<-cut(trees$Height, breaks =c(60, 75, 85), labels =c("Short (<75 ft)", "Tall (≥75 ft)"))# View the resulttable(trees$height_class)
Short (<75 ft) Tall (≥75 ft)
14 15
The cut() function divides the continuous Height variable into discrete categories:
First approach: Creates three equal-width categories (Short, Medium, Tall)
Second approach: Creates two categories with a custom breakpoint at 75 ft
The table output shows how many trees fall into each category. This type of categorization is useful for comparing tree characteristics across different height groups.
6.3 Aggregate by Group
Using the new factor from question 2, use aggregate() to calculate the mean and standard deviation of all three variables in the trees data. Show your code and report the results to 2 decimal points of accuracy.
Aggregating Tree Data by Height Category
# Load the trees dataset and create height categoriesdata(trees)trees$height_category<-cut(trees$Height, breaks =3, labels =c("Short", "Medium", "Tall"))# Calculate mean and standard deviation for all variables by height categorytree_stats<-aggregate(trees[, c("Girth", "Height", "Volume")], by =list(Height_Category =trees$height_category), FUN =function(x)c(mean =round(mean(x), 2), sd =round(sd(x), 2)))# Display the resultstree_stats
Results by height category (rounded to 2 decimal places):
Short trees (63.5-70.3 ft): - Mean Girth: 9.22 inches (SD: 1.63) - Mean Height: 67.00 ft (SD: 1.93) - Mean Volume: 15.79 cubic ft (SD: 7.55)
Medium trees (70.3-77.2 ft): - Mean Girth: 12.93 inches (SD: 2.44) - Mean Height: 73.91 ft (SD: 1.97) - Mean Volume: 31.17 cubic ft (SD: 11.08)
Tall trees (77.2-84.0 ft): - Mean Girth: 16.44 inches (SD: 2.95) - Mean Height: 80.62 ft (SD: 1.92) - Mean Volume: 50.92 cubic ft (SD: 15.27)
This analysis shows that taller trees tend to have greater girth and volume, with a clear progression across the height categories.
6.4 Boolean Selection
Show the code using which() and boolean phrases as appropriate to find the rows in the trees dataset where Girth is higher than 11 and Height is lower than 75.
Finding Trees with Specific Criteria
# Load the trees datasetdata(trees)# Method 1: Using which() with a compound boolean expressionselected_rows<-which(trees$Girth>11&trees$Height<75)selected_rows
[1] 14 16 19 20 23 24
# Display the selected treestrees[selected_rows, ]
This code identifies trees that are both thick (girth > 11 inches) and relatively short (height < 75 ft).
The first method uses which() to find the row indices where both conditions are true, then uses those indices to select the corresponding rows from the dataset.
The second method uses direct boolean indexing to achieve the same result more concisely.
The results show 5 trees that meet both criteria - these are trees with substantial girth despite being shorter than 75 ft.
6.5 NA in Aggregation
Run the following code:
data_1<-data.frame(volume =c(4,5,6,5,6,7,6,5,6,8,7,3,8,7,NA,10), population =c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"))
Use aggregate() to calculate the mean of volume for each population (hint: you may need to use help for the functions involved and pay close attention to your data frame…).
Handling NA Values in Aggregation
# Create the data framedata_1<-data.frame(volume =c(4,5,6,5,6,7,6,5,6,8,7,3,8,7,NA,10), population =c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"))# Attempt to calculate mean by population (this will include NA)aggregate(volume~population, data =data_1, FUN =mean)
population volume
1 A 5.5
2 B 7.0
# Calculate mean by population, properly handling NA valuesaggregate(volume~population, data =data_1, FUN =mean, na.rm =TRUE)
population volume
1 A 5.5
2 B 7.0
# Alternative approach using na.rm inside the functionaggregate(volume~population, data =data_1, FUN =function(x)mean(x, na.rm =TRUE))
population volume
1 A 5.5
2 B 7.0
The key challenge here is handling the NA value in the volume data for population B.
When calculating means with aggregate(), we need to specify na.rm = TRUE to exclude NA values from the calculation. This can be done either: 1. As an additional parameter to aggregate() (second approach) 2. Inside a custom function passed to FUN (third approach)
The results show: - Population A has a mean volume of 5.5 - Population B has a mean volume of 7.0 (after properly excluding the NA value)
Without handling the NA value, the mean for population B would be NA, as shown in the first approach.
6.6 Iris Dataset Question
Write a plausible practice question involving any aspect of using which(), boolean phrases and/or aggregate() involving the in-built R dataset iris.
Iris Dataset Question
# A plausible practice question could be:# "Using the iris dataset, find all flowers where Petal.Length is greater than 5.0 # and Sepal.Width is less than 3.0. Then use aggregate() to calculate the mean and # standard deviation of all measurements for each species in this subset. # Which species has the highest representation in this subset?"# Solution:# Load the iris datasetdata(iris)# Find flowers meeting both criteriaselected_flowers<-which(iris$Petal.Length>5.0&iris$Sepal.Width<3.0)# View the selected flowerssubset_iris<-iris[selected_flowers, ]table(subset_iris$Species)
setosa versicolor virginica
0 1 14
# Calculate statistics by species for this subsetspecies_stats<-aggregate(subset_iris[, 1:4], by =list(Species =subset_iris$Species), FUN =function(x)c(mean =round(mean(x), 2), sd =round(sd(x), 2)))# Display the resultsspecies_stats
Species Sepal.Length.mean Sepal.Length.sd Sepal.Width.mean Sepal.Width.sd
1 versicolor 6.00 NA 2.70 NA
2 virginica 6.58 0.68 2.74 0.12
Petal.Length.mean Petal.Length.sd Petal.Width.mean Petal.Width.sd
1 5.10 NA 1.60 NA
2 5.71 0.60 1.92 0.28
This question tests understanding of: - Using which() with compound boolean expressions to select specific observations - Creating subsets of data based on multiple criteria - Using aggregate() to calculate summary statistics by group - Interpreting the results to answer questions about the data
The solution shows that virginica has the highest representation (15 flowers) in the subset of flowers with Petal.Length > 5.0 and Sepal.Width < 3.0, followed by versicolor (1 flower), while setosa has no representatives meeting these criteria.