04 Data objects

Data objects in ‘R’ space

1 Data objects in R

Learning Objectives

By the end of this lesson, you will be able to:

  • Describe basic data types
  • Use str() to inspect data objects
  • Describe data with factors
  • Use class() and convert variable types
  • Use vector and matrix functions to manipulate data


Imagine you are floating in space in the R Global Environment and any data object you can see, you can call on by name to manipulate with functions. Let’s call it R Space

The fundamental way to analyse data in R is to be able to manipulate it with code. In order to do that, we need a system of containers to store the data in. This page is about the rules used for storing data in data objects. We will examine basic data types at the same time as how R stores them. R is actually very flexible in the way it handles data, making it as easy as possible for non-programmers to get going.


2 Basic data types, str()

2.1 Storage containers types

The first things to examine are the way that R variables are named, and the organization system for storing the data. The organization part is particularly important, because it is used to actually access and use data.

Variable names in R are simple and do not need to be “declared” like in some computing languages, and they can be almost anything, but there are a few rules. The R system has a built-in error message and warning message system (also known as the Passive-Aggressive Butler), which will usually give a hint when some of these rules are violated.


2.2 The Global Environment

One of the things we notice when people begin using R, even if they are experienced in data analysis, is that they expect to “see” data and data sets, almost as they are physical things. This might be because of experience using Excel and seeing the visual representation of data in spreadsheets (indeed, a graphical representation of physical spreadsheets!).

The regular R system for interacting with data is a little more abstract which can be disconcerting to new users. Typical use is to create variables in code script files and, usually, the bring data into the Global Environment from external files on the local PC, or from the web. We will practice using the Global Environment is the main way to interact with data.

You can use the class() function to find out the variable type (using this this is a good idea, since R occasionally guesses the intended data type incorrectly).

# Try this

variable_1 <- c(4,5,7,6,5,4,5,6,7,10,3,4,5,6) # a numeric vector

variable_2 <- c(TRUE, TRUE, TRUE, FALSE) # a logical vector

variable_3 <- c("Peter Parker", "Bruce Wayne", "Groo the Wanderer") # a character vector

class(variable_1) # "numeric"
class(variable_2) # "logical"
class(variable_3) # "character"


Look at the upper right pane of your RStudio and you should see something like:

RStudio Global Environment


The Environment tab contains the Global Environment (labelled A in the picture above). There are some other tabs, but we will ignore these for now. The Global Environment itself contains information about the variables that are held in memory. If we think of this as “R Space”, a general rule is that if you can see a variable here in the Global Environment, you can manipulate it and work with it.

Notice that there is quite a lot of information in the Global Environment about the actual variables (B in the picture). There is a column with the variable NAME (variable_1, variable_2, etc.), a column with the variable TYPES (num, logi, etc.), a column with the variable dimensions ([1:14] is an index like a unique “street address” for each of the 14 numeric values contained in “variable_1”)


2.3 Naming conventions for variables

Variable names:

  • Can contain letters, numbers, some symbolic characters

  • Must begin with a letter

  • Must not contain spaces

  • Some forbidden characters like math operators, “@”, and a few others

  • Should be human-readable, consistent, and not too long

  • Case sensitive

# Try this

## Variable name rules ####

# Can contain letters, numbers, some symbolic characters
x1 <- 5  # OK

x2 <- "It was a dark and stormy night" # OK

my_variable_9283467 <- 1 # technically works, but hard to read

# Must begin with a letter 

varieties <- c("red delicious", "granny smith") # OK

x432 <- c("a", "b") # OK

22catch <- c(TRUE, TRUE, FALSE)  # nope

# Must not contain spaces
my_variable <- 3 # OK

my.variable <- 4 # OK

myVariable <- 5 # OK

my variable <- 6 # nope

"my variable" <- 7 # nope

# Must not contain forbidden characters like 
# math operators, "@", and a few others
my@var <- 1 # nope

my-var <- 1 # nope

my=var <- 1 # nope

# etc.

# Should be human-readable, consistent, and not too long

Diameter_Breast_Height_cm <- c(22, 24, 29, 55, 43) # legal but too long

DBH_cm <- c(22, 24, 29, 55, 43) # much better

#Case sensitive
height <- c(180, 164, 177) # OK

Height # Error: object 'Height' not found (notice capital H)

height # OK


3 Data with factor()

Sometimes you will need to analyze data that is a factor, where the different values are categories. Factors in R can be a cause for confusion, but there needn’t be problems if you understand them. The information here is a starting point, and we skip some complexities, but essentially

3.1 Two types of factors, non-ordered and ordinal

Non-ordered factors are simply categories and the levels are simply the names of the categories. By definition, non-ordered factor do not have a specific order! an example here might be plant varieties.

Ordered factors have a specific order, which can be important for analysis or for graphing. You usually have to specify the order explicitly to get this right. An example here might be the days of the week, where the order of day is important.


4 Class() and converting variables

We use the class() function to query what data category a variable is set too. R is pretty good at setting this correctly, but it is a good idea to check sometimes and occasionally you will have to manually set variable characteristics.


# try this

# non-ordered factor
variety <- c("short", "short", "short",
             "early", "early", "early",
             "hybrid", "hybrid", "hybrid")
class(variety)  # "character", but this is really a factor...
variety # Notice the character strings are just printed out

variety <- factor(variety) # use factor() to convert the character vector to a factor
class(variety)  # now variety is a "factor"
variety # notice the output has changed


Now, try some code manipulating ordered factors.

# Ordered factors
day <- c("Monday", "Monday", 
         "Tuesday", "Tuesday", 
         "Wednesday", "Wednesday", 
         "Thursday")
class(day) # character

#make day a factor
day <- factor(day)
class(day)
day # Notice the Levels: Monday Thursday Tuesday Wednesday

# To set the order explicitly we need to set them explicitly
help(factor) # notice the levels argument - it sets the order of the level names

day <- factor(x = day, levels = c("Monday", "Tuesday",
                         "Wednesday", "Thursday"))
day # Notice the level order now


5 Vector and Matrix fun

Vector and matrix data structures are two fundamental ways to arrange data in R. We have already looked at vectors, which store data in a single dimension.

There are actually a few different data organisation structures in R.

  • Vector - stores data in a single dimension from 1 to i, my_vec[i]

  • Matrix - stores data in two dimensions 1 to i rows, 1 to j columns my_mat[i, j]

  • Array - Three (or more) dimensions from 1 to to i, j, and k, my_array[i, j, k]

Vectors, Matrices and Arrays can only store the same TYPE of data.


5.1 Vector practice

# Try this

myvec1 <- c(1,2,3,4,5) # numeric vector
myvec1
class(myvec1) # see? I told ya!

myvec2 <- as.character(myvec1) #convert to character
myvec2 # notice the quotes
class(myvec2) # now character

myvec3 <- c(2, 3, "male")
myvec3 #notice the numbers now have quotes - forced to character...

myvec4 <- as.numeric(myvec3) #notice the warning
myvec4 # The vector element that could not be coerced to be a numeric was converted to NA


5.2 Matrices

Matrices can be quite useful - you can manipulate data into matrix form with the matrix() function. By default rows and columns are merely numbered, but they can be named as well.


# Try this

vec1 <- 1:16 # make a numeric vector with 16 elements
vec1 

help(matrix) #notice the ncol, nrow and byrow arguments

mat1 <- matrix(data = vec1, ncol = 4, byrow = FALSE) #byrow = FALSE is the default

mat1 # Notice the numbers filled in by columns
colnames(mat1) # The Columns and Rows have no names

colnames(mat1) <- c("A", "B", "C", "D") # Set the column names for mat1
colnames(mat1)

mat1 # Yep the columns shows names


5.3 Matrix challenge

# Challenge 1: Set the Row names for Mat1 using the rownames() function

# Challenge 2: make a matrix with 3 rows with the following vector, 
# so the the first COLUMN contains the numbers 2, 5, and 9, in the order,
# for rows 1, 2, and 3 respectively:

vec2 <- c(2,3,5,4,5,6,7,8,9,5,3,1)


6 Practice exercises


6.1 Vector Recycling

Create a vector named my_var1 that contains the following 6 integers: 3, 6, 12, 7, 5, 1. Create a Second vector called my_var2 that contains the following 2 integers: 2, 3. Evaluate the expression my_var1 + my_var 2. Explain the output in terms of R mechanics in your own words.


# Create the first vector with 6 integers
my_var1 <- c(3, 6, 12, 7, 5, 1)

# Create the second vector with 2 integers
my_var2 <- c(2, 3)

# Evaluate the expression
my_var1 + my_var2
[1]  5  9 14 10  7  4

When adding vectors of different lengths in R, the shorter vector is “recycled” (repeated) to match the length of the longer vector. In this case:

  • my_var1 has 6 elements: 3, 6, 12, 7, 5, 1
  • my_var2 has 2 elements: 2, 3
  • When adding them, my_var2 is recycled to become effectively: 2, 3, 2, 3, 2, 3
  • The addition is then performed element-wise: 3+2=5, 6+3=9, 12+2=14, 7+3=10, 5+2=7, 1+3=4

This recycling behavior is a fundamental aspect of R’s vector operations, allowing operations between vectors of different lengths.


6.2 Ordered Factor

Create a character vector with the names of the 12 months of the year. Convert the vector to a factor, with the month names in chronological order. Show your code.


# Create a character vector with month names
months_vec <- c("January", "February", "March", "April", "May", "June", 
                "July", "August", "September", "October", "November", "December")

# Check the class
class(months_vec)
[1] "character"
# Convert to factor with correct chronological order
months_factor <- factor(months_vec, levels = months_vec)

# Verify the result
class(months_factor)
[1] "factor"
months_factor
 [1] January   February  March     April     May       June      July     
 [8] August    September October   November  December 
12 Levels: January February March April May June July August ... December

This code creates a character vector containing the 12 month names, then converts it to a factor with the levels explicitly set to maintain the chronological order.

Without specifying the levels, R would arrange the factor levels alphabetically (April, August, December, etc.), which wouldn’t represent the natural chronological order of months.


6.3 Matrix Data Types

What is wrong with the following code? Describe, show the code, and justify a fix for the problem in your own words. mymat <- matrix(data = c( 12, 23, 45, “34”, “22”, “31”))


# The problematic code
mymat <- matrix(data = c(12, 23, 45, "34", "22", "31"))
mymat
     [,1]
[1,] "12"
[2,] "23"
[3,] "45"
[4,] "34"
[5,] "22"
[6,] "31"
# Check the class of the matrix elements
class(mymat[1,1])
[1] "character"

The problem with this code is that matrices in R can only contain one data type. When mixing numeric values (12, 23, 45) with character values (“34”, “22”, “31”), R coerces all elements to the most flexible type - character.

This means all numbers are converted to character strings, which prevents numerical operations on the matrix.

To fix this, we need to decide what type the data should be:

# Fix 1: If all values should be numeric
numeric_mat <- matrix(data = c(12, 23, 45, 34, 22, 31), ncol = 3)
numeric_mat
     [,1] [,2] [,3]
[1,]   12   45   22
[2,]   23   34   31
class(numeric_mat[1,1])
[1] "numeric"
# Fix 2: If all values should be character (for non-numeric purposes)
char_mat <- matrix(data = c("12", "23", "45", "34", "22", "31"), ncol = 3)
char_mat
     [,1] [,2] [,3]
[1,] "12" "45" "22"
[2,] "23" "34" "31"
class(char_mat[1,1])
[1] "character"

The appropriate fix depends on how you intend to use the matrix data.


6.4 3D Array

Use the array() function to make a 2 x 2 x 3 array to produce the following output: , , 1

 [,1] [,2]

[1,] 1 3 [2,] 2 4

, , 2

 [,1] [,2]

[1,] 5 7 [2,] 6 8

, , 3

 [,1] [,2]

[1,] 9 11 [2,] 10 12


# Create a vector with all the values
data_vec <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)

# Create the 2x2x3 array
my_array <- array(data = data_vec, dim = c(2, 2, 3))

# Display the array
my_array
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 3

     [,1] [,2]
[1,]    9   11
[2,]   10   12

To create this 3D array:

  1. First, we create a vector with all 12 values in the correct order
  2. Then we use the array() function with:
    • data parameter set to our vector
    • dim parameter set to c(2, 2, 3), which specifies:
      • 2 rows
      • 2 columns
      • 3 “layers” or “slices”

The array fills by columns first, then rows, then layers. This creates the exact structure shown in the expected output.


6.5 Named Matrix

Show the code to make the following matrix: cat dog male 22 88 female 71 29


# Create the data vector
data_vec <- c(22, 71, 88, 29)

# Create the matrix, filling by column
pet_matrix <- matrix(data = data_vec, nrow = 2, ncol = 2, byrow = FALSE)

# Add row and column names
rownames(pet_matrix) <- c("male", "female")
colnames(pet_matrix) <- c("cat", "dog")

# Display the matrix
pet_matrix
       cat dog
male    22  88
female  71  29

To create this matrix:

  1. We create a vector with the four values in the correct order for column-wise filling
  2. We create a 2x2 matrix using matrix() with byrow = FALSE (the default)
  3. We assign row names “male” and “female”
  4. We assign column names “cat” and “dog”

The resulting matrix matches the requested format, showing counts for male/female cats/dogs.


6.6 Matrix Question

Write a plausible practice question involving the use of the matrix() and vector() functions.


# A plausible practice question could be:

# "Create a 3x3 matrix containing the numbers 1 through 9, filled by row. 
# Then extract the diagonal elements of this matrix into a vector. 
# Finally, calculate the mean of these diagonal elements."

# Solution:
# Create the matrix filled by row
my_matrix <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
my_matrix

# Extract the diagonal elements
diag_elements <- diag(my_matrix)
diag_elements

# Calculate the mean of the diagonal elements
mean(diag_elements)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[1] 1 5 9
[1] 5

This question tests understanding of: - Creating matrices with specific filling patterns - Extracting specific elements from matrices - Using the diagonal function - Performing calculations on vectors extracted from matrices

The diagonal elements are 1, 5, and 9, and their mean is 5.