Hello! Welcome to the R walk through. I wanted to clarify a few things at the outset. Things this is not
Things this is
With that in mind, if anything does not make sense at the outset, please ask questions, but also do not worry to much about hacking everything at once. I am providing you with the code and the notes precisely so you have a resource that you can reference later for the purpose of helping you in the future.
R is an object oriented programming language. This means that the organization occurs around data or objects rather than functions or logic. Each object you see in R has a defined class and a type. When using R, you are manipulating objects, and errors occur most often because you apply a manipulation that is illegal in the language. We are going to briefly go over how information is stored in R to help you make sense of error messages when they occur.
RStudio is an integrated development environment (IDE). Think of it as an interface for the execution of R code.
R has some nice advantages: * It is open-source and free * Since it is object oriented, it gets your foot in the door for learning other object-oriented programming (think Python) * There is a strong and vocal community of users who regularly post resources and provide answers to common questions.
R objects have names, mode, and a length. Mode tells you what type of data is in the object: character, numeric, logical, function, or list.
This is also why if you use the function `mode’ in R, you will not get the statistical mode as your answer!
These objects have different types - recursive and atomic. An atomic type is numeric, char, and logical. If the type is atomic, then only one mode is included in the object.
To get a bit deeper here, let’s think about vectors and matrices. A vector is a 1 dimension object of n length. We can place a collection of items into a vector using c(), for concatenate.
fav_food = c('chocolate', 'peanut butter', 'chicken')
fav_food
## [1] "chocolate" "peanut butter" "chicken"
fav_food
This is a collection of characters. We can make a bigger character matrix by adding a column.
food_mat = matrix(nrow = 3, ncol = 2, c('chocolate', 'peanut butter',
'chicken', 'yes', 'yes', 'no'))
food_mat
## [,1] [,2]
## [1,] "chocolate" "yes"
## [2,] "peanut butter" "yes"
## [3,] "chicken" "no"
Columns are indexed, such that elements of them can be removed. The first index is the row, and the second is the column.
food_mat[1,]
## [1] "chocolate" "yes"
food_mat[,1]
## [1] "chocolate" "peanut butter" "chicken"
food_mat[1,1]
## [1] "chocolate"
What if we tried to do this with a vector? We will get an error, because the dimensions will be incorrect.
Coercion occurs when modes are forced to change to be the same within an atomic object. Consider the following matrix.
food_mat = matrix(nrow = 3, ncol = 2, c('chocolate', 'peanut butter',
'chicken', 1, 2, 0))
food_mat
## [,1] [,2]
## [1,] "chocolate" "1"
## [2,] "peanut butter" "2"
## [3,] "chicken" "0"
Why did this happen? Because a matrix is atomic, it needs one mode. Therefore, the numbers were coerced to be in the same mode as the words. Coercion occurs according to the following hierarchy: char, numeric, logical. This means if you ask R to perform an operation that requires coercion, it will attempt to change the vector in that order.
Meanwhile, a recursive type is function or list. Recursive objects may include different modes. Think of a dataframe where you have a column of names and a column of salaries associated with those names. We get get the i element of the list with [].
n_item <- c(1,2,0)
fav_food <- c('chocolate', 'peanut butter', 'chicken')
our_list <- list(n_item, fav_food)
our_list[1]
## [[1]]
## [1] 1 2 0
We can remove elements from a list using either [] or [[]]. The difference is [] will select the item in the list whereas [[]] will select the contents. Let’s make a bigger list to see.
taste_good = c(1, 0, 1)
allergic = c(0, 0, 0)
second_list = list(taste_good, allergic)
our_big_list = list(our_list, second_list)
our_big_list[[1]][[1]]
## [1] 1 2 0
Our_big_list is a list of lists - that is, it has two elements, each one being a list. Our command that used [[1]][[1]] removed from the first list the first element.
Overall, a list as an object can include both numeric and char modes. Indeed, a dataframe is a type of list. Let’s make a dataframe.
our_df = data.frame(our_list)
colnames(our_df)[1:2] <- c('food.name', 'n.item')
mode(our_df);our_df
## [1] "list"
## food.name n.item
## 1 1 chocolate
## 2 2 peanut butter
## 3 0 chicken
A function is like a machine that takes inputs and makes outputs. There are plenty of canned R functions that are included in the base language. Take, for example, mean.
mode(mean);mean(n_item, na.rm = T)
## [1] "function"
## [1] 1
How do we name objects? It is best to use something to separate long titles like an underscore or period. It may be good practice to use different separators for different object types (periods for functions and lists underscores for vectors or matrices, but of course this is your prerogative.)
We may use ==
for logic, <-
for
assignment, and =
for functions. Technically =
can be used for assignment, which has some nice advantages (its not
white space dependent, less key strokes, easy to read) but potential
drawbacks (R will treat objects on the right of `=’ as a named function
argument.)
a <- 'cobra bubbles'
b = 'cobra bubbles'
a
## [1] "cobra bubbles"
b
## [1] "cobra bubbles"
Here is an example of the different assignment operators in action. x is a persistent object in the first line, and a function argument in the second.
x = seq(1,10,1)
sample(x = seq(1,10,1), replace = FALSE, size = 2)
## [1] 2 6
What do we mean when we say logic? Let’s say we want to make variable scored 1 when the string reads `cobra bubbles’.
as.numeric(a == 'cobra bubbles', 1,0)
## [1] 1
Our operation here is logical - say it is TRUE that a is defined as cobra bubbles, then give the value of 1. Otherwise, mark it 0! As an exercise, see the error code that turns up if you try to use a different operator.
Let’s see an example of this operation where we ignore the grammar and set the equality to be only one sign. Who can explain this error message?
Error in as.numeric(a = “cobra bubbles”, 1, 0) : supplied argument name ‘a’ does not match ‘x’
Let’s look at a preloaded dataset in R - Titanic.
data("Titanic")
Here are some useful functions for assessing objects. First is str, which tells us te structure of the data.
str(Titanic)
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
Attributes
attributes(Titanic)
## $dim
## [1] 4 2 2 2
##
## $dimnames
## $dimnames$Class
## [1] "1st" "2nd" "3rd" "Crew"
##
## $dimnames$Sex
## [1] "Male" "Female"
##
## $dimnames$Age
## [1] "Child" "Adult"
##
## $dimnames$Survived
## [1] "No" "Yes"
##
##
## $class
## [1] "table"
nrow
nrow(Titanic)
## [1] 4
ncol
nrow(Titanic)
## [1] 4
colnames
colnames(Titanic)
## [1] "Male" "Female"
x = rnorm(1000)
y = runif(1000)
z = rbinom(1000, 1, .3)
x_times_y = x*y
x_divide_y = x/y
x_exp_y = x^y
x_plus_y = x+y
x_minus_y = x-y
x_vec = c(10, 4, 7)
y_vec = c(12, 1, 6)
z_vec = c(11, 0, 22)
We can combine vectors and matrices with rbind or cbind.
First, rbind will bind by the rows. Each row of the dataset will be the corresponding vector
rbind(x_vec,y_vec, z_vec)
## [,1] [,2] [,3]
## x_vec 10 4 7
## y_vec 12 1 6
## z_vec 11 0 22
Second, cbind will bind by columns.The first row will be the first elements of each vector.
cbind(x_vec,y_vec, z_vec)
## x_vec y_vec z_vec
## [1,] 10 12 11
## [2,] 4 1 0
## [3,] 7 6 22
We can add .data.frame to make it into a dataframe after we combine the elements.
We can also perform matrix algebra in R. This is how we do regressions. We use %*% for matrix multiplication, solve() for matrix inversion, and t() for transpose.
x_mat = cbind(rep(1, 10), rbinom(10, 1, .3), rnorm(10), runif(10))
y_mat = rnorm(10)
b = solve(t(x_mat)%*%x_mat)%*%t(x_mat)%*%y_mat
As it turns out, the operation of multiplying the transpose of a matrix by another is very common, common enough that the function crossprod() performs the same operation. We will get the same answer:
b<-solve(crossprod(x_mat), crossprod(x_mat,y_mat))
We can also perform the linear regression of y on x and see that our operation provides the same numerical results.
lm(y_mat ~ x_mat[,2:4]) |>
summary()
##
## Call:
## lm(formula = y_mat ~ x_mat[, 2:4])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9715 -0.7036 0.1618 0.4811 0.9918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.58389 0.56503 1.033 0.341
## x_mat[, 2:4]1 -0.05985 0.64344 -0.093 0.929
## x_mat[, 2:4]2 0.27162 0.44156 0.615 0.561
## x_mat[, 2:4]3 -0.58541 0.93566 -0.626 0.555
##
## Residual standard error: 0.8795 on 6 degrees of freedom
## Multiple R-squared: 0.1391, Adjusted R-squared: -0.2914
## F-statistic: 0.3231 on 3 and 6 DF, p-value: 0.8091
lm() is the canned regression function. You will likely, in practice, use more specialized packages for regression to compute standard errors (estimatr, fixest, lfe to name a few). One funny quirk with lm() is that it treats / differently
y = rnorm(1000)
x = rnorm(1000)
z = rnorm(1000)
lm(y ~ x/z) |>
summary()
##
## Call:
## lm(formula = y ~ x/z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.87685 -0.73976 0.01919 0.68222 3.06850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03446 0.03262 -1.057 0.291
## x 0.04704 0.03277 1.435 0.152
## x:z -0.02731 0.03166 -0.863 0.389
##
## Residual standard error: 1.03 on 997 degrees of freedom
## Multiple R-squared: 0.002827, Adjusted R-squared: 0.0008271
## F-statistic: 1.413 on 2 and 997 DF, p-value: 0.2438
We can also use indicator functions to create dummies inside of lm()
lm(y ~ I(x>0)) |>
summary()
##
## Call:
## lm(formula = y ~ I(x > 0))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8756 -0.7267 0.0218 0.6819 3.0434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05708 0.04556 -1.253 0.211
## I(x > 0)TRUE 0.04086 0.06522 0.627 0.531
##
## Residual standard error: 1.031 on 998 degrees of freedom
## Multiple R-squared: 0.0003932, Adjusted R-squared: -0.0006084
## F-statistic: 0.3926 on 1 and 998 DF, p-value: 0.5311
If we wanted to do an interaction the normal way, we would use * like in multiplication
lm(y ~ x*z) |>
summary()
##
## Call:
## lm(formula = y ~ x * z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.86710 -0.74079 0.01939 0.68271 3.06337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03428 0.03264 -1.050 0.294
## x 0.04669 0.03281 1.423 0.155
## z 0.01017 0.03253 0.313 0.755
## x:z -0.02664 0.03175 -0.839 0.402
##
## Residual standard error: 1.031 on 996 degrees of freedom
## Multiple R-squared: 0.002925, Adjusted R-squared: -7.795e-05
## F-statistic: 0.974 on 3 and 996 DF, p-value: 0.4042
We may exclude the linear terms of the interaction with :
lm(y ~ x:z + x + z) |>
summary()
##
## Call:
## lm(formula = y ~ x:z + x + z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.86710 -0.74079 0.01939 0.68271 3.06337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03428 0.03264 -1.050 0.294
## x 0.04669 0.03281 1.423 0.155
## z 0.01017 0.03253 0.313 0.755
## x:z -0.02664 0.03175 -0.839 0.402
##
## Residual standard error: 1.031 on 996 degrees of freedom
## Multiple R-squared: 0.002925, Adjusted R-squared: -7.795e-05
## F-statistic: 0.974 on 3 and 996 DF, p-value: 0.4042
The output of lm() is a list object.
model = lm(y ~ x:z + x + z)
model
##
## Call:
## lm(formula = y ~ x:z + x + z)
##
## Coefficients:
## (Intercept) x z x:z
## -0.03428 0.04669 0.01017 -0.02664
We can take objects from the list to make plots or save them for other analyese.
model$coefficients
## (Intercept) x z x:z
## -0.03427726 0.04669328 0.01016843 -0.02664422
The most common way variables are created is using as.numeric() and ifelse(). These functions take a logical condition as an argument, and then generate indicators based on those conditions. as.numeric() will only map a vector into a binary (0,1) space, but ifelse can be used for more complex operations.
as.numeric(x_vec==10, 1,0)
## [1] 1 0 0
ifelse(x_vec == 10, x_vec*-10, x_vec*10)
## [1] -100 40 70
You may want to do something like the above ifelse statement if you need to make a continuous variable based on a logical condition. For example, you may want a variable to be scored negative or positive based on a condition D (as in regression discontinuity) or may want to create a treatment variable that is scored as years until an event for treated units and 0 for all controls (as in panel event study designs).
We may want to write loops when we need to perform the same action many times. This is much faster to type than repeating the same thing by hand. Often, if something is going to happen more than 3 times, it is good practice to write a loop. In addition to saving time, you reduce the chances of mistakes, and make them easier to fix if they do occur. There are two types of loops, for and while. A for loop is when we know the number of interations. A while loop is when the number of iterations is unknown, and we plan to test for a condition before applying a function.
numbers <- c(1, 2, 3, 4, 10)
# Initialize an empty vector to store results
squared_numbers_for <- numeric(length(numbers))
# Use a for loop to calculate squares
for (i in 1:length(numbers)) {
squared_numbers_for[i] <- numbers[i]^2
}
# Print the results
print(squared_numbers_for)
## [1] 1 4 9 16 100
Here is an example of a while loop.
# Initialize an empty vector to store results
squared_numbers_while <- numeric(length(numbers))
# Initialize the index variable
i <- 1
# Use a while loop to calculate squares
while (i <= 5) {
squared_numbers_while[i] <- numbers[i]^2
i <- i + 1
}
# Print the results
print(squared_numbers_while)
## [1] 1 4 9 16 100
Another silly/classic example of a while loop
dice <- 1
while (dice <= 6) {
if (dice < 6) {
print("No Yahtzee")
} else {
print("Yahtzee!")
}
dice <- dice + 1
}
## [1] "No Yahtzee"
## [1] "No Yahtzee"
## [1] "No Yahtzee"
## [1] "No Yahtzee"
## [1] "No Yahtzee"
## [1] "Yahtzee!"
There are other ways to repeat processes. We may use sapply(), or lapply(). Each of these take as an argument a list, use a function on each element of the list, and then return the output. In the case of sapply and lapply, sapply will try to make the output as simple as possible, and lapply will return the output as a list.
names = c('CARMY', 'COUSIN', 'CLAIRE')
output_sapply = sapply(names, 'tolower')
output_lapply = lapply(names, 'tolower')
class(output_sapply);class(output_lapply)
## [1] "character"
## [1] "list"
Meanwhile, apply() can be used on the rows or columns of a matrix.
mat <- matrix(1:12, nrow = 3, byrow = TRUE)
row_sums <- apply(mat, 1, sum)
row_sums
## [1] 10 26 42
col_sums <- apply(mat, 2, sum)
This saves us a lot of time - a for loop here would look at each column/row i, and then apply the sum function for each one, store it, and then we would save the results. This gives us one simple line of code :).
Recall that dataframes are just like lists or matrices. We can select rows and columns in the same way. Let’s make a silly dataset and mess around with it.
food = c('hotdog', 'hamburger', 'taco', 'pbj')
sandwitch = c('no', 'no', 'no', 'yes')
lunch = c('yes', 'yes', 'no', 'yes')
n_serve = c(100, 200,300, 400)
food_df = cbind.data.frame(food, sandwitch, lunch, n_serve)
#colnames(cbind.data.frame) = c('food', 'sand', 'lunch', 'n_serve')
food_df_lunch = subset(food_df, lunch == 'yes')
food_df_lunch
## food sandwitch lunch n_serve
## 1 hotdog no yes 100
## 2 hamburger no yes 200
## 4 pbj yes yes 400
food_df_c1_c2 = food_df[,c(1,2)]
food_df_c1_c2
## food sandwitch
## 1 hotdog no
## 2 hamburger no
## 3 taco no
## 4 pbj yes
We set up our workspace first by setting a working directory. Those familiar with stata should understand this concept - it sets a relative path from which files can be loaded and saved into our environment.
setwd('~/Dropbox')
getwd()
## [1] "/Users/donaldgrasse/Dropbox"
While many things can be accomplished in base R, most of the time you will need to load packages. Packages are collections of functions with documentation that R users create to perform specialized tasks. To use a package, we must first install it. We do so using the install.packages function, and wrap the package name with quotations.
We can check if a package is installed by using installed.packages(). Once installed, we can load packages with either library() or require().
#install.packages('estimatr')
#library(estimatr)
Often there will be times when you want to load many packages at once. You can place all of the package names you want into a list.
pack = c('ggplot2', 'dplyr')
lapply(pack, require, character.only = T)
## Loading required package: ggplot2
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
Sometimes you may be using a different machine, your coauthor is also using the code, or you want to make sure you can transition easily if you get a new computer. You many then want to set it up so your script finds and catches packages that are needed but not installed.
pack = c('ggplot2', 'dplyr')
new.pack <- pack[!(pack %in% installed.packages()[,"Package"])]
if(length(new.pack)) install.packages(new.pack)
There is also a package called pacman that will do this for you. As an exercise, read about this package and attempt using it on your own https://cran.r-project.org/web/packages/pacman/index.html. You should always put something like this at the very top of your script.
When you want to call a package function that is installed without loading it, you may use package name :: function to use it. You may want to do this if loading the entire package would cause conflicts (a conflict occurs if a function has the same name in two packages).
If you need help with a package, type ?packagename. If you need help with a function, try ??.
Perhaps the most used package is tidyverse, which is a language for solving R-related data cleaning challenges. There is a bit of an ideological war between base-R and tidyverse users. I’d recommend making yourself adept at both, from your perspective there is not much to gain by having a hard line in the sand.
There are various functions for reading in data native to R or foreign to it (stata or excel for instance) read.csv(), readRDS(), read.delim(), haven::read_dta, readxl::read_xlsx(). All of these functions have a similar syntax: include the filepath for the data, and correctly specify auxiliary arguments.
dplyr is a very popular tool for data manipulation in R that comes from tidyverse and is very similar to SQL. It may be used to reshape or summarize data.
data(presidents)
head(presidential)
## # A tibble: 6 × 4
## name start end party
## <chr> <date> <date> <chr>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Kennedy 1961-01-20 1963-11-22 Democratic
## 3 Johnson 1963-11-22 1969-01-20 Democratic
## 4 Nixon 1969-01-20 1974-08-09 Republican
## 5 Ford 1974-08-09 1977-01-20 Republican
## 6 Carter 1977-01-20 1981-01-20 Democratic
Say we only want the columns relating the name and the party. We can use select in dplyr.
name_party <- presidential %>%
select(name, party)
What if we only wanted presidents who are Democrats? We may use filter.
name_party_dem <- presidential %>%
select(name, party) %>%
filter(party == 'Democratic')
We could do the same thing using != which is to be read as not equal.
name_party_dem <- presidential %>%
select(name, party) %>%
filter(party != 'Republican')
This dataset has a new (and slightly annoying) category of data - date! These can be tricky at times so it is good to review the properties.
FormatCode | Description | Example |
---|---|---|
%d | Day of the month as a decimal number (01-31) | 01 |
%m | Month as a decimal number (01-12) | 08 |
%b | Abbreviated month name (Jan, Feb, etc.) | Aug |
%B | Full month name (January, February, etc.) | August |
%y | Year without century (00-99) | 24 |
%Y | Year with century (e.g., 2024) | 2024 |
%H | Hour in 24-hour format (00-23) | 14 |
%I | Hour in 12-hour format (01-12) | 02 |
%M | Minute as a decimal number (00-59) | 45 |
%S | Second as a decimal number (00-59) | 30 |
%p | AM or PM designation | PM |
%a | Abbreviated weekday name (Mon, Tue, etc.) | Mon |
%A | Full weekday name (Monday, Tuesday, etc.) | Monday |
%j | Day of the year as a decimal number (001-366) | 230 |
%U | Week number of the year (Sunday as the first day of the week) | 34 |
%W | Week number of the year (Monday as the first day of the week) | 33 |
%Z | Time zone abbreviation | UTC |
%z | Time zone offset from UTC in the form ±HHMM | +0000 |
So the format code for our presidential data is `Y-m-d’.
Let’s make a new column that is the number of years someone was president.
presidential <- presidential %>%
mutate(YearTerm = as.numeric(difftime(end, start, units = 'days')/365))
head(presidential)
## # A tibble: 6 × 5
## name start end party YearTerm
## <chr> <date> <date> <chr> <dbl>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01
## 2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84
## 3 Johnson 1963-11-22 1969-01-20 Democratic 5.17
## 4 Nixon 1969-01-20 1974-08-09 Republican 5.55
## 5 Ford 1974-08-09 1977-01-20 Republican 2.45
## 6 Carter 1977-01-20 1981-01-20 Democratic 4.00
Now let’s say we want the average term length by party. We can use group_by, which will collapse the data according to a variable.
presidential %>%
group_by(party) %>%
summarise(YearTerm = mean(YearTerm, na.rm = T))
## # A tibble: 2 × 2
## party YearTerm
## <chr> <dbl>
## 1 Democratic 5.60
## 2 Republican 5.72
What if we wanted the president that served the least amount of time?
presidential %>%
filter(YearTerm == min(YearTerm)) %>%
pull(name)
## [1] "Ford"
What if we wanted to rank them according to how long?
presidential %>%
arrange(desc(YearTerm)) %>%
mutate(serveID = row_number())
## # A tibble: 12 × 6
## name start end party YearTerm serveID
## <chr> <date> <date> <chr> <dbl> <int>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01 1
## 2 Reagan 1981-01-20 1989-01-20 Republican 8.01 2
## 3 Clinton 1993-01-20 2001-01-20 Democratic 8.01 3
## 4 Bush 2001-01-20 2009-01-20 Republican 8.01 4
## 5 Obama 2009-01-20 2017-01-20 Democratic 8.01 5
## 6 Nixon 1969-01-20 1974-08-09 Republican 5.55 6
## 7 Johnson 1963-11-22 1969-01-20 Democratic 5.17 7
## 8 Carter 1977-01-20 1981-01-20 Democratic 4.00 8
## 9 Bush 1989-01-20 1993-01-20 Republican 4.00 9
## 10 Trump 2017-01-20 2021-01-20 Republican 4.00 10
## 11 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 11
## 12 Ford 1974-08-09 1977-01-20 Republican 2.45 12
An important concept is lagging values. We may want to lag something if we want to calculate a variable such as change from last time period. Generally we only lag in the time dimension but we could do it across space as well. Note here we are using time series data rather than a panel, the code would be different in latter case.
presidential %>%
mutate(last.pres = lag(name, 1, order_by = start),
last.pres.party = lag(party, 1, order_by = start),
party.switch = ifelse(party != last.pres.party, 1,0)) %>%
select(name, last.pres, party, last.pres.party, party.switch)
## # A tibble: 12 × 5
## name last.pres party last.pres.party party.switch
## <chr> <chr> <chr> <chr> <dbl>
## 1 Eisenhower <NA> Republican <NA> NA
## 2 Kennedy Eisenhower Democratic Republican 1
## 3 Johnson Kennedy Democratic Democratic 0
## 4 Nixon Johnson Republican Democratic 1
## 5 Ford Nixon Republican Republican 0
## 6 Carter Ford Democratic Republican 1
## 7 Reagan Carter Republican Democratic 1
## 8 Bush Reagan Republican Republican 0
## 9 Clinton Bush Democratic Republican 1
## 10 Bush Clinton Republican Democratic 1
## 11 Obama Bush Democratic Republican 1
## 12 Trump Obama Republican Democratic 1
presidential %>%
group_by(YearTerm) %>%
summarise(n_pres = n()) %>%
filter(n_pres == max(n_pres)) %>%
pull(YearTerm) ; hist(presidential$YearTerm)
## [1] 8.005479
Data visualization is critical to both detecting problems with data to assess its veracity, and also for the presentation of results. Let’s simulate some data and do some visualizing.
x = rbinom(1, 100, .3) + rnorm(100)
z = runif(100)
y = 1 + .3*x + z*x + rnorm(100, 0, .3)
df = cbind.data.frame(y, x, z)
colnames(df)[1:3] = c('y', 'x', 'z')
We can call a vector of a data.frame with a dollar sign. Note that we will produce the same plot with the code below
plot(df$x, df$y, xlab = 'X', ylab = 'Y',
main = 'Our Plot', sub = 'Our Subtitle')
plot(df[,2], df[,1])
hist(df$x, main = 'Histogram of x')
In practice, you will almost never use base R plots. This is because they do not look very good and are a pain to customize. ggplot2 is the common plotting package in R.
library(ggplot2)
ggplot maps aesthetics from a dataframe into a plot object. ggplot objects are of mode list and of class ggplot.
ggplot(df, aes(x, y)) +
geom_point() +
xlab('X') + ylab('Y') +
ggtitle('Our Ggplot') +
theme_bw()
ggplot(df, aes(x)) +
geom_histogram() +
xlab('X') + ylab('Frequency') +
ggtitle('Our Second Ggplot') +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you have any questions about the code, other coding issues, or generally want to talk about stats or IR stuff, please feel free to email me dgrasse@cornell.edu. There is a lot we didn’t cover, such as how to write good functions, there is way more to discuss in terms of data visualization, we could talk about how to work with regular expressions or spatial data, web scraping, a ton left to uncover! Also, below are some resources in general for learning R.
I would also recommend checking out this site: https://rforpoliticalscience.com/2023/04/07/top-r-packages-for-downloading-political-science-and-economics-datasets/ where you can download a variety of R packages with preloaded data and documentation that you can play with on your own time to get more familiar with the language.