Intro

Hello! Welcome to the R walk through. I wanted to clarify a few things at the outset. Things this is not

A lecture with an exam at the end that will determine your future
A tutorial where you are expected to have mastered and memorized every line of code by the end

Things this is

A chance to learn about the R programming language in its gory details.
An intentionally over-inclusive summary of what R is, with the goal of exposing you to as much material about it as possible to give you a reference when you troubleshoot on your own, debug, write code, and complete your coursework

With that in mind, if anything does not make sense at the outset, please ask questions, but also do not worry to much about hacking everything at once. I am providing you with the code and the notes precisely so you have a resource that you can reference later for the purpose of helping you in the future.

Primer on R

What is R

R is an object oriented programming language. This means that the organization occurs around data or objects rather than functions or logic. Each object you see in R has a defined class and a type. When using R, you are manipulating objects, and errors occur most often because you apply a manipulation that is illegal in the language. We are going to briefly go over how information is stored in R to help you make sense of error messages when they occur.

What is RStudio?

RStudio is an integrated development environment (IDE). Think of it as an interface for the execution of R code.

Why use R?

R has some nice advantages: * It is open-source and free * Since it is object oriented, it gets your foot in the door for learning other object-oriented programming (think Python) * There is a strong and vocal community of users who regularly post resources and provide answers to common questions.

Objects

R objects have names, mode, and a length. Mode tells you what type of data is in the object: character, numeric, logical, function, or list.

Character: Cobra Bubbles
Numeric: 1, 1.5555, -1
Logical: TRUE, FALSE
Function: log(x+1)
List: Cobra Bubbles, Lilo, Stitch.

This is also why if you use the function `mode’ in R, you will not get the statistical mode as your answer!

These objects have different types - recursive and atomic. An atomic type is numeric, char, and logical. If the type is atomic, then only one mode is included in the object.

To get a bit deeper here, let’s think about vectors and matrices. A vector is a 1 dimension object of n length. We can place a collection of items into a vector using c(), for concatenate.

fav_food = c('chocolate', 'peanut butter', 'chicken')
fav_food

## [1] "chocolate"     "peanut butter" "chicken"

fav_food

This is a collection of characters. We can make a bigger character matrix by adding a column.

food_mat = matrix(nrow = 3, ncol = 2, c('chocolate', 'peanut butter', 
                  'chicken', 'yes', 'yes', 'no'))
food_mat

##      [,1]            [,2] 
## [1,] "chocolate"     "yes"
## [2,] "peanut butter" "yes"
## [3,] "chicken"       "no"

Columns are indexed, such that elements of them can be removed. The first index is the row, and the second is the column.

food_mat[1,]

## [1] "chocolate" "yes"

food_mat[,1]

## [1] "chocolate"     "peanut butter" "chicken"

food_mat[1,1]

## [1] "chocolate"

What if we tried to do this with a vector? We will get an error, because the dimensions will be incorrect.

Coercion occurs when modes are forced to change to be the same within an atomic object. Consider the following matrix.

food_mat = matrix(nrow = 3, ncol = 2, c('chocolate', 'peanut butter', 
                  'chicken', 1, 2, 0))
food_mat

##      [,1]            [,2]
## [1,] "chocolate"     "1" 
## [2,] "peanut butter" "2" 
## [3,] "chicken"       "0"

Why did this happen? Because a matrix is atomic, it needs one mode. Therefore, the numbers were coerced to be in the same mode as the words. Coercion occurs according to the following hierarchy: char, numeric, logical. This means if you ask R to perform an operation that requires coercion, it will attempt to change the vector in that order.

Meanwhile, a recursive type is function or list. Recursive objects may include different modes. Think of a dataframe where you have a column of names and a column of salaries associated with those names. We get get the i element of the list with [].

n_item <- c(1,2,0)
fav_food <- c('chocolate', 'peanut butter', 'chicken') 
our_list <- list(n_item, fav_food)
our_list[1]

## [[1]]
## [1] 1 2 0

We can remove elements from a list using either [] or [[]]. The difference is [] will select the item in the list whereas [[]] will select the contents. Let’s make a bigger list to see.

taste_good = c(1, 0, 1)
allergic = c(0, 0, 0)
second_list = list(taste_good, allergic)
our_big_list = list(our_list, second_list)
our_big_list[[1]][[1]]

## [1] 1 2 0

Our_big_list is a list of lists - that is, it has two elements, each one being a list. Our command that used [[1]][[1]] removed from the first list the first element.

Overall, a list as an object can include both numeric and char modes. Indeed, a dataframe is a type of list. Let’s make a dataframe.

our_df = data.frame(our_list)
colnames(our_df)[1:2] <- c('food.name', 'n.item')
mode(our_df);our_df

## [1] "list"

##   food.name        n.item
## 1         1     chocolate
## 2         2 peanut butter
## 3         0       chicken

A function is like a machine that takes inputs and makes outputs. There are plenty of canned R functions that are included in the base language. Take, for example, mean.

mode(mean);mean(n_item, na.rm = T)

## [1] "function"

## [1] 1

How do we name objects? It is best to use something to separate long titles like an underscore or period. It may be good practice to use different separators for different object types (periods for functions and lists underscores for vectors or matrices, but of course this is your prerogative.)

Assignment

We may use == for logic, <- for assignment, and = for functions. Technically = can be used for assignment, which has some nice advantages (its not white space dependent, less key strokes, easy to read) but potential drawbacks (R will treat objects on the right of `=’ as a named function argument.)

a <- 'cobra bubbles'
b = 'cobra bubbles'
a

## [1] "cobra bubbles"

## [1] "cobra bubbles"

Here is an example of the different assignment operators in action. x is a persistent object in the first line, and a function argument in the second.

x = seq(1,10,1)
sample(x = seq(1,10,1), replace = FALSE, size = 2)

## [1] 2 6

What do we mean when we say logic? Let’s say we want to make variable scored 1 when the string reads `cobra bubbles’.

as.numeric(a == 'cobra bubbles', 1,0)

## [1] 1

Our operation here is logical - say it is TRUE that a is defined as cobra bubbles, then give the value of 1. Otherwise, mark it 0! As an exercise, see the error code that turns up if you try to use a different operator.

Let’s see an example of this operation where we ignore the grammar and set the equality to be only one sign. Who can explain this error message?

Error in as.numeric(a = “cobra bubbles”, 1, 0) : supplied argument name ‘a’ does not match ‘x’

Assessing Objects

Let’s look at a preloaded dataset in R - Titanic.

data("Titanic")

Here are some useful functions for assessing objects. First is str, which tells us te structure of the data.

str(Titanic)

##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

Attributes

attributes(Titanic)

## $dim
## [1] 4 2 2 2
## 
## $dimnames
## $dimnames$Class
## [1] "1st"  "2nd"  "3rd"  "Crew"
## 
## $dimnames$Sex
## [1] "Male"   "Female"
## 
## $dimnames$Age
## [1] "Child" "Adult"
## 
## $dimnames$Survived
## [1] "No"  "Yes"
## 
## 
## $class
## [1] "table"

nrow

nrow(Titanic)

## [1] 4

ncol

nrow(Titanic)

## [1] 4

colnames

colnames(Titanic)

## [1] "Male"   "Female"

Mathmatical Operations

Basic operations

x = rnorm(1000)
y = runif(1000)
z = rbinom(1000, 1, .3)

x_times_y = x*y
x_divide_y = x/y
x_exp_y = x^y
x_plus_y = x+y
x_minus_y = x-y

Vector/Matrix operations

x_vec = c(10, 4, 7)
y_vec = c(12, 1, 6)
z_vec = c(11, 0, 22)

We can combine vectors and matrices with rbind or cbind.

First, rbind will bind by the rows. Each row of the dataset will be the corresponding vector

rbind(x_vec,y_vec,  z_vec)

##       [,1] [,2] [,3]
## x_vec   10    4    7
## y_vec   12    1    6
## z_vec   11    0   22

Second, cbind will bind by columns.The first row will be the first elements of each vector.

cbind(x_vec,y_vec,  z_vec)

##      x_vec y_vec z_vec
## [1,]    10    12    11
## [2,]     4     1     0
## [3,]     7     6    22

We can add .data.frame to make it into a dataframe after we combine the elements.

We can also perform matrix algebra in R. This is how we do regressions. We use %*% for matrix multiplication, solve() for matrix inversion, and t() for transpose.

x_mat = cbind(rep(1, 10), rbinom(10, 1, .3), rnorm(10), runif(10))
y_mat = rnorm(10)
b = solve(t(x_mat)%*%x_mat)%*%t(x_mat)%*%y_mat

As it turns out, the operation of multiplying the transpose of a matrix by another is very common, common enough that the function crossprod() performs the same operation. We will get the same answer:

b<-solve(crossprod(x_mat), crossprod(x_mat,y_mat))

We can also perform the linear regression of y on x and see that our operation provides the same numerical results.

lm(y_mat ~ x_mat[,2:4]) |>
  summary()

## 
## Call:
## lm(formula = y_mat ~ x_mat[, 2:4])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9715 -0.7036  0.1618  0.4811  0.9918 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.58389    0.56503   1.033    0.341
## x_mat[, 2:4]1 -0.05985    0.64344  -0.093    0.929
## x_mat[, 2:4]2  0.27162    0.44156   0.615    0.561
## x_mat[, 2:4]3 -0.58541    0.93566  -0.626    0.555
## 
## Residual standard error: 0.8795 on 6 degrees of freedom
## Multiple R-squared:  0.1391, Adjusted R-squared:  -0.2914 
## F-statistic: 0.3231 on 3 and 6 DF,  p-value: 0.8091

Linear Regression

lm() is the canned regression function. You will likely, in practice, use more specialized packages for regression to compute standard errors (estimatr, fixest, lfe to name a few). One funny quirk with lm() is that it treats / differently

y = rnorm(1000)
x = rnorm(1000)
z = rnorm(1000)
lm(y ~ x/z) |>
  summary()

## 
## Call:
## lm(formula = y ~ x/z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.87685 -0.73976  0.01919  0.68222  3.06850 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03446    0.03262  -1.057    0.291
## x            0.04704    0.03277   1.435    0.152
## x:z         -0.02731    0.03166  -0.863    0.389
## 
## Residual standard error: 1.03 on 997 degrees of freedom
## Multiple R-squared:  0.002827,   Adjusted R-squared:  0.0008271 
## F-statistic: 1.413 on 2 and 997 DF,  p-value: 0.2438

We can also use indicator functions to create dummies inside of lm()

lm(y ~ I(x>0)) |>
  summary()

## 
## Call:
## lm(formula = y ~ I(x > 0))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8756 -0.7267  0.0218  0.6819  3.0434 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.05708    0.04556  -1.253    0.211
## I(x > 0)TRUE  0.04086    0.06522   0.627    0.531
## 
## Residual standard error: 1.031 on 998 degrees of freedom
## Multiple R-squared:  0.0003932,  Adjusted R-squared:  -0.0006084 
## F-statistic: 0.3926 on 1 and 998 DF,  p-value: 0.5311

If we wanted to do an interaction the normal way, we would use * like in multiplication

lm(y ~ x*z) |> 
  summary()

## 
## Call:
## lm(formula = y ~ x * z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.86710 -0.74079  0.01939  0.68271  3.06337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03428    0.03264  -1.050    0.294
## x            0.04669    0.03281   1.423    0.155
## z            0.01017    0.03253   0.313    0.755
## x:z         -0.02664    0.03175  -0.839    0.402
## 
## Residual standard error: 1.031 on 996 degrees of freedom
## Multiple R-squared:  0.002925,   Adjusted R-squared:  -7.795e-05 
## F-statistic: 0.974 on 3 and 996 DF,  p-value: 0.4042

We may exclude the linear terms of the interaction with :

lm(y ~ x:z + x + z) |> 
  summary()

## 
## Call:
## lm(formula = y ~ x:z + x + z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.86710 -0.74079  0.01939  0.68271  3.06337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03428    0.03264  -1.050    0.294
## x            0.04669    0.03281   1.423    0.155
## z            0.01017    0.03253   0.313    0.755
## x:z         -0.02664    0.03175  -0.839    0.402
## 
## Residual standard error: 1.031 on 996 degrees of freedom
## Multiple R-squared:  0.002925,   Adjusted R-squared:  -7.795e-05 
## F-statistic: 0.974 on 3 and 996 DF,  p-value: 0.4042

The output of lm() is a list object.

model = lm(y ~ x:z + x + z) 
model

## 
## Call:
## lm(formula = y ~ x:z + x + z)
## 
## Coefficients:
## (Intercept)            x            z          x:z  
##    -0.03428      0.04669      0.01017     -0.02664

We can take objects from the list to make plots or save them for other analyese.

model$coefficients

## (Intercept)           x           z         x:z 
## -0.03427726  0.04669328  0.01016843 -0.02664422

Generating Variables

The most common way variables are created is using as.numeric() and ifelse(). These functions take a logical condition as an argument, and then generate indicators based on those conditions. as.numeric() will only map a vector into a binary (0,1) space, but ifelse can be used for more complex operations.

as.numeric(x_vec==10, 1,0)

## [1] 1 0 0

ifelse(x_vec == 10, x_vec*-10, x_vec*10)

## [1] -100   40   70

You may want to do something like the above ifelse statement if you need to make a continuous variable based on a logical condition. For example, you may want a variable to be scored negative or positive based on a condition D (as in regression discontinuity) or may want to create a treatment variable that is scored as years until an event for treated units and 0 for all controls (as in panel event study designs).

Iteration

We may want to write loops when we need to perform the same action many times. This is much faster to type than repeating the same thing by hand. Often, if something is going to happen more than 3 times, it is good practice to write a loop. In addition to saving time, you reduce the chances of mistakes, and make them easier to fix if they do occur. There are two types of loops, for and while. A for loop is when we know the number of interations. A while loop is when the number of iterations is unknown, and we plan to test for a condition before applying a function.

numbers <- c(1, 2, 3, 4, 10)
# Initialize an empty vector to store results
squared_numbers_for <- numeric(length(numbers))

# Use a for loop to calculate squares
for (i in 1:length(numbers)) {
  squared_numbers_for[i] <- numbers[i]^2
}

# Print the results
print(squared_numbers_for)

## [1]   1   4   9  16 100

Here is an example of a while loop.

# Initialize an empty vector to store results
squared_numbers_while <- numeric(length(numbers))

# Initialize the index variable
i <- 1

# Use a while loop to calculate squares
while (i <= 5) {
  squared_numbers_while[i] <- numbers[i]^2
  i <- i + 1
}

# Print the results
print(squared_numbers_while)

## [1]   1   4   9  16 100

Another silly/classic example of a while loop

dice <- 1
while (dice <= 6) {
  if (dice < 6) {
    print("No Yahtzee")
  } else {
    print("Yahtzee!")
  }
  dice <- dice + 1
}

## [1] "No Yahtzee"
## [1] "No Yahtzee"
## [1] "No Yahtzee"
## [1] "No Yahtzee"
## [1] "No Yahtzee"
## [1] "Yahtzee!"

There are other ways to repeat processes. We may use sapply(), or lapply(). Each of these take as an argument a list, use a function on each element of the list, and then return the output. In the case of sapply and lapply, sapply will try to make the output as simple as possible, and lapply will return the output as a list.

names = c('CARMY', 'COUSIN', 'CLAIRE')
output_sapply = sapply(names, 'tolower')
output_lapply = lapply(names, 'tolower')
class(output_sapply);class(output_lapply)

## [1] "character"

## [1] "list"

Meanwhile, apply() can be used on the rows or columns of a matrix.

mat <- matrix(1:12, nrow = 3, byrow = TRUE)
row_sums <- apply(mat, 1, sum)
row_sums

## [1] 10 26 42

col_sums <- apply(mat, 2, sum)

This saves us a lot of time - a for loop here would look at each column/row i, and then apply the sum function for each one, store it, and then we would save the results. This gives us one simple line of code :).

Subsetting Data Frames

Recall that dataframes are just like lists or matrices. We can select rows and columns in the same way. Let’s make a silly dataset and mess around with it.

food = c('hotdog', 'hamburger', 'taco', 'pbj')
sandwitch = c('no', 'no', 'no', 'yes')
lunch = c('yes', 'yes', 'no', 'yes')
n_serve = c(100, 200,300, 400)

food_df = cbind.data.frame(food, sandwitch, lunch, n_serve)
#colnames(cbind.data.frame) = c('food', 'sand', 'lunch', 'n_serve')

food_df_lunch = subset(food_df, lunch == 'yes')
food_df_lunch

##        food sandwitch lunch n_serve
## 1    hotdog        no   yes     100
## 2 hamburger        no   yes     200
## 4       pbj       yes   yes     400

food_df_c1_c2 = food_df[,c(1,2)]
food_df_c1_c2

##        food sandwitch
## 1    hotdog        no
## 2 hamburger        no
## 3      taco        no
## 4       pbj       yes

Workspace Management

We set up our workspace first by setting a working directory. Those familiar with stata should understand this concept - it sets a relative path from which files can be loaded and saved into our environment.

setwd('~/Dropbox')
getwd()

## [1] "/Users/donaldgrasse/Dropbox"

While many things can be accomplished in base R, most of the time you will need to load packages. Packages are collections of functions with documentation that R users create to perform specialized tasks. To use a package, we must first install it. We do so using the install.packages function, and wrap the package name with quotations.

We can check if a package is installed by using installed.packages(). Once installed, we can load packages with either library() or require().

#install.packages('estimatr')
#library(estimatr)

Often there will be times when you want to load many packages at once. You can place all of the package names you want into a list.

pack = c('ggplot2', 'dplyr')
lapply(pack, require, character.only = T)

## Loading required package: ggplot2

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE

Sometimes you may be using a different machine, your coauthor is also using the code, or you want to make sure you can transition easily if you get a new computer. You many then want to set it up so your script finds and catches packages that are needed but not installed.

pack = c('ggplot2', 'dplyr')
new.pack <- pack[!(pack %in% installed.packages()[,"Package"])]
if(length(new.pack)) install.packages(new.pack)

There is also a package called pacman that will do this for you. As an exercise, read about this package and attempt using it on your own https://cran.r-project.org/web/packages/pacman/index.html. You should always put something like this at the very top of your script.

When you want to call a package function that is installed without loading it, you may use package name :: function to use it. You may want to do this if loading the entire package would cause conflicts (a conflict occurs if a function has the same name in two packages).

If you need help with a package, type ?packagename. If you need help with a function, try ??.

Perhaps the most used package is tidyverse, which is a language for solving R-related data cleaning challenges. There is a bit of an ideological war between base-R and tidyverse users. I’d recommend making yourself adept at both, from your perspective there is not much to gain by having a hard line in the sand.

Reading data

There are various functions for reading in data native to R or foreign to it (stata or excel for instance) read.csv(), readRDS(), read.delim(), haven::read_dta, readxl::read_xlsx(). All of these functions have a similar syntax: include the filepath for the data, and correctly specify auxiliary arguments.

Data Management and Manipulation using Tidyverse

dplyr is a very popular tool for data manipulation in R that comes from tidyverse and is very similar to SQL. It may be used to reshape or summarize data.

data(presidents)
head(presidential)

## # A tibble: 6 × 4
##   name       start      end        party     
##   <chr>      <date>     <date>     <chr>     
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Kennedy    1961-01-20 1963-11-22 Democratic
## 3 Johnson    1963-11-22 1969-01-20 Democratic
## 4 Nixon      1969-01-20 1974-08-09 Republican
## 5 Ford       1974-08-09 1977-01-20 Republican
## 6 Carter     1977-01-20 1981-01-20 Democratic

Say we only want the columns relating the name and the party. We can use select in dplyr.

name_party <- presidential %>%  
  select(name, party)

What if we only wanted presidents who are Democrats? We may use filter.

name_party_dem <- presidential %>%  
  select(name, party) %>% 
  filter(party == 'Democratic')

We could do the same thing using != which is to be read as not equal.

name_party_dem <- presidential %>%  
  select(name, party) %>% 
  filter(party != 'Republican')

This dataset has a new (and slightly annoying) category of data - date! These can be tricky at times so it is good to review the properties.

Date Format Codes
FormatCode	Description	Example
%d	Day of the month as a decimal number (01-31)	01
%m	Month as a decimal number (01-12)	08
%b	Abbreviated month name (Jan, Feb, etc.)	Aug
%B	Full month name (January, February, etc.)	August
%y	Year without century (00-99)	24
%Y	Year with century (e.g., 2024)	2024
%H	Hour in 24-hour format (00-23)	14
%I	Hour in 12-hour format (01-12)	02
%M	Minute as a decimal number (00-59)	45
%S	Second as a decimal number (00-59)	30
%p	AM or PM designation	PM
%a	Abbreviated weekday name (Mon, Tue, etc.)	Mon
%A	Full weekday name (Monday, Tuesday, etc.)	Monday
%j	Day of the year as a decimal number (001-366)	230
%U	Week number of the year (Sunday as the first day of the week)	34
%W	Week number of the year (Monday as the first day of the week)	33
%Z	Time zone abbreviation	UTC
%z	Time zone offset from UTC in the form ±HHMM	+0000

So the format code for our presidential data is `Y-m-d’.

Let’s make a new column that is the number of years someone was president.

presidential <- presidential %>%  
  mutate(YearTerm = as.numeric(difftime(end, start, units = 'days')/365))
head(presidential)

## # A tibble: 6 × 5
##   name       start      end        party      YearTerm
##   <chr>      <date>     <date>     <chr>         <dbl>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican     8.01
## 2 Kennedy    1961-01-20 1963-11-22 Democratic     2.84
## 3 Johnson    1963-11-22 1969-01-20 Democratic     5.17
## 4 Nixon      1969-01-20 1974-08-09 Republican     5.55
## 5 Ford       1974-08-09 1977-01-20 Republican     2.45
## 6 Carter     1977-01-20 1981-01-20 Democratic     4.00

Now let’s say we want the average term length by party. We can use group_by, which will collapse the data according to a variable.

presidential %>%  
  group_by(party) %>%  
  summarise(YearTerm = mean(YearTerm, na.rm = T))

## # A tibble: 2 × 2
##   party      YearTerm
##   <chr>         <dbl>
## 1 Democratic     5.60
## 2 Republican     5.72

What if we wanted the president that served the least amount of time?

presidential %>%  
  filter(YearTerm == min(YearTerm)) %>% 
  pull(name)

## [1] "Ford"

What if we wanted to rank them according to how long?

presidential %>%  
  arrange(desc(YearTerm)) %>% 
  mutate(serveID = row_number())

## # A tibble: 12 × 6
##    name       start      end        party      YearTerm serveID
##    <chr>      <date>     <date>     <chr>         <dbl>   <int>
##  1 Eisenhower 1953-01-20 1961-01-20 Republican     8.01       1
##  2 Reagan     1981-01-20 1989-01-20 Republican     8.01       2
##  3 Clinton    1993-01-20 2001-01-20 Democratic     8.01       3
##  4 Bush       2001-01-20 2009-01-20 Republican     8.01       4
##  5 Obama      2009-01-20 2017-01-20 Democratic     8.01       5
##  6 Nixon      1969-01-20 1974-08-09 Republican     5.55       6
##  7 Johnson    1963-11-22 1969-01-20 Democratic     5.17       7
##  8 Carter     1977-01-20 1981-01-20 Democratic     4.00       8
##  9 Bush       1989-01-20 1993-01-20 Republican     4.00       9
## 10 Trump      2017-01-20 2021-01-20 Republican     4.00      10
## 11 Kennedy    1961-01-20 1963-11-22 Democratic     2.84      11
## 12 Ford       1974-08-09 1977-01-20 Republican     2.45      12

An important concept is lagging values. We may want to lag something if we want to calculate a variable such as change from last time period. Generally we only lag in the time dimension but we could do it across space as well. Note here we are using time series data rather than a panel, the code would be different in latter case.

presidential %>%  
  mutate(last.pres = lag(name, 1, order_by = start), 
         last.pres.party = lag(party, 1, order_by = start), 
         party.switch = ifelse(party != last.pres.party, 1,0)) %>%  
  select(name, last.pres, party, last.pres.party, party.switch)

## # A tibble: 12 × 5
##    name       last.pres  party      last.pres.party party.switch
##    <chr>      <chr>      <chr>      <chr>                  <dbl>
##  1 Eisenhower <NA>       Republican <NA>                      NA
##  2 Kennedy    Eisenhower Democratic Republican                 1
##  3 Johnson    Kennedy    Democratic Democratic                 0
##  4 Nixon      Johnson    Republican Democratic                 1
##  5 Ford       Nixon      Republican Republican                 0
##  6 Carter     Ford       Democratic Republican                 1
##  7 Reagan     Carter     Republican Democratic                 1
##  8 Bush       Reagan     Republican Republican                 0
##  9 Clinton    Bush       Democratic Republican                 1
## 10 Bush       Clinton    Republican Democratic                 1
## 11 Obama      Bush       Democratic Republican                 1
## 12 Trump      Obama      Republican Democratic                 1

presidential %>% 
  group_by(YearTerm) %>%  
  summarise(n_pres = n()) %>%  
  filter(n_pres == max(n_pres)) %>% 
  pull(YearTerm) ; hist(presidential$YearTerm)

## [1] 8.005479

Data Visualization

Data visualization is critical to both detecting problems with data to assess its veracity, and also for the presentation of results. Let’s simulate some data and do some visualizing.

x = rbinom(1, 100, .3) + rnorm(100) 
z = runif(100)
y = 1 + .3*x + z*x + rnorm(100, 0, .3)
df = cbind.data.frame(y, x, z)
colnames(df)[1:3] = c('y', 'x', 'z')

We can call a vector of a data.frame with a dollar sign. Note that we will produce the same plot with the code below

plot(df$x, df$y, xlab = 'X', ylab = 'Y', 
     main = 'Our Plot', sub = 'Our Subtitle')

plot(df[,2], df[,1])

hist(df$x, main = 'Histogram of x')

In practice, you will almost never use base R plots. This is because they do not look very good and are a pain to customize. ggplot2 is the common plotting package in R.

library(ggplot2)

ggplot maps aesthetics from a dataframe into a plot object. ggplot objects are of mode list and of class ggplot.

ggplot(df, aes(x, y)) + 
  geom_point() + 
  xlab('X') + ylab('Y') + 
  ggtitle('Our Ggplot') + 
  theme_bw()

ggplot(df, aes(x)) +  
  geom_histogram() + 
  xlab('X') + ylab('Frequency') + 
  ggtitle('Our Second Ggplot') + 
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If you have any questions about the code, other coding issues, or generally want to talk about stats or IR stuff, please feel free to email me dgrasse@cornell.edu. There is a lot we didn’t cover, such as how to write good functions, there is way more to discuss in terms of data visualization, we could talk about how to work with regular expressions or spatial data, web scraping, a ton left to uncover! Also, below are some resources in general for learning R.

I would also recommend checking out this site: https://rforpoliticalscience.com/2023/04/07/top-r-packages-for-downloading-political-science-and-economics-datasets/ where you can download a variety of R packages with preloaded data and documentation that you can play with on your own time to get more familiar with the language.

R Programming Tutorial

Donald Grasse

2024-08-20