1 Intro to R

In Quarto documents like this one, we can write comments by just using plain text. In contrast, code needs to be within code blocks, like the one below. To execute a code block, you can click on the little “Play” button or press Cmd/Ctrl + Shift + Enter when your keyboard is hovering the code block.

2 + 2

[1] 4

That was our first R command, a simple math operation. Of course, we can also do more complex arithmetic:

12345 ^ 2  / (200 + 25 - 6 * 2) # this is an inline comment, see the leading "#"

[1] 715488.4

In order to create a code block, you can press Cmd/Ctrl + Alt + i or click on the little green “+C” icon on top of the script.

Exercise

Create your own code block below and run a math operation.

1.1 Objects

A huge part of R is working with objects. Let’s see how they work:

my_object <- 10 # opt/alt + minus sign will make the arrow

my_object # to print the value of an object, just call its name

[1] 10

We can now use this object in our operations:

2 ^ my_object

[1] 1024

Or even create another object out of it:

my_object2 <- my_object * 2

my_object2

[1] 20

You can delete objects with the rm() function (for “remove”):

rm(my_object2)

1.2 Vectors and functions

Objects can be of different types. One of the most useful ones is the vector, which holds a series of values. To create one manually, we can use the c() function (for “combine”):

my_vector <- c(6, -11, my_object, 0, 20)

my_vector

[1]   6 -11  10   0  20

One can also define vectors by sequences:

3:10

[1]  3  4  5  6  7  8  9 10

We can use square brackets to retrieve parts of vectors:

my_vector[4] # fourth element

[1] 0

my_vector[1:2] # first two elements

[1]   6 -11

Let’s check out some basic functions we can use with numbers and numeric vectors:

sqrt(my_object) # squared root

[1] 3.162278

log(my_object) # logarithm (natural by default)

[1] 2.302585

abs(-5) # absolute value

[1] 5

mean(my_vector)

[1] 5

median(my_vector)

[1] 6

sd(my_vector) # standard deviation

[1] 11.53256

sum(my_vector)

[1] 25

min(my_vector) # minimum value

[1] -11

max(my_vector) # maximum value

[1] 20

length(my_vector) # length (number of elements)

[1] 5

Notice that if we wanted to save any of these results for later, we would need to assign them:

my_mean <- mean(my_vector)

my_mean

[1] 5

These functions are quite simple: they take one object and do one operation. A lot of functions are a bit more complex—they take multiple objects or take options. For example, see the sort() function, which by default sorts a vector increasingly:

sort(my_vector)

[1] -11   0   6  10  20

If we instead want to sort our vector decreasingly, we can use the decreasing = TRUE argument (T also works as an abbreviation for TRUE).

sort(my_vector, decreasing = TRUE)

[1]  20  10   6   0 -11

Tip

If you use the argument values in order, you can avoid writing the argument names (see below). This is sometimes useful, but can also lead to confusing code—use it with caution.

sort(my_vector, T)

[1]  20  10   6   0 -11

A useful function to create vectors in sequence is seq(). Notice its arguments:

seq(from = 30, to = 100, by = 5)

 [1]  30  35  40  45  50  55  60  65  70  75  80  85  90  95 100

To check the arguments of a function, you can examine its help file: look the function up on the “Help” panel on RStudio or use a command like the following: ?sort.

Exercise

Examine the help file of the log() function. How can we compute the the base-10 logarithm of my_object? Your code:

Other than numeric vectors, character vectors are also useful:

my_character_vector <- c("Apple", "Orange", "Watermelon", "Banana")

my_character_vector[3]

[1] "Watermelon"

nchar(my_character_vector) # count number of characters

[1]  5  6 10  6

1.3 Data frames and lists

Another useful object type is the data frame. Data frames can store multiple vectors in a tabular format. We can manually create one with the data.frame() function:

my_data_frame <- data.frame(fruit = my_character_vector,
                            calories_per_100g = c(52, 47, 30, 89),
                            water_per_100g = c(85.6, 86.8, 91.4, 74.9))

my_data_frame

       fruit calories_per_100g water_per_100g
1      Apple                52           85.6
2     Orange                47           86.8
3 Watermelon                30           91.4
4     Banana                89           74.9

Now we have a little 4x3 data frame of fruits with their calorie counts and water composition. We gathered the nutritional information from the USDA (2019).

We can use the data_frame$column construct to access the vectors within the data frame:

mean(my_data_frame$calories_per_100g)

[1] 54.5

Exercise

Obtain the maximum value of water content per 100g in the data. Your code:

Some useful commands to learn attributes of our data frame:

dim(my_data_frame)

[1] 4 3

nrow(my_data_frame)

[1] 4

names(my_data_frame) # column names

[1] "fruit"             "calories_per_100g" "water_per_100g"

We will learn much more about data frames in our next module on data analysis.

After talking about vectors and data frames, the last object type that we will cover is the list. Lists are super flexible objects that can contain just about anything:

my_list <- list(my_object, my_vector, my_data_frame)

my_list

[[1]]
[1] 10

[[2]]
[1]   6 -11  10   0  20

[[3]]
       fruit calories_per_100g water_per_100g
1      Apple                52           85.6
2     Orange                47           86.8
3 Watermelon                30           91.4
4     Banana                89           74.9

To retrieve the elements of a list, we need to use double square brackets:

my_list[[1]]

[1] 10

Lists are sometimes useful due to their flexibility, but are much less common in routine data analysis compared to vectors or data frames.

1.4 Packages

The R community has developed thousands of packages, which are specialized collections of functions, datasets, and other resources. To install one, you should use the install.packages() command. Below we will install the tidyverse package, a suite for data analysis that we will use in the next modules. You just need to install packages once, and then they will be available system-wide.

install.packages("tidyverse") # this can take a couple of minutes

If you want to use an installed package in your script, you must load it with the library() function. Some packages, as shown below, will print descriptive messages once loaded.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Warning

Remember that install.packages("package") needs to be executed just once, while library(package) needs to be in each script in which you plan to use the package. In general, never include install.packages("package") as part of your scripts or Quarto documents!

Arel-Bundock, Vincent, Nils Enevoldsen, and CJ Yetman. 2018. “Countrycode: An r Package to Convert Country Names and Country Codes.” Journal of Open Source Software 3 (28): 848. https://doi.org/10.21105/joss.00848.

Aronow, Peter M, and Benjamin T Miller. 2019. Foundations of Agnostic Statistics. Cambridge University Press.

Bank, World. 2023. “World Bank Open Data.” https://data.worldbank.org/.

Baydin, Atılım Günes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2017. “Automatic Differentiation in Machine Learning: A Survey.” The Journal of Machine Learning Research 18 (1): 5595–5637.

Coppedge, Michael, John Gerring, Carl Henrik Knutsen, Staffan I. Lindberg, Jan Teorell, David Altman, Michael Bernhard, et al. 2022. “V-Dem Codebook V12.” Varieties of Democracy (V-Dem) Project. https://www.v-dem.net/dsarchive.html.

Dahlberg, Stefan, Aksen Sundström, Sören Holmberg, Bo Rothstein, Natalia Alvarado Pachon, Cem Mert Dalli, and Yente Meijers. 2023. “The Quality of Government Basic Dataset, Version Jan23.” University of Gothenburg: The Quality of Government Institute. https://www.gu.se/en/quality-government doi:10.18157/qogbasjan23.

FiveThirtyEight. 2021. “Tracking Congress In The Age Of Trump [Dataset].” https://projects.fivethirtyeight.com/congress-trump-score/.

Imai, Kosuke, and Nora Webb Williams. 2022. Quantitative Social Science: An Introduction in Tidyverse. Princeton; Oxford: Princeton University Press.

Moore, Will H., and David A. Siegel. 2013. A Mathematics Course for Political and Social Research. Princeton, NJ: Princeton University Pres.

Pontin, Jason. 2007. “Oppenheimer’s Ghost.” MIT Technology Review, October 15, 2007. https://www.technologyreview.com/2007/10/15/223531/oppenheimers-ghost-3/.

Robinson, David. 2020. Fuzzyjoin: Join Tables Together on Inexact Matching. https://github.com/dgrtwo/fuzzyjoin.

Rossi, Hugo. 1996. “Mathematics Is an Edifice, Not a Toolbox.” Notices of the AMS 43 (10): 1108.

Smith, Danny. 2020. Survey Research Datasets and R. https://socialresearchcentre.github.io/r_survey_datasets/.

U. S. Department of Agriculture [USDA], Agricultural Research Service. 2019. “Department of Agriculture Agricultural Research Service.” https://fdc.nal.usda.gov/.

Whittinghill, Dexter C, and Robert V Hogg. 2001. “A Little Uniform Density with Big Instructional Potential.” Journal of Statistics Education 9 (2).

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.

Wickham, Hadley, Danielle Navarro, and Thomas Lin Pedersen. 2023. Ggplot2: Elegant Graphics for Data Analysis. 3rd ed. https://ggplot2-book.org/.