9  Text analysis

9.1 String manipulation with stringr

R stores text as strings, i.e., sequence of characters that can contain letters, numbers, and symbols.

Often we want to manipulate strings in different ways, when the stringr package from the tidyverse comes in handy.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We can combine strings with str_c(), using any separator we want:

str_c("Last name", "First name", "Address", sep = ", ")
[1] "Last name, First name, Address"

Or we split strings with str_split_1():

str_split_1("Last name, First name, Address", pattern = ", ")
[1] "Last name"  "First name" "Address"   

Also some stringr functions modify capitalization:

str_to_title("joe biden")
[1] "Joe Biden"

Or remove unnecessary spaces:

str_squish("    Joe    Biden   ")
[1] "Joe Biden"

We encourage you to check the stringr cheatsheet for more string manipulation functions.

Let’s create a somewhat messy data frame of companies:

companies <- data.frame(
  id = c("A-20-322", "A-10-231", "B-20-865", "C-20-800", "A-20-900", "C-10-022",
         "B-10-822", "C-20-029", "A-20-116"),
  company = c("Pulse Solutions Co.", "Apex Engineering LLC", "NovaTech INC", 
              "BetterPetFood Ltd", "Proxima Inc.", "MakerMind Studios LLC", 
              "TerraVerde Co.", "PulsePlay Productions Ltd", "Kinetix Design Co"),
  year_estab = c("c. 1990", "1995", "2000 APP", "1980", "2011", "circa 1950", 
                 "1976 approx", "2010", "2016 appr")
)

Perhaps we want to detect, extract, or replace the letter A in “id”:

companies |> 
  mutate(new = str_detect(id, "A"))
        id                   company  year_estab   new
1 A-20-322       Pulse Solutions Co.     c. 1990  TRUE
2 A-10-231      Apex Engineering LLC        1995  TRUE
3 B-20-865              NovaTech INC    2000 APP FALSE
4 C-20-800         BetterPetFood Ltd        1980 FALSE
5 A-20-900              Proxima Inc.        2011  TRUE
6 C-10-022     MakerMind Studios LLC  circa 1950 FALSE
7 B-10-822            TerraVerde Co. 1976 approx FALSE
8 C-20-029 PulsePlay Productions Ltd        2010 FALSE
9 A-20-116         Kinetix Design Co   2016 appr  TRUE
companies |> 
  mutate(new = str_extract(id, "A"))
        id                   company  year_estab  new
1 A-20-322       Pulse Solutions Co.     c. 1990    A
2 A-10-231      Apex Engineering LLC        1995    A
3 B-20-865              NovaTech INC    2000 APP <NA>
4 C-20-800         BetterPetFood Ltd        1980 <NA>
5 A-20-900              Proxima Inc.        2011    A
6 C-10-022     MakerMind Studios LLC  circa 1950 <NA>
7 B-10-822            TerraVerde Co. 1976 approx <NA>
8 C-20-029 PulsePlay Productions Ltd        2010 <NA>
9 A-20-116         Kinetix Design Co   2016 appr    A
companies |> 
  mutate(new = str_replace(id, "A", "Z"))
        id                   company  year_estab      new
1 A-20-322       Pulse Solutions Co.     c. 1990 Z-20-322
2 A-10-231      Apex Engineering LLC        1995 Z-10-231
3 B-20-865              NovaTech INC    2000 APP B-20-865
4 C-20-800         BetterPetFood Ltd        1980 C-20-800
5 A-20-900              Proxima Inc.        2011 Z-20-900
6 C-10-022     MakerMind Studios LLC  circa 1950 C-10-022
7 B-10-822            TerraVerde Co. 1976 approx B-10-822
8 C-20-029 PulsePlay Productions Ltd        2010 C-20-029
9 A-20-116         Kinetix Design Co   2016 appr Z-20-116
Exercise

Filter the dataset to only get companies that have the “-20-” tag in their ID. Your code:

A very useful tool in string manipulation is that of regular expressions (or regex). Regular expressions allow you to search for patterns in text.

For example, let’s extract the uppercase letter from “id” using the “[:upper:]” regular expression. NB: “[:lower:]” would pick up a lowercase letter and “[:alpha]” would pick up any letter.

companies |> 
  mutate(new = str_extract(id, "[:upper:]"))
        id                   company  year_estab new
1 A-20-322       Pulse Solutions Co.     c. 1990   A
2 A-10-231      Apex Engineering LLC        1995   A
3 B-20-865              NovaTech INC    2000 APP   B
4 C-20-800         BetterPetFood Ltd        1980   C
5 A-20-900              Proxima Inc.        2011   A
6 C-10-022     MakerMind Studios LLC  circa 1950   C
7 B-10-822            TerraVerde Co. 1976 approx   B
8 C-20-029 PulsePlay Productions Ltd        2010   C
9 A-20-116         Kinetix Design Co   2016 appr   A

Or extract the actual number from “year_estab”. The following pattern stands for “a digit, four consecutive times”:

companies |> 
  mutate(new = str_extract(year_estab, "\\d{4}"))
        id                   company  year_estab  new
1 A-20-322       Pulse Solutions Co.     c. 1990 1990
2 A-10-231      Apex Engineering LLC        1995 1995
3 B-20-865              NovaTech INC    2000 APP 2000
4 C-20-800         BetterPetFood Ltd        1980 1980
5 A-20-900              Proxima Inc.        2011 2011
6 C-10-022     MakerMind Studios LLC  circa 1950 1950
7 B-10-822            TerraVerde Co. 1976 approx 1976
8 C-20-029 PulsePlay Productions Ltd        2010 2010
9 A-20-116         Kinetix Design Co   2016 appr 2016

Or detect the companies which are LLCs/Ltds:

companies |> 
  mutate(new = str_detect(company, "LLC|Ltd"))
        id                   company  year_estab   new
1 A-20-322       Pulse Solutions Co.     c. 1990 FALSE
2 A-10-231      Apex Engineering LLC        1995  TRUE
3 B-20-865              NovaTech INC    2000 APP FALSE
4 C-20-800         BetterPetFood Ltd        1980  TRUE
5 A-20-900              Proxima Inc.        2011 FALSE
6 C-10-022     MakerMind Studios LLC  circa 1950  TRUE
7 B-10-822            TerraVerde Co. 1976 approx FALSE
8 C-20-029 PulsePlay Productions Ltd        2010  TRUE
9 A-20-116         Kinetix Design Co   2016 appr FALSE
Exercise
  1. Discuss: how would you identify companies for which there’s uncertainty in the year of establishment? What’s the pattern in them?

  2. Filter the dataset to only keep observations with uncertainty. Hint: you could use the “[:alpha:]” regular expression.

9.2 Tidy text analysis

We can use the tidytext package to conduct some basic text analysis using tidy data principles. Remember that in tidy data (Wickham 2014):

-   Each variable is a column.
-   Each observation is a row.
-   Each type of observational unit is a (separate) table.

Here our observational unit will be the token, i.e., a unit of text that’s meaningful on its own. In the most simple case, we’ll use words as tokens.

9.2.1 Getting text data to a tidy format

Let’s say we have some text as lines (very common for speech, etc.):

lyrics_lines <- data.frame(line = c("I hate every ape I see", 
                                    "From chimpan-A to chimpan-Z",
                                    "Oh my God, I was wrong",
                                    "It was Earth all along",
                                    "You finally made a monkey",
                                    "Yes you finally made a monkey out of me"))
lyrics_lines
                                     line
1                  I hate every ape I see
2             From chimpan-A to chimpan-Z
3                  Oh my God, I was wrong
4                  It was Earth all along
5               You finally made a monkey
6 Yes you finally made a monkey out of me

We break the text into individual tokens (tokenization) using tidytext’s unnest_tokens() function.

library(tidytext)
lyrics_words <- lyrics_lines |> 
  unnest_tokens(output = "word", input = "line", # column names in output and input
                token = "words")
lyrics_words
      word
1        i
2     hate
3    every
4      ape
5        i
6      see
7     from
8  chimpan
9        a
10      to
11 chimpan
12       z
13      oh
14      my
15     god
16       i
17     was
18   wrong
19      it
20     was
21   earth
22     all
23   along
24     you
25 finally
26    made
27       a
28  monkey
29     yes
30     you
31 finally
32    made
33       a
34  monkey
35     out
36      of
37      me

9.2.2 Counts

Once we have our tidy structure, we can then perform very simple tasks such as finding the most common words in our text as a whole.

lyrics_words |> 
  count(word, sort = T)
      word n
1        a 3
2        i 3
3  chimpan 2
4  finally 2
5     made 2
6   monkey 2
7      was 2
8      you 2
9      all 1
10   along 1
11     ape 1
12   earth 1
13   every 1
14    from 1
15     god 1
16    hate 1
17      it 1
18      me 1
19      my 1
20      of 1
21      oh 1
22     out 1
23     see 1
24      to 1
25   wrong 1
26     yes 1
27       z 1

Since this is just a data frame, we can use all the tools we’ve learned. For example, let’s make a ranking plot for words appearing at least twice:

lyrics_words |> 
  count(word, sort = T) |> 
  filter(n >= 2) |> 
  ggplot(aes(x = n, y = fct_reorder(word, n))) +
    geom_col()

Exercise

Look up the lyrics to your favorite song at the moment (no guilty pleasures here!). Then, follow the process described above to count the words in the chorus: store the text as a line-by-line dataset, tokenize by words, and count/plot.

If you are curious about the repetitiveness of lyrics in pop music over time, I might recommend checking out this fun article and analysis done by Colin Morris at The Pudding.

9.2.3 A richer corpus

Let’s use the text from a classic book: “A Vindication of the Rights of Woman” by Mary Wollstonecraft (1792). This and other classics are available for download in Project Gutenberg, and there’s an R package for doing it: gutenbergr.

rights_of_women <- read_csv("data/rights_of_women.csv")
Rows: 8238 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): author, book, text
dbl (3): gutenberg_id, chapter, line

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can tokenize the text by words:

rights_of_women_words <- rights_of_women |> 
  unnest_tokens(output = "word", input = "text", # column names in output and input
                token = "words")

And count the number of words:

rights_of_women_words |> 
  count(word, sort = T)
# A tibble: 7,767 × 2
   word      n
   <chr> <int>
 1 the    5059
 2 of     3713
 3 to     3270
 4 and    2468
 5 a      1844
 6 that   1367
 7 in     1305
 8 is     1182
 9 be     1040
10 it      842
# ℹ 7,757 more rows

9.2.4 Preprocessing

Why might want to do a bit of preprocessing and remove these “stop words”. tidytext comes with a little dictionary of them:

stop_words_smart <- stop_words |> 
  filter(lexicon == "SMART")
stop_words_smart
# A tibble: 571 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# ℹ 561 more rows

So you can do something like:

rights_of_women_words_cl <- rights_of_women_words |> 
  filter(!word %in% stop_words_smart$word)
rights_of_women_words_cl |> 
  count(word, sort = T)
# A tibble: 7,386 × 2
   word       n
   <chr>  <int>
 1 women    445
 2 man      308
 3 men      299
 4 reason   264
 5 mind     232
 6 virtue   198
 7 woman    190
 8 love     173
 9 life     170
10 nature   147
# ℹ 7,376 more rows

9.2.5 Counts by document

We might want to get the most common words in each “document,” e.g., chapters in this book.

count_by_chapter <- rights_of_women_words_cl |> 
  # count words by chapter
  count(word, chapter) |> 
  # get the top 10 in each chapter
  slice_max(n = 10, order_by = n, by = chapter)
count_by_chapter
# A tibble: 152 × 3
   word      chapter     n
   <chr>       <dbl> <int>
 1 women           0    31
 2 woman           0    19
 3 men             0    16
 4 rights          0    14
 5 chapter         0    13
 6 sex             0    13
 7 character       0    12
 8 human           0    12
 9 reason          0    12
10 society         0    12
# ℹ 142 more rows
ggplot(count_by_chapter, aes(x = n, 
                             y = reorder_within(word, n, chapter))) +
  geom_col() +
  facet_wrap(~chapter, scales = "free") + 
  scale_y_reordered()

9.2.6 Most distinctive terms by document

Another way to quantify what a document is about is to use TF-IDF (term frequency - inverse document frequency; Silge and Robinson, 2017, ch. 3).

The idea is to balance two things: - TF: the relative frequency of a term - IDF: how common/uncommon the term is across documents

\[ \begin{aligned} TFIDF_{i,d} &= TF_{i, d} \cdot IDF_i \\ TFIDF_{i,d} &= \frac{n_{i \text{ in d}}}{n_{\text{total in doc}}} \cdot \text{ln}(\frac{n_{\text{docs}}}{n_{\text{docs containing i}}}) \end{aligned} \] For example, let’s imagine we have 5 documents and we’re trying to determine the TF-IDF of terms in a document with 100 total terms:

(10 / 100) * # term appearing in 10% of terms in doc
  log(6 / 6) # term present in all documents
[1] 0
(10 / 100) * # term appearing in 10% of terms in doc
  log(6 / 3) # term present in half of documents
[1] 0.06931472
(10 / 100) * # term appearing in 10% of terms in doc
  log(6 / 1) # term present in just one documents
[1] 0.1791759

The bind_tf_idf() adds TF-IDFs to a grouped token count:

tfidf_by_chapter <- rights_of_women_words_cl |> 
  # count words by chapter
  count(word, chapter, sort = T) |> 
  # add TF-IDF
  bind_tf_idf(term = word, document = chapter, n = n) |> 
  # get the top 10 in each chapter
  slice_max(n = 10, order_by = tf_idf, by = chapter)
tfidf_by_chapter
# A tibble: 206 × 6
   word            chapter     n       tf   idf  tf_idf
   <chr>             <dbl> <int>    <dbl> <dbl>   <dbl>
 1 polygamy              4     4 0.000853 2.64  0.00225
 2 prince                4     4 0.000853 2.64  0.00225
 3 thirty                4     4 0.000853 2.64  0.00225
 4 middle                4     5 0.00107  1.95  0.00208
 5 condition             4     7 0.00149  1.25  0.00187
 6 created               4     9 0.00192  0.847 0.00163
 7 accomplishments       4     7 0.00149  1.03  0.00154
 8 clothes               4     4 0.000853 1.54  0.00131
 9 sensation             4     4 0.000853 1.54  0.00131
10 twenty                4     4 0.000853 1.54  0.00131
# ℹ 196 more rows
ggplot(tfidf_by_chapter, aes(x = tf_idf, 
                             y = reorder_within(word, tf_idf, chapter))) +
  geom_col() +
  facet_wrap(~chapter, scales = "free") + 
  scale_y_reordered()

Exercise

The “data/books.csv” dataset contains the text of two classics in political theory: Hobbes’ “Leviathan” (1651) and Mill’s “On Liberty” (1859). (Both come from Project Gutenberg as well).

Make a plot with the most distinctive terms in each book, according to TF-IDF. Hint: think of what “documents” will be in this case (previously we used chapters).