Basics of Text Mining in R

Posted on Jan 27, 2022

Text in R can be represented in several ways but generally it is a character vector (strings). Reading a text file would mean most of the content would either be in a single long character file, or broken into several variables and observations as a data frame like comma separated files (CSV). In this blog tutorial, I will download a Jane Austen’s book and perform some basic analysis to understand how these text functions work.

Packages

The common packages for text mining in R are stringr, tidytext, tidyverse and quanteda. I will also use gutenbergr to download the book for analysis.

library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(quanteda)
## Package version: 3.2.4
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
library(gutenbergr)

# changing default ggplot theme to minimal
theme_set(theme_minimal())

Downloading the Book

Once I have the required functions in my namespace, I can download the book using gutenberg_download(). gutenberg_works() gives a list of works that can be downloaded. (gutenberg_metadata will give a list of all books in Project Gutenberg, but we only need the ones that can be downloaded.)

gutenberg_works(title == "Persuasion")
## # A tibble: 1 × 8
##   gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
##          <int> <chr>     <chr>                <int> <chr>    <chr>              
## 1          105 Persuasi… Auste…                  68 en       <NA>               
## # ℹ 2 more variables: rights <chr>, has_text <lgl>

I am looking for Persuasion, Jane Austen’s last book. R tells me the rights to the book are public and it has text, so works for my purpose. Downloading the book requires its gutenberg_id, which is 105 for Persuasion, as seen in previous output.

book = gutenberg_download(105)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

I can download more than one books at a time and many other fancy things. Check gutenbergr’s vignette for more information.

Exploring the Book

Let’s see what we have in book.

book
## # A tibble: 8,357 × 2
##    gutenberg_id text            
##           <int> <chr>           
##  1          105 "Persuasion"    
##  2          105 ""              
##  3          105 ""              
##  4          105 "by Jane Austen"
##  5          105 ""              
##  6          105 "(1818)"        
##  7          105 ""              
##  8          105 ""              
##  9          105 ""              
## 10          105 ""              
## # ℹ 8,347 more rows

The book object has two variables: gutenberg_id and text. Unless you are downloading multiple books, text is the only useful variable.

Also note that there are 8,328 rows in the dataset. However, this text is not in tidytext format, where each row identifies a token and each column is a variable. (An easy way to remember the format is to repeat out loud “One Token Per Document Per Row” as often as you can.)

To convert it into tidytext format, I will use unnest_tokens() function from tidytext package.

book %>% 
  unnest_tokens(word, text)
## # A tibble: 83,707 × 2
##    gutenberg_id word      
##           <int> <chr>     
##  1          105 persuasion
##  2          105 by        
##  3          105 jane      
##  4          105 austen    
##  5          105 1818      
##  6          105 contents  
##  7          105 chapter   
##  8          105 i         
##  9          105 chapter   
## 10          105 ii        
## # ℹ 83,697 more rows

unnest_tokens used here has two parameters: what you want to convert into and what you want to convert. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (text, in this case).

The function also did some other operations in the background. It removed all the punctuation marks from the text. It also converted everything to lower case (which can be toggled OFF by using to_lower = FALSE in unnest_tokens. The function also has an argument token to specify what kind of text is it. words is the default option that worked for our case. Other options are characters, character_shingles, ngrams, skip_ngrams, sentences, lines, paragraphs, regex, tweets and ptb.

Exploring Words

We can look for several manipulations for insights about the words. Such as, how many four letter words did she use? Less than four letter words? Longer than ten letter words?

book = book %>% 
  unnest_tokens(word, text)

# Four Letter Words
book %>% 
  filter(str_length(word) == 4)
## # A tibble: 15,513 × 2
##    gutenberg_id word 
##           <int> <chr>
##  1          105 jane 
##  2          105 1818 
##  3          105 viii 
##  4          105 xiii 
##  5          105 xvii 
##  6          105 xxii 
##  7          105 xxiv 
##  8          105 hall 
##  9          105 took 
## 10          105 book 
## # ℹ 15,503 more rows
# Less than four letter words
book %>% 
  filter(str_length(word) < 4)
## # A tibble: 37,900 × 2
##    gutenberg_id word 
##           <int> <chr>
##  1          105 by   
##  2          105 i    
##  3          105 ii   
##  4          105 iii  
##  5          105 iv   
##  6          105 v    
##  7          105 vi   
##  8          105 vii  
##  9          105 ix   
## 10          105 x    
## # ℹ 37,890 more rows
# More than ten letters
book %>% 
  filter(str_length(word) > 10)
## # A tibble: 1,640 × 2
##    gutenberg_id word         
##           <int> <chr>        
##  1          105 somersetshire
##  2          105 consolation  
##  3          105 contemplating
##  4          105 information  
##  5          105 respectable  
##  6          105 representing 
##  7          105 parliaments  
##  8          105 handwriting  
##  9          105 presumptive  
## 10          105 infatuation  
## # ℹ 1,630 more rows

We see that there are 15,505 words that have exactly four letters. 37,908 have less than four letters (that includes numbers such as 1). There are 1,636 words that have more than ten letters in them.

Words that Start or End with …

We can also find words that start or end with a particular string. For example, I wonder how often does Jane Austen use V4 form of the verb — ending in “ing”? We can use str_ends() from stringr package.

book %>% 
  filter(str_ends(word, "ing"))
## # A tibble: 2,638 × 2
##    gutenberg_id word         
##           <int> <chr>        
##  1          105 contemplating
##  2          105 arising      
##  3          105 adding       
##  4          105 inserting    
##  5          105 serving      
##  6          105 representing 
##  7          105 forming      
##  8          105 concluding   
##  9          105 handwriting  
## 10          105 beginning    
## # ℹ 2,628 more rows

She uses 2,638 words that end with “ing”. I’m curious, what are their frequencies? I only need to add the count() at the end.

book %>% 
  filter(str_ends(word, "ing")) %>% 
  count(word, sort = T)
## # A tibble: 549 × 2
##    word           n
##    <chr>      <int>
##  1 being        220
##  2 nothing      139
##  3 having        92
##  4 going         65
##  5 something     64
##  6 morning       59
##  7 evening       54
##  8 anything      49
##  9 looking       45
## 10 everything    43
## # ℹ 539 more rows

“Being” and “nothing” are the most often used (no pun intended). What about words that start with “h”? I can use str_starts() from stringr package for this.

book %>% 
  filter(str_starts(word, "h")) %>% 
  count(word, sort = T)
## # A tibble: 215 × 2
##    word        n
##    <chr>   <int>
##  1 her      1201
##  2 had      1185
##  3 he        959
##  4 his       658
##  5 have      589
##  6 him       466
##  7 herself   159
##  8 how       125
##  9 has       100
## 10 himself    95
## # ℹ 205 more rows

They’re mostly pronouns. How many times does “gh” appear in her texts and in which words? (If I recall correctly, “gh” is probably one of the most common letter-pair in English.)

book %>% 
  filter(str_detect(word, fixed("gh"))) %>% 
  count(word, sort = T)
## # A tibble: 100 × 2
##    word        n
##    <chr>   <int>
##  1 might     165
##  2 though    117
##  3 thought    90
##  4 enough     71
##  5 ought      52
##  6 right      42
##  7 through    34
##  8 brought    33
##  9 high       26
## 10 night      24
## # ℹ 90 more rows

I did this using str_detect() function from stringr. This function usually looks for regular expressions. Since there was a fix string that I was looking for (gh), I used fixed() to tell R exactly what I was looking for. It will not make pattern matches but only exact fixed matches. I’m very naive in handling regular expressions but the starting guide could be Hadley Wickham’s R for Data Science chapter on Strings.

I can also look for words that start with a certain letter(s) and end with certain letter(s). How? Just add another condition in the filter() statement. Let’s look for words Jane used that start and end with “t”.

book %>% 
  filter(str_starts(word, "t") & str_ends(word, "t")) %>% 
  count(word, sort = T)
## # A tibble: 18 × 2
##    word            n
##    <chr>       <int>
##  1 that          874
##  2 thought        90
##  3 tenant         13
##  4 trust           4
##  5 throat          2
##  6 torment         2
##  7 transient       2
##  8 treat           2
##  9 taught          1
## 10 temperament     1
## 11 tempt           1
## 12 tenderest       1
## 13 test            1
## 14 thickest        1
## 15 throughout      1
## 16 tight           1
## 17 trent           1
## 18 trustiest       1

The most common such word is “that”, followed by “thought”.

Frequency Distribution Plots

We saw how adding count(word, sort = T) created the frequency distribution. We can also visualise the counts.

Frequency Table

book %>% 
  count(word, sort = T) %>%
  head(20) %>% 
  mutate(word = reorder(word, n))
## # A tibble: 20 × 2
##    word      n
##    <fct> <int>
##  1 the    3327
##  2 to     2809
##  3 and    2800
##  4 of     2570
##  5 a      1594
##  6 in     1389
##  7 was    1337
##  8 her    1201
##  9 had    1185
## 10 she    1146
## 11 i      1124
## 12 it     1038
## 13 he      959
## 14 be      950
## 15 not     933
## 16 that    874
## 17 as      810
## 18 for     707
## 19 but     664
## 20 his     658

Frequency Plot

I will have to reorder the counts for creating the plot as count() only counts and doesn’t change the order of the tibble.

book %>% 
  count(word, sort = T) %>%
  head(20) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(x = n, y = word)) +
  geom_col() +
  xlab("Count") +
  ylab("Word")

Finding Hapaxes

Hapaxes are words that occur only once in the text. Nothing complicated; I will first count the occurrences and then filter when the count is 1.

book %>% 
  count(word, sort = T) %>% 
  filter(n == 1)
## # A tibble: 2,578 × 2
##    word      n
##    <chr> <int>
##  1 15        1
##  2 16        1
##  3 1760      1
##  4 1784      1
##  5 1785      1
##  6 1787      1
##  7 1789      1
##  8 1791      1
##  9 1800      1
## 10 1803      1
## # ℹ 2,568 more rows

These are all numbers. What about words?

book %>% 
  count(word, sort = T) %>% 
  filter(n == 1) %>% 
  filter(!str_detect(word, "[0-9]"))
## # A tibble: 2,560 × 2
##    word               n
##    <chr>          <int>
##  1 _arrangé_          1
##  2 _au                1
##  3 _captain_          1
##  4 _did_              1
##  5 _disagreeable_     1
##  6 _gentleman_        1
##  7 _given_            1
##  8 _great_            1
##  9 _had_              1
## 10 _high_             1
## # ℹ 2,550 more rows

I have used regular expression here to identify all the words that didn’t have any numerals.

Distribution of Word Lengths

Some writers have a habit of writing long words. What were the longest words used by Jane and how often did she use them?

book %>% 
  mutate(length = str_length(word)) %>% 
  count(length, sort = T)
## # A tibble: 16 × 2
##    length     n
##     <int> <int>
##  1      3 19961
##  2      4 15513
##  3      2 15202
##  4      5  8425
##  5      6  6494
##  6      7  5712
##  7      8  3452
##  8      9  2910
##  9      1  2737
## 10     10  1661
## 11     11   823
## 12     12   486
## 13     13   231
## 14     14    71
## 15     15    25
## 16     16     4

Three letter words are most commonly used, followed by four letter and two letter ones. I have first calculated the length of words using mutate() and str_length().

I can also plot them.

book %>% 
  mutate(length = str_length(word)) %>% 
  count(length, sort = T) %>% 
  mutate(length = reorder(length, n)) %>% 
  ggplot(aes(x = length, y = n)) +
  geom_col() +
  xlab("Length of Word") +
  ylab("Count")

That was all! See you in next week when I try some harder text analysis tools.


P.S. I have used the words “word(s)” and “token(s)” quite liberally. They are not always the same. As token argument in unnest_token informs, there are many options besides words that can be tokens.