## 20.33 Quantitative Analysis of Text

The package provides an extensive suite of functions to support the quantitative analysis of text.

We can obtain simple summaries of a list of words, and to do so we will illustrate with the terms from our Term Document Matrix tdm. We first extract the shorter terms from each of our documents into one long word list. To do so we convert tdm into a matrix, extract the column names (the terms) and retain those shorter than 20 characters.

words <- dtm                                                          %>%
as.matrix                                                           %>%
colnames                                                            %>%
(function(x) x[nchar(x) < 20])

We can then summarise the word list. Notice, in particular, the use of qdap::dist_tab() from to generate frequencies and percentages.

length(words)
## [1] 6456
head(words, 15)
##  [1] "abstract"  "academi"   "accur"     "accuraci"  "acnntex"   "acsi"
## [13] "advers"    "affect"    "algorithm"
summary(nchar(words))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   3.000   5.000   6.000   6.644   8.000  19.000
table(nchar(words))
##
##    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18
##  579  867 1044 1114  935  651  397  268  200  138   79   63   34   28   22   21
##   19
##   16
dist_tab(nchar(words))
##    interval freq cum.freq percent cum.percent
## 1         3  579      579    8.97        8.97
## 2         4  867     1446   13.43       22.40
## 3         5 1044     2490   16.17       38.57
## 4         6 1114     3604   17.26       55.82
## 5         7  935     4539   14.48       70.31
## 6         8  651     5190   10.08       80.39
## 7         9  397     5587    6.15       86.54
## 8        10  268     5855    4.15       90.69
## 9        11  200     6055    3.10       93.79
## 10       12  138     6193    2.14       95.93
## 11       13   79     6272    1.22       97.15
## 12       14   63     6335    0.98       98.13
## 13       15   34     6369    0.53       98.65
## 14       16   28     6397    0.43       99.09
## 15       17   22     6419    0.34       99.43
## 16       18   21     6440    0.33       99.75
## 17       19   16     6456    0.25      100.00

Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.