21.19 Exploring the Document Term Matrix

We can obtain the term frequencies as a vector by converting the document term matrix into a matrix and summing the column counts:

freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 6508

By ordering the frequencies we can list the most frequent terms and the least frequent terms:

ord <- order(freq)

# Least frequent terms.
freq[head(ord)]
##      acnntex        dmitl  microsystem       ventur         adra attributeori 
##            1            1            1            1            1            1

Notice these terms appear just once and are probably not really terms that are of interest to us. Indeed they are likely to be spurious terms introduced through the translation of the original document from PDF to text.

# Most frequent terms.
freq[tail(ord)]
##     can dataset pattern     use    mine    data 
##     709     776     887    1366    1446    3101

These terms are much more likely to be of interest to us. Not surprising, given the choice of documents in the corpus, the most frequent terms are: data, mine, use, pattern, dataset, can.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0