20.19 Exploring the Document Term Matrix
We can obtain the term frequencies as a vector by converting the document term matrix into a matrix and summing the column counts:
<- colSums(as.matrix(dtm)) freq length(freq)
##  6508
By ordering the frequencies we can list the most frequent terms and the least frequent terms:
<- order(freq) ord # Least frequent terms. head(ord)]freq[
## acnntex dmitl microsystem ventur adra attributeori ## 1 1 1 1 1 1
Notice these terms appear just once and are probably not really terms that are of interest to us. Indeed they are likely to be spurious terms introduced through the translation of the original document from PDF to text.
# Most frequent terms. tail(ord)]freq[
## can dataset pattern use mine data ## 709 776 887 1366 1446 3101
These terms are much more likely to be of interest to us. Not surprising, given the choice of documents in the corpus, the most frequent terms are: .
Your donation will support ongoing development and give you access to the PDF version of the book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.