21.23 Identifying Frequent Items and Associations

One thing we often to first do is to get an idea of the most frequent terms in the corpus. We use tm::findFreqTerms() to do this. Here we limit the output to those terms that occur at least 1,000 times:

findFreqTerms(dtm, lowfreq=1000)
## [1] "data" "mine" "use"

So that only lists a few. We can get more of them by reducing the threshold:

findFreqTerms(dtm, lowfreq=100)
##   [1] "accuraci"    "acsi"        "advers"      "algorithm"   "also"       
##   [6] "approach"    "associ"      "australia"   "australian"  "build"      
##  [11] "call"        "classif"     "common"      "comput"      "consist"    
##  [16] "csiro"       "data"        "databas"     "dataset"     "develop"    
##  [21] "discoveri"   "effici"      "explor"      "graham"      "howev"      
##  [26] "implement"   "import"      "inform"      "kdd"         "knowledg"   
##  [31] "larg"        "may"         "mine"        "network"     "neural"     
##  [36] "page"        "particular"  "perform"     "problem"     "provid"     
##  [41] "requir"      "research"    "rule"        "scienc"      "search"     
##  [46] "signific"    "structur"    "system"      "techniqu"    "technolog"  
##  [51] "time"        "tool"        "train"       "univers"     "use"        
##  [56] "weight"      "william"     "within"      "work"        "allow"      
##  [61] "analysi"     "appli"       "applic"      "area"        "attribut"   
##  [66] "base"        "can"         "case"        "chang"       "class"      
##  [71] "classifi"    "cluster"     "collect"     "combin"      "compar"     
##  [76] "condit"      "confer"      "contain"     "decis"       "defin"      
##  [81] "describ"     "differ"      "discov"      "discuss"     "distanc"    
##  [86] "distribut"   "domain"      "estim"       "event"       "exampl"     
##  [91] "exist"       "featur"      "find"        "first"       "follow"     
##  [96] "form"        "function"    "general"     "generat"     "given"      
## [101] "group"       "hybrid"      "identifi"    "includ"      "increas"    
## [106] "interest"    "intern"      "interv"      "lead"        "learn"      
## [111] "level"       "like"        "link"        "make"        "mani"       
## [116] "mean"        "measur"      "method"      "model"       "multipl"    
## [121] "need"        "new"         "number"      "object"      "observ"     
## [126] "occur"       "often"       "one"         "paper"       "pattern"    
## [131] "period"      "point"       "predict"     "present"     "proceed"    
## [136] "process"     "propos"      "record"      "refer"       "regress"    
## [141] "relat"       "report"      "repres"      "result"      "sampl"      
## [146] "section"     "select"      "sequenc"     "set"         "similar"    
## [151] "singl"       "stage"       "state"       "statist"     "step"       
## [156] "studi"       "subset"      "support"     "target"      "task"       
## [161] "tempor"      "three"       "transact"    "tree"        "two"        
## [166] "type"        "understand"  "user"        "valu"        "variabl"    
## [171] "well"        "will"        "year"        "age"         "avail"      
## [176] "averag"      "care"        "claim"       "consid"      "cost"       
## [181] "day"         "detect"      "drug"        "effect"      "episod"     
## [186] "error"       "expect"      "expert"      "fig"         "figur"      
## [191] "health"      "hospit"      "http"        "indic"       "individu"   
## [196] "intellig"    "journal"     "machin"      "medic"       "node"       
## [201] "packag"      "patient"     "popul"       "servic"      "show"       
## [206] "small"       "sourc"       "tabl"        "test"        "total"      
## [211] "unexpect"    "angioedema"  "current"     "evalu"       "high"       
## [216] "interesting" "order"       "ratio"       "reaction"    "risk"       
## [221] "unit"        "usual"       "visual"      "entiti"      "experi"     
## [226] "hot"         "insur"       "map"         "nugget"      "open"       
## [231] "polici"      "size"        "spot"        "random"      "vector"     
## [236] "outlier"     "pmml"        "rank"        "rnn"         "window"     
## [241] "adr"         "oper"        "forest"      "subspac"     "rattl"      
## [246] "utar"

We can also find associations with a word, specifying a correlation limit.

findAssocs(dtm, "data", corlimit=0.6)
## $data
##         mine       induct     challeng         know       answer         need 
##         0.90         0.72         0.70         0.65         0.64         0.63 
## statistician      general      foundat        major         mani        boost 
##         0.63         0.62         0.62         0.61         0.61         0.61 
##         come 
##         0.60

If two words always appear together then the correlation would be 1.0 and if they never appear together the correlation would be 0.0. Thus the correlation is a measure of how closely associated the words are in the corpus.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0