21.3 Corpus Sources and Readers

There are a variety of sources supported by tm (Feinerer and Hornik 2024). We can use tm::getSources() to list them.

tm::getSources()
## [1] "DataframeSource" "DirSource"       "URISource"       "VectorSource"   
## [5] "XMLSource"       "ZipSource"

In addition to different kinds of sources of documents, our documents for text analysis will come in many different formats. A variety are supported by tm (Feinerer and Hornik 2024):

tm::getReaders()
##  [1] "readDataframe"           "readDOC"                
##  [3] "readPDF"                 "readPlain"              
##  [5] "readRCV1"                "readRCV1asPlain"        
##  [7] "readReut21578XML"        "readReut21578XMLasPlain"
##  [9] "readTagged"              "readXML"

References

Feinerer, Ingo, and Kurt Hornik. 2024. Tm: Text Mining Package. https://tm.r-forge.r-project.org/.


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0