21.8 Preparing the Corpus

We generally need to perform some pre-processing of the text data to prepare for the text analysis. Example transformations include converting the text to lower case, removing numbers and punctuation, removing stop words, stemming and identifying synonyms. The basic transforms are all available within tm (Feinerer and Hornik 2025).

tm::getTransformations()

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

The function tm::tm_map() is used to apply one of these transformations across all documents within a corpus. Other transformations can be implemented using R functions and wrapped within tm::content_transformer() to create a function that can be passed through to tm::tm_map(). We will see an example of that in the next section.

In the following sections we will apply each of the transformations, one-by-one, to remove unwanted characters from the text.

References

Feinerer, Ingo, and Kurt Hornik. 2025. Tm: Text Mining Package. https://tm.r-forge.r-project.org/.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0