21.5 PDF Documents
If instead of text documents we have a corpus of PDF documents then we can use the tm::readPDF() reader function to convert PDF into text and have that loaded as out Corpus.
<- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) docs
This will use, by default, the pdftotext
command from
xpdf
to convert the PDF into text format. The xpdf
application needs to be installed for tm::readPDF() to work.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0
