20.5 PDF Documents

If instead of text documents we have a corpus of PDF documents then we can use the tm::readPDF() reader function to convert PDF into text and have that loaded as out Corpus.

docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))

This will use, by default, the pdftotext command from xpdf to convert the PDF into text format. The xpdf application needs to be installed for tm::readPDF() to work.

