21 Text Mining

20190221

Text Mining (or Text Analytics) applies analytic tools to analyses and to learn from collections of text data. Text data might include social media posts, books, newspapers, emails, research papers, etc. The goal can be similar to humans learning by reading such material. Using automated algorithms we can learn from massive amounts of text, very much more than any human can, and indeed with the advent of large language models the power of this learning is evident. Such large language models have collected together all of the text from the Internet, converting videos, for example, to text as well, leading to an unimaginable amount of text data.

With any corpus of text material (e.g., today’s newspapers) we might begin by summarising the main themes and to identify those that are of most interest to us. Or we might be monitoring social media feeds to identify emerging topics that we might need to act upon, as they emerge.

You can download a small selection of txt files to form your corpus for exploration. From Togaware download corpus_papers.zip. Once downloaded unzip the archive into corpus/txt/. We will use this corpus to illustrate our text mining capabilities.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0