21.41 LDA

Topic Models such as Latent Dirichlet Allocation has been popular for text mining in last 15 years. Applied with varying degrees of success. Text is fed into LDA to extract the topics underlying the text document. Examples are the AP corpus and the Science Corpus 1880-2002 (Blei and Lafferty 2009). PERHAPS USEFUL IN BOOK?

When is LDA applicable - it will fail on some data and need to choose number of topics to find and how many documents are needed. HOw do we know the topics learned are correct topics.

Two fundemental papers - independelty discovered: Blei, Ng, Jordan - NIPS 2001 with 11k citations. Other paper is Pritchard, Stephens, and Donnelly in Genetics June 200 14K citations - models are exactly the same except for minor differences: except topics versus population structures.

No theoretic analysis as such. How to guarantee correct topics and how efficient is the learning procedure?

Observations:

LDA won’t work on many short tweets or very few long documents.

We should not liberally over-fit the LDA with too many redundant topics…

Limiting factors:

We should use as many documents as we can and short documents less than 10 words won’t work even if there are many of them. Need sufficiently long documents.

Small Dirichlet paramenter helps especially if we overfit. See Long Nguen’s keynote at PAKDD 2015 in Vietnam.

number of documents the most important factor

document length plays a useful role too

avoid overfitting as you get too many topics and don’t really learn anything as the humn needs to cull the topics.

New work detects new topics as they emerge.

data(cora.documents)
data(cora.vocab)
theme_set(theme_bw())  
set.seed(8675309)
K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
                                       K,  ## Num clusters
                                       cora.vocab,
                                       25,  ## Num iterations
                                       0.1,
                                       0.1,
                                       compute.log.likelihood=TRUE) 
## Get the top words in the cluster
top.words <- top.topic.words(result$topics, 5, by.score=TRUE)
## Number of documents to display
N <- 10

topic.proportions <- t(result$document_sums) / colSums(result$document_sums)

topic.proportions <-
   topic.proportions[sample(1:dim(topic.proportions)[1], N),]

topic.proportions[is.na(topic.proportions)] <-  1 / K

colnames(topic.proportions) <- apply(top.words, 2, paste, collapse=" ")

topic.proportions.df <- melt(cbind(data.frame(topic.proportions),
                                   document=factor(1:N)),
                             variable.name="topic",
                             id.vars = "document")  

ggplot(topic.proportions.df, aes(x=topic, y=value, fill=topic)) +
    geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle=45, hjust=1, size=7),
          legend.position="none") +  
    coord_flip() +
    facet_wrap(~ document, ncol=5)


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0