Python code for the CorEx topic model is available via pip:
pip install corextopic
The CorEx topic model takes a document by word matrix as input.
import numpy as np import scipy.sparse as ss from corextopic import corextopic as ct # Define a matrix where rows are samples (docs) and columns are features (words) X = np.array([[0,0,0,1,1], [1,1,1,0,0], [1,1,1,1,1]], dtype=int) # Sparse matrices are also supported X = ss.csr_matrix(X) # Word labels for each column can be provided to the model words = ['dog', 'cat', 'fish', 'apple', 'orange'] # Document labels for each row can be provided docs = ['fruit doc', 'animal doc', 'mixed doc'] # Train the CorEx topic model topic_model = ct.Corex(n_hidden=2) # Define the number of latent (hidden) topics to use. topic_model.fit(X, words=words, docs=docs)
Once the model is trained, the topics can be accessed through the get_topics() function. Words are ranked according to their mutual information with the topic.
topics = topic_model.get_topics() for topic_n,topic in enumerate(topics): words,mis = zip(*topic) topic_str = str(topic_n+1)+': '+','.join(words) print(topic_str)
Similarly, the most probable documents for each topic can be accessed through the get_top_docs() function.
top_docs = topic_model.get_top_docs() for topic_n, topic_docs in enumerate(top_docs): docs,probs = zip(*topic_docs) topic_str = str(topic_n+1)+': '+','.join(docs) print(topic_str)
For the CorEx topic model, topics are binary latent factors that can be expressed or not in each document. We can hierarchically model the topics by making another a document by topic matrix and using that as input for another CorEx topic model. We can iterate to build a hierarchical representation of topics.
# Train the first layer topic_model = ct.Corex(n_hidden=100) topic_model.fit(X) # Train successive layers tm_layer2 = ct.Corex(n_hidden=10) tm_layer2.fit(topic_model.labels) tm_layer3 = ct.Corex(n_hidden=1) tm_layer3.fit(tm_layer2.labels)
Anchored CorEx allows a user to anchor words to topics in a semi-supervised fashion to uncover otherwise elusive topics. If words is initialized, anchoring is straightforward:
topic_model.fit(X, words=words, anchors=[['dog','cat'], 'apple'], anchor_strength=2)
This anchors “dog” and “cat” to the first topic, and “apple” to the second topic. You can anchor in many creative ways. For example:
topic_model.fit(X, words=words, anchors=[['snow', 'cold', 'avalanche']], anchor_strength=4)
topic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2)
topic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado']], anchor_strength=2)
For a more detailed tutorial of how to run the CorEx topic model, extract information from it, choose the number of topics, and run hierarchical and semi-supervised models, see this Jupyter notebook.