World Library  
Flag as Inappropriate
Email this Article

Article Id: WHEBN0028934119
Reproduction Date:

Author: World Heritage Encyclopedia
Language: English
Subject: Natural Language Processing, Gensim, Concentration parameter, Digital Humanities Summer Institute, Explicit semantic analysis
Publisher: World Heritage Encyclopedia

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.


An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998. [1] Another one, called Probabilistic latent semantic indexing (PLSI), was created by Thomas Hofmann in 1999.[2] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002, allowing documents to have a mixture of topics.[3] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics.

Case studies

Templeton's survey of work on topic modeling in the humanities grouped previous work into synchronic and diachronic approaches. The synchronic approaches identify topics at a certain time, for example, Jockers used topic modelling to classify 177 bloggers writing on the 2010 'Day of Digital Humanities' and identify the topics they wrote about for that day. Meeks modeled 50 texts in the Humanities Computing/Digital Humanities genre to identify self-definitions of scholars working on digital humanities and visualize networks of researchers and topics. Drouin examined Proust to identify topics and show them as a graphical network

Diachronic approaches include Block and Newman's determination the temporal dynamics of topics in the Martha Ballard's diary to identify thematic trends across the 27-year diary. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.


In practice researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A recent survey by Blei describes this suite of algorithms.[4] Several groups of researchers starting with Papadimitriou et al.[1] have attempted to design algorithms with provable guarantees. Assuming that the data was actually generated by the model in question, they try to design algorithms that provably find the model that was used to create the data. Techniques used here include singular value decomposition (SVD), the method of moments, and very recently an algorithm based upon non-negative matrix factorization (NMF). This last algorithm also generalizes to topic models that allow correlations among topics. [5]

See also

Software / Libraries

  • Mallet (software project) (
  • Stanford Topic Modeling Toolkit (
  • Gensim - Topic Modeling for Humans (


  1. ^ a b Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). "Latent Semantic Indexing: A probabilistic analysis" (Postscript). Proceedings of ACM PODS. 
  2. ^ Hofmann, Thomas (1999). "Probabilistic Latent Semantic Indexing" (PDF). Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. 
  3. ^ Blei, David M.; Ng, Andrew Y.;  
  4. ^ Blei, David M. (April 2012). "Introduction to Probabilistic Topic Models" (PDF). Comm. ACM 55 (4): 77–84.  
  5. ^ Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). "Learning Topic Models—Going beyond SVD". arXiv:1204.1956.

External links

  • Mimno, David. "Topic modeling bibliography". 
  • Templeton, Clay. "Topic Modeling in the Humanities: An Overview". Maryland Institute for Technology in the Humanities. 
  • Brett, Megan R. "Topic Modeling: A Basic Introduction". Journal of Digital Humanities. 
  • Topic Models Applied to Online News and Reviews Video of a Google Tech Talk presentation by Alice Oh on topic modeling with LDA
  • Modeling Science: Dynamic Topic Models of Scholarly Research Video of a Google Tech Talk presentation by David M. Blei
  • Automated Topic Models in Political Science Video of a presentation by Brandon Stewart at the Tools for Text Workshop, 14 June 2010
  • Shawn Graham, Ian Milligan, and Scott Weingart "Getting Started with Topic Modeling and MALLET". The Programming Historian. 

Further reading

Help improve this article
Sourced from World Heritage Encyclopedia™ licensed under CC BY-SA 3.0
Help to improve this article, make contributions at the Citational Source
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.