Topic Modelling

From Clarin K-Centre
Jump to navigation Jump to search

Toposcope

Toposcope can be used to detect topics in unstructured text data. It provides annotations and visualizations of the detected topics, including (changes in) topic frequency over time. The tool features four algorithms: BERTopic (Grootendorst, 2022), Top2Vec (Angelov, 2020), Non-negative Matrix Factorization (Choo et al., 2013), and Latent Dirichlet Allocation (Blei et al., 2003). Users can modify a selection of topic model parameters, and apply a number of built-in preprocessing steps, such as lemmatization and stopword removal. The input format is identical to the Styloscope format: users can upload a local corpus (CSV/ZIP), or use a Huggingface dataset. The output includes visualizations of the topic-document clusters (as shown below) and the most important keywords per topic. The raw results, among other things, consist of annotations, a topic-document matrix, and a topic-term matrix. Topic diversity and topic coherence are also computed in order to support the user during the evaluation of the tool.