Optional Settings Documentation

The follwing is a first draft of documentation for the “Optional Settings” window of the Topic Modeling Tool. Expect this document to change.

These settings fall under three categories:

  1. Always put significant thought into this (num topics, stoplist) ***
  2. You may need to modify this in some cases (threads, case folding, alpha) **
  3. Defaults are fine for almost all cases (beta) *

(TODO: Change “density” to “smoothing” here and in the tool)

(TODO: Add the –num-icm-iterations option here and in the tool)

Metadata File ***

A table of comma separated values (CSV) containing metadata to be included in the output. Each row represents a single document, and each column represents a metadata feature. The first column must be the filename. Many of the most interesting applications of topic modeling involve analyzing topic distributions in relation to metadata categories such as date of publication, genre, author, or publisher.

Stopword File ***

A file listing words to exclude from analysis, one word per line. Samples for a few languages are available in the MALLET repository here.

Many users find that the word lists generated by the tool are easier to interpret when very common words are excluded (e.g. for English-language models, “a,” “the,” “of,” “for,” and so on). In some cases, excluding other kinds of words can be beneficial. For example, topic models of fiction can be skewed by character names, which may be shared between works that are otherwise unrelated.

Some researchers, especially linguists, argue that excluding common words throws away important information. For users interested primarily in “content words,” these exclusions probably won’t hurt, and may help. Users interested in low-level linguistic phenomena should try running the tool without any exclusions at least once to see how it affects the output.

Remove default English stopwords ***

For convenience, the topic modeling tool includes a list of common English stopwords. It excludes these words by default. Disable this option to prevent these words from being excluded.

Preserve case **

By default, the topic modeling tool ignores capitalization. Generally speaking, typographical and orthographical details such as capitalization are not relevant to topic models, which pay no attention to word order, and may be skewed by small inconsistencies. However, particular research questions may call for models that are sensitive to capitalization. Users may sometimes also use capitalization as a form of annotation or markup that the topic modeling tool should be aware of. Those users should enable this option.

Generate HTML output **

The topic modeling tool can generate a browsable set of HTML documents summarizing the results. These are useful for informal inspection and evaluation, but users who wish to import their results into another tool (such as Excel) needn’t leave this option enabled.

These files can take up a large amount of disk space. Users who are analyzing very large corpora should plan accordingly, or disable this option.

Preserve raw MALLET output **

Although the CSV output that the topic modeling tool generates can be read by many tools, some applications require all of the data generated by MALLET, the topic modeling tool’s underlying command-line application. These files can take up a large amount of disk space, so this option is disabled by default.

Tokenize with regular expression **

Topic modeling uses a “bag of words” model, which means that input texts are divided up into unordered collections of words before further processing. To determine where boundaries between words should fall, the topic modeling tool uses a kind of search string called a regular expression. The default regular expression here should work well for many languages that use whitespace to mark boundaries between words.

Users working with languages that do not separate words with whitespace may need to preprocess their text using specialized tools. Other users may have texts that denote word boundaries in more complex ways. This field allows those users to specify alternative word separation schemes.

Number of iterations **

The process that generates a topic model is surprisingly simple. The tool begins with a random guess about how to divide the corpus into topics. It then makes a second guess that is slightly more likely to be a good guess; then a third; then a fourth. Later guesses are more and more likely to be good guesses, but eventually the gain levels off, and there’s no point in continuing after that.

“Iterations” in this context just means “guesses,” and this parameter dictates the total number of guesses to make.

The best value to enter here depends mostly on corpus size. The default, 400 guesses, is probably reasonable for a corpus that contains a few hundred articles. For a corpus that contains many thousands of articles, or a few hundred novels, 1000 might be a better number. For very large corpora, even larger numbers might be appropriate.

Users who are using this tool for more than exploratory purposes should try a few different values to see how the results change.

Number of training threads **

This dictates the number of computer processes to run simultaneously. If you have a multicore processor, you can set this to a number that will take full advantage of all cores. Two processes per core is a common rule of thumb.

If you do not have a multicore processor, the default value, 4, will not make things faster, but it probably won’t make things slower either.

Number of topic words to print **

Every topic in a topic model is associated with a ranked list of words. The list contains every word in the corpus, but the words at the top of the list are much more strongly associated with the topic. This determines how far down the list to go. It’s common to print just the first ten or twenty words, but users sometimes prefer to print more. Some researchers have argued that printing more is important because words further down the list may provide important information, or may show when a topic is a combination of two groups of mostly unrelated words – a common phenomenon.

Alpha & Beta optimization frequency *

This value controls the way the topic modeling tool adjusts its prior assumptions about the distributions of words and topics in a corpus. The default value, 10, tells the tool to update its assumptions once every ten iterations. Most users will not need to change this value. However, under some circumstances, users may prefer not to allow the tool to adjust its prior assumptions at all. Those users should set this value to 0, which disables adjustments entirely.

For more about these prior assumptions, see below.

Topic density parameter (Alpha) ** & Word density parameter (Beta) *

These values are intimately connected to the underlying algorithm, LDA, and most users will not need to change them. They control the kinds of guesses the tool makes about how topics are associated with dicuments, and how words are associated with topics. They do so by controlling how smooth these distributions are in the model. From one of the the creators of MALLET, David Mimno:

The short version: Set Alpha to 5.0 and Beta to 0.01 and don’t worry about it. I strongly suggest allowing Mallet to optimize these parameters. Topics will look cleaner and be more stable, although you may need to watch out for very small, not-very-good topics.

The long version: I think of these as “smoothing” parameters. The algorithm works by assigning the tokens in documents to individual topics. A good allocation of tokens uses relatively few topics per document, and relatively few words per topic. Finding a good allocation of tokens to topics is hard since everything depends on everything else, and we need to resort to randomized search.

In the innermost loop, the algorithm looks at every token and decides whether to shift it to a different topic. The chance that a word token w in a document d belongs to topic k is the proportion of tokens in d that are currently assigned to k times the proportion of words in k that are of the same type. In other words, topics that currently have many tokens in that document are more likely than topics that don’t, and topics that include the same word many times are more likely than topics that don’t.

So what if a topic never occurs in the document, or the topic has no other instances of that word? We want the topic to be unlikely, but we don’t want that topic to be impossible! Our search for a good allocation is somewhat randomized, and we don’t want to cut off possibilities prematurely. So we pretend that we saw a fake “word” in every topic in the document and in every topic. To give ourselves a bit of extra flexibility we can even make our fake word count for just a fraction of a word. And that’s all the alpha and beta parameters are: the weight of the fraction of a word we reserve for every topic. So if beta is 0.01, that means that a word that currently doesn’t occur in that topic isn’t impossible, it’s just 101x less probable than a word that occurs exactly once in the topic (1.01 vs 0.01).

In practice, beta = 0.01 is a good value for natural language in almost any anguage, and optimizing that parameter rarely changes it. Setting alpha is more complicated. If you use larger values, documents will have more uniform topic distributions, since “random” topics have more weight. Smaller values lead to more concentrated distributions.

Random number generator seed **

When the topic modeling tool makes guesses, it does so using a partially random process. The source of randomness for this process is a sequence of random numbers. This setting allows you to fix a starting point in that sequence, a “seed” value. Every time the tool runs with the same seed, it will use the same sequence of random numbers. If all the other settings remain the same, and the corpus remains the same, then the output will be identical (in theory). Users who want their model to be perfectly repeatable will need to set this to a value other than 0, and report the value they used.

Divide input into n-word chunks **

Users who would like to analyze a few very large documents may find that they get better results by dividing the documents into smaller chunks. This does that for you automatically.

This strategy may sometimes backfire: the topic modeling tool may determine that each of the large documents corresponds to one single topic, and assign the overwhelming majority of each chunk to the topic of the document it came from. If you have a small number of large documents, this might help improve the results you get, but it might not.

Metadata CSV delimiter & Output CSV delimiter **

These settings simply determine how column boundaries are marked in the input and output files. For example, if your metadata file happens to use tabs instead of commas, you can enter \t as your metadata delimiter, and your file will be interpreted correctly. Similarly, if you plan to pass the CSV output to a tool that expects columns to be separated by tabs instead of commas, you can enter \t as your output delimiter, and you won’t have to do any conversion.