Datasets for Text Mining

Think about concrete research questions.
List the kinds of texts that could help answer them.
Think about proxy variables for phenomena of interest.
Select a subset of texts with many similarities.
Identify precise differences among those texts that relate to your proxy variables.
Conceive an ideal. Then use available datasets to find a compromise.

Contrasting Reviewed and Unreviewed Poetry

Research Question: How do the standards governing literary prestige change?
Kind of Text: Historical works of literature charted across time.
Proxy Variable: The existence of reviews in prestigious periodicals (easy to measure) implies prestige (hard to measure).
Similarity: Focus on works in one genre (books of poetry).
Difference: Divide the set into books of poetry that were reviewed, and books of poetry that were not reviewed.
Available Data: Assume that books randomly drawn from a large pool of data weren't reviewed.

Three steps:
- Select data using process described before.
- Download data using HTTrack or wget.
- Clean data using Notepad++ or TextWrangler.

The quick version (to be expanded):
- Download and install HTTrack: https://www.httrack.com/page/2/en/index.html
- Step through the settings pages by pressing the "Next" button; adjust settings when necessary.
- Start the download process. When it's complete, you'll have a set of HTML files.

Download and install TextWrangler: http://www.barebones.com/products/textwrangler/download.html
Use TextWrangler's find-and-replace functions to clean up the text.
Unlike word, TextWrangler can use regular expressions. They are very powerful! For example...

In the top text field (labeled "Find"), type "<[^<>]*>".
The expression has four parts. It will find every sequence of characters that
- < -- Begins with a '<'.
- [^<>] -- Continues with characters that are neither '<' or '>'.
- * -- Has zero or more of those characters.
- > -- Ends with a '>'.