Machine Learning for Genre Classification (2/2)

In my previous post, I talked about how we leveraged scikit-learn's classification algorithms for a genre classification problem, and showed that our classifier was substantially more effective at classifying some novel genres than others. You might have noticed that, for the most part, our high-performance genres were “genre fiction” genres: “science fiction,” “fantasy,” and “mystery” are not just literary-historical categories but marketing ones. Publishers use these categories to stabilize demand: readers who have enjoyed science fiction novels in the past will, presumably, purchase additional books marketed under that label in the future. Novels are written with the genre and its rules in mind: the category is a self-conscious one from the point of view of authors, editors, and readers, who intentionally imitate the style of previous works. Domestic fiction, by contrast, is a more ambiguous category. It might be, then, that classification algorithms are better at detecting “genre” in this explicit marketing sense.

One reason for this is that the novels in these genres might themselves be more formulaic or standardized, given that they are designed to meet the demands of an existing readership. Or, at least, they might be more lexically formulaic. Remember that our classifier is basing its decisions only on relative word-frequencies and receives no explicit information about plot, style, grammar, syntax, irony, and so on. Genres like detective fiction and fantasy might simply have more standardized vocabularies than “domestic” fiction. Luckily, scikit-learn lets us access the individual features—in this case, words—that a model uses for classification decisions, which can give us some insight as to the criteria it looks at for each genre.

Top 25 positive and negative coefficients for the detective genre

The results are informative. Top words used for identifying a novel as detective include predictable words like detective, murderer, killer, homicide, and alibi. For fantasy, we see words like wizard, kingdom, and necromancer. Science fiction features words like planet, mars, humanity, and scientist, or, more amusingly, meters and kilometers, which presumably reflects the fact that the metric system sounds exotic and scientific to American readers. Westerns talk about ranches and saloons, while religious fiction includes discussion of repentance, forgiveness, scripture, and jesus (it seems that “religious” mostly means “Protestant Christian,” in this corpus). Romances talk about intimacy, using words like caress and breathlessly. Negative results can be interesting, too: obscenities and sexual language such as shit, breasts, ass, and goddamn are among the top words for identifying a work as not religious fiction.

If we turn to our low-performing genres, however, the lists of top words are more ambiguous and often make less intuitive sense. Domestic fiction has words like wedding, father, grandchild, and birthday, which share a common theme, but also italy and socialism, which are harder to rationalize. This could simply mean that the set of novels that the Library of Congress classes as “domestic fiction” are more lexically heterogeneous than the set labeled “detective” or “fantasy.” Whereas fantasy has unambiguous markers like, say, wizards and elves, domestic fiction is a grab-bag category that can be used to describe any novel that features family or home-life as a central theme. If the texts under a specific label don’t share a consistent vocabulary, then obviously a classifier based on term-frequencies will struggle with them.

Another obstacle for this method was the problem of negative results. In a very early iteration of this project, the classifier assigned no genre at all for more than 2,000 novels. Such a low rate of positives immediately suggested we might have a problem with overfitting, or a classifier that sticks too closely to the training data and thus fails to generalize to new data. Sure enough, when I first opened up the list of features, I was amused to find that the top word for adventure fiction was just the name "Tarzan." I quickly realized that this was due to the corpus itself, which contains more than a dozen of Edgar Rice Borroughs’ original Tarzan novels. The adventure fiction training set was so dominated by these novels that the classifier was unable to find a vocabulary that would generalize to other novels in the class. To a lesser degree, something was also true of religious fiction, which was dominated by the name Winslow because of Gilbert Morris’ House of Winslow series.

One way we mitigated this was by adding in additional genre metadata from Harvard University Library, in order to pad out some of the underrepresented classes. Additionally, we tried classifying on a version of the corpus stripped of all proper names. The intuition here was that character names are almost invariably not generalizable information about a genre. Each new instance of a particular genre features new characters with new names; rather than attempt to find the perfect term-frequency threshold to eliminate proper names, we simply stripped them all. We accomplished this using Spacy’s named-entity recognition (NER) system, which automatically assigns labels to nouns like persons, places, and organizations. Ideally, we would have liked to keep some proper names: Biblical names like “John” and “Paul” actually convey meaningful information about religious fiction, just as the names of historical figures like “Roosevelt” are actually useful for classifying something as a work of historical fiction. Adding more data from Harvard University Library caused the biggest improvement in performance, but stripping names did modestly improve recall for a few classes: religious went up by 6%, for instance, and political by 7%. 

NER for the first line of Charlene Weir’s 1993 novel Consider the Crows

Despite all this tinkering, the final version of the classifier still left about 900 novels without a genre tag. Many of these are simply novels that don’t fit into any of the categories used by the Library of Congress. Some of them are experimental novels that intentionally evade generic categorization: Michael Chabon’s The Amazing Adventures of Kavalier and Clay; Colson Whitehead’s The Intuitionist; Susan Sontag’s Death Kit; or David Foster Wallace’s Infinite Jest. It’s reasonable enough to argue that the classifier was correct to leave these novels unlabeled. The negative result is an accurate account of the fact that they do not lexically resemble the texts from any of our genre classes.

Still, 8,100 of 9,000 novels is more than enough to make genre-specific corpus searches useful. So for our purposes, machine learning classification has proven effective. In future projects, we may experiment with using similar methods to classify texts in one of our many other corpuses. Most of the other corpuses hosted by the lab are composed of texts much older than the ones in the US Novel Corpus, which means metadata is often even more sparse. This makes the payoff of machine classification potentially greater; however, it introduces additional challenges, since training data may not be readily available.

Jordan Pruett

Machine Learning for Genre Classification (1/2)

This is the first of two blog posts describing how we used machine learning classification to produce genre metadata for our US novel corpus, a collection of over 9,000 novels published in the United States between 1880 and 2000. In this first post, I’ll be talking more about the technical side of the project: how we approached the problem, how the classifier works, and how we evaluated its performance. In the second post, I’ll dive a little deeper into what this experiment can teach us about machine learning and text classification in general.

First, a little about the corpus with which we were working. It’s a broad corpus of mostly contemporary American novels, featuring every novel published in the United States between 1880 and 2000 that fulfills two criteria: 1) it is in the holdings of at least two American libraries, as listed in Worldcat, and 2) it is available in a digital format. The corpus was designed as a snapshot of the novels that contemporary librarians think are worth making available to their patrons, which focuses its holdings towards the present.

Like all corpuses at the lab, the US novel corpus is searchable via Philologic, which allows users to search for concordances—shared snippets of text—that appear in multiple works within that corpus. Searches can be restricted by metadata such as author or place of publication. It would be useful to be able to search for concordances by categories that group together texts from multiple authors, such as genre. But here we run into a problem. Unlike publication information, genre metadata is not always readily available, especially for older texts. The information might simply not exist, or might not be reliable. And on a more theoretical level, genre designations lack the rigorous facticity of something like an authorship attribution or a city of publication. Genre is only weakly objective, a structure of family resemblances that is intrinsically more open to nuance and debate than the location of a publishing house. Some texts ride the boundaries of genre, and some readers disagree about genre classifications in the first place.

For these reasons, we decided to treat genre classification as a supervised machine learning problem. Machine learning classification takes family resemblances as its premise: classification algorithms like Multinomial Naive Bayes or a Support Vector Machine (SVM) group observations into categories based on patterns learned from training data. In natural language processing, machine learning classification is commonly used for tasks like email spam filtering. More recently, some scholars have experimented with adopting these classification methods for historical research on literary genres. The most common classification algorithms are all implemented in the open-source Python library scikit-learn, which we used for this project.

Supervised learning requires that you have access to some pre-labeled data that can serve as the algorithm’s training set. For this corpus, we used genre metadata from the Library of Congress, the entire catalog of which is publicly available for download, and Harvard University Library, which makes its catalog available via an API. Previous scholarship on genre classification of novels suggests substantially diminishing returns after a sample size of between 50 and 80. As such, we limited our training data to those genre categories which had at least 50 instances in our corpus. With this method, we were able to pull genre information for ~4,800 of our 9,090 novels. Twelve Library of Congress genre categories passed the threshold, which are listed below along with their counts in the training data.

Classification algorithms require that text data first be converted into a numerical format. For this purpose, we used a simple bag-of-words method with term-frequency, inverse-document-frequency (TF-IDF) weighting. This involves representing each text as a long vector composed of scores for every word that appears in any of the 9,090 novels in the corpus. A novel’s score for each word increases proportionally to the number of times that word appears in that novel, offset by the total number of novels in the corpus that contain that word. Thus, words that appear in many novels are down-weighted, which is important, since especially common words will not be “distinctive” and so will be less useful for classification. Finally, we limited our vocabulary to only the top 20,000 terms by document frequency. Any higher proved superfluous in testing, which makes sense considering that the average adult speaker of English only has an active vocabulary of 20,000 words.

Genres are not exclusive categories; a novel can have one, several, or even zero genres, if its genre is not one of the twelve genres we selected for this problem. In machine learning parlance, this means that genre categorization is a multilabel classification problem. For each text, the classifier will assign a positive or negative result for each of the twelve genres. Under the hood, the classifier is actually just training 12 different classifiers, one for each genre, and repeating the experiment 12 times. An unfortunate side effect of this method is that some of the genres which were poorly-represented by the training data received very few positive results. Likewise, about 900 of the 9000 texts received zero genre designations at all. This problem cannot be avoided without a cost to the precision of the classifier. But we decided that accurate results for the majority of our texts would be more useful to users than dubious results for all of them.

In testing, we tried four of the most popular algorithms for text classification: multinomial naive Bayes, regularized logistic regression, K-nearest neighbors, and a linear support vector machine (SVM). Linear SVM consistently performed the best, with logistic regression coming in a close second place. This result was largely expected. Linear SVM is known to work especially well for text classification, because it is well-suited for tasks involving sparse vectors, or vectors with many 0 values. Document vectors are usually sparse, since any one document will contain a comparatively small number of all the words contained in the total vocabulary of the corpus.

Just how well did it perform? We cross-validated by repeating the experiment five times and shuffling the training and the testing data each time. In each version of the experiment, 80% of the texts served as training data and 20% of the texts served as test data. The Linear SVM classifier achieved an average precision score of about 80%, with a margin of error of 4%. That’s pretty good! That means that about 80% of the positive results assigned by the classifier were correct results. For recall, the score averaged 72%, with a much larger margin of error of 16%. This lower recall score means that there are a fair number of false negatives, or genre designations that the classifier fails to pick up on. In any classification problem, precision and recall involve a trade-off. For this task, we decided that an incorrect negative result is more acceptable than an incorrect positive result: we care more whether something is incorrectly labeled “detective” than if it is incorrectly labeled “not detective.”

Let’s delve into these results a little deeper and see where the classifier performs well and where it struggles. Does it do better with some genres than with others? Since the classifier is really just a series of binary classifiers, we can easily get performance metrics for each genre label. Here we can clearly see that the classifier seems to struggle with a few genres. Horror, war, and political fiction each have relatively low recall scores, which means that the classifier gives them negative results when they should be positive. A bias towards negative results for these particular classes isn’t a huge problem, since we know they appear relatively few times in the corpus. The classifier will be technically right most of the time if it simply says “no” to these classes whenever it’s asked about them.

But for a few genres, the classifier achieved really remarkable results. For detective and mystery fiction, its precision score is 91%; for science fiction, 91%; for religious, 90%; and for fantasy, 89%. Why do we see such a dramatic difference in quality in these different genres? Part of the issue has to do with unbalanced classes: the training data includes almost two thousand detective novels but only 72 war stories. The classifier will inevitably perform better at classifying the genres for which it has seen the most examples. But this doesn’t fully explain the divergence; we have more cases of “domestic” than “religious,” but the classifier does worse with the former than the latter. In my next blog post, I’ll delve into the question of why some genres are easier to classify by this method than others, and what this teaches us about both machine learning and literary history.

Read Part Two Here

Jordan Pruett