What does Mark Twain Talk About?
Overview
This webpage details the work that I did as part
of my midterm for the Hacking the Humanities course
at Carleton in Fall Term of 2022. For the project, I analyzed
a dataset containing speeches given by Mark Twain. I performed
textual analysis on this data to extrapolate some of the topics
that Mark Twain talks about in his speeches. The following
sections detail more about the outcomes and process involved.
Sources
For this project, I used a dataset containing speeches of Mark Twain. While I got this dataset from
a folder shared by the class, the data actually came from Project Gutenberg. A link to the original
data on the Project Gutenberg website can be found
here
Process
Once I obtained the data, I began some preliminary exploration of it
using Voyant Tools. By using Voyant Tools, I could get a sense of some
of the most frequently used words amongst all of Mark Twain's speeches.
Voyant Tools has the added benefit of automatically detecting stopwords
for you (e.g. 'the', 'and', 'that', 'a'). An embedding of Voyant Tools output is
attached below:
However, I wasn't content with how effective Voyant was in figuring out the topics that
Mark Twain would talk about in his speeches. I also felt that I need to dive deeper into
the process of data cleaning to get better results. As such, I moved towards Python and
Natural Language Understanding techniques. In this case, emphasizing topic modeling
using a technique called Latent Dirichlet Allocation (LDA). There are many explanations
of LDA on the web, and many resources that show how it can be done in Python (all of which I used
as LDA is quite novel to me as well). In particular, I found this resource useful
for understanding LDA and this resource useful for learning how to implement
it using Python code.
I had to do a lot of preprocessing before I could use the data with LDA models and get somewhat coherent topics.
This preprocessing involved a lot of things within the field of natural language processing (NLP) such as tokenization,
lemmatization, and other more familiar things like removing stopwords. After cleaning the data, I used a Python library called
gensim to do the topic modeling. Here is a picture of some of the results I got from doing this topic modeling. Specifically,
what the results show is 4 potential abstract topics from the text, and it is our job to place labels on these abstract topics.
Further discussion of these results and other images depicting the process and outcomes can be found upon clicking the 'Images' tab.
Presentation
In regards to the presentation of my project (which is exactly what is seen on this website),
I decided to built a site completely from scratch using HTML and CSS. Doing so allows me a
lot of flexibility with what I can do and I found it to be a great excercise of my competency
in web technologies.
Significance
Using topic modeling on the speeches of Mark Twain, I was able to better understand the topics that
Mark Twain would talk about in his speeches. Specific topics I converged upon were 'speech-related', 'business',
'contemporary america', 'and 'nationalism'.