Linguistics 460: Textual Data Analysis with R
UNC-Chapel Hill Linguistics
Fall 2025
Elliott Moreton
Week 17, 2025.12.12.Fr: FINAL EXAM, 4-7 p.m.
Week 16, 2025.12.02.Tu
Week 15, 2025.11.25.Tu
Week 14, 2025.11.19.Th
Week 14, 2025.11.17.Tu
Week 13, 2025.11.13.Th
Week 13, 2025.11.11.Tu
Week 12, 2025.11.06.Th
Week 12, 2025.11.04.Tu
Week 11, 2025.10.30.Th
Week 11, 2025.10.28.Tu
Week 10, 2025.10.23.Th
Week 10, 2025.10.21.Tu
Week 9, 2025.10.14.Tu
Week 8, 2025.10.09.Th
Week 7, 2025.10.02.Th
MIDTERM
Week 7, 2025.09.30.Tu
Review day for midterm
Topics: Midterm review
Class:
- Start Zoom.
- Go over questions about course content using Midterm Syllabus (on Canvas)
- Readings?
- Labs?
- Any unanswered questions from the quizzes?
Week 6, 2025.09.25.Th
Topics: LASSO regression for document classification.
Class:
- Still using the IMDb database in the textdata
library. (Can download imdb_jr.csv from the Canvas
site)
- Linear models (preview)
- Logistic models for binary responses
- Fit by maximum likelihood
- Fit by LASSO
- How does performance compare on the database?
- Midterm syllabus (midterm is 10/02 Th)
Week 6, 2025.09.23.Tu
Topics: Evaluating classifier performance.
Changing parameters to optimize performance
Before class: Please install the naivebayes
and psych packages.
Class:
- Still using the IMDb database in the textdata
library. (Can download imdb_jr.csv from the Canvas
site)
- Fitting the Naive Bayes classifierl
- Using the classifier to make predictions
- Evaluating model performance: Cohen's kappa
- Adjusting model parameters to optimize performance
- Varying document-term-matrix sparsity
- Choice of weighting schemes: binary term frequency,
term frequeny, tf-idf
- Overfitting
Assignment:
-
Read Navarro
2017, Ch. 15 to the end of Section 15.3. This will preview linear regression,
the basis for the next classification model we will encounter. (We'll cover
linear regression in more detail in the second half of the course.)
Reminder: Midterm on Thursday, October 2. (Midterm
syllabus available on Thursday, September 25.)
Week 5, 2025.09.18.Th
Topics: Naive Bayes classifiers for document
classification.
Class:
- Start Zoom
- Comments on Lab 04 (Dante): stop words and sentiment
- Continued from last time: Naive Bayes classifiers
- Building and training a Naive Bayes classifier in R
- Making predictions with the classifier
- Start on Lab 05.
Assignments:
- For 09/23 Tu: Lab 05, classifying patents
- For 09/23 Tu: Please install the "glmnet" package,
which we will need for LASSO regression next week.
Announcement: The midterm will be on
Thursday, October 2. A midterm syllabus will be available on
Thursday, September 27.
Week 5, 2025.09.16.Tu
Topics: Document classification using
supervised learning. Document-term matrices. Intro to
Naive Bayes classifiers.
Before class: Please make sure these packages are installed:
Class:
- Introduction to classification models (slides, Canvas):
- Document classification
- Supervised learning
- Example: Internet Movie Database (IMDb) reviews (Maas et al, 2011)
- Document-term matrices
- Naive Bayes classifiers: theory
Assignments:
- For 09/18 Th: Read Navarro
2017, Ch. 17, through the end of Section 17.2, about
Bayesian statistics...
- ... and do Quiz 05 on that reading
- For 09/18 Th: Install the "e1071" package.
- For 09/18 Th: Please try to install the "glmnet"
package, and let me know in class whether the installation succeeds.
Week 4, 2025.09.11.Th
Topics: Sentiment analysis.
Class:
- Discuss Lab 03 on tidy text
- Finish sentiment-analysis slides and in-class problem from last time
- Start Lab 04 on sentiment analysis
- (Time permitting:) Set up for next week's classification models.
Assignments:
- For 9/16 Tu: Lab 04 on sentiment analysis.
Week 4, 2025.09.09.Tu
Topics: Sentiment analysis.
Before class:
- Load the following libraries: tidyverse, tidytext, janeaustenr, stringr
Class:
- Start Zoom
- Sentiment
- The "bag-of-words" model
- Sentiment lexicons
- Using inner_join () to look text words up in a sentiment lexicon
- Checking the sentiment lexica against each other.
Assignment for 9/11 Th:
- Read Chapter
2 of Silge & Robinson's book
- Do Quiz 04 on that reading
Week 3, 2025.09.04.Th
Topics: "Tidy" data and tidy text.
Before class:
Class:
- Start Zoom for 1 student (MS)
- UNDERLING slide
- Finish "tidy data, tidy text" (slides from Tuesday)
- bind_rows(),pivot_wider(),pivot_longer()
- Start on Lab 03, on tidy text
Assignment for 9/09 Tu:
Week 3, 2025.09.02.Tu
Topics: "Tidy" data and tidy text.
Before class:
- Install the "janeaustenr" package, and any dependencies
that R asks you to install.
- Same with "gutenbergr".
- Same with "stringr".
Class:
- Tidy data, tidy text (slides).
- "Tidy" data format
- Examples from janeaustenr and gutenbergr
- unnest_tokens(),group_by(),anti_join()
- bind_rows(),pivot_wider(),pivot_longer()
We most likely won't get all the way to the end.
Assignment for 9/04 Th:
- Read Silge & Robinson (2017) Ch 1, "Tidy text"
- Do Quiz 03.
Week 2, 2025.08.28.Th
Topics: Regular expressions.
Before class:
- Make sure you can find
the word-frequency
database Excel file that we used last Thursday (8/21) for in-class examples.
- Install the "rvest" package, and any dependencies that
R asks you to install. (This will be needed for the
lab.)
Class:
- Finish the slides from last time on regular expressions.
- Counting and displaying categorical data (more slides).
- Start on Lab 02, regular expressions.
Assignment for 9/02 Tu:
- Lab 02, regular expressions (pickups vs. sports cars)
Week 2, 2025.08.26.Tu
Topics: Data filtering and transformation. Strings.
Before class:
- Install the "babynames" package, and any dependencies
that R asks you to install.
Class:
- Structure of a research project.
- Illustration: liquid co-occurrence in U.S. baby names (Martin, 2007)
- Transforming a data frame in a data pipeline with mutate (), arrange (), filter ()
- Using regular expressions to describe strings
Assignment for 8/28 Th:
- Read Wickham et al. 2023, Ch. 14 ("Strings"), Sections
1, 2, and 5, and Ch. 15 ("Regular expressions"), through the
end of Section 15.4.3. (We'll get to group_by () and
summarize () next week; don't worry about them now.)
- Do Quiz 02.
Week 1, 2025.08.21.Th
Topics: Data frames and data visualization.
Before class:
- Instructor: Start Zoom
- Students: Download "lemmas.csv" from Canvas: Modules: Course materials
Class:
- Comments on: Lab 00, AI use policy
- Data frames (slides)
- Data visualization (slides).
- Start on Lab 01, "Replicating a scatterplot"
Assignment for 8/26 Tu:
- Finish Lab 01, "Replicating a scatterplot".
Week 1, 2025.08.19.Tu
Topics: Course organization. R and RStudio.
Class:
i Assignment for 8/21 Th:
- Read Freeman & Ross 2019, Chapter 5: "Introduction to R".
- Read Wickham et al. 2023, Chapter 1: "Data visualization".
- Do Quiz 01, "Course organization and data visualization" (Canvas, Quizzes)
- Do Lab 00, "Hello, World!" (Canvas, Assignments)
Week 0: Before the semester even starts
Assignment for 8/19 Tu:
- Read Sections 1.4 and 1.5 of
Technical Foundations
of Informatics, by Michael Freeman and Joel Ross,
about setting up your computer for R and RStudio. (Note:
This textbook also says things about a course at a
different university; please ignore that. For instance,
it says students in that course are supposed to be turning
in assignments using GitHub. That does not apply to LING
460 at UNC-Chapel Hill.)
- Following those instructions, try to
install R and
RStudio
on your computer.
- Please bring your computer, with R and RStudio on it,
to class on the first day.