Linguistics 460: Textual Data Analysis with R
UNC-Chapel Hill Linguistics
Fall 2024
Elliott Moreton
2024.12.07.Sat
- Final written version of the project due, 4 p.m.
- Please consult the Final Projects handout of
October 7th for details.
- Please review what the Syllabus says about the
use of generative AI. If you used it, that fact needs to be
acknowledged in the final version of the project, following
the guidelines pointed to by the Syllabus.
- FINAL EXAM, 4--7 p.m.
Week 16, 2024.12.04.W
Project presentations III
Week 16, 2024.12.02.M
Project presentations II
Week 15, 2024.11.25.M
Project presentations I
Week 14, 2024.11.20.W
Topics: Final-exam review
Class:
- HO on Canvas about final-project presentations
- Randomize order of presentations.
- Go over questions about course content
- Readings?
- Labs?
- Final-exam syllabus (on Canvas)?
Assignments:
- For your group's presentation day, 2 p.m.: Upload presentation slides to Canvas
- For 12/7 (F), 4 p.m.: Revise project draft in response to feedback.
Week 14, 2024.11.19.Tu (not a class day)
Assignments:
- Today: First draft of project due
Week 14, 2024.11.18.M
Topics: Project clinic in class.
Class:
- How to hand in the first draft of the project (Canvas, Assignments, by "project groups")
- Work together in groups on project while instructor circulates.
- Await visit from HVAC specialist.
Assignments:
- Press ahead on projects.
- Tomorrow: First draft of project due
Week 13, 2024.11.13.W
Topics: LASSO regression for document classification.
Class:
- Still using the IMDb database in the textdata
library. (Can download imdb_jr.csv from the Canvas
site)
- Linear models (reprise from 10/2)
- Logistic models for binary responses
- Fit by maximum likelihood
- Fit by LASSO
- How does performance compare on the database?
Assignments:
- Respond to email feedback about initial draft of
project proposal if you haven't already.
- Press ahead on projects.
- For 11/19 Tu (note change!): First draft of project due.
Week 13, 2024.11.11.M
Topics: Evaluating classifier performance.
Changing parameters to optimize performance
Class:
- Start Zoom
- Still using the IMDb database in the textdata
library. (Can download imdb_jr.csv from the Canvas
site)
- Fitting the Naive Bayes classifierl
- Using the classifier to make predictions
- Evaluating model performance: Cohen's kappa
- Adjusting model parameters to optimize performance
- Varying document-term-matrix sparsity
- Choice of weighting schemes: binary term frequency, term frequeny, tf-idf
- Overfitting
- In-class project clinic
Assignments:
- For 11/13 (W): Review earlier reading (9/30) of Navarro
2017, Ch. 15, through end of Section 6, on linear regression
- For 11/13 (W): Install the glmnet package
- Respond to email feedback about initial draft of
project proposal if you haven't already.
- Press ahead on projects.
- For 11/18 M:: First draft of project due.
Week 12, 2024.11.06.W
Topics: Naive Bayes classifiers for document
classification.
Class:
- Start Zoom
- Still using the IMDb database in the textdata
library. (Can download imdb_jr.csv from the Canvas
site)
- The Inverse Fallacy again (from 9/18)
- Bayes's Rule
- Naive Bayes classifiers: theory
- Building and training a Naive Bayes classifier in R
- Making predictions with the classifier
Assignments:
- Respond to email feedback about initial draft of project proposal.
- Press ahead on projects.
- For 11/18 M:: First draft of project due.
Week 12, 2024.11.04.M
Topics: Document classification using
supervised learning. Document-term matrices.
Class:
- Introduction to classification models (slides, Canvas):
- Document classification
- Supervised learning
- Example: Internet Movie Database (IMDb) reviews (Maas et al, 2011)
- Document-term matrices
- Check in about final projects.
Assignments:
- Respond to email feedback about initial draft of project proposal.
- For 11/06 W: Read Navarro
2017, Ch. 17, through the end of Section 17.2, about
Bayesian statistics.
- For 11/06 W: Install the "e1071" package.
- For 11/06 W: Please try to install the "glmnet"
package, and let me know in class if it works.
- For 11/18 M:: First draft of project due.
Announcementz: The Divine Comedy sentiment assignment
was the last lab assignment in this class.
- Energy should now be going into
the final project.
- TA office hours for final projects: M 9-9:50, Th 1-2 in Smith Bldg. 104
Week 11, 2024.10.30.W
Topics: Sentiment analysis.
Class:
- Finish sentiment-analysis slides and in-class problem from last time
- Discuss Lab 8 on tidy text
- Start Lab 9 on sentiment analysis
Assignments:
- For 11/04 M: Lab 9 on sentiment analysis.
- Respond to email feedback about initial draft of project proposal.
- For 11/18 M:: First draft of project due.
Week 11, 2024.10.28.M
Topics: Sentiment analysis.
Class:
- Sentiment
- The "bag-of-words" model
- Sentiment lexicons
- Using inner_join () to look text words up in a sentiment lexicon
- Checking the sentiment lexica against each other.
- (Time permitting:) Check-in about project proposals
Assignments:
- For 10/30 W: Read Chapter
2 of Silge & Robinson's book
- For 10/30 W: Do Quiz 9 on that reading
- For 10/30 W (3 p.m.) Submit an initial proposal for the final project
Announcement: Tuesday's office hours will be
earlier than usual this week: 1-2 instead of 2-3.
Week 10, 2024.10.23.W
Topics: "Tidying" text.
Class:
- Finish examples from
Chapter 1 of
Silge & Robinson's book, in great detail.
- Important functions: unnest_tokens (), group_by (),
anti_join (), bind_rows (), pivot_wider (), pivot_longer
().
- Start Lab 8 ("tidy" text).
Assignments:
- For 10/28 M: Lab 8 on "tidy" text.
- For 10/30 W: Submit an initial proposal for the final
project (via the "Assignments" area on Canvas).
Week 10, 2024.10.21.M
Topics: "Tidying" text
Class:
- How can we break a text document down into individual words?
- Go through examples from
Chapter 1 of
Silge & Robinson's book, in great detail.
- Important functions: unnest_tokens (), group_by (),
anti_join (), bind_rows (), pivot_wider (), pivot_longer
().
- (Last 10--15 minutes) Go over
- Lab 7 on regular expressions.
- Question 12 on the midterm (the Old Faithful geyser eruptions)
Assignment:
- Before 10/30 W: Introduce yourself in the
Discussions
area of Canvas. This is optional, but it could help with the next assignment:
- For 10/30 W: Find two or three partners for your final-project group.
- For 10/30 W: Submit an initial proposal for the final project.
Week 9, 2024.10.16.W
MIDTERM
- In this room, at this time, with computer
- Please see Midterm Syllabus for details
Assignment:
- For 10/21 M: Lab 7, regular expressions.
- For 10/21 M: Read Chapter 1 of
Silge & Robinson's Text Mining with R: a tidy approach
- For 10/21 M: Install the gutenbergr package.
- Before 10/30 W: Introduce yourself in the
Discussions
area of Canvas. This is optional, but it could help with the next assignment:
- Before 10/30 W: Find two or three partners for your final-project group.
- For 10/30 W: Submit an initial proposal for the final project.
Week 9, 2024.10.14.M
Topics: Midterm review
Class:
- Start Zoom.
- Go over questions about course content
- Readings?
- Labs?
- Any unanswered questions from the quizzes?
Week 8, 2024.10.09.W
Topics: Regular expressions.
Class:
- Continue slides on Canvas: Liquid co-occurrence in U.S. baby
names (Martin, 2007, Ch. 3), exemplifying use of regular expressions
to describe strings
- Some places to look for data sets:
- Visit to Odum Institute.
Assignment:
- For 10/14 M: Read over the Midterm Syllabus on Canvas.
- For 10/16 W: Prepare for midterm.
- For 10/21 M: Lab 7, regular expressions.
- For 10/21 M: Read Chapter 1 of
Silge & Robinson's Text Mining with R: a tidy approach
- For 10/21 M: Install the gutenbergr package.
- Before 10/30 W: Introduce yourself in the
Discussions
area of Canvas. This is optional, but it could help with the next assignment:
- Before 10/30 W: Find two or three partners for your final-project group.
- For 10/30 W: Submit an initial proposal for the final project.
Announcement: I've updated the "Final Projects" handout on Canvas to
correct the dates.
Week 8, 2024.10.07.M
Topics: Final projects. Regular expressions.
Before class:
- (Instructor:) Start Zoom.
- (Students:) Install babynames library. There may be a lot of dependences;
please let RStudio install them too.
Class:
- Start slides on Canvas: Liquid co-occurrence in U.S. baby
names (Martin, 2007, Ch. 3)
- Structure of a final project
- Introduction to
the tidyverse
- Time permitting: Introduction to regular expressions
- Handout on final projects (on Canvas)
Assignment:
- For 10/09 W (3/20 W): Read the
chapter on regular expressions in
Wickham et al. 2023.
- Before 10/30 W: Introduce yourself in the
Discussions
area of Canvas. This is optional, but it could help with the next assignment:
- For 10/30 W: Find two partners for your final-project group.
- For 11/06 W: Submit an initial proposal for the final project.
Week 7, 2024.10.02.W
Topics: Linear regression
Class:
- Linear associations
- The simple linear regression model (y = mx + b)
- Model fitting: ordinary least-squares regression
- Interpreting the fitted model
- Start Lab 6, linear regression.
Assignment for 10/7 (M):
Announcement: Office hours tomorrow (Thursday the 3rd), 2:30-3:30, instead of Friday.
(That's just for this week.)
Week 7, 2024.09.30.M
Topics: Tests of association between categorical variables
Class:
- Categori(c)al vs. scalar variables
- Association and independence
- Mosaic plots
- The chi-squared test of association
- Fisher's exact test
- Time permitting: Discuss Lab 5
Assignment:
- For 10/02 W: Reading: Navarro
2017, Chapter 15, through the end of Section 6
- For 10/02 W: Reading Quiz 8, on Canvas
Week 6, 2024.09.25.W
Topics: Testing hypotheses about a population mean.
Class:
- Start Zoom.
- General caveats about null-hypothesis significance testing (from last time's slides)
- Hypotheses about a population mean: t-tests
- Assumptions
- Worked example
- In-class problem
- Start on Lab 5.
Assignment for 9/30 M: Lab 5, comparison of reaction times in a synonymy-choice task
Announcement: Schedule for Midterm:
- 10/14 M: Midterm review day, for questions about course content
- 10/16 W: MIDTERM, in class.
Week 5, 2024.09.18.W
Topics: Null-hypothesis significance testing
Class:
- Overview of NHST
- Logic of NHST
- Work through an example (binomial test)
- Do an in-class problem (chi-squared test for equal proportions)
- Discuss caveats
Assignment:
- For 9/25 W: Please read Navarro
2017, Chapter 5, Section 6, and Chapter 13, through
the end of Section 3
- For 9/25 W: Reading Quiz 7.
Announcement: Office hours on Friday, 9/20,
will be 1:30-2:30 instead of 2:30-3:30.
Reminder: Midterm is the week of 10/14-10/16 (one of those two days).
Week 5, 2024.09.16.M
Topics: Confidence intervals
Class:
- Computing a z or t confidence
interval for a sample from a normally-distributed
population.
- Frequentist interpretation of confidence intervals
- Some misinterpretations of CIs.
- Factors determining the width of CIs. How much influence does the experimenter have?
- What if the population isn't normally-distributed?
- Other kinds of CIs.
- Uses of CIs
Assignment:
- No lab this week (and no class on Monday 9/23).
- Reading: Navarro 2017
- For 9/18 W: Chapter 11, through the end of Section 7
- For 9/25 W: (Chapter 5, Section 6, and) Chapter 13, through the end of Section 3
- For 9/18 W: Reading Quiz 6.
Week 4, 2024.09.11.W
Topics: Sampling theory.
Class:
- Statistical inference (finish slides from last time)
- Sample statistics and the sampling distribution of the mean
- The Central Limit Theorem
- Start on Lab 4, sampling distributions and confidence intervals.
Assignment for 9/16 M:
- Read Navarro 2017, Ch. 10.5,
which will introduce confidence intervals.
There is no associated quiz.
- Do Lab 4, sampling distributions and confidence intervals
Week 4, 2024.09.09.M
Topics: Frequentist vs. Bayesian statistics. Probability distributions. Samples.
Class:
- Lab 3, descriptive statistics
- Deciphering authors' description of what they did
- dplyr not needed yet (coming in 2nd half)
- Inferential vs. descriptive statistics
- Frequentist vs. Bayesian inference
- Probability distributions. Examples:
- Parameters and inference
Assignment for 9/11 W::
- Read Navarro 2017, Ch. 10,
through the end of 10.4.
- Reading Quiz 5
Week 3, 2024.09.04.W
Topics: Descriptive statistics: Linear relationships and correlation.
Class:
- Comments on Lab 2
- Need to follow authors' account of what they did with the data. What observations
did they exclude?
- Shape of RStudio graphics window can affect shape of saved graphic.
- Useless libraries loaded by a number of people. (Why?)
- Use line breaks, indentation, mnemonic variable names, and in-line comments to
make the code more human-readable (and hence more maintainable later on).
- Describing relations between two numeric variables
- Linear relationships
- Correlation coefficients: Pearson's, Spearman's
- Cautionary tale: Anscombe's Quartet
- Start on Lab 3 (descriptive statistics).
Assignment for 9/9 M:
- Lab 3 (descriptive statistics)
- Read Navarro 2017, Ch. 9
through the end of 9.5. There is no associated quiz.
Week 2, 2024.08.28.W
Topics: Data frames and basic visualization.
Class:
- Intro to data graphics in R (continued from last time; slides on Canvas)
- Saving graphics to a file
- A couple of useful data-summarization functions.
- Start on descriptive statistics (slides on Canvas)
- Descriptive vs. inferential statistics
- Measures of central tendency
- Measures of variability
- Describing data grouping using formulas
- Collapsing within groups using aggregate ()
- Start on Lab 2 (data frame and scatterplot).
Assignment for 9/4 Wed:
- Lab 2 (data frame and scatterplot). (This is due on
Wednesday, rather than Monday, because of the
holidays.)
- Read Navarro
2017, Chapter 5, Sections 5.1, 5.2, 5.4, 5.5,
5.7--5.10, and do Reading Quiz 4 (on Canvas)
Week 2, 2024.08.26.M
Topics: Data frames. Start visualization.
Class:
- Data frames (continued from last time; slides on Canvas Modules)
- Intro to data visualization (slides; Canvas Modules)
Assignment: Read Navarro 2017,
Chapter 6, through the end of Section 6.2, about base R graphics ("traditional" R graphics). Do Quiz 3 on the
reading.
Week 1, 2024.08.21.W
Topics: R data types
Class:
- Comments on Lab 1
- Functions are reusable code, so you can save effort by reusing them.
- As we start seeing more complex code in our textbook, watch how it is formatted and commented to make it
easier for humans to follow.
- Data types in R (slides, Canvas Modules).
- Time permitting: Start data frames (slides, Canvas Modules).
Assignment for 8/26 M:
Week 1, 2024.08.19.M
Topics: Course organization. R and RStudio.
Class:
Assignment for 8/21 W:
- Reading, from Learning Statistics with R (one of our free on-line textbooks):
- Ch. 1, Sections 1.1 and 1.2, on what statistics does for us
- Ch. 3, Sections 3.1 through 3.9, about R
- Quiz 1 (Canvas, Quizzes)
- Lab 1 (Canvas, Assignments)