Linguistics 460: Textual Data Analysis with R
UNC-Chapel Hill Linguistics
Fall 2024
Elliott Moreton
2024.05.03.F, 4-7 p.m.
FINAL PROJECT PRESENTATIONS
2024.04.29.M
Topics: MIDTERM 2
- In this room, at this time, with computer, just like Midterm 1
2024.04.24.W
Topics: Midterm 2 review
Class:
- HO on Canvas about final-project presentations
- Go over questions about course content
- Readings?
- Labs?
- Midterm review guide (on Canvas)?
Assignments:
- Today, 3 p.m.: First draft of project due
- For 5/3 (F), 2 p.m.: Upload presentation slides to Canvas
- For 5/3 (F), 4 p.m.: Revise project draft in response to feedback.
2024.04.22.M
Topics: Project clinic in class.
Class:
- How to hand in the first draft of the project (Canvas, Assignments, by "project groups")
- Work together in groups on project while instructor circulates.
Assignments:
- Press ahead on projects.
- For 4/24 W, 3 p.m.: First draft of project due.
2024.04.17.W
Topics: LASSO regression for document classification.
Class:
- Still using the IMDb database in the textdata
library. (Can download imdb_jr.csv from the Canvas
site)
- Linear models (reprise from 2/26)
- Logistic models for binary responses
- Fit by maximum likelihood
- Fit by LASSO
- How does performance compare on the database?
Assignments:
- Respond to email feedback about initial draft of
project proposal if you haven't already.
- Press ahead on projects.
- For 4/24 W, 3 p.m.: First draft of project due.
2024.04.15.M
Topics: Evaluating classifier performance.
Changing parameters to optimize performance
Class:
- Still using the IMDb database in the textdata
library. (Can download imdb_jr.csv from the Canvas
site)
- Fitting the Naive Bayes classifierl
- Using the classifier to make predictions
- Evaluating model performance: Cohen's kappa
- Adjusting model parameters to optimize performance
- Varying document-term-matrix sparsity
- Choice of weighting schemes: binary term frequency, term frequeny, tf-idf
- Overfitting
- In-class project clinic
Assignments:
- For 4/17 (W): Review earlier reading (2/28) of Navarro
2017, Ch. 15, through end of Section 6, on linear regression
- For 4/17 (W): Install the glmnet package
- Respond to email feedback about initial draft of
project proposal if you haven't already.
- Press ahead on projects.
- For 4/24 W, 3 p.m.: First draft of project due.
2024.04.10.W
Topics: Naive Bayes classifiers for document
classification.
Class:
- Resume with the IMDb database in the textdata library.
- The Inverse Fallacy returns (from 2/14)
- Bayes's Rule
- Naive Bayes classifiers: theory
- Building a Naive Bayes classifier in R
- Making predictions with the classifier
Assignments:
- Respond to email feedback about initial draft of project proposal.
- Press ahead on projects.
- For 4/24 W, 3 p.m.: First draft of project due.
2024.04.08.M
Topics: Document classification using
supervised learning. Document-term matrices.
Class:
- Introduction to classification models (slides, Canvas):
- Document classification
- Supervised learning
- Example: Internet Movie Database (IMDb) reviews (Maas et al, 2011)
- Document-term matrices
- Check in about final projects.
Assignments:
- Respond to email feedback about initial draft of project proposal.
- For 4/10 W: Read Navarro
2017, Ch. 17, through the end of Section 17.2, about
Bayesian statistics.
- For 4/10 W: Install the "e1071" package.
- For 4/10 W: Please try to install the "glmnet"
package, and let me know in class if it works.
- For 4/24 W, 3 p.m.: First draft of project due.
2024.04.03.W
Topics: Sentiment analysis.
Class:
- Finish sentiment-analysis slides and in-class problem from last time
- Discuss Lab 6 on tidy text
- Start Lab 7 on sentiment analysis
Assignments:
- For 4/8 M, 3 p.m.: Lab 7 on sentiment analysis.
- Respond to email feedback about initial draft of project proposal.
- For 4/24 W, 3 p.m.: First draft of project due.
2024.04.01.M
Topics: Sentiment analysis.
Class:
- Sentiment
- The "bag-of-words" model
- Sentiment lexicons
- Using inner_join () to look text words up in a sentiment lexicon
- Checking the sentiment lexica against each other.
- (Time permitting:) Check-in about project proposals
Assignments:
- For 4/2 T (11 a.m.): Lab 6 on "tidy" text. Note unusual due date!
- For 4/3 W: Read Chapter
2 of Silge & Robinson's book
- For 4/3 W: Do Quiz 5 on that reading
- For 4/3 W (3 p.m.) Submit an initial proposal for the final project
2024.03.27.W
Topics: "Tidying" text.
Class:
- Finish examples from
Chapter 1 of
Silge & Robinson's book, in great detail.
- Important functions: unnest_tokens (), group_by (),
anti_join (), bind_rows (), pivot_wider (), pivot_longer ().
- Start Lab 6 ("tidy" text).
Assignments:
- For 4/2 T (11 a.m.): Lab 6 on "tidy" text. Note unusual due date!
- For 4/3 W (3 p.m.): Submit an initial proposal for the final project.
2024.03.25.M
Topics: "Tidying" text
Class:
- How can we break a text document down into individual words?
- Go through examples from
Chapter 1 of
Silge & Robinson's book, in great detail.
- Important functions: unnest_tokens (), group_by (),
anti_join (), bind_rows (), pivot_wider (), pivot_longer ().
- (Last 10--15 minutes) Go over Lab 5 on regular expressions.
Assignments:
- Before 3/27 W: Introduce yourself in the
Discussions
area of Canvas. This is optional, but it could help with the next assignment:
- For 3/27 W: Find two or three partners for your final-project group.
- For 4/3 W: Submit an initial proposal for the final project.
2024.03.20.W
Topics: Regular expressions.
Class:
- Slides on Canvas: Liquid co-occurrence in U.S. baby
names (Martin, 2007, Ch. 3), exemplifying
- Structure of a final project
- Introduction to
the tidyverse
- Regular expressions for describing strings
- Examples of projects from last semester's LING 460.
- Some places to look for data sets:
- Start on Lab 5, regular expressions.
Assignments:
- For 3/25 M: Lab 5, regular expressions.
- For 3/25 M: Read Chapter 1 of
Silge & Robinson's Text Mining with R: a tidy approach
- For 3/25 M: Install the gutenbergr package.
- Before 3/27 W: Introduce yourself in the
Discussions
area of Canvas. This is optional, but it could help with the next assignment:
- For 3/27 W: Find two or three partners for your final-project group.
- For 4/3 W: Submit an initial proposal for the final project.
2024.03.18.M
Topics: Final projects. Regular expressions.
Class:
- Slides on Canvas: Liquid co-occurrence in U.S. baby
names (Martin, 2007, Ch. 3), exemplifying
- Structure of a final project
- Introduction to
the tidyverse
- Regular expressions for describing strings
(This will likely take more than the whole period.)
- Go over the handout on final projects (on Canvas)
Assignment:
- For 3/20 W: Read the
chapter on regular expressions in
Wickham et al. 2023.
- Before 3/27 W: Introduce yourself in the
Discussions
area of Canvas. This is optional, but it could help with the next assignment:
- For 3/27 W: Find two or three partners for your final-project group.
- For 4/3 W: Submit an initial proposal for the final project.
2024.03.06.W
MIDTERM 1
- In this room, at this time, with computer
- Please see Midterm 1 Syllabus for details (Canvas >
Modules > Course materials >
"02.28.W-460MT1Syllabus2024Sp.pdf")
Assignment for 3/18 (M): Read Chapters 1--3 of
Handling
and Processing Strings in R, by Gaston Sanchez.
2024.03.04.M
Topics: Midterm review
Class:
- Go over questions about course content
- Readings?
- Labs?
- Any unanswered questions from the quizzes?
2024.02.28.W
Topics: Linear regression
Class:
- Linear associations
- The simple linear regression model (y = mx + b)
- Model fitting: ordinary least-squares regression
- Interpreting the fitted model
Announcenment: Midterm on Wednesday, March 6.
A midterm syllabus will be unveiled on Canvas at the end of
class.
2024.02.26.M
Topics: Tests of association between categorical variables
Class:
- Categori(c)al vs. scalar variables
- Association and independence
- Mosaic plots
- The chi-squared test of association
- Fisher's exact test
- Time permitting: Discuss Lab 4
Assignment:
- Reading: Navarro 2017
- For 2/28 W: Chapter 15, through the end of Section 6
- For 2/28 W: Quiz 4, on Canvas
Announcement: Office hours on 2/27 Tues. will be later
than usual (4:15--5:15 instead of 2-3) and by Zoom (link is on syllabus).
2024.02.21.W
Topics: Testing hypotheses about a population mean.
Class:
- General caveats about null-hypothesis significance testing (from last time's slides)
- Hypotheses about a population mean: t-tests
- Assumptions
- Worked example
- In-class problem
- Start on Lab 4.
Assignment for 2/26 M:
- Do Lab 4, comparison of reaction times in a synonymy-choice task
2024.02.19.M
Topics: Null-hypothesis significance testing
Class:
- Overview of NHST
- Logic of NHST
- Work through an example (binomial test)
- Do an in-class problem (chi-squared test for equal proportions)
- Discuss caveats
Assignment:
- Reading: Navarro 2017
- For 2/21 W: Chapter 13, through the end of Section 3
2024.02.14.W
Topics: Confidence intervals
Class:
- Computing a z or t confidence interval for a sample from a normally-distributed population.
- Frequentist interpretation of confidence intervals
- Some misinterpretations of CIs.
- Factors determining the width of CIs. How much influence does the experimenter have?
- What if the population isn't normally-distributed?
- Other kinds of CIs.
- Uses of CIs
Assignment:
- No lab this week.
- Reading: Navarro 2017
- For 2/19 M: Chapter 11, through the end of Section 7
- For 2/21 W: Chapter 13, through the end of Section 3
- For 2/19 M: Quiz 3 (on Canvas -- will be up on Friday)
2024.02.07.W
Topics: Sampling theory.
Class:
- Statistical inference
- Sample statistics and the sampling distribution of the mean
- The Central Limit Theorem
- Start on Lab 3
Assignment for 2/14 W:
- Read Navarro 2017, Ch. 10.4 and 10.5.
There is no associated quiz.
- Do Lab 3, sampling distributions and confidence intervals
Note: If the illustrations are not showing up for you in the Web-based version of Navarro 2017,
a pdf can be found here.
2024.02.05.M
Topics: Frequentist vs. Bayesian statistics. Probability distributions. Samples.
Class:
- Inferential vs. descriptive statistics
- Frequentist vs. Bayesian inference
- Probability distributions. Examples:
- Parameters and inference
Assignment: Read Navarro 2017, Ch. 10
through the end of 10.4. There is no associated quiz.
Note: If the illustrations are not showing up for you in the Web-based version of Navarro 2017,
a pdf can be found here.
2024.01.31.W
Topics: Descriptive statistics: Variability
Class:
- Comments on Lab 1
- Need to follow authors' account of what they did with the data. What observations
did they exclude?
- Use line breaks, indentation, mnemonic variable names, and in-line comments to
make the code more human-readable (and hence more maintainable later on).
- Linear relationships and correlation (slides from last time)
- Start on Lab 2
Assignment for 2/5 M:
- Lab 2 (Canvas, Assignments)
- Read Navarro 2017, Ch. 9
through the end of 9.5. There is no associated quiz.
2024.01.29.M
Topics: Descriptive statistics
Class:
- Descriptive vs. inferential statistics
- Measures of central tendency
- Measures of variability
- Describing data grouping using formulas
- Collapsing within groups using aggregate ()
- Linear relationships and correlation
Assignment: Reading, from Navarro 2017: 5.1, 5.2, 5.4, 5.5, 5.7--5.10. Also: Quiz 2 (on Canvas).
2024.01.24.W
Topics: Data frames and basic visualization.
Class:
- Intro to data graphics in R (continued from last time; slides on Canvas)
- Just a few graphical parameters.
- Time permitting: A couple of useful data-summarization functions.
- Start on Lab 1.
Assignment: Lab 1
2024.01.22.M
Topics: Data frames. Start visualization.
Class:
- Data frames (continued from last time; slides on Canvas Modules)
- Intro to data visualization (slides; Canvas Modules)
Announcement: Office hours are T 2-3 and F 2:30-3:30
2024.01.17.W
Topics: R data types
Class:
- Comments on Lab 0
- Functions are reusable code, so you can save effort by reusing them.
- As we start seeing more complex code in our textbook, watch how it is formatted and commented to make it
easier for humans to follow.
- Data types in R (slides, Canvas Modules).
- Time permitting: Start data frames (slides, Canvas Modules).
Assignment for 1/22 M:
2024.01.10.W
Topics: Course organization. R and RStudio.
Class:
Assignment for 1/17 W:
- Reading, from Learning Statistics with R (our free on-line textbook):
- Ch. 1, Sections 1.1 and 1.2, on what statistics does for us
- Ch. 3, Sections 3.1 through 3.9, about R
- Quiz 0 (Canvas, Quizzes)
- Lab 0 (Canvas, Assignments)