CSE 447: Natural Language Processing, Fall 2024

MWF 3:30-4:20pm, CSE2 G20 (Gates, ground floor)

Instructor: Yulia Tsvetkov

yuliats@cs.washington.edu

OH: available on Zoom by appointment.

Teaching Assistant: Melanie Sclar

msclar@cs.washington.edu

OH: Fri 11:00-12:00pm, CSE1 220, Zoom

Teaching Assistant: Kabir Ahuja

kahuja@cs.washington.edu

OH: Tue 11:00-12:00pm, CSE1 220, Zoom

Teaching Assistant: Kavel Rao

kavelrao@cs.washington.edu

OH: Mon 4:30-5:30pm, CSE2 150, Zoom

Teaching Assistant: Khushi Khandelwal

khushik@cs.washington.edu

OH: Mon 11:00-12:00pm, CSE1 3rd Floor Breakout

Teaching Assistant: Melissa Mitchell

mcm08@cs.washington.edu

OH: Wed 2:30-3:30pm, CSE2 150, Zoom

Teaching Assistant: Riva Gore

rivagore@cs.washington.edu

OH: Thu 1:30-2:30pm, CSE2 150, Zoom

Announcements

Project 3 is out! [Handout PDF][Handout LaTeX Source][Part A Notebook][Part B Notebook]

Project 2 is out! [Handout PDF][Handout LaTeX Source][Part A Notebook][Part B Notebook]

Project 1 is out! [Handout PDF][Handout LaTeX Source][Part A Notebook][Part B Notebook]

Project 0 is out! [Instructions] [Notebook]

Summary

This course will explore foundational statistical techniques for the automatic analysis of natural (human) language text. Towards this end the course will introduce pragmatic formalisms for representing structure in natural language, and algorithms for annotating raw text with those structures. The dominant modeling paradigm is corpus-driven statistical learning, covering both supervised and unsupervised methods. Algorithms for NLP is a lab-based course. This means that instead of homeworks and exams, you will mainly be graded based on three hands-on coding projects.

This course assumes a good background in basic probability and a strong ability to program in Python. Experience using numerical libraries such as NumPy and neural network libraries such as PyTorch are a plus. Prior experience with machine learning, linguistics or natural languages is helpful, but not required. There will be a lot of statistics, algorithms, and coding in this class.

Calendar

Calendar is tentative and subject to change. More details will be added as the quarter continues.

Week	Date	Topics	Readings	Homeworks
1	9/27	Logistics [slides]	Course website, syllabus
2	9/30	Introduction [slides]	Optional reading J&M (2nd ed) 1
	10/02	Introduction [slides]	Optional reading NYT Interview with Yejin Choi
	10/04	Introduction [slides]	Eis 2; J&M III 4
3	10/07	Text classification [slides]	Eis 2; J&M III 4; Ng & Jordan, 2001	HW1 out
	10/09	Text classification [slides]	Eis 2; J&M III 5; Pang et al. 2002	HW1 overview
	10/11	Text classification [slides]	Eis 2; J&M III 5	in-class quiz 1
4	10/14	Text classification [slides]	Eis 2; J&M III 5
	10/16	Text classification [slides]	Eis 2; J&M III 5
	10/18	Text classification [slides]	J&M III 3; Eis 6.1-6.2, 6.4	in-class quiz 2
5	10/21	Language modeling [slides]	J&M III 3; Eis 6.1-6.2, 6.4
	10/23	Language modeling [slides]	J&M III 6; Eis 14
	10/25	Language modeling and Lexical semantics [slides]	J&M III 6; Eis 14	in-class quiz 3
6	10/28	Lexical semantics [slides]	J&M III 6
	10/30	Lexical semantics [slides]	J&M III 6	HW1 due; HW2 out
	11/1	Research highlight: Political biases in LLMs [slides]	political bias in LLMs; its effect on LLM users	in-class quiz 4
7	11/4	Neural networks [slides]	J&M III 7; Optional: Eis 6.3, 6.5; J&M III 7.5; Goldberg 10; Collobert et al. 2011
	11/6	Neural networks [slides]	J&M III 8
	11/8	Neural networks: Transformers [slides]	J&M III 9; Optional but recommended: Annotated Transformer, Illustrated Transformer; Recorded Lecture	in-class quiz 5
8	11/11	Veterans Day (Canceled)
	11/13	LLMs - Pretraining I [slides]	J&M III 11; BERT; Sentence-BERT
	11/15	LLMs - Pretraining II [slides]	J&M III 10; T5; GPT-2; Curious case of Neural Text Degeneration; How to generate text (HF blog)	in-class quiz 6
9	11/18	LLMs - prompting, Chain of Thought (CoT) [slides]	Schulhoff et al., 2024
	11/20	LLM + Reasoning [slides]	Survey on Knowledge Distillation; Jung et al., 2024	HW2 due
	11/22	LLMs + Reasoning: grand challenges [slides]	Reasoning Survey	in-class quiz 7; HW3 out
10	11/25	Aplications: Summarization [slides]	Gehrmann et al., 2018
	11/27	NLP in Industry: Recommender systems and online training (recorded lecture) [slides]	Lecture Recording
	11/29	Thanksgiving (Canceled)
11	12/2	LLM Safety [slides]	Risks of LLMs; SafetyPrompts; The Art of Saying No
	12/4	AI ethics [slides]	ACM Code of Ethics; NeurIPS Code of Ethics	in-class quiz 8
	12/6	Conclusion and Q&A		HW3 due

Resources

Readings
- J&M III: Speech and Language Processing (Dan Jurafsky and James H. Martin)
- Eis: Natural Language Processing (Jacob Eisenstein)
- Additional readings will be released weekly.
Ed discussion board
Canvas
Gradescope

Assignments/Grading

Project 0 (Python and Pytorch Tutorial / Review): Optional, Extra 2% Credit. Instructions Notebook
Project 1 (Text Classification and N-gram language models): 30% [Handout PDF][Handout LaTeX Source][Part A Notebook][Part B Notebook]
- Implementing Logistic Regression for text classification
- Training, evaluating, and sampling from n-gram Language Models
Project 2 (Neural Text Classification and Neural Language Modeling)*: 30% [Handout PDF][Handout LaTeX Source][Part A Notebook][Part B Notebook]
- Training feed-forward neural networks for text classification using word2vec and sentence transformers representations
Project 3 (Transformers and Natural Language Generation)*: 30% [Handout PDF][Handout LaTeX Source][Part A Notebook][Part B Notebook]
- Implementing multi-head self-attention from scratch and training transformer based language models
- Decoding algorithms for text generation
- Knowledge Distillation using synthetic data for summarization
Quizzes: 10%
- Starting from the 3rd week, we will have quizzes on Fridays (unless announced otherwise).
- There will be 8 quizzes in total.
- Quizzes will be released 10 minutes in the beginning of the class.
- 5 best quizzes will be counted into final score. Each quiz will occupy 2% of final score.
Participation: 6% bonus

*Subject to change based on factors like class performance, compute feasibility, and topics covered during the course.

Policies

Late policy. Each student will be granted 5 late days to use over the duration of the quarter. You can use a maximum of 3 late days on any one project. Weekends and holidays are also counted as late days. Late submissions are automatically considered as using late days. Using late days will not affect your grade. However, projects submitted late after all late days have been used will receive no credit. Be careful!
Academic honesty. Homework assignments are to be completed individually. Verbal collaboration on homework assignments is acceptable, as well as re-implementation of relevant algorithms from research papers, but everything you turn in must be your own work, and you must note the names of anyone you collaborated with on each problem and cite resources that you used to learn about the problem. The project proposal is to be completed by a team. Suspected violations of academic integrity rules will be handled in accordance with UW guidelines on academic misconduct.
On ChatGPT, Copilot, and other AI assistants (adopted from Greg Durrett): Understanding the capabilities of these systems and their boundaries is a major focus of this class, and there’s no better way to do that than by using them!
- We strongly encourage you to use ChatGPT to understand concepts in AI and machine learning. You should see it as a another tool like web search that can supplement understanding of the course material.
- You are allowed to use ChatGPT and Copilot for programming assignments. However, usage of ChatGPT must be limited in the same way as usage of other resources like websites or other students. You should come up with the high-level skeleton of the solution yourself and use these tools primarily as coding assistants.
- You are permitted to use ChatGPT for conceptual questions on assignments, but discouraged from doing so. It will get some of these questions right and some of them wrong. These questions are meant to deepen your understanding of the course content. Heavily relying on ChatGPT for your answers will negatively impact your learning.
An example of a good question is, “Write a line of Python code to reshape a Pytorch tensor x of [batch size, seqlen, hidden dimension] to be a 2-dimensional tensor with the first two dimensions collapsed.” Similar invocation of Copilot will probably be useful as well. An example of a bad question would be to try to feed in a large chunk of the assignment code and copy-paste the problem specification from the assignment PDF. This is also much less likely to be useful, as it might be hard to spot subtle bugs. As a heuristic, it should be possible for you to explain what each line of your code is doing. If you have code in your solution that is only included because ChatGPT told you to put it there, then it is no longer your own work in the same way.
Accommodations. If you have a disability and have an accommodations letter from the Disability Resources office, I encourage you to discuss your accommodations and needs with me as early in the semester as possible. I will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the office of Disability Resources for Students, I encourage you to apply here.

Note to Students

Take care of yourself! As a student, you may experience a range of challenges that can interfere with learning, such as strained relationships, increased anxiety, substance use, feeling down, difficulty concentrating and/or lack of motivation. All of us benefit from support during times of struggle. There are many helpful resources available on campus and an important part of having a healthy life is learning how to ask for help. Asking for support sooner rather than later is almost always helpful. UW services are available, and treatment does work. You can learn more about confidential mental health services available on campus here. Crisis services are available from the counseling center 24/7 by phone at +1 (866) 743-7732 (more details here).