2025-10-21 –, UVM Alumni House Silver Pavilon
We will discuss fundamental linguistics and data science concepts that underpin the ability to extract signal from text. This talk brings theoretical context to general data science and NLP approaches. Topics will include the linguistic grounding of large language models (LLMs), basic NLP methods, and common pitfalls in textual analysis. We will also present some tools developed by our lab that can act as powerful lenses for textual data. Some examples we will use to approach these topics include: word frequency and distributions, Zipf’s law, the Distributional Hypothesis, allotaxonometry, sentiment, time series, and scale.
Takeaways from this talk will be theoretical background and tools that support a holistic approach to extracting signal from text, empowering attendees to engage critically with NLP applications in the wild and to deploy NLP approaches responsibly and creatively.
NLP can be empowered through fundamental concepts from linguistics. We will provide examples of how computational methods, NLP approaches, and linguistics can come together to support text-related goals, covering:
 1. Basic methods and tools and their theoretical underpinnings, 2. How to decide what method to apply, and 3. An example, trying to understand “anger” in a text. In more detail, these sections would include (time permitting) discussion of: the Distributional Hypothesis; Fast-mapping, aspects of language acquisition; Distributional semantics; Static vs. dynamic word embeddings; N-grams; Time series of features; Sliding window calculation; allotaxonometry; Large Language Models (LLMs); the (treacherous yet powerful) use of frequency as a proxy for truth/ language model; how different kinds of training objectives yield different latent spaces (e.g. MLM vs. auto-regressive); Named entity recognition (NER); Syntax trees; Dependency parsing, coreference; Sentiment, emotions, lexicons/ wordlists and dictionary methods; how orthogonality and nonlinearity affect measurement; important distinctions like fine-grained vs. coarse-grained; type vs. token; use vs. mention; dynamic vs. static; aggregate vs. utterance; Scale; Context; Pragmatics, Grice’s maxims; Looking for specific linguistic features that correlate with anger or intense emotion (like intensifiers), e.g. Affective typographic features (word-stretching to indicate sarcasm/ being non-plussed, non-standard punctuation, all caps), discourse markers like "how dare", emphatic negation, indirect speech, rhetorical questions.
Julia Witte Zimmerman and Ashley Fehr are members of the Computational Story Lab at the Vermont Complex Systems Institute (UVM). Julia is Postdoctoral Associate in Artificial Intelligence and Computational Social Science and Ashley is a PhD candidate. Their research interests include stories, conversation, and meaning construction at all linguistic scales.