You shall know a word by the company it keeps — so choose your prompts wisely

Wait 5 sec.

[This article was first published on Pablo Bernabeu, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.window.xaringanExtraClipboard(null, {"button":"Copy Code","success":"Copied!","error":"Press Ctrl+C to Copy"}).figure p.caption, figcaption { text-align: left;}// Collapse code chunks by default in this postdocument.addEventListener("DOMContentLoaded", function() { document.querySelectorAll("details").forEach(function(d) { d.open = false; var s = d.querySelector("summary"); if (s) { s.textContent = "Expand"; s.style.fontWeight = "bold"; s.style.fontSize = "103%"; s.style.color = "#379E8A"; } });});In computational linguistics, word meanings are shaped by their contexts. As the British linguist John Rupert Firth put it in 1957, ‘You shall know a word by the company it keeps’ (see Brunila & LaViolette, 2022, for a re-examination of the intellectual history). It sounds almost like life advice, but Firth meant something technical: words that habitually appear alongside each other tend to share semantic territory. The adjective ‘good’, for instance, is far more likely to appear near ‘kind’, ‘genuine’, ‘fair’ and ‘quality’ than near ‘broken’ or ‘fraud’ – and a model that tracks those neighbours can learn what ‘good’ means without ever being told. The principle extends to polysemy: ‘bank’ means something entirely different in the company of ‘river’ and ‘fishing rod’ than in the company of ‘overdraft’ and ‘mortgage’. Context is everything.This deceptively simple insight is the bedrock on which generative AI was built. The earliest computational implementations of Firth’s principle – distributional semantic models such as Latent Semantic Analysis (LSA; Landauer & Dumais, 1997) and the Hyperspace Analogue to Language (Lund & Burgess, 1996) – were modest by today’s standards: a matrix of word co-occurrence counts, a few hundred latent dimensions and a vocabulary of perhaps tens of thousands of words. Yet even these pocket-sized models captured real-world structure with startling fidelity. Louwerse and Zwaan (2009) showed that the frequency with which city names co-occur in English text predicts their actual geographical distances: cities close together on a map tend to be mentioned together more often, and an LSA model trained on text alone can reconstruct approximate maps of the United States without ever seeing one. Louwerse (2011) extended this further, showing that text statistics encode not just geography but sensory properties, emotional associations and conceptual relationships across a wide range of domains. Indeed, distributional language statistics may track some sensorimotor properties of concepts (Bernabeu, 2022; Louwerse & Connell, 2011; cf. Xu et al., 2025), especially after fine-tuning on human sensorimotor ratings (Wu et al., 2026). In short, language does not merely label the world – it encodes its structure, and even a simple co-occurrence model can read that encoding back.We can see this for ourselves. The R code included below (click ‘Expand’ to view it) applies LSA – one of the simplest distributional models – to three text collections, projects the resulting word vectors into two dimensions via PCA (principal component analysis) and plots them. In brief, LSA builds a term-document matrix (a large table recording how often each word appears in each document), weights it with TF-IDF (term frequency–inverse document frequency, which highlights words distinctive to particular documents rather than ubiquitous everywhere) and then compresses it via truncated SVD (singular value decomposition, a form of dimensionality reduction). Each corpus is split into two groups: the most distinctive words per group (selected by the difference in mean TF-IDF weight between groups) are plotted in the group’s colour, while the most frequent shared words appear in purple. Words that co-occur in similar contexts cluster together; words from different domains drift apart.PCA works by finding new axes – principal components – that capture the maximum variance in the data. Each word receives a loading on each component: a number ranging from −1 to +1 that indicates how strongly that word contributes to that axis of variation (a gentle introduction to PCA in R is available in an earlier post on this blog). High absolute loadings on a component mean that the word is a strong marker of the distinction that component captures.How are the thematic groups decided? The code computes the mean TF-IDF weight of every word in each group of documents and then takes the difference. Words whose weight is much higher in group A than in group B are classified as distinctive to A, and vice versa. The top 15 words at each extreme become the coloured labels in the plot, while the most frequent words that do not belong to either extreme are labelled ‘Shared’. The grouping is therefore entirely data-driven: no human decides which words are ‘finance’ or ‘energy’ – the corpus statistics do. Above each plot, a table shows the mean loading of each thematic group on the first two principal components, with the highest positive loading per group highlighted in bold. A high absolute loading tells us that a given group of words is strongly aligned with that component – in other words, that the component captures precisely the distinction between those groups. When one group loads heavily on PC1 while another does not, the first principal component is essentially the axis that separates them.Reuters Newswire: Finance vs EnergyThe first corpus uses two classic newswire collections from the tm package (Feinerer et al., 2008): acq (50 Reuters articles on corporate acquisitions) and crude (20 articles on crude oil markets). Both have been standard NLP benchmarks since the 1980s (Lewis, 1997). The code builds a TF-IDF weighted term-document matrix, reduces it to a 20-dimensional LSA space via truncated SVD, and computes pairwise cosine similarities – a standard measure of how close two word vectors sit, on a scale from –1 (opposite) to +1 (identical) – using LSAfun::Cosine() (Günther et al., 2016). The PCA loadings table and word-vector plot below show the results.pkgs