A multimodal transformer-based tool for automatic generation of concreteness ratings across languages

Wait 5 sec.

IntroductionConcreteness evaluates the degree to which the concept denoted by a word refers to a perceptible entity. The variable first came to the foreground of Psychology in Paivio’s dual-coding theory1,2. According to this theory, human cognition operates with two distinct classes of mental representation: (1) verbal representation that encode linguistic statistical regularities on the one hand and (2) mental images that encode perceived experiences. Opposed to this view is the idea that all content of human cognition emerges in some way from mental “simulations" of experience3. Since then, more recent views have argued that both types of mental representations interact with each other, placing more or less emphasis on encoding of linguistic distributive patterns4 or contextualized, real-world experience5.Underlying this ongoing theoretical debate is a continuous stream of behavioral and neurobiological work trying to decipher the mechanisms of human conceptual processing6,7,8. A lot of this highly influential work has made use of concreteness ratings, collected in large-scale experiments with human participants. One such corpus created by Brysbaert et al.9 employed over 4000 human raters and contains concreteness ratings for almost 40,000 words, which are amply used in this line of research. For example, one developmental study used these ratings to examine how abstract world knowledge develops in children10. Another, neurobiological study disassociating the cortical networks processing verbs and nouns used the ratings to match verbs and nouns for concreteness (a potentially confounding variable)11. Related work has used these ratings to examine contextual modulation of abstract and concrete encodings in the brain during naturalistic processing12. Finally, a more computationally focused use case is Cunha et al. who created an algorithm for automatic emoji creation (given a textual prompt) and used the ratings to evaluate whether more concrete text prompts were easier to “emojize". Overall, the corpus generated by Brysbaert et al. has been mentioned in more than 2000 peer reviewed journal papers to this day and it is therefore safe to say that concreteness ratings have become an important tool in Cognitive Science, Neuroscience, Psycholinguistics and even computational applications. Considerable efforts have therefore gone into collecting human concreteness ratings (e.g., see Brysbaert et al.9 and Muraki et al.13).There are at least three issues with these and related corpora of human concreteness ratings. For one, they include only a limited number of words. For example, the 40,000 English words covered in Brysbaert et al.9 do not even cover a quarter of the total of words currently mentioned in the Oxford Dictionary. Furthermore, this corpus excludes many compund- or multi-expression words. Muraki et al.13 did go through the effort of collecting multiword ratings, yet compared to the countless possible combinations of multi-expression words, their number of collected words still appears small. Second, collecting human ratings is costly and time-intensive. Brysbaert et al.9 required an estimated 17,000 h of human labor. Another important issue with both single-word and multiword databases is that they are rated out of context, yet words are polysemous and known to have different meanings in different contexts. It would be impossible for any number of humans to rate all of these. Finally, (and likely due to the costliness of collecting data) only a few comprehensive concreteness ratings exist for languages aside English and Estonian, all of which are either small14,15,16,17,18 or focused on Western-European languages19. It is a well-known issue that a lot of research in psychology is focused on WEIRD societies (Western, educated, industrialized, rich and democratic). Therefore, making concreteness ratings easily available in other languages and for research in other types of societies is a desideratum.For these reasons, an automated method for generating on-demand concreteness ratings, which could easily transfer from English into other languages, is bound to be a useful tool for current and future research in cognitive science, neuroscience and psycholinguistics. Additionally, the notion of concreteness is gaining significance in semantic-oriented natural language processing tasks. For example, Turney et al.20 present a supervised model that exploits concreteness to correctly classify 79% of adjective-noun pairs as having literal or non-literal meaning. Tsvetkov et al.21 exploit both the notions of concreteness and imageability to perform metaphor detection on subject-verb-object and adjective-noun relations, correctly classifying 82% and 86% instances, respectively. Because of the aforementioned benefits of automatically generating reliable concreteness ratings, a few methods already exist.Previous workFirst, Kwong22 used a dependency parsing procedure defined by Johansson and Nugues23 on wordnet24. The underlying intuition was to exploit the regularity exhibited in definitions in the sense that the more concrete a word, the more conveniently and convincingly it can be explained with reference to its superordinate concept and distinguishing features. However, this hypothesis lacks independent empirical justification. The results presented by Kwong22 are insufficient because they are limited to a set of 100 nouns. Indeed, the dependency parsing procedure presented is inherently limited to nouns because it takes “genus" as a defining feature of concreteness. Since only nouns have genera, this method cannot extrapolate to other types of words.A second method employed by both Ljubešić et al.25 as well as Charbonnier and Wartena26, is to use word embeddings from a large model based on global distributional co-occurence statistics for training a regression model to predict concreteness scores. Charbonnier and Wartena use “fast text"27 and “GoogleNews"28 embeddings to train a support vector classifier on the Brysbaert et al.9 corpus discussed above. They evaluate their method on a little less than 4000 English words. As part of their predictive features, they selected frequent suffixes to identify word-types. Their main reason for this is their assumption that nouns are mostly concrete. This assumption is unwarranted, as indeed a lot of nouns are abstract29. Because their test set includes mostly nouns, including this feature likely boosted performance on their evaluation. Therefore, it is questionable to what extent this methodology can extrapolate to other word-types.A third methodology employed including transfer to a foreign language is employed by Ljubešić et al.25, who use the “fast text" model architecture trained on Wikipedia dumps with embedding spaces aligned between languages. They combine this approach with a linear transformation learned via singular value decomposition30 on a bilingual dictionary (Croatian and English) of 500 out of the 1000 most frequent English words, obtained via the Google Translate API. They also train a support vector machine on two sets of training corpora. For their “within-language" experiment, they use a training corpus of almost 45,000 english words and evaluate on 3 separate data sets consisting of concreteness ratings for around 3000 words each. The main point of this methodology is not to report state-of-the-art performance on matching human concreteness ratings, however. Rather, they conduct the within-language experiment because their use of linear transformation on bilingual data allows them to also conduct an “across-language" experiment, which adds to the training data a corpus of 3000 Croation words (which they themselves created in an online experiment for the purpose of this study). Their results suggest that their linear transformation successfully transfers concreteness ratings from one language to the other with a loss of roughly 15% accuracy.While the possibility of easy and fairly lossless transfer of concreteness ratings between languages using word embeddings is certainly intriguing, the use of embeddings based on co-occurence statistics has two important drawbacks, which may well have a negative impact on performance. It is a well known fact in the cognitive and neuroscience literature on conceptual processing that abstract words vary significantly in meaning, depending on their linguistic context31,32. However, embeddings from models based on distributional co-occurence statistics are static and cannot account for contextual variance. Thus, embeddings for abstract words in particular will likely not be a good feature for any classifier (regression-based or other). This latter point is also illustrated by results in Hill and Korhonen33 who used an approach based on distributional co-occurence statistics to predict concreteness ratings and found that this was significantly impaired for abstract words.A fourth interesting approach for predicting concreteness ratings is presented in Haagsma and Bjerva34, who, rather than relying on the richness of embeddings from modern language models, use the linguistic concept of “selectional preference" as a feature for predicting concreteness. Selectional preferences are defined as the tendency of predicates to impose semantic restrictions on the realizations of their complements, i.e., co-occurrence in a syntactic predicate-argument relationship35. For example, the word “eat" places the selectional preference of an edible thing as direct object (if not used metaphorically, sentences that violate this principle such as “I eat my education" are nonsensical). Selectional preference may be a good feature for predicting concreteness, because abstract words are likely adjacent to less selectionally preferenced predicates (for example: “think about x"), where x could be almost anything. This intuition is closely connected to the fact, discussed above, that abstract concepts are more variable than concrete concepts. However, this method suffers from a similar flaw as distribution-based approaches reviewed above. Selectional preference of a word does not change based on context and therefore inherits the problems from static embedding methods, namely the inability to capture the semantic variability of abstract words in different contexts. Indeed, when presenting their results, Tater et al. admit that the accuracy of their predicted concreteness ratings drops significantly for abstract words.The best-performing methods for automatically generating concreteness ratings so far Ljubešić et al.25 achieved correlations of 0.72 with human ratings when tested on an extended set of several thousand English words, with even poorer performance (r = 0.61) for cross-lingual transfer. Selectional preference approaches34 achieve even lower correlations of 0.68. Finally, while Charbonnier and Wartena26 fared better correlation-wise, reaching a ceiling of around 0.90, their approach was limited to nouns only.We submit that these shortcomings stem from a fundamental limitation of current approaches in failing to incorporate key insights from cognitive science about how different types of words are grounded in human experience. Research has shown a crucial distinction: while concrete words are primarily grounded in sensory-perceptual experience, abstract words tend to be grounded in emotional and introspective experience8,36 (but see ref. 37 for a contrasting account). For example, the meaning of concrete words like “table" or “pencil" is largely determined by their visual-sensory properties, while abstract concepts like “freedom" or “anxiety" derive their meaning primarily from emotional and experiential associations. Current methods based on distributional statistics or selectional preferences cannot capture either type of grounding - they neither incorporate the visual-sensory information crucial for concrete concepts nor the emotional content essential for abstract ones. This disconnect from how humans actually represent and process different types of concepts significantly limits these methods’ ability to capture the full semantic depth of concreteness ratings. An optimal approach would need to incorporate both types of information, creating an embedding space that captures both the visual-sensory grounding of concrete words and the emotional grounding of abstract ones.Our approachTo overcome the limitations of current methods, we leverage recent advances in multimodal, transformer-based architectures that offer richer, context-sensitive language representations. Transformer models learn to predict each subsequent token in a sequence by attending selectively to other tokens, enabling them to dynamically incorporate linguistic context into their embeddings38. Through this attention mechanism, transformers can, for example, differentiate the meaning of homographs like “bank" across distinct contexts-whether it refers to a financial institution or a scheduled public holiday-an ability that static embedding models lack39,40,41.Modern multimodal transformers push these capabilities further by integrating both textual and visual information into a unified embedding space42. Such models are trained on large-scale image-text datasets43, allowing them to capture not only distributional statistics from language but also visual-semantic relationships, including object appearance and contextual cues. Empirical evidence indicates that incorporating visual features enhances the predictive accuracy of semantic models, particularly for concepts tied closely to perceptible attributes33. These multimodal embeddings thus hold substantial promise for capturing the concreteness continuum-from purely abstract notions to vividly tangible concepts.Yet, an additional dimension-emotional grounding-is critical for modeling abstract concepts effectively. While concrete words often derive meaning from visual-perceptual features, abstract words may rely more on emotional and affective associations8,36. To incorporate this dimension, we build upon newly available large-scale datasets of emotionally annotated images44. By fine-tuning a multimodal model like CLIP (Contrastive Language Image Pretraining) on this affective data, we can enrich the model’s representation space to reflect both the perceptual grounding of concrete words and the affective information essential for abstract ones45.In sum, our approach directly addresses the shortcomings of static, distribution-based methods by combining four key elements: (1) transformer-based contextual embeddings for dynamic, context-sensitive semantic representations; (2) multimodal training to integrate visual information that underpins concreteness; (3) emotion-aware finetuning to capture the affective content often central to abstract meanings; and (4) zero-shot generalization to new languages and expression types. As shown in the following sections, this integrated, context-sensitive, multimodal, and emotionally grounded framework yields high predictive accuracy, with correlations exceeding 0.90 for single English words, 0.85 for English multi-word expressions, and a robust 0.68 for an entirely distinct language (Estonian, r = 0.80 after post-hoc item exclusion). These results mark a substantial step forward in automated concreteness rating generationMethods and dataWe developed an approach to automatically generate concreteness ratings by leveraging recent advances in multimodal transformers and emotion-aware language models. The tool can be freely accessed under concreteness.eu Our methodology comprises four main components: (1) a dual-embedding model architecture that combines visual-linguistic and emotional information, (2) a training procedure utilizing large-scale human-annotated datasets, (3) comprehensive evaluation metrics to assess prediction accuracy, and (4) a general prediction system capable of generating reliable concreteness ratings across both single words and multi-word expressions in multiple languages.Figure 1 represents the pipeline for our approach, which is detailed in section “Model Architecture" below.Fig. 1: System architecture for generating concreteness ratings.The process begins with input text (single, multi-word or sentence) undergoing language detection. For non-English inputs, a cross-lingual path usses translation and cleaning steps before embeddings are generated. Embeddings from CLIP and CLIP-Emotion are integrated using a deep regressor to produce the final concreteness score. If the input is a sentence, a concreteness rating will be generated for each word in the input.Full size imageModel architectureWe developed a model that integrates multimodal transformers with emotion-aware language models through a dual-embedding approach, aiming to enhance the prediction of concreteness ratings for words. The architecture comprises three main components: a base visual-language model, an emotion-aware language model, and a deep regressor (explained below) that combines embeddings from both models.The base visual-language model employed is the Contrastive Language-Image Pre-training (CLIP) model42, which is designed to learn joint representations of images and text by aligning them in a shared embedding space. Specifically, CLIP comprises two main components: a transformer-based text encoder, which processes textual inputs using a standard transformer architecture, and a vision encoder, which can either be a ResNet or a Vision Transformer (ViT). In our implementation, we used the ViT-B/32 variant for the vision encoder, which utilizes a Vision Transformer architecture with an image patch size of 32. CLIP was trained on ~400 million image-text pairs sourced primarily from publicly available, web-scale datasets42.To incorporate emotional context into our model, we fine-tuned CLIP on the Affection dataset, a collection compiled for this study that consists of 85,007 emotionally annotated images and 526,749 emotional reactions from 6283 unique participants. The dataset includes images sourced from various public repositories, such as the International Affective Picture System (IAPS)46, and each image is associated with one or more emotional labels. The fine-tuning process involved presenting the text encoder of CLIP with the emotional labels as text inputs and the corresponding images, employing a contrastive loss similar to the original CLIP training but focused on aligning images with their emotional descriptions. This emotion-aware variant of CLIP allows the model to capture emotional information in language and imagery. We compared performance metrics for CLIP, CLIP-Emotion and a combination of both (see Results) and found that combining both embeddings enhanced the model’s ability to capture concreteness perception.The deep regressor is a neural network designed to combine the embeddings from both the base CLIP model and the emotion-aware CLIP model to predict the concreteness of words. The input layer receives the concatenated embeddings, resulting in a combined feature vector of 1024 dimensions (512 from each model). This is followed by two fully connected hidden layers with 128 and 64 units, respectively, each using ReLU activation functions to capture complex nonlinear relationships. Dropout regularization with a rate of 0.2 is applied after each hidden layer to prevent overfitting. The output layer consists of a single unit with a linear activation function that predicts the concreteness score for a given word.Training procedureWe used the concreteness ratings dataset compiled by Brysbaert et al.9, which provides human-annotated concreteness ratings for 37,058 English words and 2896 two-word expressions. Each word was rated on a 5-point Likert scale, where 1 indicates high abstractness and 5 indicates high concreteness. To prepare the dataset for training, we performed data cleaning to remove entries with missing values or non-standard characters, while homographs and polysemous words were retained to maintain linguistic diversity.The dataset was randomly split into a training set and a test set, allocating (N = 36,058) for training and (N = 1000) for testing. Stratified sampling was used to maintain the distribution of concreteness ratings in both sets, ensuring that both concrete words (ratings ≥ 2.88) and abstract words (ratings