Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy

Wait 5 sec.

IntroductionCEFR: Common European Framework of Reference for LanguagesThe acronym CEFR stands for the Common European Framework of Reference for Languages. CEFR is a general framework developed by the Council of Europe designed to provide a universal basis for outlining language learning objectives and outcomes (Council of Europe, 2001). Since its inception, the CEFR has grown to serve as an important resource in the area of language education. Although its uptake has not been exactly global or uniform, it is now used to inform language policies and teaching practices in a number of countries (Negishi et al. 2013, Hiranburana et al. 2017, Yu, 2017, Foley, 2019, Savski and Prabjandee, 2022, Soliman and Familiar, 2024). The CEFR approach is action-oriented: it relies on so-called ‘can-do’ descriptors. As such, it has informed curricula, teaching, learning, and assessment in a wide variety of educational contexts. As a consistent framework, CEFR facilitates the alignment of curricula, teaching and learning, as well as assessment. Its recent update includes a greater emphasis on plurilingualism (Little and Figueras, 2022).The CEFR framework is not without controversy. It has received some criticism, such as that its top-down approach has led to confusion and mis-conceptualizations, which are however tackled with efforts to develop practical guidelines for CEFR-informed learning, teaching, and assessment (Weir, 2005, Alderson, 2007, Widdowson, 2015, Schmidt et al. 2019).One of the practical outcomes of the CEFR framework are graded vocabulary lists. These are lists of words—normally lemmas rather than word forms—graded with CEFR levels, which signal their appropriateness for the different levels of language learner proficiency. Such graded vocabulary lists can then inform the creation of teaching and learning materials such as textbooks, readers, tests, etc. They can also provide useful material for language-learning apps, used with or without teacher supervision. It should not go unmentioned that, like the entire CEFR framework, CEFR-graded vocabulary lists are sometimes considered problematic. For example, authors have pointed out the incompleteness of the lists and the occasional use of higher-level words in teaching materials intended for lower learner levels (Graën et al. 2020). Another issue mentioned is the extraction, selection and ranking of multi-word expressions (Laposhina et al. 2024).While CEFR-graded word lists may be useful in many ways, their practical application is limited, because they are often non-transparent in terms of how they were created, or not available for re-use due to licensing restrictions. This seems like a missed opportunity to use a comprehensive trans-national framework to the full benefit of language learners worldwide. In view of this, we propose an approach to supplementing CEFR levels for vocabulary items by imputing CEFR levels for new words based on an existing (but incomplete) vocabulary list, and additional word-level data on dictionary views, corpus frequency, part-of-speech, and polysemy. We test this approach for the English language, utilizing the widely used English Vocabulary Profile (Cambridge University Press, 2015, Capel, 2015).Relevant work relating CEFR vocabulary levels to other variablesThe frequency with which words appear in texts (spoken as well as written) is a basic factor that determines how often speakers encounter these specific words. Such repeated encounters, in turn, translate into the strength of lexical representations in the brain through a process of neural reinforcement (as shown by, i.a., Shtyrov et al. 2011 who show that the amplitude of brain responses is modulated by word frequency). Words that are encountered more often tend to be more salient and thus more ‘important’ (Laurence, 2021, 108). Frequent words are generally acquired (Coffey et al. 2024) and taught earlier (Baumann and Sekanina, 2022) and — more relevant in the context of the present study — tend to occupy the more basic (lower) CEFR levels (Capel, 2012). Linguists learn about textual frequency by counting word occurrences in corpora, which these days are large digital collections of text.Text frequency is a well-established measure of word relevance on the ‘supply’ side, as it were. Another view of word relevance is to approach it from the ‘demand’ side — such as by examining how often words are looked up in a dictionary. Information of this type may be culled from those dictionaries which collect and make available systematic records of user visits. As it happens, not many dictionaries meet these conditions, but for English a good option exists in the form of the English Wiktionary, a free and very large crowdsourced dictionary.Footnote 1 Comprehensive logs of user visits are available from Wikimedia foundation (for details, see Lew and Wolfer, 2024b).The latter work has looked at entry views from English Wiktionary logs and explored the degree to which interest in words, as evidenced in dictionary views, can be accounted for by a number of theoretically motivated lexical variables: frequency, age-of-acquisition, word prevalence, and the word having more than a single sense ( = polysemy). It turns out that lexical variables can usefully predict user interest in words, as revealed through the tendency to look them up in the dictionary. Recent work (Lew and Wolfer, 2024a) also added the CEFR vocabulary level to the mix, and found that CEFR levels (though obviously correlated with corpus frequency as well as some other variables) carry predictive power above and beyond the other lexical variables. It is this finding that inspired the present idea: if dictionary views for English words can be usefully predicted using their CEFR levels, why not exploit this relationship to predict CEFR levels from dictionary views, with extra help from a few other easy-to-obtain lexical variables?The value of such an imputation procedure would lie (on top of theoretical insights into how lexical information is related) in opening up an opportunity to expand existing CEFR-graded vocabulary lists, which are typically quite incomplete, even for the major languages. At the same time, the input variables for the imputation should be easy to get, if they are to be available for vocabulary items outside the existing CEFR lists. With this in mind, in the present study we have opted not to use (unlike Lew and Wolfer, 2024a) data on prevalence or age-of-acquisition, because they are difficult to obtain and, if available at all, then only for a relatively small set of words. We have, however, included: word frequency information; views in the English Wiktionary; polysemic status; and whether the word is a verb; a noun; an adjective; or a function word.Research questionsRQ1: In general terms, our main research question was to verify whether the approach outlined above were feasible, using English as a test case. At a minimum, a useful machine learning model should perform considerably better than a random baseline.RQ2: Should machine-learning models perform above the random baseline (RQ1), the second research question would be to see whether there are considerable differences in performance between the individual models.RQ3: In the next step, we wanted to see whether one of the best-performing machine learning models from the previous step would return plausible CEFR levels for words that are not in the initial CEFR-graded wordlist used for training and testing the models.The performance-related questions, RQs 1 and 2, are of a quantitative nature and can be addressed numerically, utilizing existing CEFR lists, and appropriate train-test splits. There is, however, also a more qualitative question (RQ3), and an important practical one: do the imputed words, along with their predicted CEFR levels actually make linguistic sense?Materials and MethodsLexical frequency information came from the SUBTLEX-US frequency list (see Brysbaert et al. 2018, Brysbaert and New, 2024). Further, we have sourced information on user views in the English Wiktionary from Wikimedia logs (Wikimedia Foundation n.d.). In doing so, we have followed the procedure documented in detail in Lew and Wolfer (Lew and Wolfer, 2024a). Polysemy information was based on the number of senses in the same English Wiktionary. In addition, we noted the syntactic classes (popularly known as parts of speech, PoS) listed for Wiktionary words, and entered four PoS-based binary indicator variables, depending on whether a word was listed (not necessarily exclusively) as: a Verb; a Noun; an Adjective; or a function word.Footnote 2CEFR-graded English vocabulary listFor English, we had been able to locate two respectable and sufficiently large CEFR-rated vocabulary lists: the English Vocabulary Profile (Capel, 2012, 2015, Cambridge University Press, 2015) and the Oxford Learner’s Word Lists (Oxford University Press n.d.). Due to copyright restrictions, we were unable to use the latter. In view of this, we adopted the English Vocabulary Profile (EVP) as a source of CEFR levels for English words in this research. In a minority of cases, the wordlist included multiple senses with varying CEFR levels. In such cases, we selected the lowest (i.e., most basic) CEFR level, because senses so marked would be the most frequent and essential senses that would thus be likely to account for the largest portion of dictionary consultations for the respective lemma. Another, more technical, reason to assign one CEFR level for one word (and not one sense) is the structure of the corpus frequency and Wiktionary look-up data: both are on the word level. For Wiktionary look-up data, this in turn is determined by the website structure of the English Wiktionary itself. Basically, look-up data is pageview data, and one page in the English Wiktionary corresponds to one word listing (potentially) multiple senses for this word—a standard procedure for many online dictionaries.The CEFR-level data in the EVP is not perfect—as is no language-teaching product, be it human-made or automatically generated. For example, the two spelling variants of lemma colo(u)r, treated as both nouns and verbs, reveal an inconsistency in assigning CEFR levels. The North American spelling colour is listed as A1 for both noun and verb, whereas the UK/Australian variant colour has A1 for noun, but B1 only (i.e., two grades higher) for the verb. Given that the EVP list originated in Cambridge, UK, it is unlikely that the UK spelling for the verb form should be deemed that much less important than the US one. Rather, our best guess would be that the difference may have arisen in the course of data processing, where US spelling was added as an afterthought in an automated process, copying the CEFR level from the UK variant. However, this process may have missed the CEFR level distinction between the noun and verb forms in the original UK spelling.MethodOur study uses machine learning (ML) methodsFootnote 3. We present various ML algorithms (or models, as we refer to them below) with a classification task. In general terms, ML classifiers have the task of predicting labels (in our case the CEFR level) for several items (in our case, words). To do so, the models use so-called features (also called feature variables or predictors). In our case, these feature variables are the corpus frequency of the word, the number of views in the Wiktionary, the number of meanings (one vs. more meanings) and the four PoS indicator variables. Please note that this is a rather parsimonious set of feature variables, we will go into more detail why we chose such a parsimonious approach in the Limitations section.We used the 5778 words with CEFR level information in the EVP to create 25 train-test splits by repeated random sampling, with each split comprising 80% training data and 20% testing data. Each train-test split contains all 5778 CEFR-graded words but these are distributed differently over the training and test sets. The reason why we created several of these train-test splits is that some of the splits might be suited better than others for the classification task at hand. For each of these train-test pairs, we employed a diverse set of machine-learning models to predict the CEFR levels based on the known lexical features, as listed in the following section. Hence, we end up with 25 classifications per model that only differ in the underlying train-test split. As we describe in more detail below, we let the models classify each lexical item into three coarser CEFR levels (A/B/C) and then into five levels (A1/A2/B1/B2/C). This increases the total number of classifications per model to 50. One might wonder why we conflated C1 and C2 to one level only in the five-level distinction. This was a pragmatic decision we made because C1 and C2 levels are relatively poorly represented in the EVP (each one holding fewer vocabulary items than either of the B levels).Model classes investigatedAll computations and analyses have been carried out using the statistical software environment R (R Core Team, 2024). A random baseline serves as a control to benchmark the performance of the more sophisticated machine learning models. The random baseline assigns CEFR levels to the words based on their distribution observed in the training set. We tested and compared the following machine-learning models and their versions:1.Regression Trees: using the rpart package (Therneau and Atkinson, 2023), we constructed regression trees with the following predictors: log frequency, polysemy (binary, true/false), log Wiktionary views, and four binary indicator part-of-speech variables, each marking whether a word is tagged as verb, noun, adjective, and/or function word in the English Wiktionary. A given word can have more (or less) than one such part-of-speech indicator at the same time, if it features in more than one syntactic class.A regression tree recursively splits the training data into subsets based on values of the feature variables (corpus frequency, number of views in Wiktionary, number of senses and part-of-speech indicator variables) that best separate the target variable (CEFR level). The aim of the tree is to reduce uncertainty or ‘impurity’ in the data at each split. In the test (and subsequently the imputation) phase, the resulting tree uses the decision rules established during the training phase to assign a CEFR level to each word. Note that no interactions between feature variables can be used in this kind of regression trees.2.Ordinal Logistic Regression: here we fitted ordinal logistic models using the MASS package (Venables and Ripley, 2002) with the same predictors as above, but also two-way interactions between log frequency and polysemy, log frequency and log views, polysemy and log views, as well as a three-way interaction between the three predictors.Ordinal logistic regression models the relationship between feature variables and CEFR levels by predicting the probability of a word belonging to a certain level. The ordered nature of the levels is considered. In the training phase, the model “learns” threshold values for each level (e.g., A2) representing the point at which a word is more likely to belong to a neighbouring category (e.g., level A1 or B1). A trained model can then be used to categorise words it was not trained on (test phase) or to predict the CEFR level of words where the model only has information about the feature variables (imputation phase).3.Simplified Ordinal Logistic Regression was a variant of Ordinal Logistic Regression models without any interactions and without the part-of-speech predictors. This simplified version of 2. functions in the same way but uses fewer information (no part-of-speech indicators) and assumes fewer relationships between the other feature variables (no interactions).4.Random Forests with 1000 trees: we used the randomForest package (Liaw and Wiener, 2002) to develop a model with 1000 decision trees and the same predictor structure as for the regression trees (in 1. above).Random Forests extend the idea of regression trees (1.) by growing multiple (1000) decision trees and combining their predictions to improve accuracy and robustness. Each tree in the forest is trained on a random subset of the training data, and at each split, a random subset of feature variables is considered. This procedure is intended to avoid overfitting. The forest then aggregates all predictions from all trees to determine the most likely CEFR level by majority voting. Just as for regression trees, no interactions between feature variables can be used in Random Forests.5.Random Forests with 2000 trees: an extension of the smaller Random Forest but with 2000 decision trees, and identical predictor structure.6.Naïve Bayes: finally, we fitted Naïve Bayes models using the e1071 package (Meyer et al. 2024), again using the same predictor structure, without interactions.Using Bayes’ theorem, these models estimate the likelihood of a word belonging to each CEFR level based on how frequently the word’s features appear in the training data for that level, assuming all features are conditionally independent, i.e., each feature variable is considered independently from the others. This approach is computationally efficient but can lead to inaccuracies if the features are dependent on each other in reality. The model assigns the CEFR level with the highest posterior probability.ResultsModel performance evaluationEach of the models listed in Section 2.2.1. was tasked with predicting the CEFR levels using the training sets, which is a classification task. This was always done twice: once for a broader three-level distinction (A/B/C), and again for a more fine-grained five-level distinction (A1/A2/B1/B2/Cx, where Cx stands for the collapsed levels C1 and C2). We then compared the test-set performance of each model class, expressed as the proportion of test-set items that have been predicted correctly by the model: its prediction accuracy. The whole exercise was repeated for each of the 25 train-test splits. The outcome of the comparison is shown in Fig. 1 (for three-level distinctions) and Fig. 2 (for five-level distinctions). Additional descriptives for both distinctions are available in Table 1 and the raw values for all train-test splits (including timings for the calculations) are available in the Supplementary Material. The comparison revealed that the Naïve Bayes models performed consistently worse than the other model classes.Footnote 4 Although all other model classes differ very little in their average accuracy, Random Forests with 2000 trees are, on average, the most accurate models for both 3- and 5-level classifications. In order not to overcomplicate the presentation of our analyses, we have therefore decided (as suggested by one anonymous reviewer) to use only the results from Random Forests with 2000 trees in what follows. In addition, in the Supplementary Material we also show an ‘ensemble approach’ analysis, where we consider the classifications of all models (except the underperforming Naïve Bayes models).Fig. 1: Model comparison for three-level distinctions.Each black dot represents one result for one of the train-test pairs. The lines connect models based on the same train-test pair. Red points (and the labels) represent the mean accuracy over all train-test pairs for the respective model. Random baseline performance is given in the title.Full size imageFig. 2: Model comparisons for five-level distinctions.Plot organization is equivalent to Fig. 1.Full size imageTable 1 Model accuracy descriptives for 3-level (columns 2 to 4) and 5-level distinctions (5 to 7) over all train-test splits.Full size tableTo evaluate whether the machine learning models were at all useful, we compared their performance with the random baseline. Note that the random baseline did not simply perform with an accuracy of 1/3 (for the three-level distinctions) or 1/5 (for the five-level distinctions) because the different CEFR levels are not equally distributed over the test-sets. The computation of the random baseline uses empirical relative frequencies of CEFR levels as sampling probabilities, which is why we do not simply get 1/3 or 1/5 accuracy, respectively, because some levels are more frequent than others. In fact, for the three-level distinctions, the random baseline performance (on average) lies at 0.366 (as indicated in the title of Fig. 1) and for the 5-level distinctions at 0.228 (is shown in the title of Fig. 2). To contrast the random baseline with the machine learning models, we compared the model that performed best on each train-test pair with the random baseline for this pair. For the 3-level distinctions, the best model performed (on average) 1.77 times better than the random baseline. For the 5-level distinctions, the gain vis-a-vis the baseline, as measured by the average ratio is higher at 2.10. This means that the best machine learning model performs, on average, at about double the accuracy compared to the random baseline.Following the suggestion of an anonymous peer reviewer, we decided to use a Random Forest with 2000 trees as our imputation model to reduce complexity (but present an ensemble approach in the Supplementary Material). For this model class, there are 25 models for each of the 3-level and 5-level splits. These models differ in the underlying train-test splits. We use the trained model that shows the highest accuracy in the generalisation to the test set (in Fig. 1 and Fig. 2, these particular models are represented by the highest-value data points in the column labelled “Random Forest (2k trees)”). For the three-level distinction, the accuracy of the best Random Forest with 2000 trees is 0.657 (1.79 times better than the random baseline for this particular train-test split), for the five-level distinction 0.489 (2.02 times better than the random baseline). Using these two models, we impute the CEFR levels for all 24,069 words from our dataset that had no associated CEFR-level information in the CVP list. The imputation results were saved as probability values. Thus, each of the 24,069 words received three probability values, one for each CEFR level for the three-level scale, another five probability values for the five-way distinctions. The level with the highest probability was then selected as the final imputed CEFR level.Having arrived at these imputed CEFR levels, let us next examine how the different values of the input variables in the models translate into the imputed CEFR levels. We shall do that for the three-way distinctions first, and then for the five-way distinctions.Distribution of input variables over imputed CEFR levelsIn what follows, examining the role of the input variables in arriving at specific best CEFR levels in our imputation system, we shall use complex multi-panel profile plots. As there are as many as seven such input variables, we shall consider them in two portions, and first for the three-way imputation, and then for the five-way imputation.Three-way imputationIn Fig. 3, we show the values of corpus frequency, Wiktionary views, and number of senses associated with imputed CEFR levels A, B, and C, respectively, in the three-way imputation. Note the logarithmic scale in Panels A and B. First, in Panel A, the distribution of corpus frequencies is shown (as boxplots with the usual settings) for the words classified as A, B, and C. It will be seen that the three CEFR levels are associated with rather distinct ranges of frequencies. Predictably, the most basic level A is associated with the highest word frequencies (with a median of 84.9 per million words), while C corresponds to the lowest corpus frequency ranges (median of 0.18 per million words). Level B is found in between the two extremes, with a median of 3.9 occurrences per million running words. Thus, level A has the highest word frequencies, while C has the lowest, and the differences between the levels are quite striking.Fig. 3: Frequency/views/polysemy distributions for three-level distinctions.Distribution of corpus frequency (A), Wiktionary views (B), and polysemy (C) for imputed CEFR levels, three-level distinctions. Note the logarithmic scale in (A, B).Full size imagePanel B presents a similar picture, here showing the distributions of Wiktionary views over the three imputed CEFR levels. Again, as for corpus frequency, more Wiktionary views are associated with more basic levels, and the boxes (by convention, each box contains half the observations) are still separated, but the spread of values within the levels is rather greater than for corpus frequency, and the ranges are closer to each other and with more overlap. Indirectly, this suggests that of these two factors, corpus frequency had more of an impact on the choice of imputed level. We will test this more directly in the next section, where we assess the importance of the predictor variables.In panel C, the relationship between polysemy (i.e., whether a word has one or more senses) and imputed level is shown. It will be seen that nearly all CEFR-level A and B words are polysemous (94.1% and 94.6% respectively), compared to only two-thirds polysemous words (65.1%) in the imputed C category. One thing to note in terms of the efficiency of our imputation approach is that since the majority of the 24,069 words are polysemous anyway, this criterion would not be sufficiently discriminating on its own: it has an auxiliary role in the classification. Again, see the next section for a direct assessment of predictor importance.Moving on to part-of-speech information in Fig. 4, we note the clear relationship of verb status with imputed CEFR level. Verbs make up 77% of imputed A-level words, compared to 68% B-level words, and 23% C-level words (and do note that a word may have more than one syntactic class label). By contrast, no such clear relationship emerges for noun status, adjective status, or function-word status. Noun status (Panel B) only seems to matter for imputed CEFR level C, with about a quarter (27%) of the words there not being nouns, as opposed to roughly one-tenth for levels A and 95% for level B. Adjective status (Panel C) appears to be most popular for the middle level B, and less so for A or C. Function words are a closed lexical class and thus their numbers are limited, but with this in mind, we can see in Panel D that function-word status is strongly predictive of level A.Fig. 4: Part-of-speech distributions for three-level distinctions.Distributions of the part-of-speech indicator variables (A: verbs, B: nouns, C: adjectives, D: function words) for the imputed CEFR levels, three-level distinctions.Full size imageFive-way imputationThe same analysis was repeated for the five-way imputation, in which our models were tasked with arriving at the best CEFR level out of A1, A2, B1, B2, and C (also represented as Cx to remind the reader that it covers C1 and C2). As in the three-way imputation, the more basic CEFR levels tend to be associated with higher corpus frequencies (Panel A in Fig. 5) as well as higher counts for Wiktionary views (Panel B). However, due to the finer distinctions (five levels rather than three), the values are characterized by greater overlap. In terms of the number of senses (polysemy, Panel C), the vast majority of words assigned to A1 to B2 are polysemous (ranging from 94.7% for B2 to 97.1% for A2). At the other extreme, only two-thirds of level Cx words are polysemous.Fig. 5: Frequency/views/polysemy distributions for five-level distinctions.Distribution of corpus frequency (A), Wiktionary views (B), and polysemy (C) for imputed CEFR levels, five-level distinctions. Note the logarithmic scale in (A, B).Full size imageLooking at parts of speech (Fig. 6), A1, B1, and B2 show roughly the same proportion of verbs (Panel A) with an intriguing increase for A2: only around every tenth word in A2 is not used as a verb. Cx clearly sets itself apart from the other levels with only around a quarter of the words used as verbs. Noun status (Panel B), in turn, does not distinguish well between levels A1, A2, and B2. Here, B1 and Cx group together with roughly 75% of nouns in both levels. Adjectives (Panel C) peak out at B2 (this was also the case for B in the three-way imputation). Function words (Panel D) are almost exclusively A1 or A2, which also tallies with results for three-way imputations.Fig. 6: Part-of-speech distributions for five-level distinctions.Distributions of the part-of-speech indicator variables (A: verbs, B: nouns, C: adjectives, D: function words) for the imputed CEFR levels, five-level distinctions.Full size imageThe importance of input variablesAs mentioned above, the distributions of the predictors across the imputed CEFR levels can be used to draw tentative conclusions about the importance of the individual predictor variables when solving the classification task. However, the apparent discriminatory power, e.g., of corpus frequency or Wiktionary views (Panels A and B in Figs. 3 and 5), is only an indirect measure of variable importance. For Random Forests models, though, there are metrics available that assess the importance of predictor variables more directly. Here, we are using the mean decrease in accuracy extracted via the importance() function from the {randomForest} R package (Liaw and Wiener, 2002). The intuition behind this measure is to see how much the overall performance of the model decreases when it no longer has the information of a predictor variable available. For this purpose, the respective predictor variable is randomly shuffled, and the accuracy of the model is compared to the unshuffled variant. The higher this difference, the more important the role that the predictor variable plays in the classification task.Figure 7 shows the importance of predictor variables and the distribution of importance values over the 25 train-test splits. We highlighted the importance values for the model used in the actual classification task with a red square, i.e., these are the values for the model with the highest accuracy. It is immediately clear that corpus frequency is the most important predictor in the classification task, followed by Wiktionary views. Polysemy (one vs. more senses) is relatively unimportant in comparison. Within the part-of-speech indicators, only the indicator variable for function words stands out to some extent from the others.Fig. 7: Variable importance for the Random Forest with 2000 trees.Mean decrease in accuracy (importance) of predictors for three-level (A) and five-level (B) distinctions. Each boxplot summarizes 25 values (one value per train-test split). The red square indicates the value of the model used in the classification task (Random Forest with the highest accuracy). Higher values indicate higher importance.Full size imageWe want to go one step further and look at the importance of the predictors in the assignment of the individual CEFR levels. The logic of the calculation is the same as for the overall model performance, except that the performance of the model is considered for individual CEFR levels only. The results of this analysis are visualised in Fig. 8.Fig. 8: Mean decrease in accuracy (importance) of predictors in the 2000-trees Random Forest model with the highest accuracy (red squares in Fig. 7).Three- and five-level distinctions are distributed over panels (see Fig. 7). The bars within each group distinguish the importance of the respective predictor in the assignment for each CEFR level. Higher values indicate higher importance. Negative values (below the dashed line) suggest that this predictor may not be useful (i.e., it might introduce noise or redundancy that ‘confuses’ the model) when assigning this CEFR level.Full size imageInterestingly, we see that although corpus frequency is still the most important predictor overall, this is particularly true for the extreme levels (A vs. C or A1 vs. Cx). The same holds true, albeit on a generally lower level, for Wiktionary views. The number of senses on the other hand is more important for assigning intermediate levels. For both three- and five-level distinctions, the number of senses even has slightly negative values for the C/Cx level. As indicated in the caption of Fig. 8, this suggests that this predictor increases noise or introduces redundancy into the classification tasks for the highest level. Intuitively speaking, the models tend to be ‘confused’ by the number of meanings when assigning the C/Cx level. Finally, we see an interesting pattern for the indicator variable for function words. Again, this predictor is important for the most extreme levels: If a word is a function word, C/Cx is dispreferred (see the D panels in Figs. 4 and 6), the opposite is true for A/A1/A2 – for B/B1/B2, function word status is uninformative.A look at some candidate wordsSo far, we have looked in some detail at the imputation procedure as well as the distribution and importance of individual factors in predicting CEFR levels. However, the practical output of this procedure is a list of specific lexical items as candidates to supplement existing CEFR lists (see Supplementary Material for the complete list). We shall therefore conclude the presentation of results with a closer look at such candidate words that our procedure produced for English. The Random Forest model assigns a probability to each level of each word, indicating how likely it considers this level to be the true level of the word. If the probability mass is shifted considerably towards a certain level, we can say, on an intuitive level, that the model is ‘more confident’ about its classification. In what follows, we will contrast words for which the Random Forest model assigned CEFR levels with high confidence with words where the model was ‘less sure’. We will, again, do so for three- and five-level distinctions.Figure 9 shows classification probabilities for two words per level, one with a rather ‘high confidence’ classification and one where the overall probability mass is distributed more evenly over the levels (‘low confidence’ classifications). For “jerk”, the Random Forest model assigned almost equal values to CEFR levels A and B in the three-level distinction (panel A, bottom row) but the slightly higher value for A (50.0% vs. 49.7%) determined the word’s classification as A. In the five-level distinction, “jerk” has been assigned to B1 (42.4%), again with an only slightly higher value than A2 (40.7%). So, although the switch from A to B1 might seem odd at first; by examining the underlying probability values, we can see that “jerk” is obviously a borderline case between levels A/A2 and B/B1. On the other hand, “true” (second row from the bottom in panel A) has been clearly assigned to level A (94.0%)) and the model decides on A1 in the five-level distinction with reasonably high confidence (61.6%).Fig. 9: Probability values for selected example words.A Three-level distinction, B five-level distinction. The final imputed CEFR level is indicated in parantheses and always corresponds to the highest probability (which is printed in bold).Full size imageAnother interesting pair of words is “nonce” and “heartfelt”. The latter is a very clear case of a word classified as C/Cx. The former, however, only so slightly gets imputed as C in the three-level distinction model (36.0% vs. 35.2% for B and 28.8% for A). The five-level distinction model is more confident about its classification as Cx (51.6%) but the relatively high probabilities assigned to A1/A2 (29.5% in sum) are still observable. As we established in the previous section, corpus frequency and Wiktionary views are the most important variables in the imputation process. Interestingly, “nonce” was looked up quite often (306,359 times) in our observation period, approximately 40 times more often than “heartfelt” (7884 times). This biases the model towards lower CEFR levels. However, the corpus frequency of “nonce” is so low (0.06 per million words) that it ultimately gets classified as C/Cx. We could say that the model has to integrate conflicting information about “nonce” while assigning its CEFR level. This is not the case for “heartfelt”, hence its high probability values for C/Cx.This example also shows that the classification probabilities can indeed be interesting to investigate further if one would actually use the ML-generated lists to supplement existing CEFR-graded word lists. Such insights can not only lead to a better understanding of why ML models arrive at certain classifications, but also to targeted human intervention in borderline cases.DiscussionIn this study, we proposed a novel approach to supplementing CEFR-graded vocabulary lists by imputing CEFR levels for words that are missing from existing lists. Leveraging a combination of word-level data for English (corpus frequency, dictionary views, part-of-speech information, and polysemy) and a starting set of CEFR levels from the English Vocabulary Profile, we tested several machine learning models of four general classes (partition trees, Ordinal Logistic Regression, Random Forests, Naïve Bayes) to predict CEFR levels for new list candidates. Our models demonstrated significant predictive accuracy, consistently outperforming random baselines, which suggests our approach is viable. Since all models (except Naïve Bayes) showed comparable performance on the classification task, we focussed on Random Forests with 2000 trees and tried imputation into three as well as five CEFR levels.The results confirm that certain lexical features, particularly corpus frequency, are useful indicators of the elusive concept of word “importance”, which can then be translated into a word’s CEFR level. Higher frequency words are generally associated with lower CEFR levels, corresponding to more basic vocabulary. It is to be expected that corpus frequency plays an important role, as the EVP data used to train the models is also based, at least in part, on “frequency information from first language corpora” (Capel, 2012, 2). However, it must also be mentioned that frequency information from different sources (e.g., dictionaries and academic word lists) was incorporated into the creation of the EVP on several levels (for a more detailed description, see Capel, 2012). The subtitle corpus frequency measure used in the present study seems to capture these procedures, at least to a certain extent.Dictionary views also emerged as a strong predictor, although its impact was slightly less pronounced compared to corpus frequency. The inclusion of polysemy and part-of-speech indicators further refined the models, enabling more accurate predictions across different CEFR levels.Beyond the comparison of performance to baseline guessing, it is relevant to ask if the imputations make good sense linguistically, given what we know about the nature of the lexicon. In that regard, it certainly appears reasonable that the vast majority of function words receive the basic CEFR levels of A/A1/A2. More generally, it makes sense that highly frequent words tend to go into lower levels. Further, the step-wise pattern for verbs in Fig. 4 tallies well with Lyon’s (1977, 438–452) distinction into first-order, second-order, and third-order words, extended to lexicography by Piotrowski (1989, 80–81): nouns, being prototypical first-order words, are the concrete words that most directly relate to the external world and tend to be semantic centres in texts. As such, they would quite naturally dominate the more basic CEFR levels.It should be pointed out that language-independent factors may also come into play when setting CEFR levels for vocabulary items. In particular, there are vocabulary classes which, although frequent, are generally not deemed appropriate in didactic vocabulary lists, at least for children. These include words that some people find objectionable, such as swear words, or racial/ethnic slurs. In English, there are a number of English swear words that are very frequent in less formal discourse texts and very generally known, and yet are conspicuously absent from teaching materials, even though one might argue that familiarity with them, at least passive, could be of benefit to (adult) language learners. For example, our procedure returned bastard, bitch and fuck as A-level words (all A1 for 5-level splits), and moron (B2 for 5-level splits), cock (B1), and jackass (B2) as B-level words. If the output word list is to follow the traditional approach of avoiding such words, then a human expert should vet the word set to exclude them manually. In general, some sort of sanity check by an expert human would be recommended.Thus, our semi-automatic approach offers a practical solution to the limitations of existing CEFR lists, providing a framework for expanding these lists in a systematic and data-driven manner. However, our findings also reveal the importance of human oversight in the process, particularly in filtering out words that may not be appropriate for educational contexts. We should stress that our approach will not produce perfect wordlists ready for classroom use; however, it is certainly a promising approach to add words to lists that may have been overlooked in the past.Limitations and further workAn obvious limitation is the overall performance of the ML models. Although it was consistently clearly above the random baseline, the accuracy measures are certainly not in a range that would justify an immediate productive use in the field of language learning. Hence, the proposed method may need enhancements before using it even in a semi-automatic fashion. An obvious extension would be the inclusion of additional feature variables that could make the classification more accurate. Good candidates here would be indicators of lexical sophistication such as abstract/concreteness, familiarity, prevalence or age-of-acquisition (for an automatic approach to assess many of such variables see Kyle and Crossley, 2015). For the present study, we deliberately omitted additional lexical sophistication indicators, as they are sometimes difficult to operationalise (especially for lower-resource languages). We initially wanted to only use a restricted set of variables that are relatively easy to obtain as soon as a corpus (for frequency) and a sufficiently large and well-frequented dictionary (for visits, polysemy, and part-of-speech) are available. From this perspective, our contribution should be seen as a proof-of-concept that needs to be refined in future research.The merging of the C levels is another restriction that we had to accept for pragmatic reasons. But even if we were to differentiate between C1 and C2, further distinctions between C-level words might be desirable in an advanced-level EFL (English as a foreign language) classroom as an anonymous reviewer correctly pointed out. Here, one might want to distinguish between C-level words that are still frequent enough to be potentially pedagogically useful, and the more typical long tail of the frequency distribution containing very rare and often very specialized vocabulary. Whether ML models can distinguish these two types of C-level words and which feature variables would be decisive remains a question for future research.A minor limitation concerns some isolated problems with word frequency figures in the SUBTLEX-US. Late in the analysis, it has come to our attention that the words don and haven have unexpectedly high frequencies. After some investigation and contacting the creators (Brysbaert, personal communication), we concluded that this irregularity comes from incorrectly parsing the frequent forms don’t and haven’t in the SUBTLEX-US frequency lists. We have considered substituting another frequency list, but in the end decided to retain SUBTLEX-US in the interest of interoperability, since this is the resource used in previous modelling of Wiktionary look-up behaviour employing CEFR levels (Lew and Wolfer, 2024a). Another consideration in our choice was that a subtitle corpus is typically better reflective of more informal language (Levshina, 2017), which makes it pedagogically attractive. To mitigate against the don/haven problem, we manually filtered out these problematic items (see Supplementary Material). As far as we could ascertain, there were no other problems of this type.Having completed the process of CEFR-level imputation for English, the main extension of our work that we envisage would be to apply it to other languages. In assessing the feasibility of undertaking such an effort for a given language, we would recommend that the interested parties start off by considering the following two guiding questions:1.Does the candidate language already have a sufficiently complete and up-to-date CEFR-graded vocabulary list? If not, are potential stakeholders interested in having it? Would it be useful and used?2.Does the language have the minimum set of data needed to traverse our CEFR-imputation process? This would include a starting CEFR graded list (to train the models); a decent-sized Wiktionary, or log files from another dictionary, to check visits, PoS info, and polysemy (both could be extracted from the dictionary itself); and a corpus-based frequency list.The present authors are happy to offer opinion or advice to those willing to replicate the process. Further down the line, it would be interesting to compare achieved imputation accuracies for other languages to that obtained in the present study for English. One detail that may set English apart from other languages is that because of the unique status of English as an international language in many domains, dictionaries of English are used by many learners of English. However, that might not be the case to such a degree for dictionaries of languages not so widely spoken by non-native speakers. For example, a quick check reveals that for the German Wiktionary, 78% of the look-ups in June 2024 have come from Germany, Austria and Switzerland, in contrast to 50% only of the English Wiktionary visits being from one of: USA, UK, Canada, Nigeria, and Australia.Footnote 5 Of course, this is not to say that visitors from a German speaking country may not be language learners, or vice versa, but this difference in the geo-lectal distribution of visits may signal a qualitative difference between English and other languages.Regarding possible refinements to the imputation process, another source of data that might be considered for inclusion would be the Wiktionary editing activity. One might hypothesize that words that attract more editor engagement are, other things being equal, of more interest. While our current imputation process uses entry views as a proxy for user interest, active Wiktionary editors tend to be people with special interest in the lexicon; thus, their interest might be at least partially complementary to interest of ordinary, passive dictionary users.