A hybrid deep learning and fuzzy logic framework for feature-based evaluation of english Language learners

Wait 5 sec.

IntroductionNatural Language Processing (NLP) is a dynamic and rapidly evolving field at the intersection of computer science, linguistics, and artificial intelligence, dedicated to enabling machines to understand, interpret, and generate human language in meaningful ways1. As the world becomes increasingly digital and interconnected, NLP has emerged as a crucial technology powering applications that range from real-time translation and virtual assistants to sentiment analysis2 and automated content moderation3. Its continuous advancements not only reshaping how we interact with technology but are also opening new frontiers for research and innovation across disciplines, making human-computer communication more natural, accessible, and intelligent than ever before4. The global prominence of English, both as an academic lingua franca and as the predominant medium of international communication, underscores the critical importance of accurate and effective language evaluation tools5. As educational institutions worldwide increasingly adopt English as the primary language of instruction, the demand for robust, precise, and scalable assessment systems grows correspondingly6. These assessment tools not only streamline evaluative processes but also provide valuable feedback to learners, teachers, and institutions, aiding in curriculum design, instructional strategies, and personalized education planning7.To address this evolving landscape, AI-driven tools are increasingly adopted by educational institutions to perform real-time evaluations, multi-classification and provide adaptive feedback to learners8. These advanced tools integrate deep learning techniques, linguistic analytics, and demographic and cognitive profiling, creating comprehensive evaluations tailored to diverse learner populations9. Teachers and students respond positively to such innovations, as AI-driven assessments offer precise diagnostics of learning challenges and strengths, thus fostering a responsive and engaging educational environment10. Within the specific context of ELL, AI models have been effectively employed to evaluate learners across multiple skill areas, such as speaking, writing, reading comprehension, and listening11. Deep learning architectures, including recurrent neural networks (RNN), convolutional neural networks (CNN), and transformers, have shown remarkable capabilities in accurately predicting learner proficiency and identifying key linguistic and cognitive factors that influence language acquisition12. Concurrently, fuzzy logic approaches have demonstrated strength in capturing the inherent uncertainty and subjective aspects of human evaluations, offering nuanced and interpretable assessment models that resonate closely with expert judgments13. Despite these advancements, challenges remain in creating evaluation systems that simultaneously provide accuracy, transparency, and practical interpretability14. Existing AI-based assessment frameworks often exhibit shortcomings such as limited interpretability, inadequate handling of linguistic and demographic variations, and an inability to effectively incorporate human-like reasoning and subjective judgment15. The motivation for this study, therefore, lies in addressing these gaps by proposing a novel, hybrid framework that combines deep learning-based feature ranking and fuzzy logic techniques. This hybrid approach aims to offer robust evaluations that are not only precise but also interpretable and sensitive to the diverse learner backgrounds.In this study, we introduce a comprehensive hybrid evaluation framework designed specifically for ELL. Our approach integrates deep learning-based feature ranking methodologies to identify the most influential linguistic, cognitive, and demographic factors contributing to learner proficiency. Complementing this, we employ fuzzy logic techniques to construct an interpretable evaluation model capable of handling uncertainty and subjectivity inherent in language assessments. Through this integrated method, we aspire to achieve a balanced solution that enhances the accuracy of automated evaluations while maintaining interpretability and practical applicability in educational settings. The following contributions are:Proposed a novel fusion hybrid model (DeBERTa + Metadata + LSTM) that integrates structured and unstructured data using attention mechanisms, achieving the highest classification accuracy of 93% for ELL evaluation.Applied feature ranking and importance techniques to identify the most influential attributes for both modeling and rule construction.Implemented a fuzzy logic technique to extract interpretable rules for trait-based learner classification, using ranked features to enhance the transparency and reliability of decisions.Employed explainable AI (XAI), alongside statistical significance tests, to interpret and validate model performance, further strengthening trust in the system’s outputs.The remainder of this paper is structured as follows: Section II presents an extensive review of related work, exploring both deep learning and fuzzy logic methodologies applied to language and educational evaluations. Section III outlines the detailed methodology, including the proposed hybrid framework, deep learning-based feature ranking approach, and fuzzy inference system. Section IV describes the experimental setup and dataset characteristics, followed by a discussion of the results and their implications in Section V. Finally, Section VI concludes the paper by summarizing key findings, discussing the contributions and limitations of the research, and providing directions for future investigations.Related workThe dynamic evolution of educational assessment has recently experienced big leaps with the use of deep learning and fuzzy logic techniques based on language learning analytics16. In this section, the state-of-the-art research on the combination of these methods and pay attention to the current deep learning techniques and fuzzy frameworks, to motivate the hybrid modeling strategy that advocate in the context of ELL assessment.Deep learning-based ELL evaluationHowever, recent works apply deep learning (DL) models with feature selection or ranking to assess ELLs and student performance, to enhance accuracy and interpretability simultaneously. In AWE, utilized deep neural networks to score essays; and explained how AI techniques (SHAP values) can be applied to discover what linguistic features affect the scores17. In a subsequent work, they used deep models to predict fine-grained rubric scores and holistic scores simultaneously, leading to high agreement (QWK 0.78) with human raters – better than existing methods (QWK ~ 0.53). But they mentioned that the “black box” nature of DL made scores difficult to explain, supporting the need of feature importance analysis for interpretability18. DL-based AES systems and pointed out that, although end-to-end DL models achieve high accuracy, they cannot give feedback effectively because of incomprehensible feature representation19. Faseeh et al. model combined deep features with handcrafted linguistic features and a light ensemble (XGBoost) for essay scoring and achieved higher accuracy than using only deep features20.RNN model to measure oral pronunciation fluency and accuracy in English. Their system, which included domain-specific features as input to the RNN, achieved > 90% recognition agreement with human judgments. This provides evidence of LoRa DL capability to cope with speech relationships in data21, although the model can break down in case of a lack of training data or noisy input based on graph convolutional network based on evolutionary algorithm (ESA-NEGCN-NBOA)22. Machine learning demonstrated that method is ~ 20–28% more accurate than the previous machine learning method, and the evaluation time has significantly decreased. This method operates based on unique features extraction and optimization technique, which is its strength, but with high complexity23. Similarly, combined deep CNN and virus colony search optimizer to assess EFL classroom teaching quality from audio/video data. They proposed a complete feature framework and adopted a CNN to fuse the features24. The adapted meta-heuristic tuning achieved better accuracy and robustness for CNN for multi-criteria teacher evaluation, better than traditional evaluation approaches in terms of accuracy and consistency25.Another important use-case is predicting students’ academic success at different levels of study in the wider context of educational systems using DL with feature ranking. Recent works instead use deep networks which can learn feature representation automatically but retain feature selection for interpretability26. Further by introducing a model-agnostic framework without manual feature engineering – raw data feed into an interpretable model which emphasizes the​ significant​ feature through a post-hoc analysis for online courses27. Based on the open university dataset, they used a random forest, as well as a multi-layer perception, and demonstrated that it did not hurt accuracy if removing the feature-engineering step. They attribute this in part to the ability of the model to learn good representations internally28.In ensembles and hybrid modality as well researchers have also aggregated deep models to improve the performance of educational prediction. Combination of traditional learners in addition to a custom 1D CNN for determining the students as “weak” or “strong” learners. They carried out a “multiparametric analysis,” considering a variety of factors including demographics, past grades and online behavior29. The ensemble CNN model achieved superior precision and recall of single models by 2–16%. They also noted that utilizing a high-variance VS technique enhanced interpretability by restricting the modeled effect to the most influential factors30. A composition of several deep networks that have been trained with different optimizers. This strong architecture also obtained lesser error (e.g. RMSE) on student grade datasets than that of single model31. In a later study, proposed using the instantiation of attention-augmented DL model for performance prediction in MOOC. They also performed a feature elimination pre-processing and, finally, employed SHAP values for model explanation32. The tradeoff here is one of model complexity, but they offer a compelling blueprint for resolving the tradeoff between predictive power and transparency33.Fuzzy logic-based ELL education evaluationFuzzy logic has been commonly used to represent the internal uncertainty and subjectivity in education assessments, especially in ELL testing and learning outcomes. A study introduced a fuzzy logic system for the continual assessment of Chinese students’ English proficiency at different levels34. The model developed fuzzy sets for language learning objective measures and utilized a fuzzy comprehensive evaluation to identify learning deficits. This had the benefit of being a more detailed assessment than a pass/fail or number score35. An IF-AHP model to evaluate college English teaching quality was proposed which were embedded in the online game-based learning (OGBL). They set up a multi-level evaluation system and determined the influence degree of various quality factors (e.g., student engagement, teacher knowledge, and so forth) based on AHP, and as well made an aggregate evaluation based on fuzzy comprehensive evaluation (FCE)36. Furthermore, a distinct application of fuzzy logic in the adaptive learning is presented in the context of an English game-based learning environment such that student characteristics are fuzzified to adjust game difficulty and feedback37. In a similar fuzzy logic for assessment of academic progress in medical education used. Though not directly in ELL, this application demonstrates the flexibility of fuzzy logic; FIS incorporated fuzzy exam marks, practical skills, and attendance as inputs in a composite student performance index38. Another fertile field is fuzzy logic in e-learning acceptance and satisfaction measurements. A fuzzy logic-based method in e-learning effectiveness of computer science students in the time of COVID-19. They defined fuzzy scales (e.g., high, medium, low) for perceived usefulness, ease of use, learning outcomes, and finally overall acceptance through fuzzy aggregation39. A Fuzzy Comprehensive Assessment Model (FCAM) used to evaluate the quality of English translation. Criteria such as accuracy, fluency, stylistic adequacy etc., were rated on a fuzzy scale and it used to derive an overall quality score for translation by the FCAM. This fuzzy set-based decision made it possible to not only judge slight average translation errors in context but also to express the quality of the translation as a group-based appraisal40. Furthermore, developed a multi-feature fuzzy evaluation model to evaluate teaching methods for college Physical Education. They included natural language input by fuzzy and placed it under three dimensions. The weight of all parameters in the fuzzy model was fine-tuned by an improved cuckoo search optimization. The model demonstrated better ratings for evaluating instructional effectiveness and student satisfaction (95–97%) than traditional evaluative system41. Additionally, these systems showed fuzzy rules to be better at early and accurate detection of at-risk students than threshold-based systems. In North Africa, study employed fuzzy logic to forecast e-learning engagement indicators, permitting their system to avoid hard computation and refer to expert rules for student activity patterns42.So far, fuzzy logic-based approaches have improved ELL and educational assessments by incorporating grading and expertise into the process of evaluation. Key benefits include robustness against noisy inputs (e.g. partial correct or fair engagement) and creation of reasoning akin to that of a human in the scores (e.g. “good”, “fair” and “needs improvement” in the place of mere numerical grades). Fuzzy models succeeded where traditional models failed and studies attained fairness in the evaluation, student satisfaction, and early problem recognition. The principal criticism is that the design and the validation of the fuzzy rules and membership functions is usually a laborious and costly task that needs expert personnel in the field. However, the trend is clear: fuzzy logic, in some cases in conjunction with deep learning or optimization algorithms, is emerging as a useful toolkit when it comes to educational assessment research, complementing pure deep learning works with interpretable, flexible evaluation frameworks applicable not only to English language learning, but also to other areas as well.Research proposed methodologyThis section describes the complete methodology used to research ELL assessment in combination of advanced artificial intelligence and fuzzy concepts. The structure of a study revolves around systematic data preprocessing, sound feature engineering and usage of both rule-based and machine learning models. With a combination of transformer-based textual embeddings, structured learner metadata and interpretable fuzzy rule mining, the research seeks to model the multidimensionality of language proficiency assessment. Detailed procedures shown in Fig. 1 for the model construction, feature ranking and interpretability analysis are documented for validation of the replicability, transparency, and robust comparison of traditional and SOTA evaluation approaches.Fig. 1Framework of research proposed methodology.Full size imageData collectionThe dataset used in this study obtained from the “English-Language Learners-Evaluation” publicly available repository that stored responses, scores, and learner profile data for English language learning research. The dataset is of a mixed type that includes structured metadata like demographic or contextual information, psycholinguistic and cognitive features as well as unstructured text data in the format of written language answers, as attributes displayed in Table 1, which is split into training and testing sets using an 80 − 20 ratio. This abundant diversity also allows for a comprehensive examination of proficiency and learning conditions.Table 1 Attributes distribution based on traits.Full size tableData preprocessingData preprocessing stage instrumental in achieving data quality and uniformity before modeling. The raw set was cleaned to remove duplicates and missing values through imputation and validation. Text in the full text column was cleared through a typical text cleaning process (lowercasing, removal of special characters, whitespace, stop words and lemmatization), to standardize the input for transformer-based language models. To encode the categorical variables (native language and learning environment), label encoding and one-hot encoding was implemented to make them accessible by the machine leaning algorithm43. The value ranges of numerical characteristics were made consistent with standard scaling to improve the model convergence. Score-dependent features that are outliers were identified and controlled to minimize their impact on statistical and machine learning analysis. Finally, features were clustered based on domain relationships (contextual, cognitive, linguistic), serving as contexts for both mining of trait-based rules and fusion of mixed models. In the preprocessing pipeline trait features are engineered as follows: numerical fields are imputed using median and then z-normalized; categorical fields are imputed using a mode plus one-hot encoded; rubric sub-scores are modeled as ordinal numbers. Having taken age and chosen cognitive measures and discretized them using the thresholds obtained by training-sets (temperature on age; thirds on attention/working-memory; Likert combining on motivation). Linguistic and cognitive composites are divided by summing up the way of standardized sub-scores (or they are computed with SHAP-normalized weights, which are stated separately), as displayed in Table 2. Demographics are left in the form of separate encoded variables. All the transformation statistics are written down and are reused during assessment to ensure reproducibility.This systematic pre-processing resulted in more reliable, robust and well-structured dataset which served as a bottleneck component for providing next level evaluation of English language learners based on AI and fuzzy techniques.Table 2 Trait construction and quantization.Full size tableFeature ranking and importanceFeature ranking and importance assessment are ingrained in every interpretable machine learning or rule-based modeling pipeline and become particularly important in challenging domains such as language learning analytics. Through systematic measurement and quantification of relevance of each feature with respect to a prediction or classification task, these techniques assist researchers in determining the most salient features, improving performance of the model, and facilitating transparent, interpretable decision making. From numerous techniques, Principal Component Analysis (PCA) loading and permutation importance (which combined with ensemble methods such as random forests) are seen as powerful tools to consider both linear and nonlinear associations in high dimensional educational data.Information gain (IG)IG provides the target variable that able to quantify the uncertainty reduction about the target variable \(\:Y\) when a given feature is known \(\:X\). Features with higher IG are more informative and are prioritized in model training44. This method is particularly useful for identifying variables that are highly predictive of the outcome, defined as in Eq. 1.$$\:IG\left(Y,X\right)=H\left(Y\right)-H\left(Y\right|X)$$(1)Where, \(\:H\left(Y\right)\) represent the entropy of the targeted variable \(\:Y\), defined using Eq. 2.$$\:H\left(Y\right)=\:-\sum\:_{i=1}^{\left|C\right|}p\left({y}_{i}\right){\text{log}}_{2}p\left({y}_{i}\right)$$(2)\(\:H\left(Y\right|X)\) denotes the conditional entropy of \(\:Y\)given \(\:X\), expressed as in Eq. 3.$$\:H\left(Y|X\right)=\:-\sum\:_{j=1}^{\left|X\right|}p\left({x}_{j}\right){\text{log}}_{2}p\left({y}_{i}\right)+\:\sum\:_{i=1}^{\left|C\right|}p\left({y}_{i}|{x}_{j}\right){\text{log}}_{2}p\left({y}_{i}|{x}_{j}\right)\:$$(3)Thus, the information gain calculated as in Eq. 4 for best feature targeted variables.$$\:H\left(Y|X\right)=\:-\sum\:_{i=1}^{\left|C\right|}p\left({y}_{i}\right){\text{log}}_{2}p\left({y}_{i}\right)+\sum\:_{j=1}^{\left|X\right|}p\left({x}_{j}\right)\:\left[\sum\:_{i=1}^{\left|C\right|}p\left({y}_{i}|{x}_{j}\right){\text{log}}_{2}p\left({y}_{i}|{x}_{j}\right)\right]$$(4)This ensures that by capturing the degrees to which higher ranking feature of \(\:X\) reduces the uncertainty associated with predicting \(\:Y\).Gain ratio (GR)GR is an extension of IG that addresses its bias towards features with many unique values. It normalizes the IG by the intrinsic information of the feature, which measures its overall variability45. This adjustment prevents the over-selection of features with many distinct values, thus ensuring a more balanced selection of features that genuinely contribute to the target prediction, computed as in Eq. 5.$$\:GR\left(Y,X\right)=\:\frac{IG(Y,X)}{H\left(X\right)}$$(5)Where, intrinsic information \(\:H\left(X\right)\) is defined using Eq. 6.$$\:H\left(X\right)=\:-\:\sum\:_{j=1}^{\left|X\right|}p\left({x}_{j}\right){\text{log}}_{2}p\left({x}_{i}\right)$$(6)Thus, Gain ratio calculated as in Eq. 7.$$\:GR\left(Y,X\right)=\:\frac{-\sum\:_{i=1}^{\left|C\right|}p\left({y}_{i}\right){\text{log}}_{2}p\left({y}_{i}\right)+\:\sum\:_{j=1}^{\left|X\right|}p\left({x}_{j}\right)\:\left[\sum\:_{i=1}^{\left|C\right|}p\left({y}_{i}|{x}_{j}\right){\text{log}}_{2}p\left({y}_{i}|{x}_{j}\right)\right]}{-\sum\:_{j=1}^{\left|X\right|}p\left({x}_{j}\right){\text{log}}_{2}p\left({x}_{i}\right)}$$(7)This metric allows for fairer comparison among features by adjusting for intrinsic variability.EntropyEntropy is a measure of the randomness of or unpredictability of information. In feature selection, it is used to assess how well a feature can separate the classes of the target variables \(\:Y\), with possible classes \(\:\{{y}_{1,\:}\:{y}_{2,\:\:}{\:y}_{3,\:}\dots\:..,{y}_{\left|C\right|\:}\}\)46. Feature that significantly reduce entropy, by increasing IG, are considered more valuable for prediction, computed as in Eq. 8 with the help of Eq. 2. By choosing features that lower the overall entropy, models can achieve better classification performance.$$\:H\left(Y\right)=\:-\:\underset{\epsilon\to\:{\infty\:}}{\text{lim}}\:{\int\:}_{\epsilon}^{1-\epsilon}p\left(y\right){\text{log}}_{2}p\left({y}_{i}\right){d}_{y}$$(8)A lower value of entropy indicated that a feature is more informative, as it results in a greater reduction in uncertainty about classification tasks.Principal component analysisPCA represents a way to reduce the dimensionality of input features down to a new feature space of uncorrelated variables, i.e., principal components, for which the new co-matrix would be diagonal. The “loading” of a feature onto a principal component conveys the relative importance or contribution of that feature to the component47. Large absolute loading values represent the features that are most important factors in explaining variance in the data. Importance of the feature (e.g., by PCA loading) is often evaluated by examining the absolute loadings of features on the first principal components (PCs) which capture the largest variance in the data, defined in Eq. 9. This technique can be used to discover which variables are the most significant in determining the true structure of learner traits.$$\:{\text{Loading}}_{j}^{\left(1\right)}=\frac{1}{\sqrt{{{\uplambda\:}}_{1}}}{\sum\:}_{i=1}^{n}{x}_{ij}{u}_{i1}$$(9)Where \(\:{x}_{ij}\) is the value of feature\(\:\:j\) for observation\(\:\:i\). \(\:{u}_{i1}\) is the i-th entry in the first eigenvector of the covariance matric\(\:{X}^{T}X;\:{{\uplambda\:}}_{1}\) is the largest eigenvalue associated with \(\:PC1\).Permutation importance with random forestPermutation importance is a general model-agnostic technique that estimates the importance of a feature as the increase in the model’s prediction error after permuting the feature’s values. In the context of the random forest, this would be done by randomly permuting the values of one given feature across the samples and evaluating the corresponding change in accuracy, based on random score prediction \(\:f\left(X\right)\). If rearranging the values of a feature results based on loss function \(\:L(y,\:f(X\left)\right)\) in the model accuracy, the feature is said to be important for true label\(\:\:y\), compued using Eq. 10. This approach is especially good at accounting for non-linearity and interaction effects, both of which are relevant to complex educational datasets.$$\:P{I}_{j}=\frac{1}{K}{\sum\:}_{k=1}^{K}\left[L\left({y}^{\left(k\right)},f\left({X}^{\left(k\right)}\right)\right)-L\left({y}^{\left(k\right)},f\left({X}_{\text{pr}\left(j\right)}^{\left(k\right)}\right)\right)\right]$$(10)\(\:P{I}_{j}\) is the permutation importance for feature \(\:j\),\(\:\:K\) is the number of trees, \(\:{\varvec{X}}^{\left(\varvec{k}\right)}\) feature importance, \(\:{\varvec{X}}_{\text{pr}\left(\varvec{j}\right)}^{\left(\varvec{k}\right)}\) is the same as \(\:{\varvec{X}}^{\left(\varvec{k}\right)}\) permuted with feature \(\:j\).Fuzzy rule-based classification techniqueOnce input variables were fuzzified based on \(\:\varvec{x}={\left[{x}_{1},{x}_{2},\dots\:,{x}_{n}\right]}^{\text{T}}\:\in\:{R}^{n}\), a Fuzzy Inference System (FIS) was designed to model the decision-making logic in all three traits contexts, as flow shown in Fig. 2. A fuzzy inference system processes a collection of if-then rules \(\:R={\left\{{R}_{j}\right\}}_{j=1}^{M}\) defined on fuzzy sets \(\:M\) and produces an output via a reasoning mechanism that imitates cognition rules as well48.A fuzzy rule in the framework is expressed using Eq. 11.$$\:{R}_{j}:\text{IF}\:{x}_{1}\:\text{is}\:{A}_{1}^{j}\wedge\:{x}_{2}\:\text{is}\:{A}_{2}^{j}\wedge\:\dots\:\wedge\:{x}_{n}\:\text{is}\:{A}_{n}^{j}\:\text{THEN}\:y\:\text{is}\:{B}^{j}$$(11)The firing strength \(\:{a}_{j}\) of each \(\:{R}_{j}\:\epsilon\:R\) is computed using a \(\:t-norm\) operator \(\:T:[\text{0,1}{]}^{n}\to\:[\text{0,1}]\) with the product of a \(\:t-norm\), defined as in Eq. 12.$$\:{{\upalpha\:}}_{j}\left(x\right)=T\left({{\upmu\:}}_{{A}_{1}^{\left(j\right)}}\left({x}_{1}\right),\:{{\upmu\:}}_{{A}_{2}^{\left(j\right)}}\left({x}_{2}\right),\dots\:\dots\:.,{{\upmu\:}}_{{A}_{n}^{\left(j\right)}}\left({x}_{n}\right)\right)\to\:{\prod\:}_{i=1}^{n}{{\upmu\:}}_{{A}_{i}^{j}}\left({x}_{i}\right)$$(12)Each activated rule contributes a fuzzy output set scaled by its firing strength, in Eq. 13.$$\:\stackrel{\sim}{{B}_{j}}\left(y\right)={{\upalpha\:}}_{j}\cdot\:{{\upmu\:}}_{{B}^{j}}\left(y\right)$$(13)Fig. 2Working of Fuzzy rule-based classification technique.Full size imageThe overall aggregated fuzzy output \(\:{{\upmu\:}}_{\stackrel{\sim}{B}}\left(y\right)\) is obtained using the maximum operator over all activated rules in Eq. 14.$$\:{{\upmu\:}}_{\stackrel{\sim}{B}}\left(y\right)=\underset{j=1}{\overset{M}{\bigvee\:}}\stackrel{\sim}{{B}_{j}}\left(y\right)=\underset{j}{\text{max}}\left[{{\upalpha\:}}_{j}\left(x\right)\cdot\:{{\upmu\:}}_{{B}^{j}}\left(y\right)\right]$$(14)The crisp output \(\:{y}^{\text{*}}\epsilon\mathbb{R}\) is then computed using the centroid of area method, also known as the center of gravity, defined as in Eq. 15.$$\:{y}^{\text{*}}=\frac{{\int\:}_{Y}y\cdot\:{{\upmu\:}}_{\stackrel{\sim}{B}}\left(y\right)\hspace{0.17em}dy}{{\int\:}_{Y}{{\upmu\:}}_{\stackrel{\sim}{B}}\left(y\right)\hspace{0.17em}dy}$$(15)where \(\:Y\subset\:\mathbb{R}\) is the output universe of discourse.For systems with multiple outputs \(\:Y=[{y}_{1}^{\text{*}},{y}_{2}^{\text{*}},\dots\:,{y}_{m}^{\text{*}}]\) can extend the system to vector-valued outputs using Eq. 16.$$\:{y}^{\text{*}}=\left[{y}_{1}^{\text{*}},{y}_{2}^{\text{*}},\dots\:,{y}_{m}^{\text{*}}\right],\hspace{1em}{y}_{k}^{\text{*}}=\frac{\int\:{y}_{k}\cdot\:{{\upmu\:}}_{\stackrel{\sim}{{B}_{k}}}\left({y}_{k}\right)\hspace{0.17em}d{y}_{k}}{\int\:{{\upmu\:}}_{\stackrel{\sim}{{B}_{k}}}\left({y}_{k}\right)\hspace{0.17em}d{y}_{k}}$$(16)This generalization is particularly relevant for multi-criteria decision-making in traits selection where both Impact and developments growth is based on outputs. By formalizing expert knowledge into fuzzy rules and leveraging linguistic reasoning, the FIS design enabled soft decision boundaries and interpretable evaluations in complex educational domains.Predictive modellingFurther to the phase of the study, the model development phase, incorporates a range of potentially suitable models, selected for their respective capacity to work with the complex, multivariate nature of ELL assessment data. For ML model, baselines including SVM, DT, RF, and CatBoost are used for implementation, as displayed in Table 3. SVM classifies data points well, especially in high-dimensional features space, but it may lack power to detect non-linear effects49. The Decision Tree model gives a highly interpretable structure, which can capture decision boundaries using straightforward if-then rules, yet it is often susceptible to overfitting. As an ensemble of decision trees, Random Forest can reduce such risk to some extent as it generalizes & stabilizes a forest of trees. CatBoost, a gradient boosting algorithm, greatly improves prediction accuracy particularly when categorical variables and complex data distributions are involved50.For DL-based methods, Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) networks are used. These models are well suited for capturing temporal and sequential dependencies within learner responses, enabling more complex, contextual representations in text data. LSTM has special utility in dealing with the sequences in which order and context information plays a crucial role, while BiLSTM further improves this by learning the previous and future contexts at the same time and improving the depth of understanding of the language51.Transformer-based modeling is the approach used in DeBERTa, a modern architecture celebrated for its strong self-attention mechanism and deep context embedding. DeBERTa can model intricate syntactic and semantic relationships in textual responses, thus learning strong and fine-grained language representations that enable highly accurate classification52.The last, the hybrid model DeBERTa-LSTM takes full advantages of both the transformer-based contextual encoding and the recurrent sequence modeling. This model first adopts DeBERTa based on transformer53 to generate dense, context-aware representations on the learner text and these representations are further fed into LSTM layers to extract sequential patterns and relationships. The product is an integrated predictive framework that can easily accommodate unstructured data, providing enhanced prediction and interpretability within the ELL trait continuum. This methodological variety allows for comprehensive benchmarking and demonstrates the benefits of advanced hybrid architectures over traditional non-hybrid models.Table 3 Analysis of baseline model selection based on their hyperparameter tuning strength.Full size tableProposed modelWe trained a model, Fusion DBML to capture fine-grained relationships between language features and the learner attributes, which combines the power of advanced language embeddings, structured metadata, and temporal modeling, as architecture shown in Fig. 3. The architecture starts by passing the raw textual responses of learners through a pretrained DeBERTa transformer to obtain dense, context aware embeddings via deep self-attention operations. These embeddings represent intricate syntactic and semantic structures contained in the learners’ written language and correspond to a high-dimensional feature vector \(\:{Z}_{text}\:\epsilon\:{\mathbb{R}}^{{d}_{1}}\)​, with \(\:{d}_{1}\) as the dimension of the extracted embeddings generated by DeBERTa54.Input processingInput based on two types of data structure for processing fusion models.Textual InputEach student’s textual response is tokenized and passed into a pretrained DeBERTa model to extract rich, context-aware embeddings.Let the sequence of tokens be \(\:T=\{{t}_{1},\:{t}_{2},\:\dots\:\dots\:,{t}_{n}\}\).Metadata InputStructured learner data (contextual, demographic, psycholinguistic, cognitive features) are normalized and encoded into a numerical vector.Let this vector be \(\:{X}_{meta}\:\epsilon\:{\mathbb{R}}^{{d}_{2}}\).DeBERTa embedding layerThis embedding layer is based on further positional and encoder embeddings for transformer blocks.Token Embedding:Each token \(\:{t}_{i}\) is mapped to an embedding \(\:{e}_{i}\).DeBERTa computes contextualized embeddings using stacks of transformer layers, defined using Eq. 17.$$\:{H}^{\left(l\right)}=\text{Transforme}{\text{r}}^{\left(\text{l}\right)}\left({H}^{\left(l-1\right)}\right)$$(17)Where \(\:{\varvec{H}}^{\left(\varvec{l}\right)}\:=\:[{e}_{1},.....,{e}_{n}]\), and\(\:\:l\) denotes the layer index.Each attention mechanism uses Eq. 18:$$\:\text{Attention}\left(Q,K,V\right)=\text{softmax}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}+R\right)V$$(18)Where \(\:\varvec{R}\) adds relative position bias.Pooling layerFor sequence representation, typically the [CLS] token embedding or a pooled output is taken, as in Eq. 19.$$\:{z}_{\text{text}}=h\left[CLS\right]\:\because\:\:{Z}_{text}\:\epsilon\:{\mathbb{R}}^{{d}_{1}}$$(19)Feature fusion layerThe contextual text embedding and metadata feature vector are concatenated to form a single input vector, computed using Eq. 20:$$\:{z}_{\text{fusion}}=\left[{z}_{\text{text}}+{z}_{\text{meta}}\right]\:\because\:{\varvec{z}}_{\text{fusion}}\:\epsilon\:{\mathbb{R}}^{{d}_{1\:+{\:d}_{2}}}$$(20)Fig. 3Architecture of DeBERTa Model.Full size imageLSTM integration layerThe fused vector \(\:{\varvec{z}}_{\text{fusion}}\)​ is passed through one or more LSTM layers to capture non-linear, sequential interactions between text and structured features, defined as cell states of LSTM using Eq. 21 as forget gate, Eq. 22 as input gate, Eq. 23 defining output gate, Eq. 24 computing dense layer vectors, Eq. 25 uses concatenation for output, and Eq. 26 defining hidden state based on output.$$\:{f}_{t}={\upsigma\:}\left({W}_{f}\left[{z}_{\text{fso}},{h}_{t-1}\right]+{b}_{f}\right)$$(21)$$\:{i}_{t}={\upsigma\:}\left({W}_{i}\left[{z}_{\text{fso}},{h}_{t-1}\right]+{b}_{i}\right)$$(22)$$\:{o}_{t}={\upsigma\:}\left({W}_{o}\left[{z}_{\text{fso}},{h}_{t-1}\right]+{b}_{o}\right)$$(23)$$\:\stackrel{\sim}{{c}_{t}}=\text{tanh}\left({W}_{c}\left[{z}_{\text{fso}},{h}_{t-1}\right]+{b}_{c}\right)$$(24)$$\:{c}_{t}={f}_{t}\odot\:{c}_{t-1}+{i}_{t}\odot\:\stackrel{\sim}{{c}_{t}}$$(25)$$\:{h}_{t}={o}_{t}\odot\:\text{tanh}\left({c}_{t}\right)$$(26)Where: \(\:{f}_{t},\:{i}_{t},\:{o}_{t}\): forget, input, and output gates, \(\:\stackrel{\sim}{{c}_{t}}\): cell state, \(\:{h}_{t}\)​: hidden state, \(\:{\upsigma\:}\): sigmoid function, \(\:\odot\:\): element-wise multiplication.Dense and output layerThe final hidden state from the LSTM, \(\:{h}_{t}\)​, is input to a dense (fully connected) layer for multi-class classification. The output prediction\(\:\widehat{y}\) is obtained via the softmax activation using Eq. 27:$$\:\widehat{y}=\text{softmax}{W}_{out}{h}_{T}+{b}_{\text{out}}$$(27)Fig. 4Proposed model architecture diagram.Full size imageLayer-wise model pipeline, also shown in Fig. 4:Input 1:Raw learner text → DeBERTa → \(\:{\varvec{z}}_{\text{text}}\).Input 2:Metadata features → Embedding/Encoding → \(\:{\varvec{z}}_{\text{meta}}\).Fusion:Concatenation to form \(\:{\varvec{z}}_{\text{fusion}}\).Hidden:Dense + LSTM layers for deep fusion and sequence modeling \(\:{h}_{T}\).Output:Dense (SoftMax) layer for multi-class trait classification \(\:\widehat{y}\).Hyperparameter settingsThe proposed model unites high-quality pretrained language representations and structured learner metadata with time sequence modeling using LSTM layers. Its model is made up of the use of stacked LSTM architecture, using dropout that helps capture the temporal dependencies and avoids overfitting. Training is done with an adaptive optimizer that is optimized on transformer-based models and tuning learning rate and batch size to have stable convergence, as hyperparameter tuning shown in Table 4. The length of the maximum input sequence is chosen as a compromise between computational efficiency and the context that is going to be captured. Handling overfitting (dropout, weight decay, label smoothing, early stopping, grad clip, Batch Norm, class weighting, calibration) and data-leakage prevention (Hold-out, fold-aware preprocessing, rare-category handling, de-duplication, leakage audit). There is also a fuzzy logic module introduced to improve the feature-based measure with rule-based inference in terms of learner characteristics.Table 4 Analysis of hyperparameter tuning.Full size tableStatistical analysisA variety of statistical tests are applied, displayed in Table 5 to assess the significance of performance differences and relationships between categorical outcomes in traits classification55. Where, \(\:\text{S}{\text{S}}_{\text{between}}\) and \(\:\text{S}{\text{S}}_{\text{within}}\)​ denote the between- and within-group sum of squares, k is number of groups, and N total observations. For each test, a significance level of α = 0.05 is used, and p-values are adjusted (e.g., Bonferroni correction) when multiple comparisons are made.Table 5 Analysis of applied statistical test.Full size tableXAI SHAP and deepshapExplainable Artificial Intelligence (XAI) techniques such as SHAP (SHapley Additive exPlanations and its deep learning extension, DeepSHAP are key to interpreting complex machine-learning model predictions. SHAP is a cooperative game theory-based method that provides a prediction-specific importance value to each feature and quantifies the value each feature contributes (positively or negatively) to the model’s output. This allows users to go beyond black-box predictions and instead obtain transparent, personalized explanations that are essential for informed decision making in educational applications. DeepSHAP extends SHAP to deep learning models, estimating SHAP values using a forward and backward pass through the model with a modified Deep LIFT to get approximate SHAP values56. For models such as DBML, DeepSHAP enables to disentangle contribution of prediction scores into those coming from input text embeddings and those resulting from structured metadata features, computed as in Eq. 28, providing fine-grained understanding what traits the model relies on to express its confidence for each learner.$$\:{{\upphi\:}}_{i}={\sum\:}_{S\subseteq\:F\setminus\:\left\{i\right\}}\frac{\left|S\right|!\hspace{0.17em}\left(\left|F\right|-\left|S\right|-1\right)!}{\left|F\right|!}\left[{f}_{S\cup\:\left\{i\right\}}\left({x}_{S\cup\:\left\{i\right\}}\right)-{f}_{S}\left({x}_{S}\right)\right]$$(28)\(\:F\) is the set of all features, \(\:S\) is the subset of \(\:F\) not containing \(\:i\). \(\:{f}_{S}\) is the model prediction when only features in \(\:S\) are present. \(\:{{\upvarphi\:}}_{i}\) is the SHAP value for feature \(\:i\).This expresses the marginal contribution of feature iii averaged over all possible feature combinations, making SHAP both theoretically principled and practically robust for interpreting hybrid and deep models in ELL assessment.Performance evaluation measuresPerformance evaluation in this study relies on a comprehensive suite of classification metrics to ensure robust and fair model assessment, displayed in Table 6. Accuracy The proportion of correct classified samples is calculated as measuring the rate of correctly classified samples, in this way providing an overall picture of the success of a model. Precision measures the percentage of those positive predictions that were correct and is an indication of the model’s ability to not raise false alarms. Recall (or sensitivity) is the percentage of true positives identified with respect to all actual positive, which stresses the model’s ability in identifying positive cases.Table 6 Performance evaluation measures.Full size tableThe F1-score balances precision and recall with the harmonic mean, such that it summarizes both the false positives and false negatives. Moreover, AUC-ROC (Area Under the Receiver Operating Characteristic Curve) assesses the model capacity to separate classes at different probability thresholds, providing a threshold-free measure of separability and ranking power and directly applicability for multi-class categories57. In combination, these metrics offer a multifaceted view of our predictive performance and corroborate the validity and interpretability of our hybrid and fuzzy-based models for classification of ELLs.Results and discussionIn the following, the empirical results obtained by traditional machine learning as well as advanced fuzzy rule-based methods on the ELL dataset are comprehensively demonstrated. Via a set of comprehensive analyses — including descriptive statistics analysis, visual exploratory data analysis, the estimation of feature importance and the model evaluation — important relationships and patterns in the data are revealed. Comparative performance of different feature categories, interpretability of rule-based models, and insights into linguistic proficiency in terms of context, psycholinguistic, and cognitive features are addressed. The results of each methodological step are presented in the following paragraphs, with a combination of numerical results and relevant visualizations used to demonstrate the efficiency and explanatory potential of the proposed hybrid framework.Exploratory data analysisThe insights on feature relationships and learning performance distribution are deduced from the EDA of ELL dataset. The feature correlation heatmap in Fig. 5, reveals that the linguistic constructs (labeled as Cohesion, Syntax, Vocabulary, Phraseology, Grammar or Conventions) have substantial positive correlations between themselves and with the overall score. This clustering with large correlation coefficients (most around0.65) indicates that these linguistic measures are co-variant (are highly intertwined), in line with the fine that advanced linguistic proficiency is multi-dimensional and strongly correlated. In contrast, contextual, demographic and cognitive features are weakly correlated with each other and with language traits, with the implication that they are more interdependent and subsidiary than language for measuring the learner’s competence.Fig. 5Correlational heatmap of all attributes within dataset.Full size imageThis is also confirmed by a look at focused views on selected features, in Fig. 6, Cohesion, Syntax, Vocabulary, Overall. The correlation heatmap here is even more interrelated (0.81 in some cases), specifically between Syntax and Overall or Cohesion and Overall, proving that enhancements in these features directly lead to better overall language use. This close correspondence supports the use as predictors in basic and advanced models. The distribution of the overall score is shown as a histogram in Fig. 7 with a kernel density estimate and a bell-shaped histogram peak centered at 3. This implies that the learners’ proficiencies are evenly distributed over the dataset, with most of the students located around an intermediate proficiency level and fewer at the ends. Distribution of Responses for Grades An associated bar plot showing the distribution of the grades indicates that the dataset is primarily consisting of students in classes 11 and 12 with sample size that is still quantitatively significant for grades 8, 9 and 10. This demographic distribution generates confidence that the analysis represents an overall, although slightly upper-level-skewed, portrait of secondary school students.Fig. 6Linguist Feature analysis using heatmap.Full size imageFig. 7Distributions of scores and grades based on responses.Full size imageFinally, the multi-variate density plot in Fig. 8 of all feature pairs, standardized from 0 to 6, that conveys the multi-modal character of learner profiles. Linguistic peaks (in particular: Cohesion, Syntax, Vocabulary) are found at mid-to-high levels, evidencing their influent role and developed dimensions in the sample. In contrast, circumstances like prior experience, level of motivation, and cognitive load are spread out more broadly, the density being disproportionately large for smaller values, which corroborates the previous observation that they play a complementary—but not less informative—role. This comprehensive EDA highlights the heterogeneity of the learner groups and the complex nature of language proficiency as well, hence providing a robust empirical ground for advanced models with interpretability such as fuzzy logic and deep learning fusion.Fig. 8Density plot distributions of all attributes within dataset.Full size imageNumeric-stats chart indicates that rubric scores (Cohesion, Syntax, Vocabulary, Phraseology, Grammar, Conventions) are tightly confined to the 1–5 range with means congregating around the mid-point and medians 3 and hence reasonable symmetricity of the distributions with no extreme outliers; grade occupies a much broader scale and has the greatest spread hence must be z-scaled to avoid dominating fused model, as shown in Fig. 9. The categorical plot top-category share, in Fig. 10, indicates fair skew of some features- a single category motive accounted for ~ 50 of samples and others (i.e., Vocabulary/Overall/Syntax) each had ~ 30–40% top-category. This combination suggests sufficient variation to learn yet also an element of possible imbalance; hence, stratified CV, class/feature balancing, rare-category combination (e.g. “other”) and missingness flags are recommended to the pipeline to avoid spurious associations and maintain the focus of the model on the linguistic clues.Fig. 9Bar plot distributions analysis of numeric attributes statistics.Full size imageFig. 10Top Categorical metadata analysis.Full size imageFeature ranking and fuzzy ruleThe use of feature selection (in the form of feature ranking) as a preliminary step to the rule-based fuzzy modeling is crucial to the effectiveness, interpretability and practical impact of the resultant fuzzy decision-making system. Towards this, feature ranking strategies including Information Gain, Gain Ratio, Gini Index, principal component loading and permutation importance collectively detect the most informative/distinctive attributes (from each item) among the items with different categories - contextual and demographic features, psycholinguistic and cognitive features, and linguistic features. This multivariate testing makes the rules generated in the fuzzy system interconnected in both linguistically and statistically manners.In the case of contextual and demographic characteristics, relative importance is higher for native language, learning environment and age, compared to prior experience. These findings suggest that the development of fuzzy rules be predicated on more subtle conditions related to language background and educational context rather than experience, a focus that may not always successfully distinguish learner outcomes. In the psycholinguistic and cognitive domain, working memory index and cognitive load are the most relevant features through most rank methods, while attention span and motivation level are ranked lower (yet still relevant). Thus, the fuzzy rule base may give emphasis to cognitive performance and load management in decision-making, which results in more robust and focused inference activation for those learner characteristics.The most remarkable result can be found in the linguistic feature subdomain, where all ranking criteria coincide to select phraseology, syntax, vocabulary, and coherence as the preeminent predictors of language proficiency. The uniformly good scores of these attributes with all methods imply that they are good furniture for building reliable fuzzy rules. For example, rules including thresholds or fuzzy sets on syntax or phraseology are much more likely to produce robust, generalizable classifications. Grammar and conventions, though ranked lower, still play significant role towards the betterment of formation of secondary or supportive rules. Because the fuzzy rule base is based on empirically ranked features, the proposed fuzzy inference system is transparent and well-monitored. Rules are not arbitrary or merely expert-authored, they closely represent the underlying demonstration structure and discriminative structure. This cooperation of statistical feature importance and fuzzy logic will result in the strengthened activation strength of rules, and the ambiguity in terms of class assignment and whole process of system is improved with the enhance of accuracy and interpretability of the system. Finally, such a technique can close the loop of discovery- by-data approach and human-understandable logic, and thus, the fuzzy-based e-assessment becomes more robust, interpretable and practically beneficial.This approach to assess ELL demands a strong feature selection that addresses both statistical measures (e.g. Information Gain, Gain Ratio, Gini Index) and model-based algorithms (Random Forest) to understand which learner features are the most predictive of language proficiency. The integration of these two methods—tabular rankings and visual feature importance—provides a dual developer to the importance of every attribute yet allowing for interpretable rule-based modeling and robust predictive analytics.Contextual and demographic variablesThe Table 7 rankings suggest that, of the contextual and demographic features, L1 and environment show larger Information Gain and Gain Ratio, which should allow them (as single attributes, considering individual splits) to reduce entropy and give useful splits regarding learner class. But if it weighs less in Information Gain, for Gini Index age is quite pure to separate class. When the feature importance of using Random Forest is plotted, age comes across as the most dominating feature, affirming the Gini Index viewpoint.Table 7 Contextual & demographic feature ranking.Full size tableFig. 11Feature Importance for Contextual & Demographic trait based on RF.Full size imageNote that this inconsistency with properly chosen statistics based on entropy and information often tending to select very mixed-variable-distributed attributes (due to them being also the most informative), demonstrates that model-based algorithms like RF may reveal non-linear and interaction effects (which make variables like age as particularly important), as shown in Fig. 11. This twofold evidence indicates that native language and context of learning are important in rule formation, but age consideration is also necessary to ensure the best predictive power, in favor of developing rules that capture maturational and experiential language learning differences.Psycholinguistic and cognitive factorsTable 8 shows the feature’s ranking in working memory index and cognitive load Working memory index and cognitive load are displayed as the first two features with larger IG and GI respectively, which means that they are of significance to discriminate ELL proficiency levels. Similarly, attention span score is revealed as having a high PC1 loading, highlighting its relevance in principal component-based analyses. These observations are visually consolidated in the Random Forest plot in Fig. 12, where the working memory index and span test score are the most important attributes, then cognitive load and motivation level. Statistical agreement with the model-based values justify the adoption of such cognitive traits to define rules in fuzzy systems and to feed input nodes in advanced machine learning models. It is also in line with existing psychological research, suggesting that memory and attentional control are critical for learning, particularly in the distinction of subtle learner profiles.Table 8 Psycholinguistic & cognitive feature ranking.Full size tableFig. 12Feature Importance for Psycholinguistic & Cognitive trait based on RF.Full size imageLinguistic featuresThe most compelling and consistent results appear in the linguistic feature set, shown in Table 9. Except for the model-based importances, all ranking metrics (Information Gain, Gain Ratio, Gini Index) converge on phraseology and syntax as the most important predictor. These characteristics have the highest thresholds in each of the metrics, and thus can be extremely good candidates for rule-based fuzzy systems and feature selection in machine learning models. The magnitude of each feature in the RF plot in Fig. 13, makes this point clear: phraseology and syntax are the most important features (by far), with diminishing contributions of cohesion, vocabulary, grammar, and conventions. Moreover, such a consensus among so many analytic perspectives also confirm the primacy of more advanced linguistic features in the comprehension accomplishments of ELLs, revealing that proficiency in these constructs best predict language proficiency and should thus be the core of both rule bases and predictive models.By combining these ranking results with visual inspection of the Random Forests, a variety of empirical and theoretical conclusions are drawn. Empirical feature rankings for the rule-based fuzzy systems prevent rules from being generated arbitrarily but based on data driven evidence for the significance of attributes. AI-based models training on high-ranking features (on phraseology and syntax for linguistic rules, and over age or working memory index for contextual and cognitive rules), based on interval values, are the core of potent intelligible fuzzy inference systems. From a machine learning perspective, it provides a nice balance between traditional statistical ranking and model-based importance, guaranteeing that the final models are statistically sound and empirically reliable and generalizable.Table 9 Linguistic feature ranking.Full size tableFig. 13Feature Importance for linguistic trait based on RF.Full size imageThe joint models that adaptively combine two statistical feature selection-based methods and the importance measures in Random Forest can help to take advantage of two different kinds of methods. This approach maximizes the degree to which interpretable (i.e., actionable) rules as well as non-interpretable machine learning models leverage the most informative, reliable and predictive learner characteristics, which will ultimately facilitate more accurate, explainable and actionable assessment systems. The use of a fuzzy rule-based classification model is important for progress in nuanced identification of ELLs. In contrast to rigid, threshold-based systems, fuzzy logic allows for a soft, intuitive manner of dealing with the inherent uncertainty and loose structure associated with human language learning. Through utilizing empirically ordered features in context, cognition, and language, they project learners into rules where the relative position of one type of learner compared to another along the proficiency continuum is reflected in the rules articulated. This allows educators and researchers to account for the complex interactions of varied learner dimensions in a way that contributes to a more comprehensive and personalized assessment. The results of the Fuzzy system are then discussed in terms of the proposed support for the ELL framework, particularly on how the combination of a diverse number of learner type features can yield more precise, interpretable and action-oriented learner profiles.Fig. 14Membership ratio among contextual and demographic traits.Full size imageThe plots of fuzzy memberships offer a fine-grained, intuitive visualization of the distribution and interpretability of trait attributions in the space of contextual, psycholinguistic, cognitive and linguistic dimensions for English learners. Such plots powerfully represent the power of fuzzy systems in describing the nuanced, gradual features of learner characteristics-something which crisp, fully-thresholder models can never imitate. For contextual and demographic features, Fig. 14 illustrates how features like age, native language, learning environment, and prior experience are distributed among low, medium, and high fuzzy sets. The soft degrees of membership are indicative of the fact that learners do not necessarily belong to distinct (all or none) categories.For instance, many learners belong to the moderate to high membership levels for medium age and medium native language proficiency; this is because both age and native language proficiency usually affect learning on a gradient rather than a discrete state-based manner. This granularity enables the fuzzy system to develop rules that consider slight demographic differences, adapting to the presence of mixed or intermediate profiles, rather than fixing arbitrary dividing lines. The psycholinguistic and cognitive characteristics—such as attention span, working memory, motivation, and cognitive load—have even more varied membership distributions. Figure 15 illustrates how learners can be moderately-to-highly endowed in more than one cognitive factor. As an example, one could have a student who has high working memory, but only average cognitive load, a combination that a crisp rule system could miss but can be naturally captured by fuzzy logic. This improved representation permits the fuzzy classifier to fire rules in parallel, and to weigh them based on the real membership degrees, and then to reflect better the human judgment in education domains.Fig. 15Membership ratio among psycholinguistic and cognitive traits.Full size imageIn the area of language, Fig. 16, for cohesion, syntax, vocabulary, phraseology, grammar and conventions are particularly telling. Here the degrees of membership for “high” sets are often shown as having broader, stronger flash-type distributions indicating that high linguistic proficiency is a more discrete, discriminating feature for advanced levels of learner. Nevertheless, the existence of not only learners with high memberships but also learners with moderate or low memberships in this type of features demonstrates the diversity of abilities of any cohort of learners.Fig. 16Membership ratio among linguistic traits.Full size imageThe fuzzy system’s power is also evidenced by its capability to aggregate on-line large number of high or medium linguistic memberships at the same time, increasing the interpretability and robustness of its classification. In sum, these highlight an advantage of the fuzzy approach in trait-based ELL contexts. Since membership, visualized and modeled as a continuum, leads to subtlety in reasoning, the fuzzy version of if-then rule can handle such subtlety, a nuance, which cannot be entertained by binary logical reasoning. This means that learner evaluation itself may now be more flexible, fair, and actionable, leading to more personalized instruction and assessments that more truly show the nuances of human learning.Furthermore, for supporting mean values, Fig. 17 indicates the histogram of mean rule activation strengths for the three categories—Contextual & Demographic, Psycholinguistic & Cognitive, and Linguistic features—in the fuzzy rule-based classifier. The figures depict that most of mean rule activations are distributed around a common range for the three kinds of traits, while in all cases the mean rule activation peaks are slightly higher for the linguistic feature set. This has to do with the observation that linguistic features induce stronger and decisive rule activations than do contextual and cognitive features. In other words, the fuzzy rules built over linguistic features better modulate and more robustly activate, stressing their pivotal role in distinguishing between EFL learners with different proficiency. The fact that linguistic features have a slightly higher and wider distribution, means that the corresponding rules more often have a dominating role in the final classification, which was also observed in the analysis of the feature ranking of these rules. In the case of the contextual and demographic, as well as psycholinguistic and cognitive domain narrower spreads and smaller peaks in the violin plots could indicate that the rule activations are slighter and more evenly distributed. This accords with the anticipation of the fact that these characteristics are more supportive or moderate in fuzzy inference. Their rules can be fired with (or even supplement) linguistic rules which add more nuances or contextual information to the decision, but that do not dictate the classification result.The visual aggregation of membership strengths and the small differences detected between trait categories show that fuzzy rule-based systems can be interpretable and fine grained. But instead of using a simple binary threshold, the fuzzy classifier uses a degree of rule activation which makes it possible to finally compute an overall, fine graded learner classification from multiple partial contributions. This is not only more closely aligned with the diversity of real-world learners but also provides educators with insight into which attributes were most important in a particular assessment and to what extent this dimension contributed to the learner categorization.Fig. 17Fuzzy Rule Activation Distribution among each class.Full size imageThis plot in Fig. 18 shows how the rule support (strength) of all the most important attributes and categories are distributed in the fuzzy classification model. For all of them — from contextual and cognitive to linguistic -- the degree and consistency with which their associated fuzzy rules obtain in a group of samples are assessed. The layered color attribution of categories shows that rule support is not uniform but is distributed by level of feature rather than globally, and what intervals on the various features contribute most robustly to classification. F1 Notably, the high density of support values at the bottom of each violin indicates that the fuzzy system is sensitive to a wide array of learner profiles, but the higher outliers are indicative of more rare cases causing some strong rule activations in specific instances.This multidimensional visualization illustrates the flexibility and explanatory power of the fuzzy approach (e.g., how such traditionally marginalized features as previous experience or level of motivation can still play a significant role in learner assessment when its rule support corresponds to certain categories).Fig. 18Distribution of rule support by attributes among score values.Full size imageThis bar chart display in Fig. 19, shows the means of the important features of the ELL dataset, and serves to directly support the fuzzy based analysis by demonstrating which characteristics are more prominent or essential to the learner group. Vocabulary, Cohesion, Phraseology, Conventions, Grammar, and Syntax, which are perhaps most notably linguistic features in that they are most removed from content fall, as a group, in the highest of the possible 5 ranges, which is just above 3. Such a clear focus of high end, if indeed the linguistic rule is not only highly developed, on average, but also has most effects on rule activation as well as learner differentiation within the fuzzy system. This is consistent with the fuzzy rule analysis, for which the linguistic features were consistently characterized by higher rule activations and membership degrees, which signals that they are the most decisive in distinguishing advanced language learners.Fig. 19Analysis of Mean values of attributes.Full size imageThe mean values of the contextual and the psycholinguistic-cognitive features, learning environment, cognitive load, motivation level, prior experience are lower. This is indicative of the fact that although these factors are crucial in developing a nuanced understanding and secondary rule creation, they are less impactful on the actual profile of the data and hence not dominant in the dataset. Their lower means also demonstrate the capability of a fuzzy system of having partial memberships which prevents these features from being entirely discarded but instead become modulated into linguistic-driven classification results. Thus, this mean value analysis supports and extends the fuzzy-type approach. It suggests that fuzzy classification is well-calibrated to the empirical structure of the data, highlighting valuable linguistic features and yet sensitive to the differences and contributions of context and cognitive assets. This results in more balanced and interpretable evaluations of learners, and so, they confirm the application of fuzzy logic to real educational assessment.These rule-based visualizations in Fig. 20 provide a holistic view of how fuzzy logic exploits empirical important features to provide understandable and powerful classification for ELL assessment. For example, the heatmap of the top 15 fuzzy rules shows the most activated rules with high frequency and strength in the dataset. It is also worth mentioning that the top rules usually have stacked intermediate values for each of age, attention span score and high level of motivation suggesting the complex phenomena of language learning where moderate and high traits reinforce each other to define learner profiles. Their strong appreciation of these rules suggests that - within this community - moderate cognitive and demographic characteristics coupled with high levels of motivation are mainly associated with successful language learning results. This again emphasizes that the MOF mechanism has the power in modeling multi-attribute dependencies and is not based on any binary indicator.Fig. 20Top 15 fuzzy rules with activation strength support.Full size imageBased on class-specific bar plots, different trait interactions that seem to drive different types of learners. The highest activation is obtained on rules that mix medium syntax and age such as for the Contextual & Demographic class or high motivation and medium syntax, in Fig. 21. This demonstrates that above all, including demographic aspects, language ability and motivation are key in the contextual assessment of learner achievement. Results for the Psycholinguistic & Cognitive class are mainly rules associating medium syntax and high cognitive load or cohesion with working memory—supporting a psychological hypothesis that cognitive resources, in tandem with syntactic awareness, together facilitate advanced learning, shown in Fig. 22. In this case, the rules also demonstrate symmetry where the order of the attributes (for example, syntax and cognitive load) can be switched and activated with the same degree, which shows the generality and flexibility of fuzzy logic.For the Linguistic Features class, as shown in Fig. 23, activations to rules with the maximum mean are consistently around syntax and cohesion, but specifically about medium levels for both. This indicates that, for linguistic classification, a good command of these associated features is essential. The higher mean activations of these rules as compared to the ones in the other classes seem to indicate that once trigged, the linguistic rules have a higher overall effect on the categorization of the learners—a result that aligns with our analysis of the strengths of the rule activations and the importance of the features.Fig. 21Top 5 rules for contextual and demographic class.Full size imageFig. 22Top 5 rules for psycholinguistic and cognitive class.Full size imageFig. 23Fuzzy Top 5 rules for linguistic class.Full size imageTaken together, these rule sets illustrate how fuzzy systems can expand beyond binary thresholds and leverage the combination of multiple moderate-to-high trait values for the complex, context-dependent assessment of learners. The visualization of both global and category-specific rules not only increase the interpretability of the system but provides a tool for design of pedagogical strategies—educators can ascertain which combinations of features are most relevant and adapt instruction accordingly. At the end, we thus present a data-driven, transparent personalized ELL-support roadmap, a strategy applied to pinpoint and personalize support, which capitalizes on the strength of fuzzy logic to mirror the real complexity of human learning.Predictive modelling resultsThe results of our predictive experiments highlight the capability of sophisticated ML, DL and hybrid architectures in assessing the performance of English language learners. Most importantly, the fusion-based model with DeBERTa in combination with structured metadata and LSTM achieved the highest performance than all the baseline and previous SOTA models that can provide reliable accuracy with the ability to generalize to the wide variety of learners. This holistic approach demonstrates the benefit of integrating rich text embeddings with cognitive and demographic features, achieving new state-of-the-art predictive analytics for educational assessment.Machine and ensemble learningThe analysis of machine learning models reveals interesting deviating trends in predictive accuracy in classifying ELLs according to contextual, psycholinguistic, cognitive, and linguistic features. Both overall accuracy and class-wise performance metrics show increasing difficulty of prediction as transition from linear or shallow learners to ensemble methods. Of the model benchmarks, CatBoost and Random Forest report the best performance (86% and 85%), both having better precision, recall, and F1 scores, compared to SVM and Decision Tree which return the next lower and quite similar results (78–80%), displayed in Table 10. These results are further elucidated by looking at confusion matrices shown in Fig. 24. The SVM model has a strong bias for misclassifying quite a few of the psycholinguistic and cognitive as instances of contextual and demographic. For example, 139 samples from class 1 are misclassified as class 0 and most of the linguistic attribute cases are misclassified as some other categories.Table 10 ML and EL based results analysis (%).Full size tableFig. 24Confusion Matrix of ML and EL models.Full size imageAs for the Decision Tree model, although it slightly increases the accuracy from SVM, it still has a significant confusion between classes (66 samples from class 1 are misclassified as class 0, 52 from class 1 are placed in class 2). These results suggest that linear and single-tree models are not as successful in capturing the fine, overlapping boundaries that occur when context-specific, psycholinguistic, and linguistic variables all converge to influence outcomes in language proficiency. The superior prediction capability of the Random Forest and CatBoost models is indicated by higher true positives in all the class values and a considerable decrease in wrong classifications. There is a good symmetry in both sensitivities and specificities in both models for the psycholinguistic/cognitive and linguistic categories in particular. Random Forest correctly predicts 168 samples from class 1 and 176 samples from class 2, and CatBoost improves these values, by correctly predicting 177 and 173 of them, respectively. Such improved prediction can be explained by the ability of the ensemble models to utilize feature interactions and nonlinear relations, which are prevalent in educational contexts with diverse learner characteristics. Specifically, the CatBoost model not only shows the leading scores among performance measures, it also has sound separation for all three types, which reduces potential confusion, and further enhances the robustness of the model for future applications. By comparing the ROC curves of each machine learning model employed in the study, it is possible to see a clear visual representation of their discriminative ability. Both SVM and CatBoost have impressive AUC (0.94) followed closely by Random Forest (0.93), suggesting that these models are very good at discriminating between classes across a range of decision thresholds and keep the true positive rate as high as possible while the false positive rate is low. The Decision Tree is far worse at 0.79, suggesting that it fails to separate the classes as well either because it tends to overfit or does not accurately model complex relationships in the dataset. The overlapping cluster of the ROC curves for SVM, Random Forest, and CatBoost of the Fig. 25, indicates the substantial robustness and dependability of these systems for evaluation of ELL, with ensemble and kernel methods giving a good predictive performance across various test samples on this application.Furthermore, decision boundaries for the machine leaned models provides a comparative analysis of performance of these algorithms on partitioning the English language learner profiles across the feature domains, shown in Fig. 26. After projecting the multi-dimensional learner data to a (X, Y) space defined by two principal components (PCAs), the spread of the actual samples is plotted with the regions of classification for the three class: Contextual & Demographic, Psycholinguistic & Cognitive, and Linguistic Features. The SVM model has a smooth, curved separator, which is the property of the SVM to find the best classifier of the transformed features. For the context and demographic features most users naturally cluster together in the left corner and SVM can correctly predict most of such class 0 samples. This results in some overlap and misclassification for samples that fall near the margins, particularly for students whose feature profiles exhibit blended psycholinguistic and linguistic traits.Fig. 25Combined Model AUC-ROC analysis.Full size imageThe difference in the partitioning of the input space is clearly reflected in the boundary plot of the Decision Tree. The splits are denoted as vertical, axis-aligned decisions as well, which is consistent with the threshold-based splits of the tree for individual attributes. This results in well-defined partitions when the objects are clearly separated but yields a stepped, sometimes broken line, which is unable to represent the finer, multidimensional structure in the data. Therefore, those students with feature values close to decision boundaries can be classified inconsistently, especially when their feature values are at the split points of the two ranges. The ensemble methods, Random Forest and CatBoost, clearly have more complex and tight decision boundaries. Even tighter and more class-balanced regions are allocated, and they better approximate the distribution of student samples. Random Forest is an ensemble of trees, which leads to a smoother, more adaptable separation of class than a single tree. CatBoost takes this a step further with a more advanced boosting implementation, resulting in boundaries that are not only well defined, but robust to noisy or overlapping feature sets. For English proficiency learners, that means students who have different and overlapping characteristics for example, performing well linguistically and being cognitively driven –- are better matched, meaning less misclassification and a more realistic representation of the diversity of learner profiles in the world.Collectively, these decision boundary plots highlight the respective capabilities and shortcomings of each model structure. In general, linear models such as SVM will offer smooth transitions but may have difficulty with overlapped class regions. It is therefore possible to use simple tree-based models both explaining but not offering enough flexibility to handle complex educational data. The methods based on ensemble, such as Random Forest and CatBoost, achieve the best performance because they can faithfully represent the combinations of features at different levels into learner categories by a nuanced mapping. This equates to more accurate and useful student classification for targeted interventions and personalized support in language learning environments.In conclusion, these results confirm the requirement for ensemble learning methods in multi-layered and high-dimensional educational data tasks. The gap between SVM and Decision Tree compared to Random Forest and CatBoost highlights the significance of model choice in research and application. The advanced models can provide useful information for learner profiling, educational assessment, and personalized intervention, by making good use of context, psycholinguistic, cognitive and linguistic features and turns out to be the better choice for language learner evaluation frameworks.Fig. 26Decision boundary region of all three traits based on PCA.Full size imageDeep learning and transformer-based modelThe trend toward scores through deep LM, transformer-based models, and hybrids paints a nice picture of the development due to the sophistication and effectiveness of modern ELE-scoring approaches. Below Table 11 highlights the performance of the deep learning models at baseline—LSTM, and BiLSTM shows average accuracy, LSTM: 61% BiLSTM: 63%. Their relatively low precision, recall, and F1-scores illustrate the inadequacy of merely using sequence-based approach to model the input data especially when the rich structure and diverse features of educational assessments are not fully exploited. The confusion matrices in Fig. 27 of these models show a strong tendency of both to confuse psycholinguistic/cognitive and linguistic characteristics, namely between class 1 and class 2 samples. This implies that a model that does have access to complex features but not the knowledge of how these features interact with each other and other, lower-level features does not probably could solve the fine-grained classification task. The performance is however boosted by a new transformer model DeBERTa, where the accuracy reaches 83% and most of the next refined metrics also show a significant improvement. DeBERTa can capture richer contextual representations from the input text thanks to its self-attention mechanism, therefore improving the differentiation ability across intricate proficiency levels. Its confusion matrix further confirms its better discrimination performance, showing that correct responses have significantly been increased for all classes but that there is still some confusion between psycholinguistic/cognitive and linguistic categories maintained. A similar performance trend is observed when DeBERTa is joined with an LSTM layer (DeBERTa + LSTM), and the best performance can achieve 86% accuracy with behavior precision and recall. This hybrid combines the context encoding ability of the transformer and the sequential nature of the LSTM, leading to a better performing model for capturing temporal or ordered patterns within language.Fig. 27Confusion Matrix analysis of deep models.Full size imageProposed model resultsThe fusion model that proposed, DBML is the result of these developments, performing the best metrics seen, that is, an accuracy of 93%, precision of 90%, recall of 93%, and F1-score of 92%, as detailed analysis of results displayed in Table 11. This model concatenates dense semantic embeddings of DeBERTa (that capture nuanced text representations) with structured metadata features indicating contextual, demographic, psycholinguistic, and cognitive factors. This enriched input is subsequently processed through an LSTM layer, when the network can take advantage of not only the depth of language representation but also the broadness of the learner-specific information.The outcome is a model that not only achieves superior performance in representing the fine-grained distinctions of learner-side categories, but also exhibits remarkable flexibility in accommodating complex, multi-attribute input. This is confirmed by the confusion matrix of DBML, where the blue background of the plot clearly shows that all classes are correctly predicted a lot more than they are not, notably for psycholinguistic/cognitive and linguistic feature cases, while a sharp decrease of misclassification is observed. The fusion mechanism of the model can thus help it capture the best of both the structured and unstructured data and bring the two pieces of the learner profile and understanding the language of the student together.Table 11 Deep learning-based models’ results analysis (%).Full size tableIn conclusion, the results emphasize the crucial importance of multimodal and hybrid models for language learner assessment. The proposed DBML model that combines transformer-based text embeddings along with all the associated structured features achieve the best accuracy and F1 score. Its ability to reduce false downwards confusion between near classes makes it particularly interesting for educational uses where high interpretability and precision are necessary. This model establishes a new standard for data-informed, student-centered assessment that has the potential to inform more responsive and differentiated education. Furthermore, the training and validation accuracy curves in Fig. 28 of the proposed DBML model clearly demonstrate that the model had strong learning dynamics and was well-generalized over epochs. Splitting the training procedure into epochs 1–25, 26–50, 51–75 and 76–100 makes the first plot series the most detailed one in terms of the development towards both accuracy metrics. The accuracy fluctuates as expected due to the rapid adjustment of the model to the data, but for training and validation the accuracy both levels off and progresses upward after the middle era to training. The colored marks in the accuracy curves denote the confidence of the model, higher values concentrate a little as point goes up, and the model is learning constantly and reduce its prediction variance as times goes on.This slope is maintained in higher epochs, with very few and small deviations between the validation set and the training set. The fact that the two lines align so well suggests that the model is not overfitting but instead is learning the underlying patterns that will generalize to new data. Persistent improvement followed by convergence is not only indicative of the performance of the hybrid model but also shows that the hybrid model is able to fully tap the predictive potential of both text and structured metadata features.Fig. 28Training and validation accuracy analysis of proposed model over set of epochs.Full size imageThe classic accuracy and loss curves presented in Fig. 29 also corroborate the stability and efficiency of the model. The left plot illustrates how the training and validation accuracy consistently increase, reaching almost perfect values around epoch 100. On the right are the corresponding loss curves, which dive down toward zero for both training and validation losses, with a series of little bumps through the training—the usual behavior of early period exploration and parameter tuning. Beyond this initial phase, loss is stabilized and remains low until the end of training, indicating that the model is performing post-hoc minimization of error well without issue from instability or vanishing/exploding gradients.Overall, all these training and validation graphs also demonstrate the stability and practicality of the proposed model. The overlap of the accuracy and loss and the lack of overfitting, and the similarity in training and validation performances together indicate that the model is not only well-fitted for the data but also generalizable well. This highlights the model’s applicability to real-world application in the educational setting where predictive accuracy and reliability is crucial in the assessment of English language learners.Fig. 29Overall model loss and accuracy analysis.Full size imageThe training curve along with the GPU, memory utilization indicates that combined approach (deep learning and transformer) can be used in practice in an efficient and resource-aware way. The model not only achieves good precision on train and validation set over 100 epochs (considering the complexity of the architectures) but also has an incredibly low hardware requirements, as shown in Fig. 30. The GPU usage of 8.02GB and memory consumption of 5.9% indicate the feasibility of better trade-off between high performance and resource utilization. This is especially important in the field of educational technology, since deployment environments are often resource poor. The hybrid design lightweight, we make strategic model architecture choices, ranging from basing on efficient attention mechanisms in DeBERTa, optimizing the LSTM layers for sequence modeling, and ensuring the space efficiency of integrating metadata while adding as little overhead as possible.The hybrid DBML model keeps DeBERTa-base as the main backbone and adds a lightweight 2-layer LSTM (256 hidden) and a tiny fusion MLP (128 64 with BN/Dropout). It adds minimal overhead, an around low-single digit multiple, to build DeBERTa-base, parameters increase by approximated + 23 M++ (2.33%) and the forward pass cost at 256 tokens by approx. +134 GFlops (4.95%). These numbers came by calculating trainable param-ethers (sum(p.numel() …)) and profiling FLOPs (ptflops-torchprofile) with the same sequence length, batch size and precision as the baselines. Due to fusion being late, inference cost is not particularly higher than with DeBERTa-only but makes the largest accuracy/AUC improvement; GPU memory consumption does not increase during training, and throughput is only slightly lower (≈ 5–8%). Relative to more capacity-heavy versions (larger transformers or jointly encoded multimodal models), the DBML model provides better accuracy against compute, which is: fundamentally the same speed as DeBERTa but with a small param/FLOP multiplier to access the full hybrid potential.The result is a fast, efficient solution that is both scalable and deployable on modern GPUs and memory-limited environments enabling advanced English language learner assessment across a broader spectrum of institutions and educators without compromising predictive ability or interpretability.Fig. 30Memory consumption analysis during training.Full size imageThis correctness and incorrection graphically represents the distribution of the predicted and real class assignments in all the samples for the three ELL categories of hybrid model in Fig. 31. Correct predictions are visualized as green dots, whereas the mismatched predictions are shown in red. The tight clustering and general ‘greenness’ of the classes at each level shows that even though a hybrid model was chosen for the analysis, the predictive strength of this approach is high and is capable of accurately mapping complex input data in many dimensions to the right categories of learners.The relatively low concentration of red dots also indicates strong generalization ability and high accuracy of the model. It is interesting to mention that the stability of accurate predictions over the three class bands demonstrates balance in the hybrid model’s sensitivity to a mix of learner traits, thereby validating its usefulness not only for linguistically- oriented advanced features but also for contextual and cognitive types. This visualization is strong support for the argument that fusing deep textual embeddings with structured metadata produces a robust, holistic ELL assessment tool, capturing and predicting learner diversity accurately and understandably.Fig. 31Actual (red) and predicted (green) analysis based on proposed model.Full size imageThe proposed hybrid model is modality-independent and could be tailed to vision via replacing the text encoder with a vision backbone (e.g., ViT/Swin or CNN) and, optionally inserting an LSTM/Conv LSTM to model temporal sequences (multi-frame or video) and maintaining the late-fusion MLP to inject non-image metadata (illumination level, sensor ISO, GPS, weather, time). It resembles the hybrid and deep modeling of computer vision e.g., transformer based plant disease recognition incorporating learned features and auxiliary signals, variational nighttime dehazing with hybrid regularization49, and prior-query transformers within haze removal53 - all have been suggested by the work, thus proposing two imminent applications: (i) plant pathology since growth-stage sequences and field metadata enhance discrimination58, and (ii) mitigation of adverse visibility since capture conditions dictate the model. Our fuzzy-rule layer has the capacity to encode understandable priorities (e.g., contrast/color-cast thresholds) to change border predictions, and Grad-CAM/vision-SHAP may provide saliency-scale explanations. Future work leaves open an entire vision benchmark but points out that the training recipe, leakage controls and fusion strategy do not transfer accordingly.Using the DeBERTa text-only baseline (Acc 83, F1 89) as reference, LSTM alone achieves a small increase in accuracy (+ 3) and a small decrease in F1 (− 3), so that overall hit rate is increased by this addition of sequence modeling, but then class balance or calibration is a bit worse. By contrast, only the addition of metadata is more accurate (more + 4) with a smaller F1 drop (less − 2), indicating that learner/context features are informative but not adequate to fix hard boundary cases, as displayed in Table 12. The maximum lift (93 on Acc, 92 on F1) is provided by the combination of both signals (DeBERTa + LSTM + Metadata, DBML), where recall reaches 93%, indicating that temporal and demographic/behavioral context is complementary: the former helps to capture dependencies in time sequences, the latter helps to reduce the ambiguity in near-boundary samples.Table 12 Ablation study analysis based on Deberta variants.Full size tableA fuzzy-rule post-filter upon the fused model produces fractionally smaller aggregate metrics (e.g., Acc 91, F1 91) compared to DBML, in exchange for highly constrained, interpretable decision boundaries and greatest possible predictive score. This variant has the advantage, in practice, of stabilizing edge scores, with an exchange against a compromise in accuracy in headlines. Overall, the ablation shows that the metadata based on each component helps to improve calibration, LSTM forms temporal structure, and their fusion produces the best and most generalizable performance hence justifying the worth of the proposed hybrid design.The primary contributor to the performance is DeBERTa: the text-only baseline has an accuracy of 83%/0.95 AUC. LSTM-only gives a smaller improvement of 86%, learning sequence regularities and smoothing out token-level noise whereas metadata-only introduces complementary context (e.g. learner/background cues), as combined analysis shown in Fig. 32. They are synergistic in their combination (DBML with 93% accuracy/0.98 AUC) since DeBERTa provides well-informed linguistic representations and LSTM encodes the temporal/dependency patterns at the sequence level, and disentangled metadata resolves borderline cases that cannot be settled by the text alone. This mixture was accordingly driven by complementary error plots in ablation: transformer > transformer + {LSTM|metadata} (minor gains)