IntroductionAs globalization surges forward, cross-cultural communication is intensifying, making the mastery and application of language an increasingly vital skill (Syzenko and Diachkova, 2020; Xia et al., 2024). In traditional language education, the cultivation of intercultural communicative competence is often marginalized, with a predominant focus on imparting language knowledge such as grammar and vocabulary. This educational model has led to significant challenges for students in real-world intercultural communication. In the context of today’s increasingly globalized society, marked by diverse cultural backgrounds, one of the foremost difficulties encountered by adolescents lies in navigating and understanding cultural differences. For instance, many adolescents struggle with communication barriers when interacting with individuals from different cultures, often due to language obstacles or cultural misunderstandings. Moreover, the enhancement of language skills in adolescents must be pursued alongside a deeper understanding of cultural contexts. An exclusive emphasis on language training fails to address the practical challenges faced in intercultural communication. Consequently, fostering intercultural communicative competence in adolescents has become a critical objective in contemporary language education (Tsang, 2022). Traditional classroom instruction, for one, often falls short in offering personalized learning experiences, struggling to cater to the diverse needs of individual learners. Additionally, the typical language learning process is limited by low interactivity and practical application, creating hurdles for learners when attempting to apply their skills in real-world contexts (Aririguzoh, 2022; Tafazoli et al., 2020). The advent of artificial intelligence (AI) technology, however, ushers in a transformative opportunity for language education. Intelligent language learning systems, equipped with personalized learning pathways and real-time feedback mechanisms, hold the promise of drastically enhancing both the efficiency and effectiveness of the learning process (Huang et al., 2023). Through these advanced systems, learners can experience a more tailored and interactive approach, paving the way for greater mastery of language and improved cross-cultural communication skills.In intelligent language learning systems, the advancement of deep learning technologies has been a pivotal force in driving innovation. Models such as the Transformer (Yang et al., 2023) and Bidirectional Encoder Representations from Transformers (BERT) (Gu et al., 2021) have demonstrated superior performance in natural language processing tasks, particularly in sequence-to-sequence translation and language comprehension. The Transformer model, features an encoder-decoder architecture, optimizes both language understanding and generation. In parallel, the BERT model leverages pre-training and fine-tuning mechanisms to enhance the accuracy and flexibility of language models. Additionally, considerable progress in speech recognition and synthesis technologies, including automatic speech recognition (ASR) and text-to-speech (TTS), has empowered intelligent language learning systems with robust voice interaction capabilities (Weng et al., 2023). The design of personalized learning pathways and the integration of real-time feedback mechanisms (Xu et al., 2024; Wang et al., 2022) provide foundational support for improving learner engagement and learning outcomes. Despite advancements in deep learning, speech recognition, and personalized learning strategies, several limitations persist in existing systems. Many studies focus on isolated technological applications, overlooking the potential synergies of integrating multiple technologies. Moreover, a lack of empirical validation for theoretical models complicates the verification of their practical effectiveness. While personalized learning paths and real-time feedback mechanisms are emphasized, their full integration into existing systems remains a challenge.The impact of AI on language learning has also garnered increasing attention. Research by Yu et al. highlighted that AI-based personalized learning systems could dynamically adjust content and pacing according to learners’ language proficiency, learning styles, and interests, thereby enhancing language acquisition efficiency (Yu et al., 2022). Liu et al. extended this by proposing that AI’s integration of natural language processing and speech recognition could provide real-time feedback, aiding learners in refining pronunciation, grammar, and expression to bolster practical language application skills (Liu and Quan, 2022). Further, the enhancement of intercultural communication competence through AI has emerged as a significant research area. Shadiev et al. explored the role of AI in developing intercultural communication competence, positing that cross-cultural communication simulations using virtual reality (VR) and augmented reality (AR) technologies immersed learners in authentic cultural contexts, thereby fostering cultural understanding and communication proficiency (Shadiev et al., 2021). This study underscored the value of integrating AI into the cultivation of intercultural competence, proposing that such an approach enriched the language learning experience. Additionally, Bin et al. discussed the complex dynamics, including personal interests, cultural backgrounds, and linguistic abilities, that influence young people’s interactions on social platforms (Bin and Sun, 2022). While research suggests that AI holds substantial potential to improve language learning and intercultural communication, many studies still focus on isolated skills such as grammar, vocabulary, or speaking, without exploring the holistic development of language abilities and intercultural competence.This study aims to construct an intelligent language learning system that leverages advances in AI to enhance learners’ language proficiency and intercultural communication competence. By designing personalized learning pathways and implementing real-time feedback mechanisms, the study explores how deep learning technologies, the BERT model, the Transformer architecture, and an enhanced Agent-Object-Relationship model Based on Consciousness (AORBCO) can be integrated to develop an innovative learning platform. This platform is designed to provide a highly efficient, personalized language learning experience while simultaneously fostering cultural understanding and the development of intercultural communicative skills. Although generative large language models (LLMs), such as GPT and LLaMA, have demonstrated outstanding performance in general language tasks, they exhibit three key limitations in cross-cultural language learning scenarios. First, the generative mechanism of LLMs lacks controllability and may produce outputs that conflict with cultural facts (e.g., confusing the cultural significance of bowing angles across East Asian societies). Second, the inference latency caused by models with tens of billions of parameters hinders real-time interaction. Third, LLMs struggle to capture fine-grained cultural distinctions—for instance, failing to differentiate between the direct questioning strategies typical in Anglo-American business negotiations and the indirect suggestion strategies preferred in Japanese and Korean contexts. In contrast, this study adopts the BERT model for its core advantage: a bidirectional encoding architecture that enables deep contextualized semantic representations, thereby providing interpretable cultural feature vectors for the AORBCO model. Furthermore, BERT’s lightweight design supports millisecond-level response times on standard Graphics Processing Unit servers, and its deterministic output mechanism aligns seamlessly with the belief–plan logic of AORBCO. This integration mitigates the risk of pedagogical misguidance often introduced by the stochastic nature of LLMs. This technical choice ensures the system’s reliability in terms of cultural accuracy, real-time responsiveness, and educational safety.The proposed approach incorporates multiple AI techniques to create a comprehensive solution. By combining the BERT model with the Transformer architecture, the system improves upon traditional language learning platforms by enhancing their ability to process long sequences and multimodal data. Furthermore, the refinement of the AORBCO model equips the system with greater flexibility and adaptability, allowing it to dynamically adjust feedback content based on individual learning progress, thereby offering more robust support for personalized instruction. Lastly, the system design places particular emphasis on the simulation and feedback of intercultural communication scenarios, creating a learning environment that closely mirrors real-world cultural contexts. This facilitates the development of learners’ cultural sensitivity and communication strategies in practical exchanges.The intelligent language learning system developed here aims to address the limitations of traditional language instruction—namely, the lack of personalized guidance, delayed feedback, and insufficient training in cross-cultural contexts. By integrating deep learning techniques with an enhanced AORBCO model, the system seeks to achieve coordinated optimization of both linguistic proficiency and intercultural communicative competence. The system’s core functionalities are structured around three key components. First, the dynamic generation of personalized learning pathways leverages learners’ language proficiency levels, interests, and historical learning data to produce targeted training plans. Second, the real-time multimodal feedback mechanism employs speech recognition, semantic analysis, and behavioral tracking technologies to provide immediate correction of pronunciation errors, grammatical deviations, and culturally inappropriate expressions, while dynamically adjusting task difficulty. Third, the cross-cultural context simulation utilizes virtual scenarios and a multimedia resource library to construct authentic cultural interaction environments, enabling learners to practice flexible language strategies in diverse cultural settings. To ensure pedagogical effectiveness, the system is designed to meet three critical requirements: efficient processing of long-sequence language data and multimodal inputs; adaptive alignment with the evolving trajectories of learners’ abilities; and deep integration of language skills training with cultural competence development. Through this design, the system aims to overcome the linguistic and cultural barriers learners commonly face in authentic intercultural communication.MethodologySystem architecture designThe intelligent language learning system architecture outlined in this study addresses the growing demand for modern educational technologies that enhance both language learning outcomes and intercultural communication competence. This design seeks to fill existing gaps in the literature while drawing upon a deep understanding of current language education and intercultural communication theories. By integrating cutting-edge AI technologies, the system aims to improve students’ language skills and prepare them for effective engagement in a globalized, multicultural communication environment. The primary goal of the system architecture is to cater to learners’ diverse needs at various stages of their educational journey through personalized learning paths and real-time feedback mechanisms. This approach facilitates the simultaneous enhancement of language proficiency and intercultural communication capabilities. In contrast to traditional language learning methods, which often face challenges such as limited instructional resources, fixed teaching models, and delayed feedback, the proposed system utilizes intelligent technologies to deliver real-time adjustments and tailored learning resources. These adaptations are driven by students’ progress, language skills, and intercultural communication requirements, optimizing the learning process and outcomes. The architecture consists of three key modules: the learning content module, the feedback mechanism module, and the cultural context simulation module. The learning content module develops personalized learning plans based on the learner’s proficiency and interests, ensuring a targeted approach. The feedback mechanism module employs real-time feedback, speech recognition, and automatic grading systems, providing immediate evaluations and guidance to reinforce learning. The cultural context simulation module immerses students in various intercultural scenarios, allowing them to practice language use and communication strategies in diverse cultural settings. The seamless integration of these modules significantly enhances students’ language proficiency and intercultural communication competence. This architectural framework not only aligns with established pedagogical principles but also incorporates the latest advancements in AI-powered personalized education, offering a novel solution to the challenges of language learning and intercultural communication. The system’s architecture, as illustrated in Fig. 1.Fig. 1Architecture of the intelligent language learning system.Full size imageAs illustrated in Fig. 1, the modules within the system do not operate in isolation but are orchestrated and scheduled within a unified processing pipeline governed by the AORBCO model. In this system, learners input language materials either through speech or text. These inputs are first converted into normalized textual data via the speech recognition module or text preprocessing module.Subsequently, the system invokes the BERT-based semantic understanding module to perform deep semantic modeling, generating contextual vectors enriched with cultural contextual features. These semantic vectors are then fed into the encoder of a Transformer model for sequential language modeling, with the decoder producing candidate outputs in the target language. The translation results generated at this stage are transmitted to the core AORBCO module. Within the AORBCO module, the system conducts strategy scheduling and output optimization based on the learner’s current ability state, task goals (Desires), belief data (Beliefs), and semantic structure. For instance, when the system detects cultural discrepancies in the input, the AORBCO model accesses its cultural knowledge base and strategy library to implement culturally equivalent substitutions, embed annotations, or perform pragmatic reconstructions of the translation. At the same time, based on feedback from the Ability module (e.g., speech recognition accuracy) and the learning pathway formulated by the Plan module, the system dynamically adjusts the linguistic difficulty, speech rate, and interaction modality of the output. The optimized results are then delivered to the learner via text, speech synthesis, or virtual interactive interfaces. Moreover, the AORBCO module continuously logs learner performance in real time, updating the learner’s competence profile and autonomously adjusting the content and difficulty of subsequent tasks. This enables a truly closed-loop, personalized learning experience.To enhance system efficiency, the AORBCO model adopts a multithreaded management mechanism. The semantic understanding module, based on BERT, operates in parallel with the Sequence-to-Sequence (Seq2Seq) translation module, which utilizes the Transformer architecture. Asynchronous communication between these modules reduces latency and improves processing efficiency. The speech recognition and synthesis module further optimizes the decoding process through the implementation of Beam Search and the Viterbi Algorithm, thereby ensuring smooth and responsive real-time interaction. A core strength of the AORBCO model lies in its ability to integrate multimodal data—including text, speech, and images. For example, within cultural context simulations, the system can simultaneously analyze a learner’s vocal tone (via ASR), body movements (captured via camera), and textual responses (processed through semantic understanding), enabling a comprehensive assessment of intercultural communication competence and the provision of targeted feedback. Technical details of the AORBCO model, including code architecture and multithreaded scheduling algorithms, are elaborated in “Enhanced AORBCO model”.To further elucidate the contributions of individual modules within the system architecture, the subsequent sections provide detailed discussions:“Semantic parsing and contextual representation based on BERT” explores the principles underpinning the semantic understanding and representation module.“Seq2Seq translation task” analyzes how Seq2Seq-based translation modules enhance the efficiency and precision of language conversion.“Enhanced AORBCO model” examines the augmented capabilities of the improved AORBCO model in handling multimodal data.“Speech recognition and synthesis” highlights the contributions of speech recognition and synthesis technologies in delivering natural and fluent interaction experiences.By integrating these modules into a cohesive architecture, the system effectively addresses the challenges of language learning in globalized, multicultural environments.Semantic parsing and contextual representation based on BERTThis module transforms natural language input into machine-processable intermediate representations (Mu et al., 2021; Xue et al., 2023). The semantic representation understandable by machines refers to the vectorized semantic space generated by the BERT model. This representation draws on the concepts of semantic networks and knowledge graphs, mapping linguistic units such as words, phrases, and sentences into a high-dimensional vector space to form a semantic vector network. Within this network, each word or phrase is represented as a vector, and the distances between vectors reflect their semantic similarity. For example, in the semantic vector space, the vectors for “apple” and “fruit” are positioned relatively close, indicating high semantic relatedness, whereas the distance between “apple” and “car” is comparatively greater, reflecting lower semantic similarity. This transformation is made possible by the BERT model, a pre-trained language model rooted in the Transformer architecture, celebrated for its profound capacity to grasp contextual nuances (Zhu et al., 2023). Through its sophisticated bidirectional encoder, the BERT model excels in capturing the intricate relationships and contextual depth between words, enabling a holistic semantic modeling of the text (Paganelli et al., 2022). This study constructs semantic representations based on the pretrained weights of the BERT model. The BERT model, having been trained on a large-scale text corpus, captures general semantic features of language. On this foundation, the model is further fine-tuned using domain-specific corpora to better adapt to the requirements of language learning and intercultural communication scenarios. In constructing semantic representations, the approach draws upon standard methodologies from semantic network construction, such as analyzing synonymy, antonym, and hierarchical (hypernym–hyponym) relationships between lexical items to enrich the semantic vector representations. However, this method places greater emphasis on leveraging deep learning models to automatically learn semantic features, rather than relying entirely on manually defined semantic rules or pre-structured semantic networks.The structure of the BERT model is illustrated in Fig. 2.Fig. 2Intricate architecture of the BERT model.Full size imageAs shown in Fig. 2, the input text is first converted into word embedding vectors and then augmented with positional encoding before being passed into the Transformer encoder layer. This layer utilizes a self-attention mechanism to compute contextual dependency weights among tokens, ultimately producing a 768-dimensional contextualized vector for each token. For example, in a culturally embedded sentence such as “The leader made the final decision” (original: “领导拍板决策”), BERT not only captures the literal meaning of “拍板” (“to strike the board”) but also, by analyzing its co-occurrence with terms like “leader” and “decision,” constructs a semantic representation in the vector space that reflects its culturally nuanced meaning of “authoritative decision-making,” rather than mechanically interpreting it as a physical action. These contextualized vectors serve as input features for downstream tasks, enabling the system to recognize and interpret implicit cultural presuppositions embedded in language—capabilities that traditional rule-based or knowledge-graph-driven methods often struggle to achieve. It is important to note that this process does not rely on any predefined ontology or semantic network standard. Instead, semantic association patterns are learned automatically from large-scale corpora through pretraining and fine-tuning, resulting in a dynamic vector space capable of supporting cross-cultural semantic understanding.The BERT model, at its core, is distinguished by its bidirectional Transformer encoder. Initially, text input undergoes transformation into word embedding vectors, which are subsequently fed into the Transformer encoder following positional encoding. This encoder leverages a self-attention mechanism to refine these word embeddings, producing contextually enriched representations. BERT’s training unfolds in two distinct phases: pre-training and fine-tuning. During pre-training, the model engages in unsupervised learning across a vast corpus, encompassing tasks such as Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In the MLM task, a portion of words in the input sequence is masked at random, and the model’s challenge is to predict these obscured words. The NSP task involves predicting whether a given sentence logically follows another. Transitioning to the fine-tuning stage, the pre-trained BERT model undergoes supervised training tailored to specific downstream applications, such as language comprehension and translation within the realm of intelligent language learning systems (Pavlick, 2022; Kades et al., 2021).In intelligent language learning systems, the semantic understanding and representation module is critical for converting natural language inputs into machine-readable semantic representations using the BERT model. This transformation establishes a robust semantic foundation for subsequent operations, such as language generation and translation, facilitating seamless interaction among system components. Through its bidirectional encoder, BERT effectively captures contextual relationships between words, enabling the system to comprehend and process intricate semantics within input text. Further details on utilizing these semantic representations for language translation tasks are elaborated in “Seq2Seq translation task”, which discusses their integration within the language learning workflow.Seq2Seq translation taskThe Seq2Seq translation task stands as a cornerstone in natural language processing, characterized by its pivotal role in converting one sequence—such as text in a source language—into another sequence, typically in a target language (Shi et al., 2021). At the heart of this transformation, the Transformer model emerges as a sophisticated tool, offering a robust framework for efficient Seq2Seq translation. Unlike traditional models such as Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks, which rely on recursive structures, the Transformer model employs self-attention mechanisms and multi-head attention mechanisms. These mechanisms allow the model to analyze input sequences in parallel, substantially improving training efficiency and overall performance. By focusing on relationships between words across an entire sequence, the Transformer model enhances the precision and scalability of language translation, making it a cornerstone in modern natural language processing tasks.The Transformer model, depicted in Fig. 3, features a dual-component architecture: the encoder and the decoder (Yilmaz et al., 2024). This model’s distinction lies in its exclusive reliance on the Attention Mechanism, a design choice that facilitates superior sequence modeling and translation efficiency (Grechishnikova, 2021). By eschewing recurrent structures and leveraging self-attention mechanisms, the Transformer model addresses the complexities inherent in Seq2Seq tasks with remarkable efficacy, advancing the field of natural language processing with its innovative approach.Fig. 3Architecture of the Transformer model.Full size imageFigure 3 illustrates the intricate architecture of the Transformer encoder unit, a quintessential component of the model. Each encoder unit is meticulously structured into three principal segments: Self-Attention, Add & Norm, and Feed Forward. In an intelligent language learning system, the primary function of the Transformer model is to enable efficient and precise language translation. Through the encoding and decoding of source and target language sequences, the Transformer facilitates the conversion from one language to another. The self-attention mechanism plays a crucial role by allowing each word in the input sequence to reference the entire context when generating the corresponding output word, ensuring that translations are both smooth and accurate. During the translation process, the Transformer first encodes the input source sentence, then decodes it into the target language. The self-attention mechanism not only captures contextual relationships but also handles linguistic complexities such as synonyms, polysemy, and variations in word order, resulting in translations that are both grammatically correct and contextually appropriate.When compared to traditional sequence models, the Transformer model stands out for its parallel computation capabilities and its ability to handle long sequences effectively. Unlike models that rely on recursive structures, the Transformer efficiently processes sequences, significantly accelerating training, particularly when dealing with large-scale corpora. Moreover, the multi-head attention mechanism enhances the model’s capacity by considering multiple layers of relationships between words during translation, thereby increasing accuracy. Within intelligent language learning systems, the Transformer model finds broad application across various translation tasks, including language alignment, vocabulary selection, and syntactic adjustments. Through these functions, the Transformer aids learners in better grasping language structures, ultimately advancing their cross-cultural communication competence.This study employs the standard Transformer-based greedy decoding strategy within the Seq2Seq translation module. The core mechanism of this strategy involves selecting the token with the highest conditional probability at each decoding step to generate the target sequence. Specifically, the decoder generates tokens in the target language progressively by leveraging the context vector output from the encoder. This process is facilitated by the self-attention mechanism and multi-head attention, which enable the decoder to capture both global and local dependencies within the input sequence. In the translation generation phase, the decoder employs Greedy Search to produce the target language sequence, as depicted in Fig. 4.Fig. 4Greedy Search process.Full size imageThe crux of Greedy Search lies in its iterative approach: at each stage, the option that appears to be the most advantageous is selected, with the aspiration of ultimately arriving at a globally optimal solution. The specific mechanics of Greedy Search are encapsulated in Eq. (1):$${w}_{t}={{\rm{argmax}}}_{w\in V}P({w|}{w}_{1},{w}_{2},\cdots ,{w}_{t-1})$$(1)In Eq. (1), \({w}_{t}\) represents the t-th word generated in the sequence, V denotes the vocabulary set, and \(P({w|}{w}_{1},{w}_{2},\cdots ,{w}_{t-1})\) signifies the conditional probability of the word w based on the preceding context of words \({w}_{1},{w}_{2},\cdots ,{w}_{t-1}\). The Greedy Search mechanism thus strives to maximize this probability at each step, constructing the sequence one token at a time based on local optimality.Enhanced AORBCO modelAORBCO Model: Functioning as the system’s core, the AORBCO model addresses the inherent challenges in processing long sequences and multimodal data encountered in traditional approaches. By incorporating attention mechanisms and leveraging optimized algorithms, this model enhances the system’s ability to handle complex data inputs. The integration of diverse information sources, including text, speech, and images, enables precise language comprehension and generation. This capability is particularly critical in facilitating cross-cultural communication and refining personalized learning path design. Additionally, the AORBCO model’s robust handling of long-sequence data supports the system’s overall performance, delivering significant improvements in both learning outcomes and system reliability.The AORBCO model employs attention mechanisms and optimized algorithms to prioritize and process complex data, effectively identifying critical features and dynamically adapting to contextual variations. Unlike traditional architectures such as RNNs or LSTM networks, which frequently encounter challenges such as gradient vanishing or explosion during long-sequence processing, the AORBCO model leverages self-attention mechanisms to evaluate the contextual importance of each element in the input. This approach minimizes information loss and ensures more reliable handling of extended sequences.In the domain of multimodal data processing, the AORBCO model exhibits significant strengths. By simultaneously integrating inputs from text, speech, and other modalities, it enhances the system’s capacity to process cross-modal data effectively. This feature is critical for that require the synthesis of multiple data types. For example, in speech recognition and translation scenarios, the AORBCO model processes speech and textual data concurrently, achieving improved translation accuracy. Moreover, the self-attention mechanism embedded within the model captures intricate relationships and interdependencies among diverse data types, allowing for a comprehensive understanding of multimodal inputs. Such capabilities are particularly vital in addressing the complexities of cross-cultural communication and the nuanced expressions inherent in advanced language tasks. By optimizing the processing of multimodal and long-sequence data, the AORBCO model contributes to enhanced language comprehension and application, particularly in settings requiring precision and adaptability.The AORBCO model functions as an integral component within the intelligent language learning system, facilitating advancements in several key areas:1.Speech Recognition and Translation TasksLeveraging advanced attention mechanisms, the AORBCO model extracts critical features from lengthy speech or text sequences, ensuring precise translation and comprehension. This functionality proves particularly advantageous in multilingual translation and spoken language training, supporting learners in improving both linguistic accuracy and cross-cultural communication capabilities.2.Personalized Learning Path DesignBy analyzing learning progress and contextual needs, the AORBCO model synthesizes data from speech, text, and other modalities to design individualized learning trajectories. These trajectories are dynamically adjusted to match the learner’s evolving proficiency, preferences, and challenges, creating a tailored educational experience that promotes sustained engagement and progress. Personalized learning pathways play a crucial role in enabling adaptive educational experiences. By analyzing learners’ behavioral data—such as study duration, task completion rates, and frequency of feedback interactions—along with information on their cultural background and language proficiency, the system employs machine learning algorithms to construct preliminary learner profiles. Based on these profiles, the system can dynamically adjust the difficulty and type of learning content, thereby delivering customized learning trajectories for each individual. For instance, for learners who exhibit high cultural sensitivity in intercultural communication simulations but demonstrate lower fluency in language expression, the system increases the proportion of language production exercises and provides more challenging linguistic materials. This process does not rely on complex deep learning models; rather, it is grounded in simple rule-based matching and statistical analysis. The aim is to establish a foundational framework for the design of personalized learning pathways.3.Real-Time Feedback MechanismWithin the real-time feedback framework, the AORBCO model assesses learner performance continuously, offering immediate and context-sensitive evaluations. This mechanism allows learners to identify and address errors promptly, enhancing learning efficiency and enabling continuous improvement in language acquisition.The AORBCO model’s ability to process long-sequence and multimodal data sets it apart from traditional approaches. Its robust attention mechanisms and algorithmic optimizations contribute to more accurate semantic understanding, efficient translation, and enhanced interaction experiences within the system. These advancements provide critical technical support for improving language comprehension and production in diverse learning contexts.The enhanced AORBCO model is engineered upon the Abstract Window Toolkit (AWT), a robust Graphical User Interface (GUI) library, which underpins the core framework of advanced intelligent systems. AWT’s provision of a visually intuitive interface facilitates seamless interaction and monitoring of the agent’s status, thereby significantly bolstering development efficiency. Integration with the Java Agent Development (JADE) Framework amplifies the platform’s capability, allowing for comprehensive simulation of the AORBCO model’s functions through a Multi-Agent System. The architecture of the platform is meticulously crafted with scalability in mind, incorporating a diverse array of controller interfaces. This design accommodates the complex demands of system expansion and adaptation. Within this infrastructure, four principal management threads are delineated, each dedicated to overseeing the Belief, Ability, Desire, and Plan components of the Ego model. A specialized behavior control mechanism (BCM) thread orchestrates the interplay between these components, ensuring that their operations remain harmonious and effective. The sophisticated technologies and methodologies embedded in this platform confer precise simulation capabilities for the Ego’s functionalities. This, in turn, provides a formidable foundation for advanced exploration and development of intelligent systems based on the Ego model.In the enhanced AORBCO model proposed in this study, the “Ego Model” functions as the central intelligent decision-making module, enabling coordinated processing of cross-modal data through a four-dimensional BDP framework. Specifically, when the system receives input involving complex cultural contexts (e.g., “In Western cultures, handshaking is a common form of greeting, whereas in Japan, bowing is more typical”), the Belief Module is first activated to retrieve information from the cultural knowledge graph. This module integrates a cross-cultural communication database containing over 2,000 etiquette rules and contextual tags, and automatically identifies key cultural elements (e.g., associating “handshaking” with Western cultural tags and “bowing” with Japanese cultural tags). Subsequently, the Desire Module dynamically adjusts the prioritization of translation strategies based on the learner’s target language (e.g., Japanese) and current cultural sensitivity score, which is calculated from historical interaction data. If the learner is at an early stage of cross-cultural communication proficiency, the system prioritizes literal translation supplemented with cultural annotations. Conversely, if the learner demonstrates a foundational understanding of cultural norms, the system adopts culturally equivalent substitution strategies to enhance contextual appropriateness.In this process, the Ability Module continuously evaluates the system’s available technological resources. Using ASR, the module detects potential cultural prosodic deviations in the learner’s pronunciation, such as the frequency of honorific usage in Japanese. Simultaneously, the semantic understanding component, based on BERT, analyzes implicit cultural presuppositions embedded in the source sentence—for instance, the relativity of the term “common” within Western cultural contexts. Ultimately, the Plan Module generates a multi-stage processing strategy. This includes cultural tag injection, whereby cultural feature weights are enhanced in the attention layer of the Seq2Seq translation module—for example, by enriching the contextual vector of the term “bow” with Japanese etiquette attributes. In cases where the target language lacks a direct equivalent—such as culturally specific idioms—the model invokes the AORBCO cultural equivalence lexicon to substitute appropriate terms and inserts synthesized explanatory narration via text-to-speech. Additionally, multimodal feedback is provided: along with the translated text output, a 3D etiquette animation is presented through an AR interface to reinforce cultural comprehension.As illustrated in Fig. 5, the file directory structure of the AORBCO model development platform exhibits a well-organized schema. The platform’s architecture is delineated into four principal packages: Belief, Ability, Desire, and Plan. Each of these packages is tasked with executing its specific functions independently. Notably, these modules do not engage in direct communication with one another; rather, they are orchestrated through a central BCM. The BCM serves as the critical hub for managing interactions within the system. It handles the data influx from the front-end interface and directs operations in alignment with the control flow chart. BCM is instrumental in mediating between the threads of Belief, Ability, Desire, and Plan, ensuring that each component performs its designated function without being encumbered by extraneous communication tasks. This strategic decoupling enhances system stability and operational efficiency. Furthermore, the effective scheduling and management facilitated by the BCM not only synchronize the threads but also foster automatic operation and intelligent oversight of the platform. This streamlined approach enhances the overall coherence and effectiveness of the system’s operations.Fig. 5File directory structure of the AORBCO model development platform.Full size imageIn the intricate architecture of the AORBCO model development platform, the Ego package and the GUI package play pivotal roles. The Ego package orchestrates system startup and configuration file loading, while the GUI package oversees user interface functionalities, ensuring seamless user interactions. Within the Model package, diverse entities and their interrelationships are meticulously categorized. This package includes elements such as Node, Relation, and Customer Map, which are essential for comprehensive system modeling. Complementing these core packages, various utility packages and a Resources folder are integrated into the platform. The utility packages facilitate the development process, while the Resources folder houses system configuration files and other related assets. The harmonious integration of these modules and files constructs a robust and cohesive AORBCO model component. This integration significantly enhances the platform’s efficiency, streamlining both development and application phases for developers.Upon initiating the JADE development platform, the EgoStart() method within the model startup class undertakes the creation of multiple threads, each tasked with registering and activating the primary container for modules such as belief and desire. Each thread operates with its own dedicated message queue, responsible for receiving request messages from other threads. Upon receipt, the system identifies the appropriate controller to process the message, effectively sidestepping direct inter-module communication. This architectural choice diminishes system complexity, fosters modular isolation, and enhances the stability of message transmission. This configuration endows the JADE platform with formidable concurrent processing capabilities and minimizes coupling between components. Consequently, developers gain a more refined ability to orchestrate and manage interactions among various threads, paving the way for the construction of a more intelligent and efficient system.The AORBCO communication model is intricately founded on a sophisticated knowledge structure. This model utilizes description language as a tool for semantic articulation of underlying knowledge while employing communication language to delineate the rules governing knowledge exchange. The overarching aim is to forge an intelligent model for semantic communication, with a focus on facilitating both knowledge exchange and reasoning processes. At the foundational level of the model lies the message transport layer, an element directly attuned to practical application and governed by computer network protocols. Depending on the specific application needs, the system may opt for either the Transmission Control Protocol/Internet Protocol (TCP/IP) or the Hypertext Transfer Protocol. In the case of the AORBCO communication model, the TCP/IP protocol was selected for its robustness. Ascending to the middle layer, the communication language layer plays a pivotal role. It ensures that both Ego and acquaintances are able to comprehend and accurately relay messages. This layer meticulously defines the communication protocol for messages, which encompass semantic content to facilitate contextual understanding. Messages exchanged between Ego and acquaintances may include a range of semantic functions such as requests, inquiries, notifications, and responses. The communication layer is further dissected into two distinct segments: the pragmatic layer and the semantic layer. The pragmatic layer outlines the rules governing the application of pragmatic terms, while the semantic layer provides clarity on the meanings of semantic terms. The pinnacle of the model, —the semantic understanding layer—serves as the core component responsible for parsing and reconstructing cross-module semantic flows. When a learner inputs a statement, this layer first deconstructs the utterance into its pragmatic labels and semantic parameters, initiating a multi-level processing sequence. It then accesses the cultural strategy library within the Belief module to identify culturally equivalent expressions for acts such as “indirect refusal” in the target culture. Based on the real-time load status provided by the Ability module, it allocates semantic transformation tasks to the Transformer or activates the VR module to generate an appropriate negotiation scenario. Finally, by integrating stage-specific strategies from the Plan module, the system produces culturally annotated feedback tailored to the learner’s communicative context. Figure 6 depicts the specific structure of the AORBCO communication model.Fig. 6AORBCO communication model.Full size imageThe AORBCO model functions as a central component in enhancing language acquisition and cross-cultural communication capabilities. Revisions to the original model focused on addressing limitations that hindered the system’s adaptability and personalization. Initially, the static nature of the learning path design failed to fully accommodate individual learner differences. To address this shortcoming, a dynamic adjustment mechanism was introduced, enabling real-time performance tracking and feedback. This mechanism allows the system to autonomously adapt the learning content and task difficulty, ensuring that each learner receives a personalized learning experience tailored to their progress, thereby optimizing learning efficiency. Furthermore, the original model’s approach to cross-cultural communication lacked sufficient cultural diversity and complexity. To rectify this, the model was enhanced with diverse cultural context simulations. These include language-switching scenarios, conflict resolution strategies, and culturally specific communication tactics, enabling learners to engage in a broader range of real-world communication situations. Such inclusivity strengthens cross-cultural communication skills and prepares learners to navigate various intercultural interactions. To amplify immersion and interactivity, multimodal feedback mechanisms were integrated into the model. These include speech recognition, speech synthesis, and visual feedback, which offer learners a comprehensive interaction with the system. In addition to textual and linguistic feedback, these enhancements evaluate speech fluency, pronunciation accuracy, and other aspects of language production, providing a richer, more practical learning experience. In traditional language learning systems, feedback is often fixed, unable to adapt to the specific performance of learners. The revised AORBCO model incorporates an adaptive feedback mechanism, which tailors the feedback process to the learner’s real-time performance. For example, when a learner makes a grammatical error during spoken practice, the system not only identifies the mistake but also recommends follow-up tasks suited to the learner’s proficiency level, facilitating targeted improvement. This real-time adjustment ensures that learners receive the most relevant and effective support at every stage of their development.The AORBCO model employs a multi-thread management framework, facilitating the coordinated operation of diverse modules within the intelligent language learning system. Central to this framework, the BERT model performs semantic analysis on the input text. Utilizing its bidirectional self-attention mechanism, BERT extracts and processes contextual relationships within the text, forming a robust semantic foundation for subsequent translation tasks. The AORBCO model orchestrates the interaction between the BERT model and other integral system components, such as the Seq2Seq translation module, through its specialized BCM. Upon receiving user input, the AORBCO model first leverages the BERT model to derive a semantic representation of the source language text. This representation is then transmitted to the AORBCO model, which updates the internal states of its Belief, Desire, Ability, and Plan modules. These updates are driven by the semantic analysis of the input, guiding the system in determining the appropriate next steps. In this process, the AORBCO model not only achieves semantic comprehension of the input but also dynamically adapts the task’s priority and translation approach based on the current states of the Belief and Desire modules. This ensures that the generated translation aligns with both the semantic requirements of the context and the task-specific goals.The Seq2Seq translation module, integral to the conversion of languages, operates through the Transformer architecture to facilitate automatic translation from the source to the target language. Within the AORBCO model framework, this module’s input and output processes are managed through the BCM. The AORBCO model selects optimal translation strategies and task scheduling protocols based on the specific demands of the translation task, ensuring that the semantic coherence of the translation is preserved while maintaining the fluency of the target language. Guided by the AORBCO model’s scheduling capabilities, the Seq2Seq translation module goes beyond basic translation tasks by dynamically adjusting the tone and phrasing in response to contextual cues. When cross-cultural contextual elements are detected within the input, the AORBCO model directs the Seq2Seq module to focus on the nuanced expression of cultural differences during translation. This enables the model to enhance cross-cultural communication and improve the precision and appropriateness of the translated output.The AORBCO model distinguishes itself through its modular architecture and robust thread management, enabling seamless collaboration across multiple system modules. In a cross-cultural language learning system, AORBCO ensures that linguistic variations across cultures are effectively identified and addressed by leveraging its task scheduling and BCMs. In particular, when handling complex sentences within cultural contexts, the AORBCO model guides the Seq2Seq translation module to select culturally appropriate expressions, promoting smoother, more natural communication across different cultures. By aligning BERT’s semantic understanding with Seq2Seq translation capabilities, the AORBCO model not only refines translation accuracy but also augments the system’s adaptability to multicultural contexts through its intelligent feedback system. This allows for the delivery of more tailored and precise learning materials while enhancing the system’s ability to navigate cross-cultural nuances effectively.For sentences with complex cultural contexts, such as “In the West, a handshake is a common greeting, but in Japan, bowing is more typical,” the AORBCO model first employs the BERT model to thoroughly analyze the semantics, recognizing key cultural elements (e.g., handshake in Western culture, bowing in Japan). This cultural context is subsequently relayed to the Seq2Seq translation module. When the aforementioned example is processed by the system, the AORBCO model first identifies the cultural attributes of the terms “handshake” and “bowing” using BERT. Subsequently, within the Belief Module, the model retrieves culturally specific distinctions relevant to the Japanese context, such as differentiating between “eshaku” (a 15-degree bow) and “saikeirei” (a 45-degree bow). Based on the learner’s current proficiency level (assumed to be intermediate), the Plan Module determines that the cultural contrast structure of the source language should be preserved in the translation output. Additionally, during the TTS phase, the system appends an explanatory narration to enhance the learner’s understanding of the cultural nuances. Within this process, the AORBCO model adjusts the translation strategy to retain the cultural essence of the original sentence, while ensuring modifications align with the target culture’s linguistic norms. Consequently, the translation generated by the Seq2Seq model preserves the sentence’s core meaning and integrates cultural nuances relevant to the target culture’s social customs.The improved AORBCO model is built upon the AWT and the JADE Framework, providing robust multi-agent system support for the intelligent language learning system. This module enhances the processing capability of complex long sequences and multimodal data through an advanced attention mechanism, working in tandem with other modules in the system to construct a more intelligent and adaptable language learning platform. Building on the semantic understanding and translation modules, the AORBCO model further strengthens the system’s cross-cultural communication abilities, ensuring that the system can address diverse language learning needs. In the subsequent “Speech Recognition and Synthesis”, the role of the speech recognition and synthesis modules will be discussed, focusing on how deep neural network technology is utilized to provide learners with a realistic and smooth language interaction experience.Speech recognition and synthesisSpeech recognition and synthesis technologies are fundamental components within the intelligent language learning system, playing a crucial role in enhancing learner interaction by converting spoken input into text and generating natural speech from text output. This module integrates with both the system’s semantic understanding and representation as well as the Seq2Seq translation modules, facilitating seamless transitions between speech and text. In speech recognition, ASR technology, driven by deep neural networks, captures minute distinctions in speech, providing precise input for subsequent language processing. Similarly, TTS technology produces fluent, natural-sounding speech output, improving the learner’s immersion in the learning environment. ASR and TTS technologies are integral to the evolution of advanced language learning systems, driving their sophistication and functionality.Within the ASR module, the Viterbi algorithm is employed during the decoding process. This algorithm, a staple in dynamic programming, excels in determining the optimal path when a model’s structure is predefined. In speech recognition applications, the Viterbi algorithm integrates the outputs from both the acoustic and language models, selecting the most probable text sequence corresponding to the spoken input. Despite its effectiveness, the Viterbi algorithm encounters limitations when processing lengthy sentences or large vocabularies. The need to compute the probability of every potential path and subsequently select the optimal one results in exponential growth in computational demands with longer sequences and larger vocabularies, ultimately reducing its efficiency.Several optimization techniques have been implemented to address the computational challenges associated with speech recognition. Pruning techniques strategically eliminate unlikely paths during computation, thus reducing the computational burden. By setting predefined thresholds, paths are pursued only when their probability surpasses a critical value, effectively minimizing redundant calculations. Another widely utilized optimization method is Beam Search, which limits the number of optimal paths considered at any given time, as opposed to evaluating all potential paths. This approach significantly accelerates the decoding process. While Beam Search does not always guarantee the discovery of the optimal path, its computational efficiency surpasses that of the traditional Viterbi algorithm, particularly in contexts involving long sentences and extensive vocabulary sizes. Despite the computational bottlenecks inherent in the Viterbi algorithm for large-scale data and complex scenarios, its global optimality and higher accuracy made it the preferred choice in this study. However, exploring the integration of the Viterbi algorithm with pruning or Beam Search techniques could offer a pathway to optimizing the decoding process and enhancing real-time system performance.In the implementation of the speech recognition and synthesis modules, the study adheres to a classical framework that integrates signal processing with deep learning techniques. The ASR component is constructed using a hybrid Hidden Markov Model–Deep Neural Network (HMM-DNN) architecture (Shahin et al., 2021). Acoustic features are extracted from speech signals using MFCCs (Saxena et al., 2022) through the following process:(1)Pre-emphasis: Applied to compensate for high-frequency attenuation, using the formula \({s}^{{\prime} }\left(t\right)=s\left(t\right)-0.97\cdot s(t-1)\);(2)Framing and Windowing: The speech signal is segmented using a 25 ms frame length and a 10 ms frame shift, followed by the application of a Hamming window \(\omega \left(n\right)=0.54-0.46\cos \left(\frac{2\pi n}{N-1}\right)\);(3)Fourier Transform: The power spectrum of each frame is calculated using \(P\left(k\right)={\left|{FFT}({s}_{\omega }\left(t\right))\right|}^{2}\);(4)Mel Filter Bank Processing: A set of 40 triangular filters maps linear frequency to the Mel scale, producing log-energy outputs \(E\left(m\right)=\sum _{k}P(k)\cdot {H}_{m}(k)\);(5)Discrete Cosine Transform: The cepstral coefficients are extracted using \(C\left(n\right)=\mathop{\sum }\nolimits_{m=1}^{M}\log \left(E\left(m\right)\right)\cdot \cos \left(\frac{\pi n\left(m-0.5\right)}{M}\right),n=1,\cdots ,13\).Decoding optimization is performed using the Viterbi Algorithm (Rowshan and Viterbo, 2021), which employs dynamic programming to determine the most probable state sequence. The recurrence relation is given by \({\delta }_{t}(j)={\max }_{i}[{\delta }_{t-1}(i)\cdot {a}_{{ij}}]\cdot {b}_{j}({o}_{t})\), where \({a}_{{ij}}\) denotes the state transition probability, and \({b}_{j}({o}_{t})\) represents the observation likelihood.To further balance computational efficiency and recognition accuracy, Beam Search (Libralesso et al., 2022) is incorporated for pruning, retaining only the top \(k=10\) most probable paths at each time step.The ASR module serves a pivotal role by transforming spoken language into written text. This process hinges on several critical phases: feature extraction, acoustic modeling, language modeling, and decoding. In the initial phase of feature extraction, the ASR system dissects the speech signal to derive its core features using MFCCs, which are extracted through the five-step process described earlier including pre-emphasis, framing/windowing, Fourier transform, Mel filter bank processing and discrete cosine transform.The acoustic model functions as the bridge between the extracted features and their corresponding phonemes or words. Within the realm of acoustic modeling, techniques such as Hidden Markov Models (HMMs) and Gaussian Mixture Models are frequently employed. In the context of this study, the decision is made to utilize an acoustic modeling approach grounded in Hidden Markov Models. This choice is particularly suited for handling the temporal dynamics inherent in speech signals with efficiency and precision. Following this, the language model takes center stage, tasked with calculating the likelihood of the recognized text. This step is crucial for enhancing the overall accuracy of speech recognition. The n-gram model was selected to capture and represent word sequences, with the probability of each sequence determined by Eq. (2):$$P({w}_{1},{w}_{2},\cdots ,{w}_{N})=\mathop{\prod }\limits_{i=1}^{N}P({w}_{i}|{w}_{i-1},\cdots ,{w}_{i-n+1})$$(2)The final stage, decoding, involves synthesizing the outputs from both the acoustic model and the language model. The Viterbi algorithm is employed to identify the optimal path, effectively navigating through the possible sequences to find the one that best matches the spoken input. The decoding score is calculated Eq. (3):$${Score}({w}_{1},\cdots ,{w}_{N})=\mathop{\sum }\limits_{i=1}^{N}\log P({w}_{i-1},\cdots ,{w}_{i-n+1})$$(3)TTS technology transforms written text into speech that is both natural and fluent. The TTS module is implemented using the end-to-end Tacotron 2 model (Aziz et al., 2023). This model extracts textual features through character embedding and convolutional layers. Alignment between text and acoustic features is achieved using location-sensitive attention, which enhances the model’s ability to associate linguistic input with temporal acoustic patterns. Spectrogram prediction is performed using a stack of five convolutional layers followed by two LSTM layers, resulting in an 80-dimensional Mel-spectrogram. For waveform generation, the system employs a WaveNet vocoder (Dorado Rueda et al., 2021), which converts the predicted spectrogram into a 16 kHz audio waveform. The WaveNet vocoder utilizes a dilated convolutional neural network architecture to capture long-range temporal dependencies, thereby improving the naturalness and intelligibility of the synthesized speech.During the text analysis phase, the input text undergoes a transformation into a format optimized for speech synthesis. Key tasks in this phase include word segmentation, part-of-speech tagging, and phonetic tagging. To enhance the naturalness of the synthesized speech, this study adopts a rule-based text analysis approach, breaking down the text into fundamental phonetic units such as syllables, ensuring a granular level of detail.Next, the process advances to the speech synthesis stage, where a concatenative synthesis method is employed. This method translates the text into a series of speech units, like phonemes, and then meticulously concatenates these units into a seamless speech signal. The synthesis can be mathematically represented as Eq. (4):$${Speech}(t)=\mathop{\sum }\limits_{i=1}^{N}{w}_{i}{{Unit}}_{i}(t)$$(4)In Eq. (4), \({{Unit}}_{i}(t)\) denotes the time function of the i-th phonetic unit, while \({w}_{i}\) represents the corresponding weight coefficient, highlighting the intricacies involved in assembling the final speech output.Finally, in the voice output phase, the synthesized speech signal is audibly rendered through a loudspeaker. To guarantee both the naturalness and clarity of the output, high-quality audio coding technology is applied, followed by post-processing steps aimed at eliminating noise and echo, thus ensuring that the final auditory experience is both crisp and lifelike.Design of empirical experimentExperimental subjects and groupingThe experimental sample consists of 262 undergraduate students majoring in language-related disciplines. Voluntary participation is obtained from all subjects, with informed consent acquired prior to the study. To ensure the sample’s representativeness, participants are randomly selected from a cohort with diverse academic backgrounds. The age range of participants is 18 to 25 years, with uniformity in English proficiency levels and no presence of language disorders or other conditions that might interfere with language acquisition. In this context, the term “English proficiency level” specifically refers to learners who have all attained B1-level certification under the Common European Framework of Reference for Languages (CEFR), with a score range of 45–60 out of a maximum of 100 points. The standard deviation (SD) is maintained within ±5 points (pre-test total score SD = 6.2–7.1), meeting the definition of “homogeneity” as outlined in educational statistics. However, the CEFR B1 level itself allows for reasonable variation across individual language skills (e.g., speaking, writing). For instance, a learner may demonstrate stronger listening skills (B1+) and comparatively weaker writing skills (B1−), which aligns with the case study’s description of “intermediate initial proficiency” and “localized difficulties.”The 262 students are randomly assigned to either an experimental group or a control group, each comprising 131 participants. The experimental group utilizes an AI-powered intelligent language learning system, while the control group engages with traditional language learning methods. A Matching Assignment technique is employed to account for baseline differences, pairing participants based on language proficiency, learning style, gender, and age to ensure balance across these variables. To minimize external influences, rigorous controls are established during the experiment to maintain a consistent learning environment, reducing the potential impact of environmental factors on the results. To further address individual differences, a pre-experiment language proficiency assessment is administered. Participants are subsequently categorized into proficiency levels, ensuring a balanced distribution of students across both groups. These precautions aim to mitigate the influence of individual variability on the experimental outcomes, thereby enhancing the validity and reliability of the study’s findings.Data collection and analysisPrior to the commencement of the formal experiment, a pilot study is conducted to evaluate the effectiveness of the optimized AORBCO model. This pilot study is carried out independently from the main experiment’s experimental group (which utilizes the optimized AORBCO model) and control group (which employs traditional methods). The primary objective is to compare performance differences among the same group of learners (n = 50, not involved in the main experiment) before and after the model optimization. The participants are native Chinese speakers learning English as their target language. The study spans four weeks and is divided into two phases: the baseline testing phase (Weeks 1–2), during which the original AORBCO model—without integration of the attention mechanism or cultural knowledge base—is applied; and the optimized testing phase (Weeks 3–4), which employs the enhanced AORBCO model incorporating multimodal data processing and a dynamic cultural strategy module.At the commencement of the experiment, an extensive assessment is conducted to establish baseline data for each participant. This initial phase includes a multifaceted language proficiency evaluation encompassing four key components: listening, speaking, reading, and writing. Additionally, participants engage in an assessment of their cross-cultural communication abilities, where their performance and strategies for navigating simulated cross-cultural scenarios are systematically recorded and analyzed.The evaluation of cross-cultural communication ability is based on the following criteria:1.Language Fluency: This criterion gauge the fluidity of language usage in communication, particularly the participant’s facility in utilizing language across diverse cultural contexts. Aspects such as coherence, pronunciation clarity, and linguistic naturalness are central to the assessment.2.Cultural Understanding: This measure focuses on the participant’s comprehension of cultural nuances, specifically their awareness of and respect for differing cultural practices within communication. Evaluation emphasizes the participant’s capacity to accurately identify cultural differences and their ability to adjust language and behavior in accordance with the cultural context of the interlocutor.3.Communication Strategies: This aspect assesses the participant’s competence in employing effective communication techniques within cross-cultural interactions, with particular attention to adaptability in complex cultural situations. The evaluation sights to determine whether participants can effectively address communication barriers and dynamically adjust strategies to facilitate successful exchanges.Each evaluation criterion is quantified using a five-point scale, defined as follows:1 point: Communication task failure, unable to complete the task.2 points: Communication task significant difficulty, lacking adequate cultural understanding or communication strategies.3 points: Completion of the communication task with notable deficiencies in language fluency or cultural understanding.4 points: Successful task completion with some cultural understanding and effective communication strategies.5 points: Fluent completion of the communication task, demonstrating a high level of cultural understanding and effective communication strategies.To ensure the reliability of assessments, evaluators undergo standardized training, fostering a consistent understanding of the scoring criteria. Training focuses on clarifying the standards, conducting consistency testing, and mitigating subjective bias. The goal is to ensure evaluators could apply scores objectively and consistently, thereby improving the reliability of results.(1)Experimental Intervention Phase (12 weeks):The core phase of the experiment spans 12 weeks, during which the experimental and control groups pursue different learning paths. The experimental group utilizes an intelligent language learning system designed for personalized learning, with an adaptive feedback mechanism that dynamically adjusts the approach to enhance both language proficiency and cross-cultural communication skills. In contrast, the control group adheres to traditional learning methods, including textbook study, classroom instruction, and periodic teacher guidance, without the support of an intelligent system.(2)Post-test Phase:Following the intervention phase, participants undertake a post-test to assess language proficiency and cross-cultural communication abilities. The post-test aims to directly measure the impact of the intervention, comparing the results to the pre-test to evaluate improvements in both language and communication skills.(3)Delayed Test:A delayed test is administered four weeks after the completion of the experiment to assess the durability of the learning effects. The content mirrors the post-test, aiming to determine whether the gains from the intervention were sustained over time and to provide comparative data.The experimental design of this study adheres strictly to established scientific standards in cross-cultural language learning research. All participants are native speakers of Mandarin Chinese. The experimental group engages in English language learning via the intelligent system, while the control group receives instruction through traditional classroom-based English teaching. English is selected as the target language due to its status as a primary medium of global cross-cultural communication and its inclusion of diverse culturally embedded contexts—such as greeting customs and metaphorical expressions—making it a suitable medium for evaluating the system’s capacity to address cultural sensitivity.During the 12-week instructional period, the experimental group engaged in tasks designed to simulate culturally nuanced communicative strategies using ASR and TTS technologies. For instance, learners practiced interruption strategies commonly used in Anglo-American business meetings—such as the intonational patterns of “I’d like to add…”—in contrast to the silence-as-respect convention observed in Mandarin Chinese interactions. Culturally embedded dialogues were generated using TTS, highlighting contrasts between individualistic expressions in English (e.g., “I prefer…”) and collectivist formulations in Chinese (e.g., “我们觉得…”). VR scenarios were integrated to enhance cross-cultural training. In one scenario, learners adopted the role of a Chinese employee explaining the concept of “face” to a British client. Voice Emotion Analysis (VEA) was used to assess whether the learner’s tone conveyed appropriate politeness. Based on this analysis, the system consulted a cultural knowledge base to suggest culturally equivalent expressions, such as replacing a literal translation of “giving face” with “mutual respect.” A detailed example from Week 6’s “Cross-Cultural Business Negotiation” module illustrates the application of these technologies. When responding to the statement “Your deadline is unrealistic” from an American business partner, the system used ASR to detect potentially aggressive intonation (e.g., sharp pitch increase) and prompted learners to revise the phrasing to “Could we discuss extending the timeline to ensure quality?” In another example, when translating the Chinese sentence “这件事需要领导拍板,” the AORBCO model injected cultural context into the translation, producing “Final approval requires consensus from our senior management,” rather than a literal and culturally ambiguous version such as “leaders clapping boards,” thereby mitigating the risk of misinterpretation.To ensure the reliability and scientific validity of the experimental results, several variables are meticulously controlled across all phases of the experiment:1.Individual Differences: Random assignment of participants to experimental and control groups minimizes the potential for systematic bias due to individual differences.2.Experimental Time and Frequency: Both experimental and control groups participate in learning activities with identical frequency within the same time frame, ensuring that differences in learning duration and frequency did not affect the results.3.Environmental Control: The experiments are conducted in a controlled laboratory setting, utilizing identical equipment and materials, thereby eliminating environmental factors that could impact learning outcomes.These rigorous controls are implemented to uphold the scientific integrity of the experimental design and provide a robust foundation for the reliability of the findings.The data collection process is methodically structured around the pre-test, post-test, and delayed test stages, capturing various dimensions of the participants’ learning journey.i.Language Ability Test DataIn each stage, detailed records are kept of participants’ scores in listening, speaking, reading, and writing. These scores are calibrated on a 100-point scale, ultimately synthesized into a comprehensive language proficiency score. The granularity of these scores provides a clear view of individual and collective progress throughout the experiment.ii.Intercultural Communication Skills Assessment DataParticipants’ performances during simulated intercultural communication scenarios are rigorously evaluated. Criteria such as language fluency, cultural understanding, and the effectiveness of communication strategies are employed, with each participant being rated on a five-level scale. These assessments offer critical insights into how participants navigated and responded to intercultural challenges.iii.Learning Process DataFor the experimental group, an additional layer of data is collected, focusing on their interactions with the intelligent language learning system. Key metrics such as total learning time, the number of tasks completed, and the frequency of system feedback responses are meticulously recorded. This data serves as a vital resource for analyzing how the system’s personalized learning path and real-time feedback mechanism influenced the learning outcomes.The resulting data sets form a comprehensive picture, enabling in-depth analysis and understanding of the various factors at play in the language learning and intercultural communication processes.Results and discussionResults of empirical analysisThe empirical analysis reveals distinct patterns when comparing pre-test and post-test language proficiency scores (average values, with a full score of 100), as depicted in Fig. 7. Significant differences emerge between the experimental and control groups. The experimental group, following their engagement with the intelligent language learning system, demonstrates substantial improvements across all assessed domains. Post-test scores in listening, speaking, reading, and writing surged by 17.2, 17.3, 16.1, and 16.2 points respectively. These gains culminate in a comprehensive score increase of 16.7 points, reaching an impressive 85.7. The magnitude of these improvements underscores the potential impact of personalized learning pathways and real-time feedback mechanisms embedded within the system. Conversely, the control group, which adhered to traditional language learning methods, exhibits more modest gains. Their scores in listening, speaking, reading, and writing increased by 5.3, 4.5, 3.8, and 3.5 points respectively, with a total comprehensive score increase of just 4.2 points. The disparity in outcomes highlights the significant advantage conferred by the intelligent language learning system, suggesting that its tailored approach and dynamic feedback played a pivotal role in enhancing language proficiency.Fig. 7Comparison of scores before and after the language proficiency test (S1-S5 represent the average scores and comprehensive scores of listening, speaking, reading, and writing, respectively).Full size imageThe comparison of pre-test and post-test scores for intercultural communication skills (out of 100) is illustrated in Fig. 8. The results reveal a pronounced difference in performance between the experimental and control groups, particularly in the realm of intercultural communication. In the experimental group, which utilizes the intelligent language learning system, post-test improvements are substantial. Language fluency, cultural understanding, communication strategies, and cultural sensitivity each sees increases of 18.9, 18.9, 19.4, and 19.2 points respectively. The cumulative effect is an 18.9-point surge in the comprehensive score. These gains suggest that the system’s tailored approach not only bolsters linguistic skills but also significantly enhances the learners’ ability to navigate and respond to diverse cultural contexts. In contrast, the control group, adhering to conventional learning methods, shows more modest advancements. Improvements in language fluency, cultural understanding, and communication strategies are limited to 8.2, 8.2, and 8.1 points, respectively, with a corresponding comprehensive score increase of just 8.2 points. This stark contrast underscores the profound impact of the intelligent language learning system on fostering both linguistic proficiency and intercultural competence, highlighting its potential to transform the way learners engage with and understand diverse cultural paradigms.Fig. 8Scores before and after the assessment of intercultural communication skills (R1-R5 represent language fluency, cultural understanding, communication strategy, cultural sensitivity, and comprehensive score respectively).Full size imageThis study employs an independent samples t-test to compare the improvements in linguistic proficiency and intercultural communication competence between the experimental and control groups. Table 1 presents the pre- and post-intervention scores for both groups (n = 131 per group) over the 12-week period, including the mean scores, SD, t-values, p values, and effect sizes (Cohen’s d).Table 1 Comparison of gains in language proficiency and intercultural competence between experimental and control groups (maximum score = 100).Full size tableAs shown in Table 1, the experimental group exhibits statistically significant improvements across all assessed dimensions—listening, speaking, reading, writing, and overall intercultural competence (p