Assessing lexical and syntactic simplification in translated English with entropy analysis

Wait 5 sec.

IntroductionTranslation plays a pivotal role in bridging linguistic and cultural divides, facilitating the global exchange of ideas, knowledge, and values (House, 2015). As globalization accelerates, this role becomes increasingly central: the demand for skilled translation has expanded exponentially, with profound implications for geopolitics, economic systems, educational practices, and cultural production. In this context, translation is far more than a conduit for communication—it operates as a dynamic sociocultural practice that (re)constructs cultural identities, negotiates power dynamics in public discourse, and mediates policy formulation. Its multifaceted impact is evident across domains: it underpins the dissemination of scientific innovation (Olohan, 2007), enables multinational corporate operations (Wang et al., 2023), and ensures the cross-cultural transmission and preservation of literary heritage (Wu and Li, 2022a).Despite its critical role in global communication, translation is frequently misrepresented as a superficial, mechanical word-for-word transfer between languages. In reality, it constitutes a dynamic and multifaceted process that entails negotiation across divergent linguistic systems, cultural norms, and sociocultural contextual constraints (Bassnett, 2013). Given its centrality to cross-cultural mediation, existing research on the nature of translated language (as distinct from non-translated language) is imperative for optimizing the efficacy of translation practices. Research in translation studies has thus focused on the systematic analysis of linguistic and discursive patterns in translated texts, particularly their implications for translation quality, communicative equivalence, and cross-cultural communication strategies (Laviosa, 2002; Wu and Li, 2022b; Huang and Li, 2023). Such investigations not only enhance our understanding of translation as a unique mode of communication but also provide valuable insights for refining translation theory and practice in an increasingly interconnected world. Scholars in the field have identified specific linguistic characteristics that consistently appear in translated texts across language pairs (Baker, 1993; Chou et al., 2023; Su and Liu, 2022; Su et al., 2023; Wang and Liu, 2024), as well as in the interpreting process (Li et al., 2022; Xu and Liu, 2023, 2024). These distinctive features, known as translation universals, are defined as “the features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems” (Baker, 1993, p. 243). The concept of “translation universals” bears a close resemblance to that of “translationese” (Gellerstam, 1986). Originally employed as a somewhat pejorative term to denote unnatural or deficient language use, translationese has since been redefined as a neutral, descriptive concept within translation studies, to describe the linguistic features or “fingerprints” that are characteristic of translated texts (Toury, 1995). Translation universals have been the subject of extensive scholarly inquiry in the field. Baker (1993) systematically identified several such universals, including simplification, explicitation, and normalization, while Toury (1995) further advanced the theoretical framework by introducing the laws of standardization and interference. Teich (2003) conducted a corpus-based investigation into translation universals, notably exploring the phenomenon of “shining-through”, where specific features of the source language remain visible in the target text, preserving the source language patterns. Unlike interference, “shining-through” is considered a systematic pattern rather than a translation error.Translation universals are understood as inherent patterns and tendencies intrinsic to the translation process itself, independent of the linguistic systems of either the source or target languages (Fan and Jiang, 2019; Liu and Afzaal, 2021). Recent advancements in the application of digital humanities approaches to translation studies have provided new perspectives on these patterns, yielding insights that have significantly enriched both translation theory and practice (Gu, 2023; Sun and Li, 2020). This progress has prompted critical examinations of the intricacies of language transfer during translation, underscoring the complexity of the process. Consequently, there is a pressing need for further research into the linguistic complexities embedded within translated texts. Such investigations are essential to advancing our understanding of the translation process and, ultimately, to fostering more effective cross-cultural communication.Extensive research on the distinctive characteristics of translated texts has primarily focused on four key translation universals: explicitation, simplification, normalization, and levelling-out. Simplification, defined as “the idea that translators subconsciously simplify the language, or message, or both” (Baker, 1996, p. 176), stands as a particularly debated hypothesis within the study of translation universals (Liu and Afzaal, 2021). Explicitation, on the other hand, refers to “an overall tendency to spell things out rather than leave them implicit” (Baker, 1996, p. 180). Normalization, also known as “conventionalization” (Mauranen, 2007), involves exaggerating “features of the target language to conform to its typical patterns” (Baker, 1996, p. 183). Consequently, translated texts exhibit a semblance of normalcy compared to non-translated native texts in the target language (Xiao, 2010). This normative influence is notably evident in the overuse of clichés, typical grammatical structures of the target language, genre-specific features, punctuation adjustments, and treatment of various dialects in dialogs within the source texts (Xiao, 2010, pp. 10–11). Levelling-out signifies translation’s tendency to “steer a middle course between any two extremes, converging toward the centre” (Baker 1996: 184). Numerous studies have sought to investigate these universals (Liu and Afzaal, 2021; Wang et al., 2023; Wang et al., 2024c; Xiao and Dai, 2014); however, the findings remain inconclusive, and no definitive consensus has been reached regarding their existence.As Liu and Afzaal (2021) observe, the majority of earlier studies have relied on isolated and simplistic linguistic features to substantiate the existence of translation universals in translated texts. For instance, average sentence length has often been employed as an indicator of simplification, as noted by Laviosa (2002). However, this measure has proven inconsistent across different language pairs, such as Spanish-English (Pym, 2008) and Chinese-English (Xiao and Yue, 2009), highlighting the limitations of relying on individual linguistic features for conclusive evidence. Additionally, early research on translation universals predominantly focused on literary texts (e.g., Blum-Kulka and Levenston, 1983; Laviosa, 1998). Yet, the genre of texts has since been identified as a critical variable influencing translational language features, necessitating its consideration in studies on translation universals (Delaere et al., 2012; Liu and Afzaal, 2021). To address these methodological challenges, researchers have increasingly turned to information-theoretic measures, such as entropy (Liu et al., 2022a; Lin and Liang, 2023; Wang et al., 2024a), and tree-based dependency measures (Xu and Liu, 2023, 2024) to quantify the complexity of translated texts. These approaches have yielded new insights, providing a more nuanced understanding of the translation process. The present study builds on this body of research by incorporating the concept of information entropy from information theory, utilizing balanced corpora across multiple genres to investigate whether simplification occurs in translated English. Entropy, which quantifies the complexity or uncertainty inherent in a linguistic system, serves as a robust analytical tool for examining the linguistic features of translated texts. By applying entropy as a measure, this study aims to offer a more precise understanding of the degree of simplification or complexification in translated English, contributing a novel perspective on the linguistic characteristics of translational language. Furthermore, this study extends prior research (Liu et al., 2022a; Liu et al., 2022b) by exploring the application of computational methods to deepen our understanding of translation’s impact on language.Related workSimplification in translationSimplification is one of the most extensively studied translation universals in the field of translation studies. This process typically involves the use of simpler lexical, grammatical, and syntactic structures while maintaining the intended meaning. Prior to the advent of corpora and computational tools, text collection and analysis were primarily conducted manually, with research often limited to small samples of source and target texts. In this context, Blum-Kulka and Levenston (1983) investigated translations between Hebrew and English, with Hebrew as the source language and English as the target language. Their study identified lexical simplification in translations, evidenced by the use of fewer words to convey equivalent meanings. This early research highlighted the phenomenon of simplification in translation, establishing a foundation for more comprehensive studies enabled by modern computational methods and larger corpora. Vanderauwera (1985) expanded the investigation of simplification to include syntactic aspects in English translations of Dutch texts, observing that translated texts frequently replaced finite clauses with non-finite ones, thereby reducing syntactic complexity. Additionally, Vanderauwera (1985) examined simplification from a stylistic perspective, noting how translators modified stylistic elements to enhance the accessibility of the text. However, due to the restricted sample sizes and the absence of advanced statistical techniques, the findings of these early studies lacked generalizability.With advancements in corpora and computational linguistics, quantitative approaches have increasingly been adopted to examine the manifestation of translation universals from a statistical perspective. The assessment of simplification has been conducted using a comparable corpus approach, which allows for comparisons between translated and original texts (Baker, 1993). Malmkjær (1997) found that translated English texts from Danish tend to employ stronger punctuation, such as replacing commas with semicolons or periods, and substituting complex syntactic structures in source texts with shorter, less complicated clauses in translated texts. Laviosa (1998) examined the linguistic features of English-translated narrative prose by analyzing criteria such as lexical density, average sentence length, and the frequency of commonly used words. By analyzing a section of the English Comparable Corpus (ECC), a monolingual, multi-source-language English corpus, Laviosa (1998) identified four key aspects in which translated English differs from native English. Firstly, translated texts exhibited lower lexical density and a higher proportion of grammatical terms compared to original texts. Secondly, they made greater use of high-frequency words. Thirdly, these high-frequency words were repeated more often in translated texts. Lastly, the most commonly used terms in translated texts included fewer lemmas than those found in original texts. Olohan (2004) employed lexical diversity as a metric to compare translated English fiction from German with native English fiction, revealing that translated works tended to use fewer synonyms for colors than their native counterparts. Similarly, Pastor et al. (2008) explored simplification using natural language processing tools, readability formulas, and other indices, finding that non-translated Spanish texts exhibited higher lexical density and richness compared to Spanish texts translated from English. Collectively, these studies provide empirical support for the simplification hypothesis. The adoption of quantitative methods and statistical analysis in such research has enabled a more systematic and rigorous examination of simplification in translation. By measuring diverse linguistic features and comparing translated texts with their native counterparts, researchers have identified consistent patterns of simplification, thereby advancing our understanding of translation universals.While simplification has been extensively studied as a potential translation universal, it remains a contentious topic within translation studies. Some research findings challenge the traditional view of simplification, suggesting that the phenomenon is more nuanced than previously thought. Lexical complexity, often associated with vocabulary richness and diversity, has been extensively studied (Baker, 1996; Blum-Kulka and Levenston, 1983; Kruger, 2019; Laviosa, 2002). Early studies, such as Blum-Kulka and Levenston (1983), suggested that translated texts tend to exhibit simplified lexical choices in their comparison of Hebrew and English. However, more recent investigations have challenged this view. For instance, Kruger (2019) found that translated English texts from Afrikaans display higher lexical diversity than comparable non-translated English texts.Syntactic complexity in translation has been examined through various approaches. Vanderauwera (1985) found reduced complex syntax in Dutch-English translations. This finding was supported by Pastor et al. (2008), who showed that sentences in translated Spanish texts were statistically shorter than those in non-translated Spanish texts. However, Wang et al. (2023) found that chairman’s statements translated from Chinese into English exhibited higher syntactic complexity than non-translated English chairman’s statements, as evidenced by longer production units, higher frequency of coordinate phrases, and more complex nominal structures.The relationship between text type and complexity has also received considerable attention. Biber, Conrad (2019) demonstrated how different genres exhibit distinct linguistic patterns. In the field of translation studies, Liu and Afzaal (2021) showed that text type significantly influences syntactic complexity in English texts translated from Chinese. Translation status and its impact on text complexity have been investigated through various methodological approaches. Studies employing machine learning techniques have successfully distinguished between translated and non-translated texts based on their linguistic features (Baroni and Bernardini, 2006; Volansky et al., 2015; Wang et al., 2024a; Wang et al., 2024b). Additionally, multidimensional analyses have shown that translated texts often display distinct linguistic patterns compared to non-translated texts (Wang and Liu, 2024).The exploration of translation universals, however, has largely been conducted from a Eurocentric perspective, as noted by Xiao and Dai (2014). This Eurocentric focus is reflected in the significant number of studies that have concentrated on European languages (De Camargo, 2016; Laviosa, 2002), particularly in their examination of lexical and syntactic features. The variations in these features within European languages may not be as pronounced as those observed in more linguistically diverse language pairs, such as English and Chinese, where the differences in vocabulary, grammar, and sentence structure are considerably more substantial (Xiao and Dai, 2014). The substantial linguistic disparities between English and Chinese pose unique challenges and opportunities for translators, underscoring the necessity for a nuanced analysis of translation universals that carefully considers these distinctions. The characteristics of translated texts often vary significantly across different language pairs, necessitating a broader and more inclusive perspective in translation studies. Notably, prior research comparing Chinese-English translations with original English texts has identified compelling evidence of syntactic variations between translated and non-translated texts (Liu and Afzaal, 2021). These variations are shaped not only by linguistic factors but also by cultural, social, and contextual elements that influence the translation process. The current lack of a coherent picture in research on translation universals can, in part, be attributed to the selective use of linguistic indicators chosen to support specific hypotheses (Liu et al., 2022a).To address this gap, we draw on the methodologies proposed by Liu et al. (2022a) and Lin and Liang (2023), employing entropy as a robust quantitative measure of linguistic complexity. Entropy, introduced by Shannon (1948), was developed to quantify information and assess the information content within a given source. It serves as a metric to measure the degree of uncertainty or randomness in a specific dataset. Essentially, entropy provides a numerical representation of the average information or “surprise” associated with an event or observation. The calculation involves taking the logarithms of the probabilities assigned to each potential outcome or message in a random event, with these probabilities serving as weights. Outcomes with higher probabilities contribute less to the overall entropy compared to those with lower probabilities. Shannon’s formula for calculating entropy is as follows:$${\rm{H}}=-\mathop{\sum }\limits_{i=1}^{n}{P}_{i}{\log }_{2}{P}_{i}$$In the provided formula, the total entropy denoted by H represents the aggregate measure of entropy for all elements present in a message. Pi within the formula represents the probability of a specific element occurring, which can be computed using its relative frequency within the dataset. The value of n corresponds to the total count of elements within the message. The formula allows for the calculation of entropy based on the probabilities associated with different elements, providing a quantitative assessment of the overall uncertainty or information content in the given dataset. Entropy, widely used in fields such as ecology, computational linguistics, and information science (Ben-Naim, 2019; Bentz et al., 2017; Cushman, 2021), also offers a systematic and objective method for analyzing linguistic phenomena. Among the diverse quantitative approaches available for analyzing translation simplification, entropy-based measures are particularly advantageous due to their capacity to simultaneously capture both lexical and syntactic complexity through the quantification of linguistic elements’ predictability and distribution patterns (Liu et al., 2022a). While traditional analytical methods often focus on isolated features such as lexical density, sentence length, or type-token ratio, entropy measures offer a comprehensive assessment of text complexity by integrating information content, structural organization, and the probabilistic nature of language patterns (Wang et al., 2024a). This methodological advancement toward sophisticated quantitative analysis, specifically through the application of entropy measures, enables researchers to precisely quantify the degree of simplification in translated texts. The shift from potentially subjective qualitative evaluations to mathematically grounded metrics not only enhances the reliability and replicability of findings but also facilitates cross-linguistic comparisons in translation research.Entropy in language and translation researchThe concept of entropy has been extensively applied in language research, highlighting its versatility and significant contributions to linguistic analysis. Genzel and Charniak (2002) introduced a foundational principle of language generation known as the entropy-rate constancy principle. Their research demonstrated that, when sentences are analyzed out of context, sentence entropy increases with sentence number, a finding that aligns with this principle. This discovery provided new insights into linguistic patterns associated with entropy, particularly in written English. Specifically, Genzel and Charniak focused on the entropy of text, conceptualizing each word as a random variable whose distribution depends on all preceding words. They observed that the average entropy of these variables remains constant throughout a text. Building on these findings, Tanaka-Ishii (2005) expanded the understanding of entropy in language by showing that the uncertainty of tokens following a sequence is crucial for determining context boundaries. This study assessed the uncertainty of successive tokens using the concept of branching entropy, thereby underscoring the importance of entropy in shaping linguistic contexts. Juola (2008) further applied entropy to quantify linguistic complexity at the levels of lexicon, morphology, and syntax, demonstrating its utility across diverse linguistic dimensions. Mehri and Darooneh (2011) explored entropy’s role in word ranking, emphasizing its systematic application in text mining and its relevance to understanding word patterns. More recently, Yang et al. (2013) advanced this line of research by introducing a new entropy-based metric for evaluating word relevance. They found that significant words closely reflect the author’s intent, distinguishing them from irrelevant, randomly distributed words. Thus, analyzing word distribution is pivotal for understanding text complexity and for discerning how meaningful words are shaped by an author’s objectives.In the field of speech therapy, entropy has been applied to evaluate word complexity, with a particular focus on individual verbs and verb paradigms (van Ewijk and Avrutin, 2016). These researchers employed inflectional entropy as a measure, framing the informational complexity of a word as a combination of the information inherent in the target word itself and the information embedded within its morphological paradigms. Bentz et al. (2017) made significant contributions by exploring convergence points in word entropy, enabling quantitative language comparisons and enhancing translation systems, thereby underscoring entropy’s fundamental role in comparative linguistic studies. Lowder et al. (2018) applied entropy-reduction techniques to enhance lexical prediction, extending its applications to predictive text analysis. They used entropy reduction as a complexity metric to quantify the amount of information gained with each word. The concept posits that if entropy decreases from one word to the next, communicative uncertainty is reduced, indicating active information processing by the reader. Friedrich et al. (2020) introduced a neural language model to measure word ambiguity in legal texts, finding lower entropy in the German Federal Court of Justice corpus due to the frequent use of technical language, thereby emphasizing entropy’s relevance in legal language analysis. Friedrich (2021) continued this exploration, investigating various linguistic features of legal language and highlighting its high complexity and low entropy, which present challenges for non-experts in understanding specialized language domains. The consistent application of entropy across this wide spectrum of research underscores its critical importance and broad implications in language analysis and computational linguistics.Entropy has gained considerable attention within the field of translation research due to its demonstrated relevance to cognitive and linguistic processes. Carl and Schaeffer (2017) proposed a noisy channel model for the translation process, demonstrating how entropy measurements can reveal cognitive processing patterns during translation. Their work demonstrated that information-theoretic measures like entropy are proportional to cognitive processing effort, particularly when combined with other metrics. Further developing this line of research, Carl (2021) investigated the first universal translational response, finding that the cognitive effort required for initial translation responses correlates with information content and cross-linguistic similarity. His study revealed that typing pauses, which indicate cognitive processing time, are influenced by both information content of the source text and the degree of literal translation possible between language pairs. Wei (2022) introduced two important metrics, surprisal (ITra) and entropy (HTra), which approximate cognitive load. ITra was identified as a more accurate predictor of translation production time, while HTra was found to be more effective in predicting the reading time of the source text. More recently, Deilen et al. (2023) examined cognitive aspects of compound translation, using entropy measurements alongside other metrics to quantify cognitive effort. Their research provides empirical evidence that Information-theoretic Translation (ITra) measures, when combined with other indicators, can effectively predict cognitive processing load during translation tasks. These findings highlight the utility of entropy-based indicators in accounting for cognitive processes underlying translation activities.Chen et al. (2017) explored the linguistic profiles of different text types using entropy. By examining word and part-of-speech entropy, as well as the entropy of aspect markers in Chinese and English corpora, they revealed distinct distribution patterns in both languages. Notably, Chinese exhibited higher entropy in aspect markers, underscoring grammatical differences compared to English and aiding in distinguishing between narrative and expository texts. Expanding the application of entropy in translation studies, Yerkebulan et al. (2021) developed an entropy-based methodology to detect patterns in multilingual texts and identify translations. Their innovative approach, which involved calculating text proximity using centers of parametric means, proved effective in distinguishing genuine translations from “pseudo” ones based on distance measures. This underscores the potential of entropy in translation verification and analysis. Recent studies by Liu et al. (2022a) and Liu et al. (2022b) further demonstrate the effectiveness of entropy-based approaches in examining translation universals, particularly in Chinese texts translated from English. These researchers suggest that further exploration of entropy-based methods could yield valuable insights when applied to other translated languages beyond Chinese.Entropy has been employed as an inverse indicator of simplification: higher entropy values indicate lower levels of simplification and increased linguistic complexity, while lower values denote higher simplification and reduced complexity (Liu et al., 2022a; Liu et al., 2022b; Wang et al., 2024a). When translated texts exhibit lower entropy values compared to non-translated texts, this supports the simplification hypothesis, indicating less diverse lexical choices and more predictable language structures. Conversely, higher entropy values suggest greater lexical diversity and less predictable patterns, pointing to a lower degree of simplification.Research questionsThe use of entropy-based metrics to analyze translational simplification has gained traction, as evidenced by Liu et al. (2022a), who demonstrated their effectiveness in revealing translation phenomena in Chinese texts. Building on this work, our study extends the investigation to translated English, aiming to provide comparative data that enhances our understanding of translational simplification across different languages. We employ entropy-based metrics to quantitatively assess the complexity and simplification of texts, aiming to identify patterns of simplification across various linguistic structures. This research incorporates data from typologically diverse language pairs, such as English and Chinese, which Xiao and Dai (2014) suggest could lead to robust and insightful results. Our comparative analysis seeks to determine whether translated English, derived from Chinese sources, exhibits simplification compared to native English in four distinct genres. This approach advances our understanding of cross-linguistic complexity patterns and reveals how genre-specific features are preserved or simplified during the Chinese-to-English translation process. The following research questions will guide our investigation:RQ1: Are there significant differences in the lexical and syntactic complexity of texts based on translation status?RQ2: Are there significant differences in the lexical and syntactic complexity of texts based on text type?RQ3: Is there an interaction effect between translation status and text type on the lexical and syntactic complexity of texts?Materials and methodsCorporaThe study utilizes two primary corpora: the Freiburg-LOB (LOB stands for Lancaster-Oslo/Bergen corpus) Corpus of British English (FLOB) (Hundt et al., 1998) and the English component of the Corpus of Chinese into English (COCE). FLOB, established in the early 1990s, aimed to create a parallel corpus to the original LOB and Brown corpora with one million words reflecting the diversity of written British English from that period. It includes a variety of genres, such as press reportage, editorials, religion, hobbies, popular lore, biographies, and academic prose, all from 1991 to provide a dataset comparable to the original LOB Corpus from 1961 (Hundt et al., 1998). COCE, designed to closely match FLOB in terms of size and genre components, serves as a parallel corpus containing Chinese source texts and their English translations, aligned at the sentence level for enhanced representativeness (Liu and Afzaal, 2021). The focus of our research is on the English translations within COCE, which comprises 500 texts averaging 2,000 words each. These texts cover four main genres and 15 subgenres, offering a broad spectrum for analysis. Further details about the English component of COCE, including genre distribution, text types, and token counts, are summarized in Table 1. This corpus setup allows for a rigorous comparative analysis of translational simplification between native and translated English across distinct genres. It is worth noting that all translators in this study were native Chinese speakers with advanced English proficiency, rather than native English speakers. While this approach is common practice in China, it differs from many translation studies in which translators are typically native speakers of the target language. This characteristic of our translator pool represents a methodological limitation that should be considered when interpreting our results, as non-native English translators may process and produce translations differently from native speakers. Their linguistic backgrounds might influence the translation outcomes in ways that native English translators’ work would not.Table 1 Details of the english part of COCE.Full size tableFollowing Baker’s (1993) comparable corpus approach, the present study compares the English translations in COCE with the original English texts in FLOB. This comparative analysis is instrumental in comprehensively examining the degree of simplification utilizing the entropy-based approach.Calculation of wordform entropy and POS entropyPrevious research by Shi and Lei (2020) has shown that text length significantly affects text entropy, underscoring the importance of maintaining consistent text lengths for precise analysis. Although initial estimates suggested that each text in our corpora should average 2000 words, a detailed examination revealed discrepancies in text lengths between FLOB and COCE. To address this, we standardized text lengths across both corpora, aiming for uniformity in our analysis. In line with the methodology proposed by Liu et al. (2022a), we set a maximum word count of 1500 words per text, excluding punctuation. This adjustment was crucial to preserve the original structure of the corpora, which comprises 500 texts. Where texts exceeded this limit, they were truncated to meet the new word count criterion. This approach ensures that our entropy measurements are not skewed by variations in text length, facilitating a more reliable comparison of translational simplification between the corpora. For POS tagging, we employed the Penn Treebank POS tagger implemented through Natural Language Toolkit (NLTK). Subsequently, we computed wordform entropy and POS entropy values for both FLOB and COCE using a Python program. It should be noted that during this computation, all punctuation marks were omitted to eliminate potential confounding effects caused by variations in punctuation usage between the corpora. This standardized approach to text length and preprocessing ensures a rigorous and comparative analysis of word and POS entropy across the two corpora.The calculation of wordform entropy for a text is outlined as follows:1.Count the occurrences of each word: This is achieved by tabulating the frequency of all words in the corpus.2.Calculate the probability of each word: The probability of a word, $P\left({w}_{i}\right)$ is determined by dividing the frequency of the word by the total number of words in the corpus.$$p\left({w}_{i}\right)=\frac{{\rm{Frequency\; of}}{w}_{i}}{{\rm{Total\; number\; of\; words\; in\; the\; text}}}$$3.Calculate entropy using the Shannon entropy formula:$${\rm{H}}({\rm{W}})=-\mathop{\sum }\limits_{i=1}^{n}p\left({w}_{i}\right){\log }_{2}p\left({w}_{i}\right)$$where H(W) is the entropy of the word distribution, $P\left({w}_{i}\right)$ is the probability of word i, and n is the number of unique words.As for the POS entropy of a text, its calculation is presented below:1.Tag each word with its part of speech: Use a POS tagger to label each word in the corpus with its corresponding part of speech. Punctuation is removed prior to this step.2.Count the occurrences of each POS tag.3.Calculate the probability of each POS tag: The probability of a POS tag, $p({t}_{j})$, is computed as the frequency of the tag divided by the total number of POS tags in the text.$$p\left({t}_{j}\right)=\frac{{\rm{Frequency\; of}}{t}_{j}}{{\rm{Total\; number\; of\; POS\; tags\; in\; the\; text}}}$$4.Calculate POS entropy:$${\rm{H}}({\rm{T}})=-\mathop{\sum }\limits_{j=1}^{n}p\left({t}_{j}\right){\log }_{2}p\left({t}_{j}\right)$$where H(T) is the entropy of the POS distribution, $P({t}_{j})$ is the probability of POS tag j, and n is the number of unique POS tags.The use of wordform entropy and Part of Speech (POS) entropy in evaluating textual complexity has been well established in prior research (Liu et al., 2022b; Shi and Lei, 2020). Wordform entropy, calculated from the distribution of each word’s occurrences within a text, quantifies the unpredictability or randomness of word usage. High wordform entropy signifies a text that utilizes a broad vocabulary in a relatively even distribution, suggesting linguistic richness and a higher level of lexical sophistication (Liu et al., 2022a). Such texts may be more challenging to comprehend due to the wide range of vocabulary required. Conversely, low wordform entropy, indicating repetitive use of certain words, denotes simpler texts that are often more suitable for beginner readers or contexts where clarity and repetition are valued. Therefore, wordform entropy can effectively capture both the depth and utilization of a text’s vocabulary, reflecting its lexical complexity, which influences reader comprehension and engagement (Liu et al., 2022b; Shi and Lei, 2020). Similarly, POS entropy assesses the diversity of grammatical structures by analyzing the distribution of parts of speech within a text (Liu et al., 2022a). Utilizing a probability-based entropy formula, it measures the uncertainty in the choice of grammatical categories. High POS entropy reflects the use of a variety of grammatical constructs, indicating complex sentence formations that contribute to syntactic richness. This complexity can enhance the cognitive load for readers, as intricate grammatical structures often involve variations in sentence length, clause embedding, or innovative uses of language, characteristics typical of advanced texts. Conversely, lower POS entropy indicates a predominance of specific parts of speech, leading to uniform and potentially simpler syntactic structures. Such simplicity may limit the expressive breadth and syntactic diversity of the language. Taken together, these measures provide a nuanced view of textual complexity by integrating both lexical and grammatical dimensions, which is essential for analyzing language use across different types of texts.Data analysisTo address the research questions, two two-way ANOVA tests were conducted. The first analysis assessed the influence of translation status and genre on lexical complexity, which served as the dependent variable. Specifically, translation status was categorized into translated and non-translated texts, and genre was divided into four categories: press, general prose, academic prose, and fiction. This ANOVA aimed to identify any main effects of translation status and genre, as well as any interaction between these factors on lexical complexity, quantified through wordform entropies. Following the identification of an interaction effect, a Tukey post hoc test was performed to pinpoint specific differences between the groups.Similarly, the second two-way ANOVA explored the effects of translation status and genre on syntactic complexity, which served as the dependent variable in this analysis. Translation status was again categorized into translated and non-translated texts, while genre included press, general prose, academic writing, and fiction. This test aimed to determine whether translation status and genre independently or interactively influenced syntactic complexity, measured by POS entropies. Like the first test, upon detecting significant interaction effects, a Tukey post hoc analysis was conducted to elucidate specific differences between the groups. This methodical approach ensures a thorough examination of how translation status and genre may shape syntactic structures in various text types.ResultsThe descriptive statistics for wordform entropy and POS entropy across the two corpora can be observed in Tables 2 and 3, and Fig. 1. We calculated the wordform entropy and POS for the four genres, encompassing 15 text types, within both FLOB and COCE. The findings reveal that native English texts (FLOB) consistently exhibit lower average wordform entropy values compared to translated texts (COCE) across all four genres. Conversely, translated texts display lower mean POS entropy values than non-translated texts in general prose, academic writing, and fiction, while registering higher mean POS entropy values in news.Table 2 Descriptive statistics for wordform entropy of FLOB and COCE.Full size tableTable 3 Descriptive statistics for POS entropy of FLOB and COCE.Full size tableFig. 1: Boxplots of wordform entropy and POS entropy of FLOB and COCE.Boxplots comparing wordform entropy (left) and POS entropy (right) across four genres (News, Prose, Academic, Fiction) in the FLOB and COCE corpora. COCE is represented in turquoise, and FLOB in orange. Medians are indicated by lines within the boxes, with outliers shown as dots.Full size imageTo analyze the statistical differences in wordform entropy between the two corpora, a two-way ANOVA was conducted, with corpus and genre as the independent variables. The results revealed a significant main effect of corpus, indicating a statistically significant difference in wordform entropy between the two corpora (F(1, 992) = 42.91.8; p