A dataset for recognition of Arabic accents from spoken L2 English speech (ArL2Eng)

Wait 5 sec.

Background & SummaryThe recognition of non-native accents in spoken language has gained significant attention in speech processing research, particularly with the advent of deep learning techniques.Proper identification of the accent increases the efficiency of speech recognition, improves human-computer interaction, and builds correct linguistic study.The ArL2Eng dataset1 is designed to meet the demand for a robust and varied data of Arabic accents recognition in English speech to support the development of machine learning models for automated fluency prediction.Originally, this dataset was used in2, where it was useful in evaluating the efficiency of deep learning techniques in accent recognition. Another application of the ArL2Eng dataset is the assessment of L2 English fluency, becoming crucial in linguistic teaching. Traditional methods for assessing fluency depend on subjective judgments from a human examiner. Such processes are burdened with inconsistency, which raise reliability issues.For this reason, it is important to introduce a consistent dataset for determining the fluency in English among Arabic speakers.MethodsThe ArL2Eng dataset is a collection of audio files from Arabic accents L2 speakers of English from all over the world using high-quality microphones. Data collection has been done in controlled environments. The dataset uses a set of pre-extracted features, which are relevant to machine learning applications. These features include MFCC3 capturing the main phonetic features of the speech, to assess the level of fluency, and PCA4 representing a dimensionality reduction applied to MFCC to enhance the training as well as the computational efficiency.The data is a primary collection of new data, to which a set of existing information is added. Existing data exist in5 and6, while added data exist in1.We record the samples by two methods: directly (in person), or online. In both cases, we use a questionnaire7 to save audio records.To capture how Arabic speakers pronounce English words, including potential transfer errors from their native language, a list of practical tasks achieved are added as follows:Provide the “please call Stella” paragraph to the participant to read aloud.Ensure informing them about consent for audio data collection, particularly for sensitive or identifiable content.Record the audio sequence.Request a repetition from the participant if the sequence is not audible or a noise was noticed.Analyse phonetic features (stress patterns, intonation, vowel length, rhythm) based on processing algorithms (MFCC and PCA).MFCCs are selected as primary features since their extensive use in phonetic analysis. Furthermore, reducing the dimensionality by PCA, is justified by its model generalization and computational efficiency. The overall process for collecting and processing data is described in Fig. 1 which illustrates the design of the ArL2Eng. First, data are collected. Then, data are processed using MFCC and PCA. Afterwards, the fluency of candidate records is assessed using human raters. Next, the model training and testing (using machine learning and deep learning) is accomplished. Finally, the prediction and classification of accent is achieved.Fig. 1Design of the ArL2Eng.Full size imageEthical approvalThe ethical approval process at the University of Tabuk is overseen by the Scientific Research Ethics Committee under the approval number UT-IRB-2021-190. This committee ensures that all research involving human or animal subjects adheres to national and international ethical standards.The involved institutional review board (IRB) committee comprises ten members with diverse backgrounds, including scientific and non-scientific individuals, to provide comprehensive oversight.This ethical approval is obtained by the consent presented to participants before filling the form in7. This text is as follows: “Hello. What is required in this scientific research is to read the following paragraph and send the audio recording. The paragraph is a sentence approved within the framework of a scientific research at Tabuk University. Correct pronunciation in English is not important at all, as the research aims to develop an artificial intelligence program that predicts a person’s region or country based on how they pronounce some English words. Filling out this questionnaire is an acknowledgement of the right of the data collector to use and republish it. The recordings will be used for scientific purposes without recording any personal data.”.PCA dimensionality reduction and evaluationEach utterance is handled as a sequence of 13-dimensional MFCC vectors extracted from short overlapping frames of the audio signal. This variable-length sequence can be converted into a fixed-size feature vector appropriate for the modelling. Statistical descriptors are calculated, especially the mean and standard deviation, for each MFCC coefficient across all frames. As shown in Table 1, a 78-dimensional feature vector (13 coefficients × 2 statistics × 3 streams: MFCC, delta, delta-delta) per utterance, captures both spectral and dynamic speech properties.Table 1 MFCC feature description and processing for PCA.Full size tableTo enhance the computational efficiency and minimize redundancy, PCA is applied to the 78-dimensional utterance-level feature vectors as specified in Table 2. PCA retained 95.2% of the cumulative explained variance using only nine components. This dimensionality reduction preserved the most important phonetic data used to train the model with minimum complexity and time.Table 2 Results of PCA evaluation and feature retention.Full size tableA Kaiser-Meyer-Olkin (KMO) test is achieved to evaluate the sampling adequacy for PCA. KMO test gives 0.84, which confirm the suitability of PCA for the used dataset. Indeed, PCA improved the validation accuracy by 4.5%, and minimized the training time by 22%, due to reduced overfitting.Fig. 2 shows the explained variance of the first 15 PCA components. The elbow point occurs at the ninth component, beyond which components contribute marginally to the total variance.Fig. 2Explained Variance Ratio by PCA Components.Full size imageData RecordsAccessThe ArL2Eng dataset is licensed under the International Creative-Commons-Attribution License 4.0, allowing a wide distribution in academic and educational research. The 238 files taken from8 and integrated in our dataset are also given under the International Creative-Commons-Attribution License 4.0 from the source.DocumentationComprehensive documentation, including detailed descriptions of the dataset, and instructions for use, are provided alongside the dataset.Source of dataThe ArL2Eng dataset is available at1. The ArL2Eng dataset is composed of audio speeches from native Arabic speakers of L2 English with varying proficiency levels from different regions, including mainly the North Africa/ArabMagrib (Tunisia, Algeria, Morocco, Libya), Gulf (Saudi Arabia, Qatar, UAE, Oman, Kuwait, Bahrain, etc.), Levant (Jordan, Lebanon, Palestine, Syria), Iraq, Egypt-Sudan, and Other (Yemen, Somalia, Usa, Uk, Indonesia). Participants were asked to read a predefined English passage (the “Please call Stella” text8), ensuring a consistent phonetic structure across recordings.Compared to other datasets involving Arabic speakers of English as a second language, our dataset involves more records: the dataset in8 includes 238 participants which are all considered in our dataset (among the 640 participants of ArL2Eng, as indicated in Table 3).Table 3 Participants involved in ArL2Eng and existing in8.Full size tableOne hundred from the 238 files of participants also exists in5. Another dataset in6, which does not include files in our dataset, is presented in Table 4. This dataset6 involves only 29 Arabic speakers of L2 English and uses a paragraph different from the “Please call Stella” one.Table 4 Participants not involved in ArL2Eng and existing in6.Full size tableDatasetThe content of the dataset in the repository is composed of the following folders:A first folder named “01 All Accent detection data” containing all audio files without classification.A second folder named “02 Classified data fluent-non fluent” containing classified audio files. This folder contains two sub-folders named “fluent” and “non fluent”, both contain the sub-folders “ArabMagribMP3”, “IraqMP3”, “JordanMP3”, and “SaudiMP3”.The spoken paragraph in “Paragraph P1.txt”.A spreadsheet with metadata named “ArL2Eng_data_description v2.xlsx” that describes the audio files. This spreadsheet contains the following columns: “speakerid”, “age”, “age of English onset”, “sex”, “country”, “filename”, “our sample?”, “name in6”, “name in5”, “birth place”, “native language”, “country of English residence”, “length of English residence (years)”, “Fluency Score”, “Audio file quality”, “Place/environment”.Sampling methodTo ensure a wide representation of age, sex, and linguistic backgrounds, participants were selected from language learning programs, educational institutions, and different random places outside the academic environment (such as public places and communities’ groups of industry, agriculture, medicine, commerce, sport, and culture). The dataset includes both male and female speakers, with ages ranging from 10 to 70 years, to capture a broad spectrum of L2 English fluency levels.The sociodemographic distribution of the participants is shown in Tables 5, 6. Table 5 involves 407 participants that we collect their data and Table 6 involves all the 640 participants (including our data and data from5,6).Table 5 Sociodemographic distribution of new participants according to their activity fields.Full size tableTable 6 Distribution of all participants according to their ages.Full size tableA summary of statistics is as follows:Total Recordings: 640 audio records.Average Recording Length: 28.45 seconds.Sex Distribution: 326 Male (50.93%), 314 Female (49.06%).Accent Distribution: North Africa/ArabMagrib(121), Gulf(170), Levant(136), Iraq(92), Egypt-Sudan(101), and Other(20).Tools and instrumentsAudio recordings were captured using high-quality microphones to ensure clarity and minimize background noise (noise occurs due to noisy environments).The first collected recordings in ArL2Eng were saved in different formats. Afterwards, all files (fluent and non-fluent) are set in the MP3 format with a 128 Kbps bit rate, which represents the standard for speech processing tasks suitable for high-resolution audio analysis.Structure and format of dataThe ArL2Eng dataset consists of two main components:Audio files: A collection of 640 audio recordings, each corresponding to a unique speaker and reading passage.Annotation file: All audio records are described in a metadata file named ArL2Eng_data_description.xlsx. This spreadsheet includes:Fluency metric: Assigned by human raters, either fluent or not. Fluency is defined as the perceived smoothness and flow of speech, as assessed by human experts, considering features like hesitation, rhythm, articulation clarity, speech rate, and pause length.Speaker Information: Speakerid, age, age of English onset, sex, country, filename, our sample, name in Kaggle (if exist), name in accent.gmu.edu (if exist), birthplace, native language, country of English residence, length of English residence (years).The ArL2Eng is organized into folders by country, with each folder containing audio files named according to a consistent naming convention (Table 7). A classification of audio records is also achieved (Table 8) according to the fluency of the English speech.Table 7 All ArL2Eng data classified according to country and source.Full size tableTable 8 ArL2Eng data classified according to fluency.Full size tableThe evaluation of the fluency is achieved by three linguists via a holistic rubric according to the known fluency research literature9,10. The focus is set on the following criteria: Speech rate defined by the number of words per minute; smoothness and hesitation indicated by the frequency of filler words; articulation clarity; and intelligibility reflecting the ease of understanding. Each rater gave a binary classification (either fluent or non-fluent) relying on previous criteria, then the final classification is identified by majority consensus.Example entriesThe first 560 audio files are accompanied by a metadata file in excel format in the metadata file ArL2Eng_data_description.xlsx (as indicated before, information about the rest of audio records is to be added). Table 9 illustrates an example of the metadata:Table 9 Two participants from the ArL2Eng data.Full size tableSome audio files (21 from the 640 files) belong to Somali, Kurdish, Kabyle, and Amazigh persons. According to5 and6, these persons were born in Arab communities/countries, and are either living, or were living there for enough time to be considered as native speakers.Technical ValidationValidation methodsThe dataset underwent rigorous validation processes to ensure the accuracy of the value of the fluency metric and the quality of audio recordings. A panel of three expert human raters independently assigned fluency values, and inter-rater reliability was calculated to ensure consistency. Since ArL2Eng dataset relies on continuous numeric fluency scores, and ratings are given by human experts, the suitable test is Intraclass Correlation Coefficient (ICC) which evaluates the agreement and consistency between experts. ICC attributes a value between 0 (unreliable) and 1 (highest reliability). An ICC more than 0.75 (0.9, respectively) indicates a good (excellent, respectively) reliability.In our case, the ICC can be used to assess the strength of resemblance between the spoken audio sequences rated by several experts. Our measured ICC was equal to 0.88 indicating a good (near to excellent) agreement between the different human raters, which indicates the consistency of their fluency assessments.Accuracy and reliabilityThe accuracy of accent labelling was validated through manual review by linguistic experts. A subset of this dataset was cross validated with automatic speech recognition (ASR) systems for verifying the reliability of fluency annotations. The strong correlation of the fluency metric with the ASR performance metrics confirmed the validity of the annotations.Usage NotesPossible applicative research fieldsThe ArL2Eng dataset is appropriate for assessing models in multilingual speech recognition, speaker identification, and accent recognition. In fact, ArL2Eng might serve a variety of applications such as:Fluency prediction: train and validation of machine and deep learning models by predicting a fluency value from acoustic features.Language assessment tools: Development of automated tools enabling to track the progress of English L2 learners by language educators.Speech processing: For linguistic studies, studies the relation between perceived fluency and phonetic features of Arabic accents in English speech.Comparative analyses against other tools and softwareThe tools developed using this dataset can be compared to tools such as Speaker Recognition API-named Azure AI Speech11 from Microsoft, used to build multilingual AI frameworks with customized speech models.Comparisons with other datasetsArL2Eng can be compared to other datasets such as:Google Audioset12,13: This dataset consists of an expanding ontology of 635 audio event classes, along with a collection of over 2 million sound clips from YouTube videos, each has ten seconds length. Furthermore, this dataset had human labellers that add metadata, context, and content analysis.LibriSpeech ASR Corpus14,15: is a free public corpus of over one thousand hours of English speech, taken from audiobooks. The recordings rely on texts from the project Gutenberg.However, these datasets involve audio sequences of non-Arabic native speakers having a speech in English which is not the case of our dataset.LimitationsStemmed from second language testing frameworks, the used “fluent vs. non-fluent” classification denotes a binary classification model relying on a perceptual judgment of fluency. This dichotomy lines up with simplified fluency evaluation in second language literature9,10. Despite this reliability and simplicity, this dichotomy is limited binary values. This dataset might be enhanced by considering a more complex and subjective scaled assessment with fluency scores rated by human experts. For better granularity, this future enhancement can use multi-point rating (e.g., a scale from zero to five), as admitted in CEFR fluency levels and IELTS speaking band descriptors16.Moreover, despite the dataset providing a diverse range of Arabic accents, some regional accents, such as Egyptian accent, are underrepresented due to the unavailability of speakers. Adding more audio files for those regional accents can enrich the dataset.Best practicesResearchers are advised to consider the demographic information of speakers when training models to account for potential biases. It is also recommended to use the PCA-transformed MFCC features to reduce computational load and improve model performance.Code availabilityFor the fluent-non-fluent sub-dataset, the collected audio sequences are converted to MP3 format using a standard non-custom code based on ffmpeg software. We use a code (Supplementary Material 1) to calculate the quality of each audio file based on the Word Error Rate (WER) as follows: quality on a scale from 0 to 100% = (1-WER)*100%. Another code (Supplementary Material 2) is used to recognize new sequences of accents. The results of WER quality assessment are shown in Supplementary Material 3: “SM3 whisper_WER_results.xlsx”. To obtain the three-level classification (excellent, decent, poor), we can either add a threshold to categorize the WER quality into three categories or manually rate the files by experts. The threshold-based classification is fully automated, and can be reproduced across larger datasets. However, it lacks accurate assessment which can misclassify data having atypical speech patterns or accents affecting WER independently of existing quality of audio. On the other hand, the manual rating by human experts facilitates capturing nuances in intelligibility and eliminate background noise that WER cannot achieve. However, manual rating can engender subjectivity, is not scalable for large size problems, and need more time than automatic methods to be performed. The classification of accents into (poor, decent, excellent) uses additional speech features like vowel formant patterns, consonant insertion or substitution (such as /p/ and /b/), and the intonation and pitch contours. These features were embedded within MFCCs and manipulated using PCA to reduce the dimension, then accents are classified via machine learning models (like LSTM or CNN), on region-labelled data (for example Levant, Gulf, or North Africa). The output of the initial model is verified by human experts for more consistency, especially when overlap between regional accents exists. The output of transcriptions is shown in Supplementary Material 4 “SM4 whisper_transcriptions.csv”.ReferencesFigshare repository link for ArL2Eng dataset. Available: Mnasri, Sami. “ArL2Eng Dataset to Recognize Arabic Accents from English Speech”. Figshare. Retrieved from: https://doi.org/10.6084/m9.figshare.27893778 April 29th (2025).Habbash, M. et al. Recognition of Arabic Accents From English Spoken Speech Using Deep Learning Approach. IEEE Access 12, 37219–37230, https://doi.org/10.1109/ACCESS.2024.3374768 (2024).Article Google Scholar Sahidullah, M. & Saha, G. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543–565, https://doi.org/10.1016/j.specom.2011.11.004 (2012).Article Google Scholar Ratnovsky, A. et al. EMG-based speech recognition using dimensionality reduction methods. J Ambient Intell Human Comput 14, 597–607, https://doi.org/10.1007/s12652-021-03315-5 (2023).Article Google Scholar Kaggle speech accent dataset: Accessed: April 17th, Available: https://www.kaggle.com/rtatman/speech-accent-archive/version/2?select=speakers_all.csv (2025).Dialects archive speech accent dataset: Accessed: April 17th, Available: https://www.dialectsarchive.com (2025).Our online questionnaire: Accessed: April 17th, Available: https://forms.gle/9SAgPGqjPMQaCTNj9 (2025).Weinberger, S. Speech Accent Archive. Accessed: April 17th, 2025. Available: http://accent.gmu.edu George Mason University, (2015).Lennon P. Investigating fluency in EFL: A quantitative approach. Language learning. 40(3), 387-417. https://doi.org/10.1111/j.1467-1770.1990.tb00669.x (1990).Derwing, T. M., Rossiter, M. J., Munro, M. J., Thomson, R. I. Second language fluency: Judgments on different tasks. Language learning. 54(4), 655-79. https://doi.org/10.1111/j.1467-9922.2004.00282.x (2004).Azure, A. I. speech: Accessed: April 17th, Available: https://azure.microsoft.com/en-us/products/ai-services/ai-speech (2025).Gemmeke, J. F. et al. Audio Set: An ontology and human-labeled dataset for audio events, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780, https://doi.org/10.1109/ICASSP.2017.7952261 (New Orleans, LA, USA (2017).Google audioset: Accessed: April 17th, Available: https://research.google.com/audioset (2025).Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, https://doi.org/10.1109/ICASSP.2015.7178964 South Brisbane, QLD, Australia (2015).Librispeech: Accessed: April 17th, Available: https://www.openslr.org/12 (2025).Dashti, L. & Razmjoo, S. A. An examination of IELTS candidates’ performances at different band scores of the speaking test: A quantitative and qualitative analysis. Cogent Education, 7(1). https://doi.org/10.1080/2331186X.2020.1770936 (2020).Download referencesAcknowledgementsThe authors extend their appreciation to the Deanship of Scientific Research at University of Tabuk for funding this work through Research no. S-1442-0198. The authors would like to thank all the contributors participating in the dataset. Special thanks to the linguistic experts who assisted in the validation process.Author informationAuthors and AffiliationsApplied College, University of Tabuk, Tabuk, 47512, Saudi ArabiaManssour Habbash, Sami Mnasri, Mansoor Alghamdi, Malek Alrashidi & Ahmad B. HassanatFaculty of Information Technology, Mutah University, Kerak, 61710, JordanAhmad S. TarawnehCollege of Computer Science, University of Tabuk, Tabuk, 47512, Saudi ArabiaAbdullah GumairAuthorsManssour HabbashView author publicationsSearch author on:PubMed Google ScholarSami MnasriView author publicationsSearch author on:PubMed Google ScholarMansoor AlghamdiView author publicationsSearch author on:PubMed Google ScholarMalek AlrashidiView author publicationsSearch author on:PubMed Google ScholarAhmad S. TarawnehView author publicationsSearch author on:PubMed Google ScholarAbdullah GumairView author publicationsSearch author on:PubMed Google ScholarAhmad B. HassanatView author publicationsSearch author on:PubMed Google ScholarContributionsData collection, verification, and curation (Habbash M., Hassanat A., Mnasri S., Alghamdi M., Alrashidi M., Gumair A.); Formal analysis and resources (Habbash M., Mnasri S., Hassanat A., Tarawneh A., Alrashidi M.); Data annotation, preprocessing, and technical validation (Mnasri S., Hassanat A., Habbash M., Alghamdi M.); Software (Hassanat A., Mnasri S., Tarawneh A.); Writing the manuscript (Mnasri S., Alghamdi M., Alrashidi M., Habbash M.).Corresponding authorsCorrespondence to Manssour Habbash or Sami Mnasri.Ethics declarationsCompeting interestsThe authors declare no competing interests.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary informationClassification codefluency recognition code and detailed resultswhisper_transcriptionswhisper_WER_resultsRights and permissionsOpen Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.Reprints and permissionsAbout this article