Volces of formerly enslaved: A new text corpus of narratives by formerly enslaved persons

Wait 5 sec.

Background & SummaryThe aim of this research project is to study the living conditions of enslaved and/or recently emancipated persons who resided in the United States of America (and, to a small extent, in the Caribbean). This research will be performed using an annotated corpus of texts that consists of different parts, or subcorpora. The data presented in this article was derived from two of these text parts: autobiographies by and interviews with such persons.The current state-of-the-art research on historical living standards relies on quantitative methods for analysing large samples of historical populations1. This approach has generally overlooked the testimonies of enslaved individuals, despite their significant value. These first-hand accounts offer crucial insights into the social history of slavery, detailing the experiences of enslaved people both during and after their enslavement. While some historians have qualitatively analysed evidence of these experiences2,3, a systematic analysis of a large dataset documenting the lived experiences of formerly enslaved individuals has not yet been conducted. These testimonies also matter linguistically, since they can shed light on the language used by this social strata. Previous language studies, such as Schneider4, have used only a few of these data. Future research, however, could leverage the full dataset to yield more robust findings.Due to common interests, a collaboration with the Baquaqua: The Afro-Diasporic Text Corpus Project (https://library.morgan.edu/aatc) has been initialised, and discussions on best tool practice have been performed both online and on site at Morgan State University Library. For our joint purposes, one subcorpus, AATC, in our dataset5 consists of public domain texts they are also working on. All the corpora (the whole and its parts), made available for open access, have the additional advantage that they can be put to use for much other research, studying a variety of research questions, in both the Humanities and the Social Sciences.The sources of data used for this research are interviews with and autobiographies by formerly enslaved persons, in the previous research often called “slave narratives” more generally. In such records, the formerly enslaved individuals describe how they remember their own lived experience of slavery. Such records can be found in various historical archives or in published form, from various countries and in various languages6,7. All sources used for this article and our corpus version 0.15 are openly accessible.Two key sets of sources were used for the pilot corpus, as seen in Table 1. The first set of records comprised autobiographies published by individuals who had previously experienced enslavement. These autobiographies had previously been collated and digitised in a separate project, entitled Documenting the American South (here abbreviated DocSouth), by the University Library at the University of North Carolina at Chapel Hill (https://docsouth.unc.edu/index.html). It is estimated that more than 200 such autobiographies were published during or in the aftermath of the period when slavery was legal. In this study, 188 out of 446 texts from the DocSouth collection have been included; all the first-person narratives by African-Americans in this collection. These texts are from various states within the USA and from some parts of the Caribbean. The metadata currently relies on the geographical metadata from the DocSouth project. Exact details of the current known geography is found in the Metadata file, in the different variables named Geography.Table 1 Source material size and status for corpus version 0.1.Full size tableThe second set of records was interviews conducted by the Federal Writers’ Project (FWP)8. These interviews were conducted in the 1930s with individuals who had been enslaved during their childhood or early adulthood. It is estimated that the total number of interviews amounts to several thousands4. In future releases, therefore, the intention is to add more interviews. The metadata are described in more detail in the Methods section. For the pilot study and corpus version 0.15 described in this article, 33 volumes collected in 17 states (as seen in Table 2) were included, with a total of 2243 interviews.Table 2 The publishing US state and number of interviews per FWP volume included in corpus 0.1.Full size tableAs seen in Table 2, the Arkansas interview volumes comprise a large part, 7 of the total, with Georgia, South Carolina and Texas being equally represented by 4 volumes. Different volumes include different numbers of interviews. Volume 7, compiled in Kentucky, is a composite book that incorporates interview transcripts alongside extraneous material. For the present release (v0.1) we extract and present the individual interviews from that volume; the so-called combined interviews will be included in v1.0. Material that is not interview-based is excluded from the project.The present metadata file, for the FWP volumes in version 0.1 of the corpus, only includes the US state and the names of the interviewees, with one, representantive, volume. For this one volume (Vol04_03 in Table 2), we manually added a more specific place, recorded by the interviewer or editor on the first page of an interview, as well as details on the language variety on each page. In later versions of the corpus, however, there will be such metadata for all volumes, and geographical metadata for both datasets will be improved by adding places mentioned in the texts after an Named Entity Recognition analysis.Due to the age of these autobiographies, copyright protection for the texts has expired. As for the FWP records, output by federal employees in the USA is not covered by copyright. The original records collected for the corpus are thus all in the public domain, available for research purposes. The copyright holders of the digitised versions of these DocSouth records have granted us permission to include them in the corpus for research purposes. The records from the FWP had likewise been digitised, by the U.S. Library of Congress, and made available online for research in their “Born in Slavery” collection of documents (https://www.loc.gov/collections/slave-narratives-from-the-federal-writers-project-1936-to-1938/about-this-collection).MethodsThe creation of the corpus and its resources is a progressive process. In certain cases, steps are taken to test a specific workflow, and these tests may be removed if the results are deemed unsatisfactory or unuseful. To ensure the robustness of this process, it has been configured to operate on an iterative basis, with each step having its own version. The process thus entails the progressive enhancement of the project’s corpus resources, using additional components and workflows as required, thereby ensuring the robustness of the progress. The different editions will undergo versioning during the process, to facilitate the experience for users. This exercise was first initiated using the modest AATC corpus and a part of our own data, with the objective of surveying the potential workflows and existing tools to create our corpus. This approach was informed by our prior experiences with open standards and XML tooling.The processing of all corpus data, inclusive of the automatically extractable metadata, was conducted utilising the Sparv 5.3.19 Pipeline. Manually extractable metadata will be included in the process later during the project. All extractions, including the OCRing to obtain TEI-XML for further processing, were facilitated by Tesseract (v5.4.1, https://github.com/tesseract-ocr) as a Sparv plugin in the pipeline. Furthermore, comparisons were conducted to a) reveal whether our automated analyses were improved or not, and b) reveal whether or not a specific step in the process worked better than others. These comparisons were partly conducted to identify improvements within individual analyses, and partly to find possible trends in which analysis works better for our different text types.One annotator proceeded to manually proofread the first iteration, incorporating variables such as a corrected part-of-speech (PoS) and a correct lemma. The annotator also provided commentary on instances where the PoS could be considered uncertain or where a single token might possess multiple lemmata. An example is the token “de”, which may have either the lemma “the” or the lemma “they”. A more prominent example is the lemma “master”, a typical keyword for these texts, that was found to have about 20 different tokens in the first round, including the standardised spelling but with capital letters only or plural form. These corrections were then fed back to train a new annotation model. The pilot project annotations are available in the “Annotations” folder, in different formats such as csv, conllu and xml (https://doi.org/10.23695/P5HW-DR52).The original texts (pdfs) were subjected to a manual examination to identify and categorise pages based on their linguistic characteristics. Metadata curation, encompassing both descriptive and content-related data, constituted a significant phase of the research methodology. The implementation of unique identifiers streamlined subsequent processes, including data validation, entry, extraction, and testing. A distinction was made between pages such as colophons (included in the variable name Other_pages), pages employing a standardised linguistic register (included in the variable Standard_variety_pages), and those featuring transcribed spoken language. The latter exhibited orthographic variations not present in standard English dictionaries, exemplified by spellings such as “mout” for “might” and “gwine” for “going.” (Although Schneider4 considers the term ‘dialect’ to be a neutral term, we consider the term ‘variety’ to describe these discrepancies better). Another illustration of this phenomenon can be observed in phrases such as “Dey bilt dem a house” for “They built themselves a house”. In most cases, this metadata variable, called “Spoken_variety_pages”, refers to pages with a transcribed spoken variety, where vernacular speech is cited by the interviewer or editor of text. It is important to note that this speech is mediated by the interviewer/editor, and that this process of mediation sometimes entailed substantial editing of the transcribed speech10. A third spelling-related variable pertains to instances where, in a volume with otherwise mainly standardised variety English, a part of a page is composed of speech or dialogue, transcribed to emulate the African-American vernacular English of the era. This metadata variable is called Utterance_pages.Subsequently, the Sparv pipeline employs the lemma annotations to generate input for e.g. the word embeddings model generation. The Named Entity Recognition (NER) annotations are utilised for subsequent steps as input.The steps explained above are taken not only to create a corpus that includes as many of these narratives as possible, but also because they are necessary to create a corpus that can be used through various corpus apps. Initially, the outputs are primarily created for Corpus Workbench-based tools, for example CQPweb, but there are also csv files that may be used in other tools. An example is shown in Fig. 1, where an early version of the FWP subcorpus, the transcribed interviews, has been opened in AntConc11. This concordance example shows not only that there are different spellings of one and the same word (“master”), but also that the other words have plenty of issues, such as the previous OCR having converted letters to numbers.Fig. 1The alternative text for this image may have been generated using AI.Full size imageThe first OCR version of FWP interviews as seen in AntConc.In Fig. 1, we see a search for mas*, meaning all tokens that start with the letter combination mas, to include more spellings of the word “master”. In addition, throughout all of these concordance lines, the token “be” should be interpreted as different forms of the verb “to be”, most often “is”, but it could likely also be “was”, “used to be” or even “has been”. These two words illustrate two of the data issues – lemmatisation and misinterpreted OCR – that we aim to solve with this project, one step at the time, before we start analysing the texts themselves. Files to be used in stand-alone software such as this are found in the “Sources” folder. The folders there contain source material used to perform the analyses found in the “Analyses” folder. Instructions are available, and will be updated during the rest of the project, in the “Documentation” folder in our repository (https://doi.org/10.23695/P5HW-DR52).For future releases of the corpus, we are adding text type classifier models, but also topic modelling data, semantic and frame data, and sentiment annotations. We expect to have SQL-tables and other structured data for use within different tools before 2027. There will be information on syntactical relations to use in for example Korp’s “word pictures” (word sketches), but also resources in other formats than the current, for exploration in other Corpus Workbench-based tools.Data RecordsThe dataset6 and related resources are available at a Språkbanken Text repository (https://spraakbanken.gu.se/en/resources/votfe-pilot) with this section being the primary source of information on the availability and content of the data being described. The data belonging to corpus version 0.1 have the following structure:Analyses material:Annotations:Mostly token and sentence level, inlined for all parts of the entire corpus:A frequency list for each subset of the corpus.Resources in vrt format for exploration in Corpus Workbench-based tools.Stand-alone output:The whole, fully annotated corpus in different file formats (xml etc.).The fully annotated sub-corpus files.The whole corpus with less annotations in relevant formats (txt, conllu, csv etc.).The less annotated sub-corpus files.Documentation:Videos, for example on how to open a (sub)corpus file in different corpus tools.Written instructionsIllustrations, figures and pictures that we have used or will use in articles and presentations.Metadata, semi-automatically extracted and manually curated, as well as manually produced. The metadata file includes a codebook data sheet, where all variables are explained.License, CC-BY-SA 4.0.Data OverviewThe dataset and codes on the project’s repository5 is version 0.1. An overview of the data is found in Fig. 2.Fig. 2The alternative text for this image may have been generated using AI.Full size imageData flowchart overview.Technical ValidationThe digitised source texts for the corpus have been restructured at different times. The different pdf and xml files stored at the DocSouth and FWP websites have previously been restructured by their respective projects. However, it was found that the digitisation contained a substantial number of errors. With advancements in technology since the initial digitisation, state-of-the-art optical character recognition (OCR) will now be employed to create a more accurate and consistent corpus. We have conducted new OCR tests on these records, and will use the advantages of the different test versions to create a more robust and consistent final corpus. Manual annotation has been employed to train classifiers, with inter-annotator agreement measured using Fleiss’ ϰ. This manual annotation has especially improved the lemmatisation process. The statistical significance of samples is determined through manual evaluation of automatically annotated parts. A comparison of different versions is then performed to identify any additions, removals, or omissions. This pilot project has revealed a substantial number of inconsistent normalisations. This pilot project has identified a substantial number of inconsistent normalisations.The text excerpt in Fig. 3 is from volume 4, book 3 of the Federal Writers’ Project collection (https://www.loc.gov/resource/mesn.043/?sp = 117&st = text&r = -0.301,-0.08,1.603,1.603,0 and digital id https://hdl.loc.gov/loc.mss/mesn.043).Fig. 3The alternative text for this image may have been generated using AI.Full size imageComparison images showing different sources of text and a gold transcription. 1) A pdf page from the FWP project’s website. Paragraphs are here marked by handwritten lines. 2) Gold transcription, manually corrected, of the text from 1). 3) FWP text variant 1, taken from the FWP website tab “Images with text”. 4) FWP text variant 2 as embedded in the website’s pdf version. 5) Tesseract Baseline OCR using its default settings (Tesseract v5.4.1). 6) Gutenberg “translation” where the rows were aligned by us to match the other images in this figure. 7) Gutenberg translation with original row numbering (note the difference from the image and other sources’ row numbering).In images 3), 4), 5), and 6) of Fig. 3 the Character Error Rate (CER) and Word Error Rate (WER) scores are overlaid on the text images. The term “character error” means that two single characters differ, and “word error” that two tokens differ between two images.The Tesseract baseline image 5), with no adaptation (using the default settings) gave a CER of 94.92 percent and a WER of 91.35 percent. Compared to the versions from FWP, images 3) and 4), this tells us that such a modern OCR tool is useful for most cases, even without modification. The OCR from FWP in images 3) and 4) also contains chunks of text in the wrong order, as well as text from other paragraphs. An example of this from image 3) are lines 4, 5 and 6. This is less obvious in image 4), but lines 3, 4 and 5 are jumbled. As these texts extracted from the PDFs exhibit similar behaviour but at the same time contain different errors, we can probably expect further error variants within other text samples, perhaps because they are produced with different engines and at different times. We have therefore prepared to improve the OCR using the Tesseract plugin.When we compared this initial outcome with the Gutenberg curated versions added later, images 6) and 7), we can see that the WER is 100 percent. However, their versions have been proofread, so that the word accuracy is 100 percent is somewhat expected. Unfortunately, the Gutenberg versions are also inconsistently normalised. In these three paragraphs, the text happens to be identical, whereas in other samples there may be different corrections and even comments. This, on the other hand, gives us the opportunity to treat the FWP archive versions, which are also available from Gutenberg, as “translations” of each other, or comparable texts, for the whole sub-corpus. We expect the available Gutenberg versions to be much fewer for the future project leading to corpus version 1.0, since they have not included all volumes and pages in their collection. However, we have only used them as reference, to note the difference between our OCR and theirs.The evaluation of metadata variables is conducted through the utilisation of consistency tests, or positive and negative tests, for variables such as the interviewee’s or author’s year of birth. These tests are used to solve problems such as whether the data can be extracted automatically. A negative test can determine whether any values are empty when they should not be. The equivalent positive test confirms that the value is not only present, but also the same as in the source file. While birth years may be uncertain, the concept of (un)certainty is not yet a factor in the pilot study’s metadata assessment.The content can be explored through word rain12 analyses, as Figs. 4, 5 show. These semantic vector analyses were conducted on the DocSouth texts, to get an idea of the content. The models for these are available at our repository (https://doi.org/10.23695/P5HW-DR52). In word rains, words that have a similar semantic meaning are located in close proximity to each other. A preliminary finding from the vector analysis is the presence of a discernible semantic distinction between the nouns “man” and “person” in two of the different subsets, both originating from Documenting the American South.Fig. 4The alternative text for this image may have been generated using AI.Full size imageWordrain of Word2vec analysis from the DSC first person narrative sub-corpus: both “man” and “person” can be seen used for different people, they are in the same vertical space as both “white_man” and “colored_person”.Fig. 5The alternative text for this image may have been generated using AI.Full size imageWordrain of Word2vec analysis from DSC other narratives sub-corpus: “man” and “person” are used differently for different people, they are on the opposite sides, with “white_man” in the same vertical space as “man” and “colored_person” in the same space as “person”.As illustrated in Fig. 4, in the first person narratives of formerly enslaved persons, the nouns “man” and “person” can be used to refer to any person, irrespective of their origin. As illustrated in Fig. 5, which is based on other narratives from individuals of diverse backgrounds, the noun “man” most often describes white persons, whereas “person” most often describes people of African-American descent.Usage NotesCorpus version 0.15 is based on the data in our pilot project, and later versions will contain more data and more analyses. On the referenced page, there are more detailed usage notes, and included in the dataset are “Read me” files. There is also a specific folder for instructional files and films in the repository folder “Documentation”.Data availabilityThe data set and other resources are available at Språkbanken Text’s resource site (https://doi.org/10.23695/P5HW-DR52).Code availabilityThe code and data for the Volces of formerly Enslaved Corpus is publicly available under a Creative Commons Attribution-ShareAlike 4.0 International license (CC BY SA 4.0). This is to make it as available as possible to others, since the data are part of the public domain and human cultural heritage.ReferencesHow Was Life? Global Well-Being since 1820. (OECD Publishing, 2014).Blassingame, J. W. The Slave Community. (Oxford University Press, New York, 1972).Genovese, E. D. Roll, Jordan, Roll: The World the Slaves Made. (Vintage, New York, 1976).Schneider, E. W. American Earlier Black English: Morphological and Syntactic Variables. (University of Alabama Press, Tuscaloosa, 1989).Språkbanken Text. Corpus Voices of the formerly Enslaved. Språkbanken Text https://doi.org/10.23695/P5HW-DR52 (2026).Blassingame, J. W. Slave Testimony: Two Centuries of Letters, Speeches, Interviews, and Autobiographies. (Louisiana State University Press, Baton Rouge, 1977).Fogleman, A. S. & Hanserd, R. Five Hundred African Voices: A Catalog of Published Accounts by Africans Enslaved in the Transatlantic Slave Trade, 1586–1936. (American Philosophical Society Press, Philadelphia, 2022).Schwartz, M. J. The WPA Narratives as Historical Sources. in The Oxford Handbook of the African American Slave Narrative (ed. Ernest, J.) 89–100, https://doi.org/10.1093/oxfordhb/9780199731480.013.007 (Oxford University Press, 2014).Hammarstedt, M., Schumacher, A., Borin, L. & Forsberg, M. Sparv 5 User Manual. (2022).Lawrence, K. Introduction. in The American slave: a composite autobiography. Supplement series I (eds Rawick, G. P., Hillegas, J. & Lawrence, K.) xci–xcvi (Greenwood press, Westport, Connecticut, 1977).Anthony, L. AntConc. Waseda University (2024).Skeppstedt, M., Ahltorp, M., Kucher, K. & Lindström, M. From word clouds to Word Rain: Revisiting the classic word cloud to visualize climate change texts. Information Visualization 23, 217–238 (2024).Article Google Scholar Download referencesAcknowledgementsThis research has been funded by the Åke Wiberg Science Foundation (H24-0275), the Helge Ax:son Johnson Foundation (F25-0072) and the Swedish Science Foundation (2025-01228).FundingOpen access funding provided by University of Gothenburg.Author informationAuthors and AffiliationsDepartment of Swedish, Multilingualism, Language Technology, University of Gothenburg, Box 100, SE-405 30, Gothenburg, SwedenIrene Elmerot & Leif-Jöran OlssonDepartment of Economy and Society, University of Gothenburg, Box 100, SE-405 30, Gothenburg, SwedenKlas RönnbäckAuthorsIrene ElmerotView author publicationsSearch author on:PubMed Google ScholarLeif-Jöran OlssonView author publicationsSearch author on:PubMed Google ScholarKlas RönnbäckView author publicationsSearch author on:PubMed Google ScholarContributionsIrene Elmerot has co-edited the metadata and annotations, distributed the texts to the other authors, annotated the frequency lists with corrections and comments, created Fig. 1 and co-created Fig. 2, as well as written the bulk of this article and re-written parts of it after peer-review. Leif-Jöran Olsson has co-edited the metadata and annotations; run and rerun the texts through OCR and annotation pipelines, consistency checks, curation etc.; written parts of the article that regards those issues and created Figs. 4–6; given the token numbers for Table 1, as well as co-created Fig. 2 and Table 2. Olsson has also compiled the final dataset and the repository site. Klas Rönnbäck has been in contact with the original text repository holders regarding copyright issues, given the source numbers for Table 1, and prepared the background section in this descriptor.Corresponding authorCorrespondence to Irene Elmerot.Ethics declarationsCompeting interestsThe authors declare no competing interests.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.Reprints and permissionsAbout this article