Extracting toponyms from OpenStreetMap and other gazetteers: comparing representational accuracy in multilingual contexts

Wait 5 sec.

IntroductionGeographic Information Science (henceforth: GIS) and Linguistics may initially appear as unrelated disciplines. In highly schematic terms, GIS investigates various types of phenomena from the perspective of their geographical and spatial distribution (Fotheringham and Wilson, 2007). Linguistics, instead, investigates languages and their distinctive properties (Fasold and Connor-Linton, 2014). However, linguistics includes sub-disciplines such as geo-linguistics, dialectology, and toponomastics, which respectively study the geographical distribution of languages, dialects, and place names or toponyms (e.g. Perono Cacciafoco and Cavallaro, 2023). Conversely, several sub-disciplines in Geography and GIS study toponyms, their use and socio-cultural status across different cultures and languages (e.g. Alderman, 2022; Gnatiuk and Melnychuk, 2020). Thus, GIS and linguistics overlap in their domains of enquiry and research methodologies when focusing on spatial data, broadly defined. They then seem to converge in their focus on toponyms as a source of data regarding human understanding of spatial information, also broadly defined.An open question is whether these disciplines can also share information sources from which they can extract their toponymic data. Recent linguistically oriented works have offered a preliminary positive answer by using OpenStreetMap (henceforth: OSMFootnote 1) to extract a large set of toponyms and analyse their grammatical properties (Ursini and Samo 2023). However, the work does not offer a detailed analysis of how information about toponyms and their properties appears in OSM. Therefore, this and other similar works do not address the theoretical and methodological problems that emerge once researchers extract, process, and manage data from this source, and compare these data with data from official sources (e.g. gazetteers and land registry data).The goal of this this paper is to analyse OSM as a research tool and data source for linguistics and GIS, thus comparing this source with official sources. In so doing, we also aim to show that OSM can be a useful inter-disciplinary source for linguistic and geographic research on toponyms (e.g. respectively, Perono Cacciafoco and Cavallaro, 2023; Rose-Redwood, Alderman and Azaryahu, 2018). We present two case studies in which we carried out toponym extraction and analysis in multi-lingual contexts defined at different scales and densities of geographic distribution (city level, Macao; regional and national level, Italy). We analyse the methodological problems emerging from this extraction procedure, the type of linguistic data and the degree of empirical coverage of OSM when compared with other sources. We thus aim to show that several disciplines studying toponyms can amply benefit from using the OSM source, in combination with other accessible sources.We organise our paper as follows, to reach this goal. We first offer an overview of OSM and previous OSM-based works on toponyms, thus introducing three research questions (Section 'Literature review: previous research on OSM and current challenges'). We then present our methodology and materials (Section 'Methodology and materials'), and then the specific studies and results by which we answer each research question (Section 'Results'). Section ‘Discussion’ offers a discussion as a general answer to our research questions; Section ‘Conclusions’ concludes.Literature review: previous research on OSM and current challengesOSM is an online platform that offers 'a free, editable map of the world' (Curran et al., 2012, 2013; Keßler, 2017). Since its online appearance in 2004, OSM has provided open-source, easily editable maps of increasing detail and definition to all users and contributors (Arsanjani et al., 2015a; Mooney et al., 2017). OSM founders based the platform on a philosophy known as volunteered geographic information (henceforth: VGI, Antoniou and Skopeliti, 2017; Goodchild, 2007; Keßler et al., 2009; Sui and Goodchild, 2011). Any registered user can become a contributor by inserting and editing information regarding locations and the objects occupying these locations. OSM has thus emerged as an important source of knowledge for researchers in GIS and other disciplines focusing on geo-spatial information (e.g. urban planning, data mining), due to its flexibility and ease of management.Contributors can enrich OSM maps with spatial information; however, its core geographical objects work as follows (Almendros-Jiménez et al., 2021; Rajšp et al., 2021). Registered contributors can edit information based on their knowledge of locations. Contributions centre on the geographical objects shaping maps: nodes representing locations, ways representing connections among locations, and relations between nodes and/or ways. Each object has tags, labels indexing attributes ('keys') and values associated to locations (e.g. coordinates, altitude, shape, type of location). Tags represent objects on maps via a dual visual and textual format. Visually, tags are icons for objects on maps; textually, tags are key-value matrices of the type found in linguistics, GIS, and other computational disciplines (Gamerschlag et al., 2015; Sag, 2012). Tags are therefore unique multi-modal (i.e. visual and textual) indexes, as illustrated in Fig. 1 (left panel):Fig. 1: Tags for the Colosseum in Rome.The visual tag (i.e. the Colosseum’s location) is the grey oval shape in the centre of the figure.Full size imageContributors can introduce icons, keys and values according to their knowledge of a location and the objects that occupy this location; guidelines and graphical tools can streamline this process. For instance, local contributors from a neighbourhood can insert information pertaining to two buildings that are still unreported in OSM. They can create two new objects and tags, fill the tags with sets of keys describing the buildings, and select 'building' icons (i.e. visual tags; Salvucci and Salvati, 2022; Zhou et al., 2022). Contributors can also insert keys and values for ways (e.g. streets connecting to these buildings), and the informational content of relations. For instance, buildings can operate as habitations for citizens; streets may be quite busy during rush hour, and so on. Contributors can continually update tags that represent the physical-geographical properties of objects, but also the possible relations between these objects and the individuals interacting with the objects (Mayer et al., 2022).OSM tags can therefore offer information about places: geographical objects in which humans perform activities and to which they can possibly develop forms of social, cognitive, and psychological attachment (Cresswell, 2014; Malpas, 2018; Tuan, 1977). Toponyms can consequently act as names that carry this complex, partially subjective information via their semantic content and ability to refer to places and their attachment relations to human individuals (Blair and Tent, 2015, 2021; Perono Cacciafoco and Cavallaro, 2023). Nodes and ways can be objects representing places of increasing complexity: from buildings to cities and regions, and from streets to highway networks. Irrespective of this complexity, they can represent places and the rich informational content that contributors associate with places. OSM can thus operate as a multi-modal map integrating both spatial and 'platial', i.e. place-based information (Arsanjani et al., 2015; Mayer et al., 2022).The focus on platial information has allowed GIS researchers to use OSM as a data source for several topics. OSM maps can include places as small as ATMs, trees, and benches (Touya et al., 2017). Maps can also provide real-time information regarding risks affecting places (e.g. natural disasters: Cerri et al., 2021; Hecht et al., 2013; Seto, 2022); epidemic diffusion (Mooney et al., 2021; Mooney and Juhász, 2020). OSM maps can then provide online information about real-time updates that contributors perform on objects (e.g. fires in buildings: Novack et al., 2022; Schäfer and Kieslinger, 2016; Senaratne et al., 2017). Recent works have thus suggested that OSM maps are becoming increasingly spatially, temporally, and platially accurate. They therefore offer a dynamic view of the places they represent (Mocnik, 2022; Romm and McKenzie, 2023), to the benefit of casual users and geo-spatial scholars alike.The fact that OSM can provide dynamic information updates and management tools has, however, raised certain research questions. Traditionally, research in GIS has centred on gazetteers, ordnance survey maps and other representational sources produced by official geographic institutions, i.e. authoritative geographic information sources (henceforth: AGI Bortolini and Camboim, 2019; Fize et al., 2021). Geographic institutions mostly adopt top-down practices of information collection and management. For instance, officials from the land registry can register information about buildings and streets from local authorities owning these places. Local citizens may never be involved in these procedures, even if they have knowledge pertaining to the aforementioned places. OSM stakeholders, instead, have mostly adopted bottom-up management practices. Contributors are usually citizens interested in inserting information about their local places, often addressing coverage gaps cropping up in AGI sources (Keßler, 2017; Keßler et al., 2009).Overall, contributors’ knowledge may be accurate to sometimes-volatile degrees, thus casting a shadow on the accuracy of VGI sources. Several studies have, however, shown that the soundness of spatial information in VGI sources strongly correlates with contributors’ formal education, motivation, and commitment to professional-like data insertion (Garba et al., 2022; Holthaus and Thiemermann, 2022; Jaljolie et al., 2023). Furthermore, several AGI sources have become freely accessible to the public, and thus accessible to OSM contributors for systematic 'information dumps', i.e. massive imports from other sources (Bravo and Sluter, 2022; Wu et al., 2022). Corporations and NGOs have also begun to hire professional contributors performing large data sets imports, since business and to citizens’ communities consider the support of OSM beneficial (Anderson and Sarkar, 2020; Sarkar and Anderson, 2022). Hence, OSM is becoming a 'multi-source' model of information management, in which contributors reconcile top-down and bottom-up philosophies via carefully documented data (Hu et al., 2022).The importance of this multi-source model in platial analysis becomes clear when one focuses on toponyms. In highly schematic terms, OSM tags usually include a 'name' key (i.e. attribute), among their many keys. The specific value for this key is usually the toponym assigned to a given place. In our example involving the new buildings, we can have an OSM tag for each building, including near-identical values, in case the same company built both buildings. However, we can have one building called 'Joseph Joestar', and the other 'Jotaro Kujo': toponyms are usually unique, distinctive labels for places. Citizens interacting with these buildings and OSM will likely exploit this platial uniqueness to refer to either building. For humans, toponyms are more cognitively accessible than pure spatial information (e.g. coordinates, lists of features: Perdana and Ostermann, 2018). This is the case because they are the key language category that allows humans to talk about places (Alderman, 2022; Perono Cacciafoco and Cavallaro, 2023; Rose-Redwood et al., 2018).One can thus define the central role of toponyms in OSM as follows (cf. again. Mocnik, 2022; among others). Toponyms act as prime keys leading to the access of the complex semantic information defining place descriptions. Any contributor can insert an official or unofficial toponym for a place, and provide references/sources from which this information originates (e.g. personal knowledge, other volunteer-based sources, and authoritative sources). Other contributors can question the reliability of this information via their own sources, and one can solve eventual disputes via cross-referencing and online discussions aimed at avoiding 'editing wars'. Information about toponyms can receive updates in real time, as in the case of information about places. Toponyms in OSM are therefore platial data types that are as important as geographic data types, due to their immediate cognitive appeal to general users and researchers alike.In recent times, several studies have used OSM for toponym analysis, exploiting its multi-source model of data integration (Ahmadian and Pahlavani, 2022; Hall and Jones, 2022; Kaisar Ahmed, 2022; Machado et al., 2021). For instance, contributors to the Paris’ toponym database have integrated grassroots knowledge with information from public gazetteers (Antoniou et al., 2016). The Jerusalem database includes coverage of Hebrew toponyms hinging on gazetteers’ imports, but coverage of Palestinian toponyms has gained momentum, even if Palestinian users cannot rely on AGI sources (Carraro, 2021). Notably, OSM contributors tend to focus on Europe and North America. However, coverage of other countries is increasing at a dramatic, if uneven pace (e.g. Brazil: Kaisar Ahmed, 2022; China: Qian et al., 2016; Kenya: Daniel and Mátyás, 2022). Contributors generally work intensely on the task of assigning names to each object perceived as a place.OSM is thus becoming a reliable even if still partially unbalanced, resource for GIS studies. Its role in linguistic research, however, appears to involve two apparently distinct problems. The first problem can be pre-theoretically defined as a problem of heterogeneous distribution. Recent publications show that coverage of platial information tends to be denser at smaller, local scales (Westerholt, 2019a, 2019b). Contributors to VGI sources may offer coverage of the districts or cities they live in. However, the distribution of platial information at a regional and national level may include vast regions of missing information. For instance, rural zones and underdeveloped urban zones tend to correlate with poor toponym coverage (Daniel and Mátyás, 2022; Elias et al., 2023; Qian et al., 2016). Toponyms’ spatio-temporal density in OSM can therefore reflect a region’s salience or irrelevance for societies and populations. A consequent question is how OSM compares to AGI sources, with respect to toponyms’ distribution.The second problem can receive a formulation as a problem of heterogeneous multi-lingual representation. Toponomastics and critical toponymy have shown that different communities may decide to name places via different socio-cultural practices (e.g. Cavallaro et al., 2019; Stolz and Warnke, 2018). These practices may, however, follow conflicting guidelines and involve subtle power conflicts, in multi-lingual contexts (Alderman, 2022; Azaryahu, 2011; Gnatiuk and Melnychuk, 2020; Rose-Redwood et al., 2010). For instance, Uluru is the sacred toponym that local Australian Aboriginal communities use for the monolith in the centre of Australia. For most people, however, the English Australian name Ayers Rock may be more familiar. Both are eponymous names: however, the former is based on a divinity’s name while the latter is based on the name of a previous governor of the South Australia state. Crucially, the Jerusalem case suggests that co-existing linguistic communities can have uneven access to OSM, due to language-external pressures on accessibility (Carraro, 2021). A consequent question is how AGI sources and OSM differ in multi-lingual coverage, due to these accessibility asymmetries stemming from socio-linguistic and geo-linguistic factors.As matters stand, these two intertwined problems of heterogeneity lead to the emergence of three compound research questions. The first research question arises from the distribution problem; the second question, from the multi-lingual problem; the third question, from their theoretical implications. The three research questions can receive the following formulations:RQ1: How many toponyms can one find in OSM, and how accurate and homogeneous this coverage is? How does OSM compare with authoritative sources?RQ2: What asymmetries can one find in OSM, when one analyses the coverage of toponyms across the multiple languages used in a delimited region? Where these asymmetries emerge, and at what scales of analysis they emerge? How OSM compares with AGI sources?RQ3: How results based on OSM can inform GIS research, toponomastics, and other sciences studying toponyms and their properties?We answer RQ1 from a mostly quantitative perspective. We thus analyse the distribution of toponyms at three levels of geographical scale and density: the city, regional and national levels. We propose two case studies: Macao, in China (city level), plus Italy and its 20 regions (national, regional levels). We then discuss how these databases include toponym information from AGI sources (e.g. official gazetteers). We answer RQ2 from mostly a qualitative perspective. We thus analyse the differences in multi-lingual coverage across the official languages of each target study (e.g. Chinese and Portuguese in Macao), and the toponyms attested in these regions. We then discuss where and at what scales these asymmetries arise. We answer RQ3 by integrating these two perspectives into one model, and by discussing how one can compare OSM data to AGI sources’ data. We subsequently discuss how OSM data can find applications in linguistics, GIS, toponomastics, and other sciences focusing on platial information.Methodology and materialsWe used one language-general methodological approach to data extraction and processing; however, we discuss language-specific adjustments in section 'Results'. Portions of each study appeared in previous works that analysed toponyms from a toponomastic perspective (Xie et al. 2023, Samo and Ursini 2023). In this study, we present a broader range of geographic and geo-linguistic data to address the first two research questions and a more in-depth meta-analysis of the data to address the third question (cf. Ursini and Samo 2023). We acknowledge that a focus on one geographical domain (e.g. Italy) could have been a more practical and empirically coherent choice. A methodological goal motivated this less practical choice, however. By using our two previous studies as a baseline, we can indirectly confirm that other researchers can replicate our methodology irrespective of the geographical region and scale under discussion. We thus trade higher data cohesiveness with evidence about the repeatability of the procedure.We accessed OSM data through the platform overpass-turboFootnote 2, sizing our search in the relevant geographical areas. We queried the platform with the script in Fig. 2 to extract the relevant data (in the example, to extract toponyms in the 'Abruzzo' Italian region). We proceeded with data analysis following the flowchart in Fig. 3:Fig. 2: An example of the algorithm used for data extraction.Our query retrieves data from OSMP within the specified geocode area (in the first line of our example, the Italian region of 'Abruzzo') in each timeout time specified in milliseconds. It then looks for toponyms via the 'highway' label, and for their tags (second line). The query extracts the toponyms (third line), and outputs the results in alphabetical order in CSV format (fourth line). The extraction procedure covers not only ways/highway toponyms, as the label seems to suggest, but also other toponym types within a given area. See the main text for discussion on the types of toponyms extracted in the studies.Full size imageFig. 3Flowchart representing the methodology used in each study.Full size imageAs the flowchart shows, the data extraction step generates a raw data file that we transformed in a.csv format ('OSM Turbopass' step). The output.csv file (i.e. the materials) easily supports statistical data analysis, visual data representation, and linguistic categorisation. The additional step in this paper with respect to the two previous works is that we focus solely on generic terms in toponyms. We define generic terms as the terms that describe/classify a place carrying a given name (e.g. the term vicolo 'alley’ in King’s alley: Blair and Tent, 2015, 2021). This type of analysis provides crucial details on the geographical distribution of items and thus yields quantitative evidence, as we show in Section ‘Methodology & Materials’. Figure 4 illustrates how we first converted the raw data files to.csv files, how we extracted generic terms, and then how we prepared the visual maps for the data:Fig. 4: An example of the process for toponyms’ extraction from the second study, which investigates the distribution of local toponyms in Italy for given regions.The script’s output is a csv file and the data become plotted in maps (the rightmost panel is the distribution of number of tokens of toponyms across the regions of Italy, map created with Datawrapper v.1.25.0 Lorenz et al., (2012). Please consult the lists of script/queries used for data extraction in the supplementary files, 'List of Queries A' and 'List of Queries B' files.Full size imageOnce the.csv file was ready ('Data cleaning' step), we compared these data with the data from selected AGI sources for each study ('Comparison with AGIs' step). We then plotted the distribution and density of data ('Geographical distribution analysis' step), and performed an analysis of the grammatical and typographic properties of toponym sets in each study ('Linguistic analysis' step). From these different components of the study, we created a description of the data ('Data description' step), which forms the section 'Results'. We therefore use mostly geographic data to answer RQ1, linguistic data to answer RQ2, and their combined model to answer RQ3. We clarify further intermediate steps in the linguistic and geographic distribution analyses once we present each case study.Before we move to the description(s) of the data, we offer two methodological clarifications and one clarification about the materials. First, the comparison of the OSM data with AGI data achieves a form of what psychologists define as triangulation, i.e. the analysis of the same dataset(s) via multiple methods and/or sources (Damico and Tetnowski, 2014; Rothbauer, 2008). In GIS, there is also a growing awareness that toponym retrieval and analysis studies must involve multi-source methods implementing forms of cross-verification (Hu, Al-Olimat, et al., 2022; Hu et al., 2022; We thus assume that by implementing a form of methodological triangulation, we increase the reliability of our findings, and properly compare OSM as a VGI source with AGI sources. Second, we focus on toponyms as one data type. When relevant, however, we explain how toponyms map onto the other data types forming tags (e.g. spatial coordinates).The clarification about the materials pertains to the sub-type of toponyms we extracted in each study. We extracted toponyms describing places in urban administrative zones, known in the literature as 'urbanonyms' (Vannieuwenhuyze, 2007; David, 2011; Way, 2019; Xie et al. 2023, Samo and Ursini 2023). Hundreds of works have studied toponyms for streets (hodonyms), toponyms for squares (agoranyms), and other sub-types of urban toponyms, especially in critical toponymy (e.g. Alderman, 2022; Azaryahu, 2011; Rose-Redwood et al., 2010, 2018; Gnatiuk and Melnychuk 2020). We cannot possibly review all the relevant literature here, but one can find recent state-of-the-art overviews are in Coates (2007); Basik (2020, 2021); Walkowiak (2024). In this paper, the tokens forming the analysis are toponyms for streets, squares, parks, points of interest (e.g. monuments), and other places forming the urban zones under consideration. We further clarify relevant study-specific details along with the results. We then address the methodological theoretical import of these results in the discussion section.ResultsWe present the results of each study and our study-specific answers to RQ1 and RQ2 in this section. The data for the studies are in the supplementary materials (Supplementary file A, B for the first Study; Supplementary file C for the second study).First study: Macao, China (locally bilingual, city level)In the first study, we analysed the urban toponyms from Macao, a city, and special administrative region (SAR) in South-East China. Macao has a centuries-long tradition of multi-lingualism, as a former Portuguese colony. European settlers introduced Portuguese has the official language of their rule; this language co-existed with Cantonese, the Sinitic language spoken in the Guangdong province (Yee, 2014). In modern Macao, Portuguese has remained as an official language, though only 1% of the population speaks it natively. Cantonese and other Sinitic languages (e.g. Mandarin and Hakka) are prevalent and English is slowly becoming a de facto lingua franca in Macao, for economic purposes (Botha and Moody, 2021). The first study thus provided a complex multi-lingual environment at a city level/scale.Gazetteers include toponyms in Chinese and Portuguese, the two official languages. Chinese toponyms are written in Chinese simplified characters, and are thus intelligible to speakers of any Sinitic language (e.g. Chinese/Mandarin, Cantonese and Hakka). Portuguese toponyms are written in Macanese Portuguese, which is nearly identical to standard European Portuguese. The authors analysed the grammatical and lexical properties of each token toponym and corresponding general term. One author was a native speaker and reader of Mandarin, and another author had a high degree of fluency in Portuguese. Chinese generic terms provided a minor challenge when they appeared as via two-character compounds (e.g. 公園 gung1 jyun4 ‘public garden’). The paper’s native author solved this challenge by assessing whether such compounds would jointly describe a distinct place type. See Supplementary file A for the data regarding the results from the linguistic analysis performed in Xie et al. (2023).We compared the resulting sets against an official gazetteer from the Macao government in CD-ROM form, as our AGI source (Cartography and Cadastre Bureau of Macau SAR, 2021). This gazetteer includes names for streets, squares, parks, POI’s, and other places within the urban administrative territory of Macao. Thus, these toponyms can qualify as urban toponyms irrespective of the type of place they name (Xie et al. 2023). The gazetteer represents an AGI source because the Macao government handles administrative matters regarding Macanese toponyms, and updates online and off-line maps (e.g. CD-ROM releases approximately at bi-annual intervals). For the current study, we collected further data with respect to (Xie et al. 2023) to address geo-distributional and geo-linguistic aspects. The three key results are as follows.First, we obtained two lists of 1394 toponyms from both sources (OSM, CD-ROM). We compared the two sources by using the Jaccard Index of similarity (from 0 to 1: the closer to 1, the more similar two populations, Jaccard, 1901), and obtained 0.989 as a result. OSM toponyms only presented minor spelling variants in Portuguese that stem from contributors’ mistakes (e.g. accent omission: Rua de Santo Antonio instead of Rua de Santo António 'Saint Anthony Street’). Notably, the etymology of toponyms often differed from Portuguese to Chinese, due to different naming practices in each linguistic community (45% of the total). For instance, 龍鬚街lung4 sou1 gaai1 ‘Dragon Beard Street’ and Rua Central ‘Central Street’ are the two toponyms for a key street in Macao’s old town. Portuguese authorities named this street after is location and social function; Chinese speakers, after the imagined appearance of a local temple.Second, we compared the two data sets on a place-by-place basis. We thus analysed the geographic distribution of these toponyms on the Macao territory, while also analysing the density of urban constructions and agglomerates across Macanese districts. Our conjecture was that zones with higher numbers of human-built places (e.g. buildings, streets and squares) would also feature a higher number of toponyms (cf. Hecht et al., 2013; Salvucci and Salvati, 2022). We then verified that each Portuguese-Chinese toponym pair uniquely corresponded to one place via an analysis of the toponyms’ coordinates. We present the geographical distribution of these toponyms in Fig. 5 (extracted from Xie et al. 2023):Fig. 5: Distribution of near-equivalent terms.a We present near-equivalent terms (see details in Xie et al. 2023), namely 'near-equivalent' 1-to-1 translations of the generic terms in Chinese and Portuguese. As a control group, b The distribution of dissimilar terms representing instances in which the Portuguese and the Chinese terms are not near-equivalent translations. Extracted from Xie et al. (2023, p. 37) Fig. 1. See also the 'List of Macau Streets' file in the supplementary file section for the full list of urban toponyms used in this analysis.Full size imageAs the map indirectly shows, the geographical distribution of toponyms correlates with the degree of urban development. For instance, the old town has the highest density of toponyms because it includes the highest number of places with key social functions in Macanese society (e.g. the Senate building). Instead, the southern island of Coloane ('Ilha de Coloane', in the map) is still mostly composed of historical spots (e.g. temples), natural reserves and scenic views. One can find toponyms along its coasts, where these spots and views are located. Therefore, Macanese toponyms may have a heterogeneous distribution as a reflection of places’ size and spatial prominence (cf. Elias et al., 2023; Kaisar Ahmed, 2022).Third, we compared the OSM data with other AGI sources, and analysed whether OSM updates had occurred from the time of the original study. We first consulted the APP place directory for Macao provided by the ministry for tourism.Footnote 3 The APP implements data from official gazetteers for Portuguese and Chinese. It also includes English toponyms as direct calques (i.e. translations) from Portuguese. The 1633 toponyms thus also include toponyms for casinos, historical buildings and other 'points of interest' (POI’s). For instance, English A- Má Temple is a translation of Portuguese Templo de A-Má, and Chinese 媽閣廟maa5 gok3 miu6. English features in this APP because this language is slowly emerging in official toponymy (e.g. in street plaques in the old town: Botha and Moody, 2021). The APP thus seems to reflect this increasing relevance of this third language.Table 1 offers an overview of the extra English toponyms one can find in the APP by listing the generic terms only found for the toponyms exclusive to this APP:Table 1 English generic terms, and their Portuguese and Chinese Counterparts (adapted from Xie et al. 2023, page 38, Table 4).Full size tableThe map in Fig. 6 shows that OSM includes almost all the casino toponyms attested in the APP: four toponyms appear missing (cf. OSM’s 35 units vs. the APP’s 39). Their respective toponyms mostly occur in the old town and in Taipa, the northern and more urbanised side of the southern island; no casinos exist in Coloane. The data presented via Figs. 5, 6 therefore suggest that places and hence toponyms tend to occur in the densely urbanised parts of Macao, i.e. the city’s zones with a more intense human presence. As we now have a broad set of data at disposal, we can answer our first two research questions with respect to the city level of platial distribution.Fig. 6: Map of Casinos.As we explain in the main text, casinos mostly lie in the more urbanised parts of the city. Coloane lacks any of these places, being mostly green placesFootnoteThe screenshots are from OSM. The link is available at: https://osmand.net/map/#13/22.1743/113.5612, last access 12.11.2024..Full size imageRegarding RQ1, OSM now includes a slightly higher quantity of data than the official CD-ROM gazetteer. However, the quality of this data is marginally inferior for the Portuguese dataset. An analysis of the updates’ history suggests that one update occurred from the date of the original study. This update brought OSM’s accuracy above the CD-ROM’s level, but below the tourist office APP’s level. Furthermore, the density of toponyms on the Macao maps from all three sources (OSM, official gazetteer, tourist office APP) supports the correlation between human presence and the heterogeneous distribution of places. Zones with dense human presence attract dense, potentially homogeneous clusters of toponyms; zones with scarce human presence attract rarefied, potentially heterogeneous clusters. Thus, OSM appears to offer a slightly more homogeneous and accurate coverage of toponyms than one AGI source (the CD-ROM gazetteer), but a less accurate coverage than another AGI source (the tourist office APP).Regarding RQ2, OSM reveals multi-linguistic information about toponyms in Portuguese and Chinese. Crucially, this multi-lingual coverage is as accurate as the official CD-ROM gazetteer. The tourist office APP also includes English toponyms, since it works as a gateway language for tourists visiting this city, and aims to capture the growing relevance of this language in Macanese society. Thus, the APP acts as an AGI source that presents a broader, multi-lingual and detailed mapping of Macanese toponyms than OSM. Nevertheless, one can assume that OSM contributors are already implementing information from this source in ongoing updates (cf. the casino data). These data overall confirm that OSM can provide relevant evidence for linguistic analyses due to its multi-lingual status. However, OSM can involve missing data involving specific toponym types (e.g. POI toponyms) at local scales of analysis. AGI sources may still provide more homogeneous and accurate coverage of toponyms.Second study: Italy and its regions (multi-lingual, regional and national level)In the second study, we investigated the potential dialectal origins of Italian urban toponyms via OSM (Samo and Ursini 2023). In Italy, standard Italian co-exists with geographical dialects that are considerably different from this language. Differences involve socio-linguistic aspects (e.g. Berruto, 2012), grammatical, phonological and lexical features (e.g. Bossong, 2016; Samo and Ursini 2023). For this study, two basic considerations play a role. First, linguistic research considers dialects such as Neapolitan in the South, Piedmontese in the North distinct languages, since their linguistic features are different enough from Italian to warrant this status. Italian and most languages/dialects, however, belong to the Romance branch of Indo-European languages. They have interacted over the centuries, as closely related languages. Second, toponyms may nevertheless originate from languages other than Italian (e.g. German, French: Cassi and Marcaccini, 1998; Marcato, 2009). The second study and the results provided in this study thus provide evidence for a situation of nuanced multi-lingualism.We offer a concise overview. Italian toponyms often originate in the languages of the pre- Italic populations that once inhabited Italy (e.g. Etruscan), but also in the local dialects (e.g. Florentinian in Florence; Chiappinelli, 2013; Cassi, 2015). The legislation for Italian 'odonimi' (i.e. urban toponyms, Mastrelli, 2005) establishes that local administrations (e.g. municipalities) assign toponyms to urban places (e.g. streets), also via the consultation of local citizens. In regions with special administrative status (Valle D’Aosta, Trentino-Alto Adige), parts of the population have an official status as 'linguistic minorities'. These communities form less than half of the population and speak a different language from the official language (Mastrelli, 2005). In these regions, toponyms occur in both official languages (Italian and French, Italian and German), with the minority language preceding Italian (e.g. Bozen/Bolzano, German/Italian, for the administrative seat of Alto Adige).Building on these premises, Samo and Ursini (2023) used OSM to analyse urban toponyms and their generic terms. The study showed that toponyms including generic terms not attested in standard dictionaries of Italian or glossaries of geographic terms (e.g. Calafiore, 1975; Gasca Queirazza et al., 1990; De Mauro, 2020) may be dialectal in origin. Such urban toponyms often entered the modern Italian language via a process of spelling standardisation and lexical absorption. For instance, in the city of Genoa, one can find terms such as crosa in toponyms for the crimson alleys traversing the city’s quarters. However, the Genoese term was originally creusa (e.g. Italian Crosa del Mare from Genoese Creusa de Ma’). Crucially, Genoese ('Zeneize’, in the original language) is a Gallo-Romance (i.e. Francophone) dialect/language mostly spoken in the Liguria region (Bossong, 2016). The study thus showed that Italian includes hundreds of generic terms, and the toponyms they occur in, from local geographical dialects/languages.The second study concentrated on toponyms for places being part of urban administrative zones (Way, 2019). Thus, its tokens include toponyms, for e.g. streets, squares and range from toponyms belonging to urbanised villages (e.g. Chiusi, in Tuscany) to toponyms from the capital city, Rome. For the present study, we extracted the set of generic terms attested in urban toponyms in two different periods (433,574 attested entries in July 6 2022, 455,383 entries in December 7 2023). We then analysed their spelling and their linguistic properties, and their geographic distribution to individuate their dialectal roots. We also incorporated the temporal dimension to observe trends in the growth of linguistic data in OSM and pinpoint the specific locations where such growth occurs. We can thus discuss three key novel results (N.B. We present the data for the figures in the Appendix, while all the other relevant data were available in Supplementary file B,C).First, OSM offers a higher number of toponyms than some AGI sources, at least with respect to urban toponyms. We addressed this aspect by extracting toponyms from the YellowPages online directory.Footnote 5 A clarification on this directory as an AGI source is necessary, before we proceed. The YellowPages directory has commercial purposes, since it offers addresses and locations of various commercial activities that are willing to buy this service. Thus, the lists of places with commercial functions are not necessarily exhaustive. However, the maps and place directories are based on official gazetteers provided by local and national administrations. Crucially for our purposes, the YellowPages directory includes urban toponyms, and it includes data of AGI origin not offered by volunteers. It thus approximates an AGI source.Footnote 6We obtained 213,218 toponyms, i.e. less than half of the toponyms extracted via OSM. Crucially, the YellowPages directory includes directories for minor urban centres, villages, and hamlets. However, these province-specific directories tend to offer lower-resolution maps than those for major urban centres (e.g. 1:5000 against 1:3000, and against OSM’s 1:1000 scale). They can overlook villages and hamlets that may be too small to appear at these resolutions, and therefore report lower numbers of toponyms. Upon calculating the Jaccard index for these lists (score: 0.0088), we also confirmed that the YellowPages directory only includes a part of the toponyms found in OSM. We can thus conclude that OSM can currently offer a higher quantity of toponyms and a better coverage of their distribution on the Italian territory than the AGI source YellowPages.Second, the geographical distribution of urban toponyms, irrespective of their linguistic origins, appears heterogeneous. However, at a regional scale, a more nuanced picture emerges that involves different types of homogeneous distributions. Figure 7a, b offer the distributions in terms of tokens and percentages associated with each region in the two time-spans under investigation. We have retrieved data in two different periods to detect the dynamic evolution of the source. This evolution (i.e. increase of instances) appears in the lower panel of Fig. 7, again in terms of tokens and percentages:Fig. 7: Distribution (in tokens) and increase of instances (in tokens and percentages) in the target time span.The figure in panel a shows that show that some regions include the highest percentages of toponyms due to their size and the higher number of urban centres (e.g. Lombardy in the North, Lazio in the centre, Sicily in the South). The figure in panel b also shows that also show that some regions (e.g. Lombardy, Sicily) also involved updates of considerable in the target time span. We use the figure in panel a also to introduce the names of the 20 administrative regions.FootnoteThe link is available at: https://it.wikipedia.org/wiki/File:Regions_of_Italy_with_official_names.png, last access 12.11.2024..Full size imageGiven the size and relevance of these updates, we can infer that most contributors concentrated on these regions to remove perceived or real gaps in toponyms’ coverage. A further finding is that some regions include urban centres that cover most toponyms, and thus indirectly determine heterogeneous distributions. Other regions, instead, feature more spatially homogeneous distributions of toponyms. This is the case irrespective of the number and size of urban centres. We can analyse this pattern via quantitative and qualitative insights based on the most recent dataset as a reference (December 7, 2023). The quantitative insights are as follows. The impact of the most prominent urban centre within each region, automatically retrieved by inputting the name in Italian, appears in Fig. 8a.Fig. 8: Distribution of geocodes.a Regions with administrative seats including the highest number of geocodes are in dark blue. These regions are: Liguria, Trentino-Alto Adige (North); Umbria, Lazio, Abruzzo, Molise (Centre); Basilicata (South). b Distribution of tokens/surface ratio. Lighter-shaded regions have low numbers of tokens per square kilometre; darker-shaded have higher numbers. Lighter-shaded regions are Valle D'Aosta (North); Sardinia (Island, Centre), Molise (Centre); Basilicata, Calabria (South).Full size imageThe Northern region of Lombardy comprises 11.56% of the total toponyms; its administrative capital and global economic hub, Milan, covered 8.96% (4715 tokens) of the total. Central regions Lazio and Molise respectively comprised 8.20% and 2.98% of the total, instead. Lazio includes the national capital city, Rome (45.09% of the total; 16843 tokens). Molise is a small region to the West of Lazio, with Campobasso being its most important urban centre. The dark blue colour for Molise indicates that Campobasso includes the majority of this region’s tokens. The light blue colour for Lombardy indicates that Milan does not include most tokens. Lazio’s shade represents an intermediate to high concentration example. More in general, most Italian regions feature homogeneous distributions of toponyms (i.e. no urban centre covers most tokens: lighter blue shades). However, regions with heterogeneous distributions also occur (darker blue shades).The qualitative insights are as follows. Basilicata is a mostly mountainous region including national parks and a few minor cities, in the South of the country (dark blue shade in Fig. 8a). Its administrative seat, Matera and the other urban centres appear scattered on this territory. These centres correspond to clusters of urban toponyms in spaces mostly devoid of these toponyms (light yellow shade in Fig. 8b). Liguria, on the other hand, is the coastal region in the North-West of the country, and offers a clear example of urbanisation in a limited space. That is, the region has a wealth of small urban centres and only one centre covers a relative majority of toponyms, its administrative seat, Genoa. However, the distribution of these centres appears (relatively) dense and evenly distributed.We present these patterns in Fig. 8b, in which we plotted a ratio calculated by the number of toponyms divided by the number of km2 of surface.Footnote 8 Aggregating all regions, we observe a strong correlation between the number of tokens and the size of the surface in km2 (Pearson’s r = 0.82, p