Generative AI models trained on internet data lack exposure to vast domains of human knowledge that remain undigitized or underrepresented online. English dominates Common Crawl with 44% of content. Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population. Tamil represents 0.04% despite 86 million speakers worldwide. Approximately 97% of the world's languages are classified as "low-resource" in computing. A 2020 study found 88% of languages face such severe neglect in AI technologies that bringing them up to speed would require herculean efforts. Research on medicinal plants in North America, northwest Amazonia and New Guinea found more than 75% of 12,495 distinct uses of plant species were unique to just one local language. Large language models amplify dominant patterns through what researchers call "mode amplification." The phenomenon narrows the scope of accessible knowledge as AI-generated content increasingly fills the internet and becomes training data for subsequent models.Read more of this story at Slashdot.