LLMs, Data Dysphoria, and the Global Regulatory Response

Wait 5 sec.

In the fast-moving world of artificial intelligence (AI), large language models (LLMs) like ChatGPT have emerged as revolutionary tools that are capable of generating human-like text, solving complex problems, and assisting with a myriad of tasks. However, these models are not without their challenges, particularly when it comes to governance. The proliferation of LLMs has sparked what some analysts are calling "data dysphoria," a state of unease regarding the use and governance of the vast datasets that power these models.The Rise of LLMs and the Data ConundrumLarge language models are built on massive datasets, often scraped from the web. These include everything from public posts on social media to proprietary content on websites. Such datasets are the lifeblood of LLMs, making it possible for them to learn patterns in language and generate coherent responses to user queries. However, the very process of web scraping data collection has raised significant concerns among policymakers, data protection authorities, and content creators.On August 24, 2023, the Office of the Australian Information Commissioner (OAIC) and 11 international data protection counterparts released a joint statement warning about the increasing incidents of data scraping, particularly from social media. This statement reflects a growing discomfort about the practices of LLM developers who rely on scraped data, often without the explicit consent of the data owners.The Governance ChallengesThe governance challenges posed by LLMs are complicated. First, there is the issue of where data comes from. The data used to train these models often come from a variety of sources, some of which may not have given explicit permission for their data to be used. This raises questions about the legality and ethics of data scraping, as well as the rights of content creators and data subjects.Second, the quality and validity of the data are critical issues. Since LLMs learn from the data they are fed, any inaccuracies, biases, or gaps in the data can lead to flawed outputs. This is particularly concerning when LLMs are used in high-stakes environments, such as healthcare or legal advice.Third, there is the issue of transparency. Many LLMs operate as black boxes, with little to no information provided to users about how they were trained or what data they were trained on. This lack of transparency makes it difficult for policymakers to regulate these models effectively and for users to trust the outputs they generate.Policy Responses and Global PerspectivesAround the world, policymakers are grappling with how to regulate LLMs and address the data governance challenges they present. The European Union's AI Act, for example, aims to establish a comprehensive regulatory framework for AI, including provisions for transparency and accountability in the data supply chain. Meanwhile, in Japan, the government has revised its copyright laws to facilitate AI development while ensuring that the use of copyrighted material is fair and transparent.In contrast, the United States has taken a more piecemeal approach, with ongoing legal battles over the use of copyrighted data by LLM developers. As of August 2023, several lawsuits are pending against major AI companies like OpenAI and Google, highlighting the growing tension between innovation and intellectual property rights. Canada implemented the Directive on Automated Decision Making in 2019 to govern AI systems procured by the government, ensuring data relevance, accuracy, and traceability. The act would requires organizations to keep records of how they manage anonymized data, but it mainly deals with how to identify, assess and mitigate harms of AI in General. The AI and Data Act, under review as of August 2023, remains disconnected from the governance of LLMs and the data supply chain, leaving gaps in the regulatory framework.The Road AheadAs LLMs continue to evolve and become more integrated into daily life, the governance challenges will only become more pronounced. Policymakers must adopt a systemic approach that considers the entire lifecycle of data, from collection and processing to use and disposal. This will require new regulations, innovative governance frameworks, and ongoing dialogue between governments, AI developers, and civil society.Moreover, LLM developers must take proactive steps to ensure that their models are trained on high-quality, ethically sourced data. This includes obtaining explicit consent from data subjects, compensating content creators when appropriate, and being transparent about their data practices.ConclusionThe rise of large language models has ushered in a new era of AI, one filled with both promise and peril. As we navigate the complexities of data governance and seek to address the challenges posed by these models, it is crucial that we do so with an eye toward protecting individual rights, fostering innovation, and maintaining trust in the digital ecosystem. Only then can we fully harness the potential of LLMs while mitigating the risks they pose.Call to ActionAs this debate continues to unfold, it is essential for all stakeholders, including policymakers, AI developers, and the public, to engage in discussions about the future of AI governance. By working together, we can ensure that the benefits of LLMs are realized while minimizing their potential harms. Share your thoughts in the comment section below!This article is based on this research: Susan Ariel Aaronson, 2023. "The Governance Challenge Posed by Large Learning Models," Working Papers 2023-07, The George Washington University, Institute for International Economic Policy.