Anthropic Shredded Millions of Physical Books to Train its AI

Wait 5 sec.

Today in schnozz-smashing on-the-nose metaphors for the AI industry's rapacious destruction of the arts: exactly how Anthropic gathered the data it needed to train its Claude AI model. As Ars Technica reports, the Google-backed startup didn't just crib from millions of copyrighted books, a practice that's ethically and legally fraught on its own. No — it cut the book pages out from their bindings, scanned them to make digital files, then threw away all those millions of pages of the original texts. To say that the AI "devoured" these books wouldn't merely be colorful language.This practice was revealed in a copyright ruling on Monday, which turned out to be a major win for Anthropic and the data-voracious tech industry at large. The judge presiding over the case, US district judge William Alsup, found that Anthropic can train its large language models on books that it bought legally, even without authors' explicit permission.It's a decision that owes, in part, to Anthropic's method of destructive book scanning — which it's far from the first company to use, according to Ars, but is notable for its massive scale. In sum, it takes advantage of a legal concept known as the first-sale doctrine, which allows a buyer to do what they want with their purchase without the copyright holder intervening. This rule is what allows the secondhand market to exist — otherwise a book's publisher, for example, might demand a cut or prevent their books from being resold.Leave it to AI companies, though, to use this in bad faith. According to the court filing, Anthropic hired former head of partnerships for Google's book-scanning project Tom Turvey in February 2024 to obtain "all the books in the world" without running into "legal/practice/business slog," as Anthropic CEO Dario Amodei described it, per the filing. Turvey came up with a workaround. By buying physical books, Anthropic would be protected by the first sale doctrine and would no longer have to obtain a license. Stripping the pages out allowed for cheaper and easier scanning. Since Anthropic only used the scanned books internally and tossed out the copies afterwards, the judge found this process to be akin to "conserv[ing] space," Ars noted, meaning it was transformative. Ergo, it's legally OK.It's a specious workaround and flagrantly hypocritical, of course. When Anthropic first got up and running, the startup went the even more unscrupulous route of downloading millions of pirated books to feed its AI. Meta did this with millions of pirated books, too, for which it is currently getting sued by a group of authors.It's also lazy and careless. As Ars notes, plenty of archivists have pioneered various approaches for scanning books en masse without having to destroy or alter the originals, including the Internet Archive and Google's own Google Books (which not too long ago was also the subject of its own major copyright battle.)But anything to save a few bucks — and to get that all too precious training data. Indeed, the AI industry is running out of high quality sources of food to feeds its AI — not least of all because it's short-sightedly spent this whole time crapping where it eats — so screwing over some authors and sending some books to the shredder is, for Big Tech, a small price to pay.More on AI: Microsoft Is Having an Incredibly Embarrassing Problem With Its AIThe post Anthropic Shredded Millions of Physical Books to Train its AI appeared first on Futurism.