I'm really confused by the FSF statement here. The court ruled that the use of copyrighted information is fair use. The issue is that Anthropic pirated (obtained illegally) copyrighted work and that was the offense. FSF books are free to download and store etc. The license says: "This is a free license allowing use of the work for any purpose without payment." So how can they claim that their rights were infringed when the court ruled that the problem was the illegal downloading of copyrighted work? It's impossible to illegally download a FSF book.
The framing of 'share your weights freely' as a remedy is interesting but underspecified. The FSF's argument is essentially that training on copyrighted code without permission is infringement, and the remedy should be open weights. But open weights don't undo the infringement -- they just make a potentially infringing artifact publicly available. That's not how copyright remedies work. What they're actually asking for is more like a compulsory license, which Congress would have to create. The demand for open weights as a copyright remedy is a policy argument dressed up as a legal one.
A related topic that I have in the past thought about is, whether LLM derived code would necessitate the release under a copyleft license because of the training data. Never saw a cogent analysis that explained either why or why not this is the case beyond practicality due to models having been utilized in closed source codebases already…
> It is a class action lawsuit… the parties agreed to settle instead of waiting for the trial…
It would be nice if members of the class could vote to force a case to trial. For the typical token settlement amount, I’m sure many would rather have the precedent-setting case instead.
The issue is that every CS masters student & AI researcher knows how to build a SOTA LLM..
But, only a few companies have the resources.
The process:
(1) steal as much data from the internet as possible (data is everything)
(2) raise incomprehensible amounts of money
(3) find a location where you can take over the energy grid for training
(4) put a black box around it so nobody can see the weights
(5) charge users $$$ to use
(6) retrain models with user session data (opt in by default)
(7) peek around at how users are using, (maybe) change policies to stop them from using that way, and (maybe) rapidly develop features for that use case.
(Sorry that last one is jaded and not fair - just included to give you a picture of what could be happening with this sort of tech)
…
The entire premise of the product is “built on the backs of any & everyone who has ever published a work”
> Among the works we hold copyrights over is Sam Williams and Richard Stallman's Free as in freedom: Richard Stallman's crusade for free software, which was found in datasets used by Anthropic as training inputs for their LLMs.
This is the reason why AI companies won't let anyone inspect which content was in the training set. It turns out the suspicions from many copyright holders (including the FSF) was true (of course).
Anthropic and others will never admit it, hence why they wanted to settle and not risk going to trial. AI boosters obviously will continue to gaslight copyright holders to believe nonsense like: "It only scraped the links, so AI didn't directly train on your content!", or "AI can't see like humans, it only see numbers, binary or digits" or "AI didn't reproduce exactly 100% of the content just like humans do when tracing from memory!".
They will not share the data-set used to train Claude, even if it was trained on AGPLv3 code.
What weak, counter-productive, messaging. This is like having a bully punching you in the face and responding with “hey man, I’m not going to do anything about this, I’m not even going to tell an adult, but I’d urge you to consider not punching me in the face”. Great news for the bully! You just removed one concern from their mind, essentially giving the permission to be as bad to you as they want.
132 comments
> It is a class action lawsuit… the parties agreed to settle instead of waiting for the trial…
It would be nice if members of the class could vote to force a case to trial. For the typical token settlement amount, I’m sure many would rather have the precedent-setting case instead.
The hero we need, but not the hero we deserve..
The issue is that every CS masters student & AI researcher knows how to build a SOTA LLM.. But, only a few companies have the resources.
The process:
(1) steal as much data from the internet as possible (data is everything) (2) raise incomprehensible amounts of money (3) find a location where you can take over the energy grid for training (4) put a black box around it so nobody can see the weights (5) charge users $$$ to use (6) retrain models with user session data (opt in by default) (7) peek around at how users are using, (maybe) change policies to stop them from using that way, and (maybe) rapidly develop features for that use case.
(Sorry that last one is jaded and not fair - just included to give you a picture of what could be happening with this sort of tech) …
The entire premise of the product is “built on the backs of any & everyone who has ever published a work”
> Among the works we hold copyrights over is Sam Williams and Richard Stallman's Free as in freedom: Richard Stallman's crusade for free software, which was found in datasets used by Anthropic as training inputs for their LLMs.
This is the reason why AI companies won't let anyone inspect which content was in the training set. It turns out the suspicions from many copyright holders (including the FSF) was true (of course).
Anthropic and others will never admit it, hence why they wanted to settle and not risk going to trial. AI boosters obviously will continue to gaslight copyright holders to believe nonsense like: "It only scraped the links, so AI didn't directly train on your content!", or "AI can't see like humans, it only see numbers, binary or digits" or "AI didn't reproduce exactly 100% of the content just like humans do when tracing from memory!".
They will not share the data-set used to train Claude, even if it was trained on AGPLv3 code.