[This post is authored by Bharathwaj Ramakrishnan. Bharathwaj is a 3rd year LLB Student at RGSOIPL, IIT Kharagpur, and loves books and IP. His previous posts can be accessed here.]
In the first part of the post, we dealt with the relevant background, wherein I stated that it is useful to see any GenAI model as being located within an AI supply chain. I also discussed the first two issues. In this post, I will discuss the last two issues and the two contentions I referred to in the first part.
Is Training of GenAI Models Fair Use? We Might Know Soon
The 3rd issue framed by the Court will deal with the question of fair use, wherein the question is whether using the plaintiff’s data or copyrighted content constitutes fair use under Section 52 of the Copyright Act. On the issue of fair use/fair dealing, as Akshat and Sneha Jain discuss here, the transformative use defence is not available specifically in the statute, even though it is available under judge-made law. Based on the existing line of cases recognizing transformative use, there is an argument to be made that training a GenAI model is transformative use as the use of the work is done for a different purpose from the copyrighted work. The training aims to build a GenAI model that can respond accurately to prompts distinct from the purposes for which any copyrighted work might be created. This argument also ties up to the long-standing understanding under US copyright law that non-expressive use of copyrighted works is transformative (see here and here).
Additionally, it has also been pointed out that the storage of copies for training purposes itself might fall under Section 52(1)(b)(c) if the copying is transient or incidental. Still, this question will turn on facts on how the data or copyrighted content is stored and for how long. Another interesting question is whether Section 52(1)((iii) will rescue OpenAI on the output side. Section 52(1)(a)(iii) extends fair dealing protection to “reporting of current events and current affairs”, and even if the OpenAI’s outputs are found to be infringing, it would be interesting to see if the Court is willing to extend protection under this section to GenAI models. However, it must also be noted that wide interpretation or protection under fair dealing to GenAI model proprietors might lead to very narrow protection being extended to news publishers over their copyrighted works. This is an interesting issue raised in the past with regard to academic publishers in the DU photocopy case.
The Question of Jurisdiction:
The question of jurisdiction or lack thereof might apply differently in the context of training and the context of infringing outputs. OpenAI has argued that it has no permanent office here in India, nor are the servers present here in India, and training itself did not happen in India. Thus, it has been argued that the Delhi HC lacks jurisdiction. Regarding the training and storage question identified in issue 1, the jurisdiction challenge might be serious, and it would be interesting to see how the Court decides this issue. While on the infringing outputs issue, it is clear ChatGPT as a service is available all over India, including in areas that fall under the jurisdiction of the Delhi HC. It would be interesting to see how the Court decides on the jurisdictional issue as it might play spoilsport in the entire issue. If the Court states it does not have jurisdiction, it might prevent the Court from dealing with the substantive issues.
Finally, I wish to raise two additional issues outside the issues framed by the Court that I think would be a recurring theme in this copyright litigation; one is the question of how a Model trains and learns from its training data and the issue of the adversarial user.
How Does a Model Learn from its Training Data?
The plaintiffs and the defendants might present differing contentions on how a model trains and learns from the data. As this paper by Cooper and Grimmelmann point out, there are differing understandings of how a GenAI model learns. At the same time, OpenAI might emphasize, as pointed out earlier, that the Model, while training, merely learns the patterns and connections between the training data or only the meta-information and thus merely learns the idea and not the expression. The plaintiff might stress the fact that models might memorize random aspects of their training data and use this as an argument to state the Model, to an extent, does not merely learn statistical patterns but sometimes can learn expressions and not just ideas (see the memorization paper which tries to provide a different understanding how models memorize in contrast to the understanding that the Model merely learns meta-information or statistical correlations in the training data). Even though GenAI models are considered black boxes, the question of how the Model trains and understands from its training data would be disputed, with the defendants arguing that the Model merely learns patterns and meta-information and the plaintiff arguing that it does something more.
Regurgitation and Extraction as Bugs and the Phantom of the Adveserial User:
As discussed in the memorization paper mentioned earlier, the parties would have differing arguments and contentions on how frequently a model memorizes its training data and in what instances a model produces memorized training data as output. The plaintiff would like to argue that the models memorize more often than what the GenAI companies want to admit and that even innocent prompts might make the Model regurgitate or spit out potentially copyright-infringing training data as outputs. OpenAI would argue that memorization and regurgitation are bugs rather than features and would argue that they should be treated as rare phenomena. They might also attempt to pass the blame on to an adversarial user who seeks to prompt the Model to ensure it produces infringing outputs. Thus, the argument would be that the infringing outputs are not provided as outputs to normal users in the normal functioning of the Model. Still, the adversarial users, through their prompts, engage in extraction with the intention of making the Model produce potential infringing output.
In conclusion, ANI vs OpenAI litigation opens up a vast canvas of questions that will shape the future of GenAI and Copyright law in India.
I wish to thank Swaraj for his valuable input on the posts.