Discussing the Hamburg Regional Court’s landmark decision in Robert Kneschke v LAION e.V., on text data mining and its interpretation as an exception to allegations of copyright infringement, SpicyIP Fellowship applicant Tanishka Goswami highlights the key findings of the Court and explores the probable lessons for Indian policymakers in harmonizing copyright with AI and innovation. Tanishka is an advocate at the High Court of MP. She graduated from National Law University, Delhi in 2023 & enjoys reading and writing on copyright laws. Her previous post can be accessed here.
Taming the ‘LAION’: Lessons for Harmonising AI and Copyright Law
By Tanishka Goswami
Carrying significant implications on the use of copyright-protected material for developing AI training data, the Hamburg Regional Court’s (“the Court”) recent decision in Robert Kneschke v LAION e.V. merits serious consideration by IP and AI enthusiasts alike. LAION e.V. (“Large-scale Artificial Intelligence Open Network”), a non-profit organisation enabling public access to large-scale machine learning models was accused by seasoned German photographer Robert Kneschke of using his photographic images in their dataset without obtaining requisite consent. Ruling in favour of LAION e.V., the Court identified the use of photographs as “scientific research” under the EU Directive on the Digital Single Market (“DSM Directive”). The exception covered under Article 3 permits text and data mining (“TDM”) for a wide range of purposes including “scientific research”.
In this context, I examine: firstly, the functioning of multi-modal AI models and the German Copyright Act’s (“the Act”) treatment of TDM by such models; secondly, the Court’s approach towards the “opt-out” safeguard for rightsholders against the “scientific research exception”; and thirdly, lessons for Indian policymakers in harmonizing copyright with AI and innovation.
Readers may kindly note that the scope of this article is limited to potential copyright infringements by open datasets that are utilized to train AI models, and not the generative AI platforms/models themselves. The latter has been discussed in a previous SpicyIP post, here, and can be a great follow-up read to the current one.
How Does the LAION 5B Dataset Operate?
The dataset in contention, LAION 5B, showcases around five billion images with text descriptions in a bid to democratise research around “large-scale multi-modal model training”. Multi-modal Models are AI systems that process and integrate information from multiple sources including text, audio, video, and image data. Examples include Stable Diffusion, DALL-E 3, ImageBind by Meta AI, and Google’s Multimodal Transformer. Given the variety of objectives these models serve in healthcare, education, and research in science and technology, it is patent that humongous amounts of data drive them.
A snippet from the LAION 5B Dataset
Against this backdrop, ending up as a part of the LAION 5B dataset meant that the petitioner’s copyrighted works were open to use by millions of AI researchers, without there being a concrete pathway for him to secure compensation. How the German court dealt with this set of circumstances by invoking the national copyright statute and the European AI jurisprudence is discussed below.
Is the Court ‘Opting Out’ of its Duty to Defend Authors?
- Text and Data Mining under §44b of the German Copyright Act (“the Act”)
§44b acknowledges TDM as a limitation on an author’s copyright, thereby constituting use permitted by law. The provision defines TDM as the automated analysis of individual/several digital or digitised works for information-gathering purposes, specifically for identifying “patterns, trends, and correlations”. The Court’s analysis of the TDM exception focused on: firstly, the nature of reproduction undertaken by the defendant LAION; and secondly, the relevance of the InfoSoc Directive that aims to harmonise the rights of copyright holders with technological developments.
On the application of §44b, the Court observed that the reproduction undertaken in the LAION 5B dataset enabled information extraction about “correlations” surrounding the uploaded images. What was being correlated in this case? According to the Court, the content of the downloaded images was correlated with pre-existing image descriptions stored in the LAION 5B dataset. Hence, the protection under §44b was attracted.
Additionally, the Court referred to Article 5(5) of the InfoSoc Directive which enshrines a three-step test to enable exceptions such as TDM to copyright. The Court recognised that the reproduction identified in the present case (“correlation”, as noted above) would not conflict with the normal exploitation of the plaintiff’s works. Further, the future possibility of training AI systems based on datasets such as LAION 5B was not deemed an impairment of the plaintiff’s right either.
- Would LAION 5B constitute “Scientific Research” under §60d?
This provision sanctions TDM for the purpose of “scientific research” by research organisations, libraries and museums, and individual researchers. The Court noted that: firstly, a broad conception of “scientific research” must be adopted to extend protection to steps aimed at achieving knowledge gains in the future; secondly, the creation of datasets like the LAION 5B furthered such conception of “scientific research” by serving as a basis for the training of AI systems, and therefore knowledge acquisition in the future; and thirdly, the operations of the defendant were purely non-commercial.
Readers may note that the EU AI Act and emerging global practices place the onus of reserving their rights on copyright holders. Hence, they need to “opt-out” of allowing their works to be open to use by other creators, developers, and publishers. The plaintiff, in this case, asserted his opt-out right – the website hosting his photographic works prohibits “downloading, scraping, or caching” of any content. However, the Court peculiarly placed the defendant’s right of reproduction under §60d on a higher pedestal than the right under §44b since the former does not permit an opt-out. Why this was done is discussed below.
- Balancing the Differing “Opt-Out” Mechanisms under §44b and §60d
Opt-out formalities aim to strengthen right holders’ position by empowering them to negotiate licensing deals with AI and technology companies, while obtaining necessary remuneration for their creative works. However, Article 3 of the DSM Directive obligates EU member-states to provide an exception to this opt-out right for the pursuance of “scientific research” and TDM. In pursuance of this, the German Copyright Act also lacks an opt-out exception to §60d. Hence, while it has been pointed out that the Court failed to discuss other valid opt-out provisions in the Act in the judgment, the exclusion was essential to further TDM for scientific research. Simply put, a reservation/opt-out by a copyright holder (as under Article 4(3) of the Directive) does not hamper the exception for “scientific research”.
On this aspect, the Court also focused on the intent behind Article 53 of the EU AI Act, which urges providers of general-purpose AI models to identify and comply with the “opt-out” reservation discussed under Article 4(3) of the DSM Directive. Article 4(3), in turn, provides that limitation on an author’s copyright for the purposes of TDM will kick in if the use of their works has not been “reserved” by the right holders in an appropriate manner. However, as discussed above, the “scientific research” exception supersedes such reservations.
It becomes important to ask: One, is the opt-out mechanism a sufficient safeguard of creators’ rights? Two, how can copyright law further AI and innovation alongside protecting authors? Three, what can Indian policy-makers learn and adopt from these emerging debates?
Mining Concerns Surrounding the Opt-Out Mechanism
Datasets that train AI will inevitably continue to download and scrape images they do not own, albeit for the larger goal of democratising and enabling research and innovation. To what extent should they be regulated? A few considerations are important in answering this.
Firstly, while opting-out may generate convergence of opinion from leading AI companies, it is not convenient to implement. There exists limited clarity on whether creators have to provide opt-outs for all entities that train AI models, and if such exercise would be standardized/model-specific. This exacerbates judicial inconsistencies. Hence, while the Hamburg court observed (obiter) that opt-outs in natural language can be as effective as those expressed in machine-readable formats, the requirement may be differently applied across jurisdictions.
Secondly, on the flip side, aggressive opting-out by creators may add to licensing and related costs for AI system providers, especially if the concerned data are critical to the research. If these costs are too high, the development of biased models would be a likely result. The varying contractual relationships on remuneration/compensation for use with different creators will impact the growth of AI-based creativity.
Lessons for the Emerging AI Landscape in India
Unlike the legal position in the EU, the Copyright Act, 1957 does not mention TDM as an exception to copyright. Hence, as discussed on this blog here, it is likely to be covered under section 52(1)(a) which covers “research” as a fair dealing exception to copyright infringement. In this uncertain backdrop, the initiation of drafting exercise of a law on AI in India by the Ministry of Electronics & IT becomes significant.
How can the law address TDM activities? I argue that any such policy must steer away from an “opt-out” mechanism. The AI expansion in India is set to reach $22 billion in the coming years, with leading industry experts hailing the country for treading the path to becoming a global AI hub. This growth can not only contribute towards addressing India’s pressing societal needs , but also further complement the cycle of growth and innovation. Hence, in the interests of greater access to knowledge, to further democratised research and creativity, reference to the American “fair use” approach (analysed w.r.t. TDM by Dr. Arul Scaria, here) may be ideal for the time being.
The Robert Kneschke judgment is the first European case that examined the legality of using copyright-protected works to create datasets for AI training. In coming years, we will witness greater clarity on: one, the interpretation of “scientific research” under the DSM Directive; two, the difference in the application of law towards datasets that index images and those that actively scrape them (see, Getty Images v Stability AI); and three, standardized protocols for authors to “opt-out” of AI training datasets. Until then, stimulating the growth of young innovators and developers by safeguarding them from apprehensions of licensing formalities and infringement fines would be the way forward.