Gen AI: copyright claims, data poisoning, and regulation

With AI companies facing a slew of lawsuits, Bisman Kaur of Remfry & Sagar considers the arguments and how initiatives such as ‘data poisoning’ and new legislation could help address the intellectual property issues involved

AI technologies are proving transformative not just for businesses developing autonomous vehicles or advanced medical diagnostics but for individuals performing everyday tasks such as unlocking a phone using face recognition tools. Lately, one may even have experimented with generating a poem on ChatGPT or creating an image using text inputs on DALL-E. ChatGPT and DALL-E are generative AI (Gen AI) platforms designed to create text or other forms of media based on patterns and data they have been trained on. But where does this data come from?

Data for training Gen AI platforms is typically ‘scraped’ from information that is publicly available. And therein lies the rub.

A bevy of lawsuits demanding compensation from AI companies have been filed recently. Prominent among them is the suit by The New York Times (The Times) for copyright infringement against OpenAI and Microsoft in December 2023. The Times alleges that millions of its articles have been used to train OpenAI’s chatbot and other technology to develop a competing source of reliable information that mimics its style and recites content verbatim. Such unauthorised use (and unjust enrichment) on the part of OpenAI violates The Times’ intellectual property, it is argued. Significantly, OpenAI and Microsoft were approached by The Times earlier in 2023 to discuss copyright concerns, but the parties never reached an agreement.

OpenAI, along with Microsoft and its subsidiary GitHub, has also been sued for software piracy in a November 2022 class action lawsuit. The allegation is that the parties are violating copyright law by allowing Copilot, a code-generating AI system trained on billions of lines of public code, to reproduce licensed code snippets without providing credit. GitHub repositories contain code posted under open-source licences that require attribution of the author’s name and copyright by users. Violation of the attribution requirements and the removal of copyright management information is alleged to infringe the legal rights of a large community of coders.

Closer to home for the legal industry, in a case scheduled to be heard by a US court in 2024, information services company Thomson Reuters has accused Ross Intelligence of illegally copying thousands of ‘headnotes’ – short summaries of points of law that appear in opinions – from its well-known Westlaw legal research platform to train a competing AI-based legal search engine.

Meanwhile, an open letter from US organisation the Authors Guild that has been signed by more than 15,000 authors – including Margaret Atwood, Dan Brown, and Jodi Picoult – has urged technology companies responsible for Gen AI applications, such as ChatGPT and Bard, to cease using their works without proper authorisation, credit, or compensation.

Drawing parallels

This brings to mind the Authors Guild’s previous run-in with technology, when it sued (other parties were involved as well) Google for copyright infringement in 2005. The Google Book Search Library Partner project had initiated cooperation with the largest research libraries of the world to create digital copies of books. Digitally scanning copyrighted books without the permission of rights holders resulted in Google being accused of violating the exclusive rights of reproduction, distribution, and public display that are the sole preserve of owners of copyright and being sued by authors (and publishers) for copyright infringement.

The electronic library catalogue developed by Google allowed users to search for keywords and obtain a list of books in which those words were found. Based on the search results, users could choose a book and view up to three excerpts.

According to the court, Google’s actions were highly transformative as the underlying books were being used for a new purpose (namely, the creation of a searchable online database), the public display of text was limited, and the revelations did not provide a significant market substitute for the protected aspects of the originals. It was also felt that the public benefits of creating a large-scale searchable online database of scanned books and expanding access to books outweighed any potential harm to the rights holders. Of note was the fact that Google only made excerpts available for viewing and did not benefit commercially through advertising related to this project or receiving incentives for providing links to other websites where books could be purchased.

Thus, the case established that digitising books without permission was legal as long as the books were not made available online in their entirety and there was no evidence of direct commercialisation by the entity that digitised the book.

In the case of Gen AI tools, could one claim exemption under the transformative effect principle? One may argue that Gen AI platforms trained on copyrighted works do not infringe proprietary rights since these models are not copying the training data. Rather, they are designed to learn the associations between the elements of the data, such as words and pixels, and the results produced tend not to resemble any of the training inputs.

In the GitHub litigation mentioned above, the court agreed with the defendants that the plaintiffs had failed to “identify any instance of Copilot reproducing Plaintiffs’ licensed code”. Alleging infringement of copyright and trademarks by an AI system can be difficult as they are trained on an enormous cache of data and there is no way of tracking the manner in which deep learning systems combine training data to answer a particular query (the ‘black box problem’).

Also, copyright law protects original works by granting exclusive rights to creators, such as the right to reproduce, distribute, perform, and display the work. However, one cannot apply copyright to facts, ideas, concepts, or style. And when it comes to comparing the output generated by Gen AI versus copyrighted material, it is important to differentiate between inspiration and infringement.

Developing approaches

Ultimately, the greater threat of liability may stem not from how Gen AI tools are trained or the outputs they produce but from how the technology fares in the hands of an end user. Selective prompting strategies may enable users to create text, images, or video that violate copyright and other laws – deepfakes being one example. Could the entities that own the Gen AI tools be found guilty of contributory infringement in such a case?

Microsoft argues (in the GitHub suit) that the answer lies in the 1984 Betamax ruling by the US Supreme Court (Sony Corp. of America v Universal City Studios). When Sony first started selling videotape recorders (VTRs), rights holders argued it was facilitating mass copyright infringement and a machine manufacturer could be held liable under copyright law for how others used that machine.

The court ruled in favour of Sony, observing that where copyright law had not kept up with technological innovation, courts should be mindful of expanding copyright protections on their own. Second, it borrowed the concept of ‘substantial non-infringing uses’ from patent law to say that Sony would be liable only if it were shown that the VTRs were a mere tool for infringement in the hands of consumers. However, VTRs were found to be ‘capable of substantial non-infringing uses’ as customers were making private, non-commercial recordings that qualified as lawful fair use.

This ruling proved to be an innovation enabler and arguments abound in favour of viewing disruptive Gen AI tools through the prism of Sony’s VTRs – tools capable of substantial non-infringing uses for the public good.

Even so, some outfits are adopting mitigating measures. Stability AI plans to allow artists to opt out of the data set used to train the next-generation Stable Diffusion (an open-source text-to-image generation model) version. And while DALL-E assigns rights around generated art to users, it also warns them of ‘similar content’ and encourages due diligence to avoid infringement. What this may translate to is AI users adopting risk management frameworks that would include greater awareness of the terms of use of Gen AI products and carrying out due diligence such as reverse image searches of generated works intended to be used commercially.

Some legal experts also suggest that the copyright infringement suits ought to be directed not at the creators of these AI systems but at the party responsible for compiling the images used to train them. Others call for AI-generated material to include a digital watermark so that it can be easily tracked.

The dark side

Meanwhile, a section of artists and businesses have sought to remedy the issue of unauthorised use of copyrighted content by embedding digital noise or imperceptible patterns to ‘poison’ artworks. These alterations, invisible to the human eye, are designed to be detected by AI algorithms. Training on such embedded anomalies diminishes the accuracy and, ultimately, functionality of an AI model. The term for intentional modification of data to disrupt the learning process of AI models is ‘data poisoning’.

Ben Zhao, a professor at the University of Chicago who has led the team behind a data poisoning tool named Nightshade, says it is meant as a powerful deterrent against disrespect of copyright and intellectual property. When the tool was tested by researchers using Stable Diffusion, it was found that after training on 50 poisoned images, the model began generating dogs with distorted features – the entire concept of ‘dogs’ had been poisoned within the model. After 100 samples, the model began producing cats instead of dogs, and 300 samples later, the cat images were near perfect.

Since Gen AI models cluster similar concepts, Nightshade was also able to trick the model into generating a cat when prompted with words related to a dog, such as “husky” and “wolf”. The same team has also developed a tool called Glaze, which, through a subtle change in the pixels of images, enables artists to mask their personal style and prevent it from being scraped and manipulated by machine-learning models.

The intent behind developing such tools is not to break AI systems; rather, the team hopes its technology can persuade AI training companies to license image data sets, respect crawler restrictions, and conform to opt-out requests.

In the future

If businesses are unable to lawfully train models on freely available materials, it may set back AI technology. A balanced approach would lie in developing legislation to guide development. Also, given the global reach of the internet, ideally, national legislation ought to be modelled on global benchmarks.

The EU’s Artificial Intelligence Act offers a valuable template. Approved by the European Parliament on March 13 2024, and expected to come into force in May or June this year, it emphasises transparency and traceability. From a Gen AI perspective, AI systems must be trained, designed, and developed in such a way that there are state-of-the-art safeguards against the generation of content that violates EU laws. This will include informing users of a tool’s exact functionality. Detailed documentation and the provision of a publicly available summary regarding the use of copyrighted training data, and compliance with stronger transparency obligations, is also necessary. These obligations directly address the infringement of intellectual property rights, particularly copyright.

The Indian government has also announced recently that it is looking to frame a new piece of legislation on AI. Keeping in mind the broader aim of promoting innovation, the law would balance the rights of various parties, including content creators and AI technology enablers such as large language models, the text-generating part of Gen AI.

With regard to transformative technologies, the law must scramble to catch up. New AI legislation is one approach. However, answers may also arise from fresh interpretations of existing statutes – a development likely to emerge once courts in different countries start assessing the various lawsuits filed against AI companies.

Meanwhile, businesses can mitigate risk by vetting data sets for permissive licences and seeking indemnities against intellectual property infringement from AI providers wherever possible. They could also consider documenting how an AI model was trained, educate staff to minimise the likelihood of producing infringing outputs, and adopt measures to check for infringements before using outputs.