They are what’s known as generative AI tools: they generate outputs, whether text or image, on the basis of prompts input by the user. Boosters of this technology point to its potential for increased productivity. Need to build a website? ChatGPT can help you write the content. Midjourney can help you quickly create the artwork, and Copilot can accelerate your ability to write the code. Behind these seemingly benign time-savers, however, lurks the risk of plagiarism and copyright violation as a result of the way the AI models were trained.

While people often refer to ‘AI’ as though it’s a monolith, not every AI model is the same. Part of what distinguishes one AI model from another is the dataset used to train it. ‘Training’ a model means inputting the data upon which it will base its outputs. For example, an AI model for identifying lung cancer will be trained on many, many images of lung cancer patients’ lungs. Since generative AI models are used for more generic purposes, their training data will be less specialised, encompassing all kinds of text and images.

To train ChatGPT, OpenAI scraped huge swathes of the internet, including sites like Wikipedia, Reddit, and Stack Overflow. Stable Diffusion is trained on images scraped from sites like Flickr, DeviantArt, and Shutterstock. Each of these websites is filled with content whose copyright belongs to individuals and organisations, sometimes with specific licensing requirements for its reuse. Misshapen Getty Images watermarks are often produced by image-generation AI tools, for example, indicating that the output was clearly derived from – or trained on – a copyrighted image.

Data quality issues like these are already causing problems for the companies behind these generative AI tools. Microsoft, GitHub, and OpenAI are being sued for alleged copyright violations because their AI-powered coding assistant, Copilot, appears to reproduce sections of someone else’s code without proper attribution. Artists are suing Stability AI, the makers of Stable Diffusion, and Midjourney in a class-action lawsuit over alleged copyright infringement for using their artistic work to train AI models without permission.

As a result, some companies see a competitive advantage in creating generative AI tools that take copyright and intellectual property seriously. Adobe, for example, is launching its own AI image generator, called Firefly, that they claim has only been trained on images that are out of copyright, licensed for training, or in Adobe’s own stock library.

DuckDuckGo has implemented an AI-powered search assistant called DuckAssist that summarises articles from Wikipedia and Encyclopaedia Britannica and – key to avoiding plagiarism – cites the reference sources. You.com is a new AI-assisted search engine with an interface similar to a chatbot, but which provides source citations within its text outputs, unlike ChatGPT.

In trying to discern the right approach to generative AI, respect for artists’ and creators’ intellectual property is an important piece of the puzzle.

Dr Gina Helfrich is Baillie Gifford programme manager for the Centre for Technomoral Futures at Edinburgh Futures Institute, University of Edinburgh

AI bots: Copyright lawsuits claim some 'artificially intelligent' programs are actually plagiarising humans – Gina Helfrich