First published by LexisNexis New Law Journal, this article considers the law in relation to web scraping and the illegal appropriation of copyrighted works, with reference to precedents and the ongoing Getty case, and explains that bridging the gap between law and technological practice will need targeted legal reform and clear regulatory guidance.
In December 2024, the UK government launched a consultation proposing reforms that would help it deliver a copyright framework that ‘rewards human creativity, incentivises innovation and provides the legal certainty required for long-term growth in both sectors’. Among other measures, it proposed allowing artificial intelligence (AI) companies to mine publicly available content for commercial training purposes, unless rights holders expressly opted out. While echoing abandoned 2022 proposals, the suggested reforms mark a substantial departure from the traditional framework of UK copyright law, which has long emphasised authorial control and consent.
The government’s suggestions, however, have drawn criticism from both the sectors it seeks to serve. The creative industries argue the reforms threaten their livelihoods, while AI developers criticise their technical unworkability and legal ambiguity. With little consensus emerging, and AI deployments accelerating, what lessons can be drawn from existing copyright law for AI model training, and how might the proposed reforms bridge the widening gulf between law and practice?
How copyright affects training
In the context of AI model training, particularly for large language models (LLMs) such as ChatGPT and Gemini, enormous bodies of materials are copied and processed to ‘teach’ the model about language, context and meaning. One way this training can occur is through web scraping, where computer programs search the Internet for content, processing the text and data they encounter. This process is referred to as text and data mining (TDM).
TDM does not necessarily require the permanent retention of these materials, though it usually requires their temporary storage and reproduction for analysis during training. The patterns and textual representations extracted from them are generally not stored verbatim but can be embedded as fragments in the model, giving rise to the risk of the later reproduction of copyrighted content.
Whether materials are processed as inputs during TDM or as outputs when an AI tool responds to a user’s prompt, the same legal concern arises: have copyrighted works been unlawfully appropriated or reproduced in substance?
The legal framework
The foundational statute governing copyright in England and Wales is the Copyright, Designs and Patents Act 1988 (CDPA 1988). Copyright subsists automatically in original works under s 1, provided those works fall within a protected category (such as literary, dramatic, musical or artistic works) and meet the threshold of originality. Content published on publicly accessible web pages, such as blog posts, articles, artwork or source code, which may be ‘scraped’ during TDM, automatically qualifies for protection if it involves sufficient skill, labour and judgement in its creation.
Section 29A of CDPA 1988 introduces a limited statutory exception for TDM, but only for non-commercial research. Inserted by the Copyright and Rights in Performances (Research, Education, Libraries and Archives) Regulations 2014 (SI 2014/1372), it allows copying for the purpose of ‘computational analysis’ when by a person with lawful access.
The narrow construction of s 29A excludes commercial research, limiting its applicability to most AI firms. This limitation perhaps explains the UK government’s observation that difficulties navigating UK copyright law are ‘undermining investment in and adoption of AI technology’, prompting its renewed push for reform.
It also reflects the broader legal context of the ongoing high-profile litigation in Getty Images (US) Inc and others v Stability AI Ltd [2023] EWHC 3090 (Ch), in which Getty alleges that Stability AI infringed its intellectual property (IP) rights by copying and processing millions of its copyrighted images to train an AI model.
The Getty case marks the UK courts’ first opportunity to address AI training and infringement, and will doubtless be watched closely by all IP lawyers. Past decisions, however, reveal tensions between the law as it stands and the ongoing policy debate, as well as the principles the High Court may apply in Getty.
Lessons from precedent
One of the defining features of LLMs is that they do not produce predetermined outputs. Responses are instead generated by the AI system itself, probabilistically derived from patterns in the model’s training data as shaped by a user’s prompts. As the outputs do not emanate from the pen of a system’s operator, rights holders may struggle to demonstrate any intention on the part of the AI operator to reproduce their copyrighted works.
In the matter of Francis, Day and Hunter Ltd v Bron [1963] Ch 587, [1963] 2 All ER 16, however, decided under the Copyright Act 1956, the Court of Appeal confirmed that copyright infringement can arise whether a work is copied consciously or unconsciously. Therefore, where a substantial part of a protected work is reproduced by an AI tool, an infringement may occur even if the reproduction is incidental or automated.
The question of whether a substantial part of a work has been taken was firmly addressed in Designers Guild Ltd v Russell Williams (Textiles) Ltd [2000] 1 WLR 2416, [2001] 1 All ER 700, where the House of Lords confirmed that substantiality is to be assessed in terms of the quality and importance of what is copied, rather than quantity.
Newspaper Licensing Agency Ltd v Meltwater Holding BV [2011] EWCA Civ 890, [2011] All ER (D) 248 (Jul) further embedded this principle, illustrating that even short elements may qualify as protected works if they reflect the author’s own intellectual creation, emphasising that copyright protection focuses on originality, not length. In Meltwater, the Court of Appeal confirmed that even article headlines could constitute literary works, and that short extracts could qualify as substantial parts of longer works under CDPA 1988, s 16(3). As the use of such works without a licence constitutes an infringement under s 16(2), even fragments of a copyrighted work used during AI training or in AI outputs could trigger liability.
Meltwater also emphasised the narrow construction of the transient or incidental copying defence under CDPA 1988, s 28A. The defence permits certain temporary reproductions (such as caching or buffering), though only if the use of the original work is itself lawful. The fact that a web page is accessible does not mean it is free to use, and browsing alone does not automatically confer a right to make a temporary copy. By analogy, transient copying during TDM for AI training is likely to be an infringement under s 17.
The courts have resisted broadening copyright exceptions absent clear legislative authority, which may make them reluctant to do so in Getty. In Ashdown v Telegraph Group Ltd [2001] EWCA Civ 1142, [2001] 4 All ER 666, for example, the Court of Appeal rejected a fair dealing defence under CDPA 1988, s 30, despite arguments based on public interest. The court found that the defendant’s substantial reproductions of copyrighted materials were motivated by commercial purposes and criticised the defendant’s extensive verbatim copying, suggesting instead that limited quotations within a report on current events might arguably have been justified.
Ashdown illustrates that commercial use weighs heavily against a finding of fair dealing, casting doubt on whether training AI on protected material for commercial purposes could qualify. However, it leaves broader questions about transformative use in copyright law unsettled. At what point might transformations enable a fair dealing defence, for example?
Courts increasingly consider the nature and purpose of secondary use when assessing fairness in copyright claims. In the context of AI, where model training results in the statistical abstraction of underlying data, a defendant might therefore seek to argue that such abstraction is sufficiently transformative to avoid liability.
However, case law casts doubt on the strength of such a defence. In Temple Island Collections Ltd v New English Teas Ltd [2012] EWPCC 1, the court emphasised that copying extends not only to literal replication but also to the appropriation of elements reflecting the author’s intellectual creation. Its approach followed the Court of Justice of the European Union’s decision in Infopaq International A/S v Danske Dagblades Forening (C-5/08 [2010] FSR 20), where the reproduction of just 11 words from a newspaper article was held capable of infringement if it expressed the author’s intellectual creation. In Temple Island, the defendant’s photograph was therefore found to infringe, despite important and visually significant differences, because a causal connection remained between the claimant’s and defendant’s works.
The implications for AI training here are significant. Even if AI models statistically abstract features from training data, the replication of expressive elements may still amount to infringement. When read alongside SAS Institute Inc v World Programming Ltd [2013] EWCA Civ 1482, [2013] All ER (D) 254 (Nov), which reaffirmed that ideas and functionality are not protected, but expression is, it becomes clear that AI operators are unlikely to avoid liability through transformation alone if the expressive elements of the training data are embedded in the AI model or are otherwise reproduced.
These cases show that UK copyright law takes a broad and nuanced approach to infringement, highlighting the considerable risks faced by companies that use protected works without permission.
It is against this backdrop that pressure has mounted for legislative reform to clarify the legal position and better balance the interests of rights holders and AI developers.
Proposed reforms
The government’s suggestions for an opt-out model for TDM for AI would substantially alter the current legal position, which treats the unlicensed use of protected material as a potential infringement. It would require rights holders to actively object to the use of their works, reversing the default position of requiring prior authorisation. Although intended as a middle ground, the government’s proposal has proved divisive. Creatives argue it undermines their control, while some in the tech sector view it as technically unworkable and legally uncertain.
Indeed, the legal infrastructure needed to support such a model does not yet exist. CDPA 1988 makes no provision for opt-out declarations, and implementing such a framework would require primary legislation capable of defining and operationalising this mechanism. It may also necessitate statutory definitions for abstract terms such as ‘artificial intelligence’, ‘training’ and ‘model’ to ensure the boundaries of any exception are clear. The difficulty of legislating with precision in a field marked by rapid technological change and conceptual ambiguity should not be underestimated.
Further ambiguity surrounds enforcement. Might the breach of an opt- out amount to a regulatory contravention or an actionable infringement? If the former, creators may lack adequate redress. If the latter, litigation costs could deter creators from enforcing their rights where the value of their work is relatively low. Either outcome could embolden AI operators to risk non-compliance, treating the misuse of protected content as a commercially tolerable cost rather than a legal wrong.
Enforcement poses other practical challenges. A central registry may be needed to make opt-outs operationally meaningful, and questions remain as to whether the burden of monitoring might fall disproportionately on individual creators.
Without greater transparency in AI training datasets and outputs, it may be impossible for rights holders to detect when and how their works have been used, leaving enforcement contingent on speculative claims and asymmetric litigation.
The opt-out model could also introduce conceptual inconsistency into copyright law. It risks creating a system where a reproduction is lawful when executed by a machine but unlawful when done by a human. This could complicate and bifurcate established doctrines of authorisation, reproduction and fair dealing.
The Getty case will likely bring many of these tensions to the fore, offering the courts their first opportunity to confront the legality of large-scale ingestion for AI training. Yet a decisive ruling is unlikely to resolve the broader structural dilemmas at the intersection of copyright law and AI development.
Bridging the gap between law and technological practice will ultimately require targeted legal reform and clear regulatory guidance. A statutory TDM licensing scheme, modelled on extended collective licensing principles, could offer an efficient mechanism to support AI innovation while respecting copyright.
Complementary measures, such as enhanced transparency obligations or a centralised rights management system (opt-out or even opt-in), backed by enforceable remedies, could further align incentives between rights holders and developers.
Absent coherent reform, however, the misalignment between law and practice will only deepen, frustrating innovation, undermining the interests of rights holders and jeopardising the sustainable growth of both the creative industries and AI sector.
This article was first published in the LexisNexis New Journal on 6th June 2025.