OpenAI’s copyright conundrum pits fair use precedent against an ‘impossible’ hurdle
It’s not stealing if it’s innovating.
That’s one prickly way of describing the position of AI companies that rely on the internet’s copyrighted works to inspire their models.
This week, OpenAI, the company behind the culture-shifting AI chatbot ChatGPT, elaborated on its public case for rethinking intellectual property in the age of AI.
In response to the New York Times’ copyright infringement lawsuit against it and Microsoft (MSFT), OpenAI sought to clarify its business and motives, writing in a blog post: “Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”
In a submission responding to an inquiry of the UK Parliament late last year, the company wrote: “Because copyright today covers virtually every sort of human expression — including blogposts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials.”
What makes OpenAI’s arguments interesting and consequential is the novelty of the debate.
It’s unclear to what extent existing copyright law speaks to AI and the process of ingesting existing material to train powerful models that aim to generate and capture new types of value.
But in a tech industry move that by now seems familiar, AI companies are acting as if their permissive interpretation of the law is the natural mode of engagement and as if restrictions don’t apply to them until they are proven wrong.
The maneuver resembles social media companies dodging accountability from real moderation responsibilities while reaping the rewards of publishing other people’s content. It also brings to mind the early days of ride-sharing and the gig economy, when popular apps rushed to claim market share while operating in a legal void.
And with both industries continuing to thrive while the law remains unsettled, AI companies must ask: Why tread lightly when inevitability is on your side?
To calm concerns over infringement, OpenAI is ramping up efforts to partner with more publishers. CNN, Fox Corp., and Time are among the outlets currently in talks with the AI company to share access to their content, Bloomberg reported Thursday.
But a licensing model introduces an array of obstacles on top of potentially onerous costs.
Sasha Luccioni, a research scientist at Hugging Face, an artificial intelligence startup, said imposing a new paradigm on AI companies would require a massive overhaul in how technologists train and deploy their models.
So far, AI companies have largely taken the path of hoovering up the internet to train large language models, or LLMs, without thinking too deeply about copyright, filtering, and licensing. Reorienting that process around meticulous curation, consent, and disclosure is essentially incompatible with the existing process.
Luccioni said pursuing a more careful approach isn’t impossible, but it would be a huge undertaking. “It essentially would be back to the drawing board for LLMs,” she said.
If LLMs are generally powered by massive amounts of data of dubious provenance, a new path would force companies to think about new ways of AI training, using much smaller pools of data — small language models, if you will.
Jack Stilgoe, a professor of science and technology policy at University College London, said OpenAI’s response highlights a classic tension among professed tech disruptors: To earn the public’s trust, new entrants have to prove they are playing by the rules while also styling themselves as rule-breakers, blazing a path to innovation.
Stilgoe said AI companies likely recognize the incongruence but see the technology moving so fast that the law simply can’t keep up. That’s what makes the legal cases so important. If copyright holders continue to press their challenges, they threaten the whole structure of LLM systems. “It could bring down the whole house of cards,” he said.
But applying a traditional interpretation of copyright law to the novel use of AI could unleash other perverse effects. If AI models are walled off from the most authoritative sources, like trustworthy news outlets or big science publications, future LLMs could be even less trustworthy and reliable, degraded by exposure to inferior sources.
Those risks are amplified by existing concerns over misinformation and “hallucinations,” in which AI tools present false information as fact with all the confidence of an all-knowing, anthropomorphized computer.
“In a world where asymmetries of information matter more than ever,” Stilgoe said, “you can imagine those concerns only growing in which LLMs are mediating and accelerating people’s access to information.”