RSS co-creator launches new protocol for AI data licensing - TechCrunch

Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Latest
AI
Amazon
Apps
Biotech & Health
Climate
Cloud Computing
Commerce
Crypto
Enterprise
EVs
Fintech
Fundraising
Gadgets
Gaming
Google
Government & Policy
Hardware
Instagram
Layoffs
Media & Entertainment
Meta
Microsoft
Privacy
Robotics
Security
Social
Space
Startups
TikTok
Transportation
Venture
Staff
Events
Startup Battlefield
StrictlyVC
Newsletters
Podcasts
Videos
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
In the wake of Anthropic’s $1.5 billion copyright settlement, the AI industry is coming to terms with its training data problem. There are as many as 40 other pending cases that seek damages for unlicensed data — including one that takes Midjourney to court for creating images of Superman.
Without some kind of licensing system, AI companies could face an avalanche of copyright lawsuits that some worry will set the industry back permanently.
Now a group of technologists and web publishers has launched a system that would enable data licensing at massive scale — provided AI companies take them up on it. Called Real Simple Licensing (RSL), the system is already being backed by major web publishers like Reddit, Quora, and Yahoo. The question now is whether that momentum will be enough to bring major AI labs to the bargaining table.
According to RSL co-founder Eckart Walther, who also co-created the RSS standard, the goal was to create a training-data licensing system that could scale across the internet. “We need to have machine-readable licensing agreements for the internet,” Walther told TechCrunch. “That’s really what RSL solves.”
For years, groups like the Dataset Providers Alliance have been pushing for clearer collection practices, but RSL is the first attempt at a technical and legal infrastructure that could make it work in practice. On the technical side, the RSL Protocol lays out specific licensing terms a publisher can set for their content, whether that means AI companies need a custom license or to adopt Creative Commons provisions. Participating websites will include the terms as part of their “robots.txt” file in a prearranged format, making it straightforward to identify which data falls under which terms.
On the legal side, the RSL team has established a collective licensing organization, the RSL Collective, that can negotiate terms and collect royalties, similar to ASCAP for musicians or MPLC for films. As in music and film, the goal is to give licensors a single point of contact for paying royalties and provide rights holders a way to set terms with dozens of potential licensors at once.
A host of web publishers have already joined the collective, including Yahoo, Reddit, Medium, O’Reilly Media, Ziff Davis (owner of Mashable and Cnet), Internet Brands (owner of WebMD), People Inc., and The Daily Beast. Others, like Fastly, Quora, and Adweek, are supporting the standard without joining the collective.
Notably, the RSL Collective includes some publishers that already have licensing deals — most notably Reddit, which receives an estimated $60 million a year from Google for use of its training data. There’s nothing stopping companies from cutting their own deals within the RSL system, just as Taylor Swift can set special terms for licensing while still collecting royalties through ASCAP. But for publishers too small to draw their own deals, RSL’s collective terms are likely to be the only option.
But while it’s easy enough to determine when a song has been played, AI models pose unique challenges when it comes to figuring out when royalties are due for a specific piece of training data. The issue is simplest for a product like Google’s AI Search Abstracts, which draw data from the web in real time and maintain strict attribution for each fact.
But if training isn’t logged when it occurs, it can be nearly impossible to confirm that a given document was ingested into an LLM. It’s particularly challenging if publishers ask to be paid per inference rather than receiving a blanket fee, an option offered by one of the stock RSL licenses.
Still, RSL’s creators believe AI companies will be able to manage the difficulty. “Some of the licensing agreements they’ve already done have required them to be able to report on it, so it’s possible,” says Doug Leeds, a co-founder of RSL and former CEO of IAC Publishing. “It doesn’t have to be perfect. It just has to be good enough to get people paid.”
The bigger question is whether AI companies will embrace the system. As the success of companies like ScaleAI and Mercor shows, frontier labs have no problem paying for data, but the web has traditionally been seen as a source for cheap, low-quality data. With datasets like the Common Crawl already available, it may be a challenge to extract royalties from something labs are used to getting for free. And as the recent dustup between Cloudflare and Perplexity shows, it’s not straightforward to tell the difference between web-scraping and machine-enhanced browsing.
When I put the question to Leeds, he pointed to recent comments from AI leaders calling for a system like RSL — most notably from Sundar Pichai at last year’s Dealbook Summit. Whether the calls for a licensing system are earnest or not, the RSL team plans to hold them to it. “They have said outwardly to everyone, something like this needs to exist,” Leeds told me. “We need a protocol. We need a system.”
Now they may get one.
Topics
AI Editor
Founders: land your investor and sharpen your pitch. Investors: discover your next breakout startup. Innovators: claim a front-row seat to the future. Join 10,000+ tech leaders at the epicenter of innovation. Register now and save up to $668.

ZoomYourWeb3

RSS co-creator launches new protocol for AI data licensing – TechCrunch

Contact Us

Quick Links