Skip to main content

New research shows that AI could consume the entire supply of publicly available, high quality web data before the end of the decade. That is compounding an uptick in domains that are blocking bots from crawling their content, requiring AI firms to pay up for access to their valuable information. These licenses have proven to be quite expensive, potentially blocking smaller AI startups – those without access to large swathes of investor cash – from building models that can keep up with more well-funded programs backed by big tech.

Further, tech giants like Alphabet and Meta, which own sprawling social media platforms, can leverage their access to an ocean of private data extracted from users’ posts and interactions to train popular large language models. The creation of synthetic data may also prove to be increasingly valuable in the years to come, as this could offset the seemingly inevitable exhaustion of public web data.

Related ETF: Roundhill Generative AI & Technology ETF (CHAT)

The increasingly wide spectrum of AI startups may soon find data to be an increasingly expensive and scarce commodity as web domains crack down on free-riding developers hoping to build lucrative machine learning software on the backs of authors and other content creators. MRP has noted several times this year that enterprise AI developers have been forced to start doling out significant sums of cash to license data from publishers and social media sites. We posited that the windfall from the ongoing wave of AI adoption would not only flow to the companies developing and utilizing AI applications, but brokers of critical data that facilitate the training of complex algorithms and underlie the trillions of parameters in modern machine learning architectures.

To name a few examples, OpenAI, the hulking startup behind ChatGPT, has inked high-profile deals with media outlets including News Corp, Vox Media, Dotdash Meredith, Time, The Atlantic, and Financial Times, as well as social media platform Reddit just this year. Though the terms of these deals were undisclosed to the public, reports have suggested the value of these deals could net the outlets hundreds of millions of dollars. According to Reuters sources, Reddit struck a similar deal with Google in February to license the use of its data at an annual rate of $60 million. At the same time, OpenAI faces ongoing lawsuits from the New York Times, the New York Daily News, Chicago Tribune, Orlando Sentinel, San Jose Mercury News (all owned by hedge fund Alden Global Capital), and others accusing the company of “unlawful copying” of their works.

Data licensing to a burgeoning AI industry presents a new and lucrative way for publishers to earn revenue but threatens to put a hefty financial strain on rapidly-growing machine learning startups that leverage large language models (LLMs) that need to…

To read the complete Intelligence Briefing, current All-Access clients, SIGN IN

All-Access clients receive the full-spectrum of MRP’s research, including daily investment insights and unlimited use of our online research archive. For a free trial of MRP’s All-Access membership, or to save 50% on your first year by signing up now, CLICK HERE