Data Warfare

Back in 2012, I used the term "data warfare" to describe what was happening between companies competing for information advantages. At the time, the concept felt a little dramatic. Companies had always competed on intelligence. What was new was the scale and speed. Data was becoming a weapon you could point at competitors, at markets, and at customers. You could use it to predict what a rival would do before they did it. You could use it to undercut prices in real time. You could use it to lock in customers so thoroughly they'd never leave.

I thought we were at the beginning of something. We were, but the arms race that followed was wilder than I expected.

Data as a Competitive Weapon in 2012

Our original thesis was straightforward. Companies that collected more data and analyzed it faster would outcompete companies that didn't. We pointed to examples from retail, finance, and advertising where data advantages translated directly into market share. Walmart was already using point-of-sale data to optimize inventory at a level competitors couldn't match. High-frequency trading firms were spending billions on infrastructure to shave microseconds off data processing times. Google was using search data to identify market trends before traditional market research could.

The idea of a "data moat" was just starting to enter the business vocabulary. If your company had a dataset that competitors couldn't replicate, you had a structural advantage that was hard to overcome. We argued that data was becoming as important as capital, talent, or technology in determining competitive outcomes.

What we didn't fully appreciate in 2012 was how quickly data would become not just a competitive advantage but a competitive weapon. The difference matters. An advantage helps you win. A weapon helps you destroy.

In 2012, data was a competitive advantage. By 2026, it had become a weapon, a hostage, and a trade commodity all at once.

The API Lockdowns

The most dramatic shift happened between 2023 and 2025, when companies that had been relatively open with their data slammed the doors shut.

Reddit announced in April 2023 that it would charge for API access, pricing out the third-party apps that millions of users relied on. The real target wasn't app developers. It was AI companies. Reddit's corpus of human conversations and Q&A threads had become some of the most valuable training data on the internet, and companies like OpenAI and Google had been scraping it for free. Reddit's IPO filing in 2024 explicitly cited data licensing as a revenue stream, and the company signed deals with Google and OpenAI reportedly worth tens of millions annually.

Twitter, under Elon Musk, took an even more aggressive approach. In February 2023, the company shut down free API access entirely, then introduced pricing tiers that started at $42,000 per month for basic research access. Musk publicly accused AI companies of "illegally using Twitter data for training" and threatened to sue. The platform also started blocking web crawlers, breaking links from search engines, and requiring login to view any content.

Stack Overflow signed a deal with Google reportedly worth $6 million to license its Q&A database for AI training. News organizations from the New York Times to the Associated Press either sued AI companies for unauthorized use of their content or signed licensing deals. The New York Times lawsuit against OpenAI and Microsoft, filed in December 2023, became the bellwether case for whether training AI on copyrighted content constitutes fair use.

Synthetic Data and the New Arms Race

As organic data became harder and more expensive to acquire, companies turned to synthetic data. The concept is simple: use AI to generate training data for other AI models. By 2025, Gartner estimated that synthetic data would be used in 60% of AI development projects. Companies like Mostly AI, Gretel, and Syntheticus built entire businesses around generating realistic but artificial datasets.

The implications are strange. If you can generate unlimited training data synthetically, does the original data moat still matter? The answer is mostly yes, at least for now. Synthetic data works well for structured problems but struggles to replicate the messy, unpredictable patterns in real human behavior. Models trained exclusively on synthetic data tend to develop subtle biases and blind spots. The companies with the best real-world data still have an edge, but the gap is narrowing.

What Enterprise Leaders Should Know

The data warfare concept I wrote about in 2012 has become a board-level concern. Companies need to think about their data assets the way they think about intellectual property: what do we have, who wants it, and how do we protect it?

The organizations getting this right are treating data strategy as a core business function, not an IT project. They're auditing what data they hold, understanding its value for AI training, and making deliberate decisions about licensing, sharing, and protecting it. They're also investing in governance systems that track data lineage so they can prove where their training data came from when regulators or litigators ask.

The Reddit and Twitter examples show what happens when companies wake up to the value of their data too late. They had to take drastic action because they hadn't built operational frameworks for managing data as a strategic asset from the start.