Teaching the Discipline of the Data Industry

In October 2012, I wrote about a problem that was driving me crazy: the data industry was drowning in confusion. Terms like "big data," "data discovery," and "data mining" meant different things to different people. Sellers couldn't explain their products. Buyers reached for solutions before understanding their problems. R "Ray" Wang had just published a blogpost arguing that organizations should "focus on the questions to ask, not the answers."

I argued that the industry needed better education. Standardized terminology. Clear frameworks. We needed to teach this stuff properly so that business leaders could actually use data instead of just talking about it.

Fourteen years later, we did build an education ecosystem. It's enormous. And we still have the talent gap.

The Education Explosion

When I wrote the original piece, data science education barely existed as a formal discipline. A handful of universities offered analytics programs. There was no widely accepted curriculum. People who worked with data came from statistics, computer science, physics, or economics backgrounds and mostly taught themselves the rest.

The change was rapid. Harvard Business Review declared data scientist "the sexiest job of the 21st century" in October 2012, the same month I published my article. Universities scrambled to launch programs. By 2015, there were over 60 master's programs in data science in the U.S. alone. By 2020, there were hundreds worldwide.

MOOCs democratized access. Andrew Ng's machine learning course on Coursera has been taken by over 5 million people, and AI education is now a thriving ecosystem. Bootcamps promised to compress a graduate degree into 12 weeks. Some delivered value. Others were expensive shortcuts that left graduates underprepared.

In 2012, there were almost no formal data science programs. By 2026, there are thousands. The talent gap persists anyway, because the field keeps moving faster than the education system can follow.

The Curriculum Kept Shifting

Here's the problem I didn't anticipate: the thing we needed to teach changed every few years.

2012-2015: The focus was on Hadoop, MapReduce, and big data infrastructure. Schools taught people to manage distributed systems and write batch processing jobs.

2015-2018: Machine learning took over. TensorFlow, scikit-learn, random forests, neural networks. The curriculum pivoted to modeling and algorithms.

2018-2021: MLOps and production engineering became important. Companies realized they had plenty of people who could build models in notebooks but too few who could deploy them reliably. The gap between data science and data engineering became a major industry problem.

2021-2023: Deep learning, transformers, and generative AI rewrote everything. Understanding how GPT-style models work became essential overnight. Schools that had just finished building ML curricula had to add LLM coursework.

2024-2026: Prompt engineering, AI application development, and AI operations management emerged as distinct disciplines. The idea that you'd spend years learning to build ML models from scratch started to feel like learning to build a car engine when you could just drive. The tooling abstracted away much of the low-level work.

The Persistent Talent Gap

Despite all this education, the talent gap persists. Why?

First, the field moved faster than curricula could update. By the time a university approved a new course, trained instructors, and graduated students, the industry had moved on to the next thing. Universities that spent years building Hadoop programs watched Hadoop become obsolete.

Second, the gap shifted. There's no shortage of people who can train a model. There's a massive shortage of people who can integrate AI into business operations and align AI capabilities with business strategy.

Third, AI itself is changing what matters. When ChatGPT can write Python code, what does "knowing Python" mean? What's the irreducible human skill that remains?

What Actually Needs Teaching Now

I think the answer is something close to what I argued for in 2012, just at a higher level. The industry still needs people who understand the questions, not just the tools. People who can look at a business problem and determine what data would help, what analysis makes sense, what biases might distort the results, and whether the output can be trusted.

The technical skills change every few years. The thinking skills, knowing what question to ask, evaluating whether an answer is reasonable, understanding the limits of your data, those compound over a career.

Ray Wang's 2012 advice remains the best starting point for any data professional: "Focus on the questions to ask, not the answers." The tools for finding answers are better than ever. Knowing which questions matter is still the hard part.