Applied Data Labs

In 2012, we launched the Fusion Project with a simple thesis: combining public datasets creates exponentially more value than keeping them separate. We used US Census data, Lending Club statistics, Pew Research surveys, and World Bank data -- stitching them together with statistical methods and distributed analysis. The goal was to unlock insights hiding in the spaces between datasets. We were right about the idea. We were just fourteen years early on the tooling.

What the Fusion Project Was

The core concept was data augmentation. Take your own data and enrich it with public datasets to get a fuller picture. Census data gave you demographics for a location. Lending Club data told you about financial behavior in that area. Pew Research surveys added information about technology adoption, media consumption, and social attitudes. The World Bank provided an international lens.

We built the system on distributed analysis technology -- essentially early versions of what would become standard data engineering practice. The idea was to make all of these datasets explorable through a single interface, with the joins and statistical matching handled behind the scenes.

The hard part wasn't getting the data. It was making the connections between datasets that didn't share common keys. Census data is organized by geography. Lending Club data has zip codes. Pew surveys have demographics. Making them talk to each other required probabilistic matching, the kind of work that software engineers trained on exact joins found deeply uncomfortable.

In 2012, joining two messy public datasets together felt like cutting-edge research. In 2026, any junior data engineer can do it before lunch with off-the-shelf tools.

From Manual Data Fusion to AI Integration

The Fusion Project's ideas won. Its implementation became obsolete.

Today, data integration is a solved problem at the infrastructure level. Tools like Fivetran, Airbyte, and dbt handle the plumbing of pulling together hundreds of data sources. Cloud data warehouses like Snowflake and BigQuery make it trivial to query across datasets that would have required custom distributed systems in 2012. Databricks' lakehouse architecture does what our Fusion Project aspired to -- combining structured and unstructured data in one queryable system -- but at scales we couldn't have imagined.

The real successor to data fusion isn't a product, though. It's the entire modern data stack. What used to require a dedicated research project is now a standard architecture pattern: ingest from multiple sources, transform in a warehouse, serve through APIs. Companies like Census (the data activation platform, not the government) and Hightouch sync enriched data directly into operational systems.

And then AI changed the game again. Large language models can now perform entity resolution -- matching "Jason Kolb" in one dataset to "J. Kolb" in another -- with accuracy that our 2012 statistical methods couldn't touch. Vector embeddings let you find semantic similarities between records that share no common fields at all. The probabilistic matching that was our biggest challenge is now a commodity capability.

Public data has exploded too. Data.gov alone has over 300,000 datasets. The EU's open data portal, academic data repositories, and commercial data marketplaces like Dewey Data and Datarade make the four datasets we started with look quaint. The challenge has shifted from finding data to deciding which data actually adds signal versus noise.

The Operational AI Connection

The Fusion Project's central insight -- that combining data sources creates compound value -- is exactly what modern data infrastructure is built to support. But more data also means more governance complexity. Every dataset you integrate brings its own quality issues, bias risks, and compliance requirements. The organizations getting this right have strong data governance that treats data quality as a first-class concern, not an afterthought. Operational AI starts with trustworthy data.