Data in AI

Hi all,

As we embark on Pundi’s own AI journey, let’s look into the often times overlooked third pillar of AI - data.

In the current phase of AI development everyone talks about compute and model as the engine of AI, and rightly so. Look at how compute leader Nvidia became the world’s most valuable company, and large language model leader OpenAI became the industry darling.

The other equally important but less cool engine of AI is data. The quality, quantity, and diversity of data directly determine the final outcomes of the models. A significant challenge in the AI industry is the scarcity of vertical datasets—clear, filtered, and up-to-date information specific to particular sectors.

In addition to the continued high demand data, there is a significant and unmet demand for domain-specific data distinguish by different human languages, specific content, and audio-visual media in vertical domains, and even country-specific sovereign data. As Jensen Huang puts it to the UAE Minister Omar Al Olama, “Sovereign AI brings together your culture, social wisdom, common sense, and history. Your data is unique.” This demand for data, especially domain-specific and state-invested data is expected to surge starting in 2024.

In other words, data, data and data.

In fact, I’ll go to argue that between the two engine of AI - data and compute. Data is an even larger competitive differentiator for anyone not Google or Apple. Because when you train long enough with enough GPUs, you’ll get similar results with any modern model. It’s not about the model, it’s about the data that it was trained with.

James Betker, a researcher at OpenAI even goes to predict that in his own words.

“In that time, I’ve trained a lot of generative models. More than anyone really has any right to train. As I’ve spent these hours observing the effects of tweaking various model configurations and hyperparameters, one thing that has struck me is the similarities in between all the training runs. It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. What that means is not only that they learn what it means to be a dog or a cat, but the interstitial frequencies between distributions that don’t matter, like what photos humans are likely to take or words humans commonly write down. What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion. This is a surprising observation! It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently delivery compute to approximating that dataset.”

Trend Research

  • A survey reveals that 69% of respondent organizations rely on unstructured data such as text, images, and audio-visual media to train models.
  • Investment Plans: More than 72% of organizations plan to invest in AI models and related technologies within the next three years.
  • Among them, 35% consider data quality as the biggest challenge. To address this issue, 55% of organizations use internal labeling teams, 50% employ specialized data labeling services, and 29% utilize crowdsourcing for data labeling. Organizations are expanding the scale of their data labeling efforts, with an increasing demand for specialized, vertical domain labeled data.

These trends highlight the growing recognition of AI’s potential and the growing importance of data. In light of these, I believe good data will play an outsized role in the AI revolution and one startups like us can take advantage of.