The Future Of AI Training Data Is Human. The Question Is How

Two months ago, Wired ran an account of the working conditions for the data-labeling startup Mercor. In it, the author described a culture of confusion, stress, and incompetence, as contractors competed for work to be completed under near-impossible deadlines. And despite these conditions, Mercor might be one of the better data-labeling companies; employees at the Kenyan company Sama were paid as little as $1.50 per hour to analyze data from Meta’s smartglasses, which included images of nudity.

AI is only as good as the metadata that powers it, and if the people creating it are laboring under poor conditions, they’re probably not producing at the highest level or in a natural environment. The world-building platform VLGE and data firm Protege imagine something different and better, and recently announced a partnership that will see VLGE provide data from its spatial worlds to help better understand behavioral signals, including movement trajectories, hesitation loops, exploration patterns, object interactions, spatial decision making, and contextual commerce behavior.

“Those data sets are just super biased because they're only as good as however you ask people to do it... versus a data set like VLGE where people are kind of going and pursuing their own goals. Everyone will take that in a slightly different way. So it's more telling about human behavior and less biased overall," says Grant Murphy-Herndon, the General Manager of Protege.

“AI systems need to understand not only what humans say, but how humans build, move, hesitate, explore, compare, and decide within environments,” adds Evelyn Mora, founder and CEO of VLGE. “The future of human intelligence will come from scalable living behavioral systems and spatially contextualized human interaction.”

As more AI use cases move from two-dimensional to three-dimensional, this type of data will be critical for people building real-world use cases. A lot of money and attention has been focused on using this data to train robots, but there are plenty of applications for people to use today, including designing store layouts and shopping experiences.

"People are like, hey, we don’t know why, but this rack sells so much. We want to figure out why it sells so much? How can we maximize and tap into that?" says Mora. Most stores, of course, already track and analyze customer behavior in person, but capturing that experience in a spatial world can provide a wealth more. For instance, Mora points to a “hesitation score” that can be captured in a virtual experience but is much harder to discern by just pointing cameras to people in a store or taking surveys.

This type of data collection will likely result in better shopping experiences or safer robots, but how will people feel about providing this type of information and feedback? On one hand, we consent to this whenever we enter many stores; cameras might be there for security at first, but they’re also collecting a trove of data that is being fed back to headquarters. Every time we interact with a robot now, that data is also being collected and used to build additional training models. We are simply swimming in a sea of data already; this is just another step forward.

Mora points out that everyone using VLGE consents to having their data captured and shared. “We want to build a human-centered and human-nurtured or sustainable future of AI,” she says. “We focus on making sure the individual’s rights and data are protected. Otherwise, it is just like social media all over again, where you sign up to use a platform for free but in exchange, all of your data is collected and monetized and you might not see a benefit.”

As we move towards needing more spatial data, the way it is collected will change. While the conditions for humans working on data labeling aren’t great, they are at least paid something. In the future, smartglasses could track everything from walking patterns to eye movements, with all that data being fed back to train models and inform designers and decision makers. Ideally people would be compensated on some level for this, even if just in the form of subsidies for the hardware to make it more affordable.

Mora frames the stakes in the broadest possible terms. The “oxygen and fuel for the success and longevity and true expansion of AI,” she says, “will be human data” — data that “requires a very deep understanding of humans.” If she's right, then the question isn't whether human behavior becomes the raw material for the next era of AI; it's whether the people generating it are treated as labor to be minimized or as participants to be respected. The companies betting on the latter are making a wager that ethics and quality point in the same direction.

The Future Of AI Training Data Is Human. The Question Is How

Read Next