May 8, 2024

Datology: Data curation is the missing piece of the AI strategy

Viviana Faga

Astasia Myers

Tobi Coker

Example H2

Example H3

What are the ingredients of a successful AI product? It is three things: talent, compute, and data. While the war for talent and thirst for compute have been widely publicized, data’s importance has been largely overlooked.

The warning signs are beginning to show. Increasingly, the limiting factor to training AI models has been available data. We could run out of high-quality data to train models on by 2026. The cost of running AI at scale has become a stark reality for most startups. Outside of inference, the largest expense for these businesses is the cost of training models, which requires data. A standard narrative has been that the bigger the model, the more performant it is. Performance equates to higher customer satisfaction.

However, models are only as good as the data they are trained on. Businesses can’t have an AI strategy without a data strategy.

‍

Data as the foundation

The Datology team said it best, models are what they eat. Models are a reflection of the data used to train them. Yet, the process of determining which data are vital is a daunting and challenging task. To date, this process has been manual. Companies send their data to a data labeling service that recruits subject-matter experts to label data by hand. Then, the companies must explore large volumes of labeled data and select what data to use for training. Automating this process is difficult — but it represents the most critical challenge to AI performance to date.

‍

Less data, better performance

The vast majority of the tech industry has operated under the mindset that more data equates to better performance. What if that isn’t the case? What if, instead, better data equates to better performance? This implies that not all data are created equal. This counter-positioning narrative is exactly the type of business opportunity that gets us excited at Felicis (similar to the MotherDuck team’s approach in the database market).

This proposition has intrigued us since we first met Ari, Bogdon, and Matthew. Their hypothesis, and the underpinning of DatologyAI, is that one can achieve more efficient pre-training and training of models by selectively removing less important data. This can lower training costs, increase operational efficiency, and improve model performance.

‍

AI’s microprocessor moment

Intel pioneered the microprocessor. Their technical and manufacturing innovations significantly reduced the cost of developing a microprocessor, which had ramifications for every consumer electronic product. Within 20 years, seemingly every product, from the refrigerator in your kitchen to the computer in your office, was powered by a microprocessor.

What are the possibilities if it becomes cheap enough for every company to train and deploy their own AI models?

When we imagine a world where every company has a fine-tuned model tailored to their specific needs, we know that this process will be fully automated and powered by a platform that employs advanced algorithms and ML techniques to streamline the AI data curation and data management process. This not only reduces the manual workload but also enhances the overall quality of the curated data. By automating repetitive tasks, AI developers can focus on refining their models and addressing more complex challenges. This is the kind of future Datology enables.

‍

Creating value

Data curation has a tangible ROI. Companies can save significant time spent working with manual data labelers, improve models' accuracy and scalability, and realize substantial cost savings.

Automated data curation is a game-changer for AI tech stacks and offers a streamlined approach to managing and optimizing data for model training. As the AI landscape continues to evolve, organizations that leverage Datology’s tech will be positioned to innovate faster, make informed decisions, and stay ahead of the competition.

The Datology team has stellar experience in AI data and infrastructure, which makes them uniquely positioned. Ari received Outstanding Paper awards at NeurIPS and ICLR for his groundbreaking research on data pruning and deduplication; Matt was the head of data research at MosaicML; and Bogdan built language and search infrastructure at Twitter. We are incredibly excited to work with Ari, Bogdan, Matthew, and the rest of the Datology team to build toward a future where automated data curation unlocks the full potential of AI.

Models are about to eat so much better.

If you want to learn more about Datology, you can sign up for the waitlist here. If you’re interested in joining the team, see the open roles here.