
D2 Product Preview: The Multimodal Shift - Rethinking Data Infra for the Age of Vision, A/V & LLMs
As AI applications increasingly combine vision, audio, video, and LLMs, teams are drowning in fragmented infrastructure. A typical production AI workflow requires stitching together 5+ services: cloud storage for raw files, a separate metadata database, an orchestrator like Airflow, embedding services, vector databases for similarity search, and hundreds of lines of glue code to keep everything in sync. Every time a model improves or business requirements change, you're rewriting pipelines and re-processing entire datasets. This isn't building AI - it's maintaining infrastructure.
In this session, you'll see how a declarative approach to AI data infrastructure eliminates the glue code and maintenance burden. Instead of writing pipelines, you define transformations as computed columns and tables
And we'll walk through a real multimodal workflow, e.g. starting with raw video, extracting audio, transcribing with Whisper, generating embeddings with CLIP, and building semantic search. Then we'll show what happens when you update upstream data and/or add new transformations.
This session is most relevant for ML Engineers, AI App Developers, and technical leaders who are:
- Building production AI applications that combine images, video, audio, or documents with LLMs
- Currently managing multiple services (S3 + Postgres + Airflow + Vector DB + custom scripts)
- Hitting scaling pain points: slow iteration, complex model updates, or lack of reproducibility
- Looking to reduce infrastructure complexity without sacrificing capability
Agenda
Attendees












