Data Infrastructure Engineer
Job Description:
Delta Labs uses AI to simulate and predict consumer behaviour at scale. We build Elaiia, a simulation engine that generates AI Twins — intelligent synthetic agents that mirror real consumer populations. Our clients use Elaiia to simulate customer decisions before committing to them: pricing strategies, product launches, campaign messaging, channel allocation. We replace surveys, focus groups, and intuition with simulation-based evidence.
We’re a small, focused team and we intend to stay that way. We give people ownership, trust, and the autonomy to do their best work. We work with urgency and expect new team members to match our pace. Elaiia is live, paying customers use it daily, and the problems ahead are about scaling what works — not figuring out if it works.
What you’ll do
• Own the data layer that feeds Elaiia’s simulation engine — from raw ingestion through to the structured, enriched datasets the engine consumes
• Design scalable ingestion for heterogeneous sources: survey data, CRM exports, panel data, and public datasets, census data. Formats vary, quality varies, and that’s your problem to solve
• Build enrichment pipelines that join behavioural signals with demographic profiles to produce complete, simulation-ready records
• Handle statistical imputation where data is incomplete — you’ll work with longitudinal and cross-sectional datasets where missingness is the norm, not the exception
• Build and maintain the embedding infrastructure and vector stores that the simulation engine retrieves from — you own storage, indexing, and performance; the simulation team owns retrieval logic
• Ensure data quality through monitoring, validation, and automated testing. Schema drift, stale sources, and silent failures are your enemies
• Onboard new data sources as clients and use cases expand — scoping what’s available, assessing quality, and integrating it into the pipeline
What you need
• Experience building ETL/ELT pipelines for heterogeneous, messy, real-world data — different formats, different schemas, different levels of quality
• Strong SQL skills and experience designing data models for analytical workloads
• Familiarity with statistical imputation methods and comfort working with longitudinal and cross-sectional datasets
• Solid Python skills for pipeline development and data processing
• Experience with Azure data services (Data Factory, Blob Storage, Synapse, or similar)
• You treat data quality as a first-class concern — monitoring, validation, and documentation are not afterthoughts
• You use AI tools as a core part of your development workflow and consider AI-assisted engineering the standard, not optional
Nice to have
• Experience working with survey, panel, or market research data
• Background in census, demographic, or behavioural datasets
• Experience building or maintaining embedding pipelines and vector databases
• Familiarity with distributed processing frameworks (Spark, Dask) for larger-scale workloads