HyredData

Data Engineer (Data Pipelines & RAG)

RemotePosted today

A versatile Data & AI Engineer role at a fast-growing Property Tech AI company, focusing on building and maintaining data pipelines for Gen AI applications, with responsibilities spanning data modeling, AI integration, observability, and automation.

Location: Remote

Responsibilities

Automate data ingestion from diverse sources including unstructured documents, tables, charts, and drawings.
Own chunking strategy, embedding, indexing of data for retrieval by RAG/agent systems.
Build, test, and maintain robust ETL/ELT workflows using Spark (batch & streaming).
Define and implement logical/physical data models and schemas, develop schema mapping and data dictionaries.
Instrument data pipelines to surface real-time context into LLM prompts.
Implement prompt engineering and RAG for workflows within the RE/Construction industry vertical.
Implement monitoring, alerting, and logging for data quality, latency, and errors.
Apply access controls and data privacy safeguards (e.g., Unity Catalog, IAM).
Develop automated testing, versioning, and deployment using Azure DevOps, GitHub Actions, Prefect/Airflow.
Maintain reproducible environments with infrastructure as code (Terraform, ARM templates).

Requirements

5 years in Data Engineering or similar role, with 12-24 months experience in building pipelines for unstructured data extraction, document processing with OCR, cloud-native solutions, chunking, indexing for RAG/Gen AI applications.
Proficiency in Python, dlt for ETL/ELT pipelines, duckDB or equivalent tools, dvc for large file management.
Solid SQL skills and experience with relational databases; familiarity with non-relational column-based databases.
Familiarity with Prefect or similar tools (Azure Data Factory).
Proficiency with Azure ecosystem and services in production.
Familiarity with RAG indexing, chunking, and storage across file types.
Strong DevOps and CI/CD experience (CircleCI / Azure DevOps).
Experience deploying ML artifacts using MLflow, Docker, or Kubernetes.