Thinkahead4mo ago

GenAI Data ETL Engineer

Hyderabad · Hyderabadmid

OtherData EngineeringGenai Data Etl Engineer

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

AHEAD builds platforms for digital business. By weaving together advances in cloud infrastructure, automation and analytics, and software delivery,

Technical Tools

OtherData EngineeringGenai Data Etl Engineer

AHEAD builds platforms for digital business. By weaving together advances in cloud infrastructure, automation and analytics, and software delivery, we help enterprises deliver on the promise of digital transformation.

At AHEAD, we prioritize creating a culture of belonging, where all perspectives and voices are represented, valued, respected, and heard. We create spaces to empower everyone to speak up, make change, and drive the culture at AHEAD.

We are an equal opportunity employer, and do not discriminate based on an individual's race, national origin, color, gender, gender identity, gender expression, sexual orientation, religion, age, disability, marital status, or any other protected characteristic under applicable law, whether actual or perceived.

We embrace all candidates that will contribute to the diversification and enrichment of ideas and perspectives at AHEAD.

We are seeking a GenAI Data Engineer – Data Integration & Retrieval to design, build, and operate the data pipelines that power our LLM‑based applications, agents, and analytics. This role sits at the intersection of data engineering and generative AI, with a focus on turning messy, distributed enterprise data into high‑quality context for retrieval‑augmented generation (RAG), copilots, and intelligent automation.

You will partner closely with the Platform and Use Cases Teams, GenAI/ML engineers, and business stakeholders to deliver robust, observable, and future‑proof data flows that keep us ahead of where the industry is going.

GenAI / RAG Data Pipeline Development

Design, develop, and maintain ETL/ELT pipelines that ingest structured and unstructured data (databases, documents, tickets, logs, wikis, APIs, SaaS apps) into vector stores, search indexes, and feature tables that power GenAI use cases.

Implement document and record transformations including chunking, metadata enrichment, normalization, deduplication, and PII redaction for safe and high‑quality LLM context.

Build and evolve semantic data models that reflect how LLMs consume context (e.g., knowledge domains, entities, relationships, access controls) rather than only traditional star schemas.

Optimize pipelines for performance, reliability, and cost (incremental loads, CDC, partitioning, caching, adaptive refresh strategies) in support of low‑latency GenAI experiences.

Implement data quality checks and evaluations tailored to GenAI workloads (e.g., coverage of knowledge domains, freshness, retrieval accuracy, hallucination risk signals).

LLM & Integration Engineering

Design and implement system‑to‑system integrations that consolidate context for GenAI from SaaS platforms and internal systems (CRM, ITSM/ticketing, ERP, knowledge bases, collaboration tools).

Work with GenAI engineers to wire data pipelines into LLM orchestration flows (e.g., RAG, tools/agents, workflows), ensuring clean interfaces and robust contracts.

Build and maintain prompt/response logging, retrieval traces, and feedback capture to enable experimentation, evaluation, and continuous improvement.

Ensure integrations and pipelines are secure, auditable, and compliant, including access controls, row/column‑level permissions, and policy‑driven redaction for LLM consumption.

Collaborate with application and platform teams to define SLAs, schemas, and APIs for data contracts that support GenAI services.

Operations, Monitoring, and Documentation

Set up scheduling, orchestration, and workflow management for GenAI data pipelines (e.g., Airflow, Prefect, Dagster, cloud‑native orchestrators).

Implement observability for data and retrieval: pipeline health, data freshness, vector store/index stats, retrieval coverage, and failure modes that impact LLM behavior.

Diagnose and resolve pipeline and integration issues, performing root‑cause analysis across data sources, transformations, and downstream GenAI applications.

Maintain clear documentation of data flows, lineage, schemas, mappings, and runbooks, with a focus on how they support specific GenAI use cases.

Partner with data governance and architecture to enforce naming standards, lineage, and metadata practices that enable safe and explainable GenAI.

Minimum Required: Bachelor’s degree in Computer Science, Information Systems, or similar

5+ years of experience in data engineering, ETL/ELT development, or data integration roles.

Strong SQL skills (complex joins, window functions, performance tuning) across analytical and operational workloads.

Hands‑on experience with at least one modern data pipeline / transformation framework (e.g., dbt, Airflow/Prefect/Dagster, cloud‑native ETL, or custom Python/SQL pipelines).

Experience building and maintaining data pipelines on cloud data platforms (e.g., Snowflake, BigQuery, Redshift, Synapse, or equivalent).

Proficiency in Python (preferred) or another programming language commonly used in data workflows (e.g., Java, Scala), including working with APIs and JSON.

Experience working with REST APIs, webhooks, JSON, CSV, and other common integration formats.

Solid understanding of data modeling and integration concepts (relational modeling, denormalization, CDC, event‑driven or log‑based ingestion).

Familiarity with version control (Git) and standard software engineering practices (code review, branching strategies, CI/CD basics).

Demonstrated exposure whether in personal or work projects to LLMs / GenAI (personal projects, pilots, or production systems).

Experience with LLM‑centric data patterns, such as retrieval‑augmented generation (RAG), semantic search, or document intelligence.

Hands‑on experience with vector databases or search technologies (e.g., Pinecone, Weaviate, pgvector, OpenSearch, Elasticsearch, Vespa).

Experience with workflow orchestration tools (e.g., Apache Airflow, Prefect, Dagster, Azure Data Factory, AWS Glue workflows).

Exposure to message‑based or streaming integrations (e.g., Kafka, Kinesis, Pub/Sub, EventBridge) for near real‑time data and event feeds into GenAI systems.

Experience in data quality and observability (e.g., Great Expectations, Monte Carlo, Soda, or custom checks/alerts).

Knowledge of at least one cloud platform (AWS, Azure, GCP) and its data/AI services (e.g., object storage, serverless compute, managed warehouses, managed LLMs or embeddings).

Familiarity with security and compliance concepts: data classification, encryption, access controls, secrets management, and safe handling of PII/regulated data.

Experience partnering with ML/GenAI teams, including feature pipelines, evaluation datasets, or MLOps practices.

Experience with BI / analytics tools (e.g., Power BI, Tableau, Looker) and understanding how analytical needs intersect with GenAI use cases.

Background with data catalogs, lineage tools, or knowledge graphs that help organize enterprise knowledge for GenAI.