Shyftlabs1mo ago

Senior ML Engineer

NoidaFull-Timesenior

OtherMachine Learning EngineerData

0 views0 saves0 applied

Apply Now

Quick Summary

Overview

Position Overview: We are looking for an experienced Machine Learning Engineer (5–7 years) to serve as the critical bridge between our Data Engineering and Data Science teams.

Technical Tools

OtherMachine Learning EngineerData

Position Overview:

We are looking for an experienced Machine Learning Engineer (5–7 years) to serve as the critical bridge between our Data Engineering and Data Science teams. While our Data Engineers own the core data infrastructure and our Data Scientists develop and validate ML models, the ML Engineer owns the full operationalization layer — turning research prototypes into reliable, monitored, production-grade systems at scale.

ShyftLabs is a growing data product company founded in early 2020 and works primarily with Fortune 500 companies. We deliver digital solutions built to help accelerate the growth of businesses in various industries, by focusing on creating value through innovation.

Own the full ML lifecycle end-to-end — from translating ambiguous business problems into well-defined ML problem formulations, through model development, deployment, monitoring, and retraining — with minimal guidance.

Lead the design and architecture of production-grade ML systems: define service contracts, data contracts, infrastructure patterns, and failure handling strategies; not just implement them.

Build and maintain ML pipelines for batch and real-time use cases using orchestration frameworks such as Apache Airflow, Prefect, or cloud-native equivalents (Cloud Composer, AWS MWAA, Azure Data Factory).

Operationalize models developed by the Data Science team: package, containerize, version, and deploy to scalable serving infrastructure (managed endpoints, Kubernetes, serverless) with latency and cost awareness.

Implement robust MLOps practices: CI/CD for ML, automated model evaluation gates, shadow deployments, canary rollouts, and experiment tracking (MLflow, Weights & Biases, DVC, or equivalent).

Design and manage feature pipelines with strong understanding of train-serve skew, feature freshness, data leakage prevention, and feature store patterns (Feast, Tecton, cloud-native feature stores).

Build model observability systems: monitor input distribution drift, prediction drift, latency (P50/P99), and correlation to business KPIs; define and automate retraining triggers.

Design and deploy LLM-powered systems where applicable — RAG pipelines, prompt versioning, fine-tuning workflows, vector database integration, and LLMOps tooling.

Collaborate with Data Engineers consuming data warehouse datasets and with Data Scientists to understand model requirements; write clear ML design documents, ADRs (Architecture Decision Records), and technical specs.

Mentor junior and mid-level engineers, lead ML code reviews, and raise the engineering bar across the team.

Partner with product and business stakeholders to define success metrics before building and communicate ML system trade-offs in non-technical terms to client leads.

Requirements

~3 min read

5–7 years of hands-on ML engineering experience, with at least 3 end-to-end production ML deployments that you personally owned — from problem framing through live monitoring. [Non-Negotiable]

Production-grade Python engineering: clean, modular, testable code with type hints, structured error handling, logging, and meaningful unit/integration test coverage. Notebook-only candidates will not meet the bar. [Non-Negotiable]

Hands-on MLOps: CI/CD for ML, automated evaluation pipelines, model versioning, experiment management, and reproducible workflows. Treating MLOps as an engineering default — not a post-deployment afterthought. [Non-Negotiable]

Deep expertise with at least one major cloud ML platform: AWS (SageMaker, EMR, Lambda), GCP (Vertex AI, Cloud Composer, Dataflow), or Azure (Azure ML, Azure Databricks, ADF). Architectural fluency across services — not just usage. [Non-Negotiable]

Expert SQL proficiency for complex data transformation, window functions, CTEs, and query optimization on a cloud data warehouse (BigQuery, Snowflake, Redshift, Databricks SQL, or equivalent). [Non-Negotiable]

Feature engineering expertise in production: feature stores, train-serve skew mitigation, data leakage prevention, and feature freshness trade-offs in high-throughput systems. [Non-Negotiable]

Model monitoring and observability: instrumenting input drift and prediction drift detection, P50/P99 latency tracking, alerting pipelines, and automated retraining trigger logic. [Non-Negotiable]

Containerization with Docker and experience deploying models via Kubernetes (EKS, GKE, AKS), managed endpoints, or serverless runtimes. Understanding of resource allocation, autoscaling, and rolling deployments. [Non-Negotiable]

Architectural leadership: ability to write ML design documents and Architecture Decision Records (ADRs), lead design reviews, and communicate system trade-offs to both technical teams and non-technical client stakeholders. [Non-Negotiable]

Deep ML fundamentals: model selection, bias-variance trade-offs, evaluation design, calibration, and reasoning about failure modes — not just framework-level usage. [Non-Negotiable]

LLMOps and generative AI production experience: RAG pipeline design, vector database integration (Pinecone, Weaviate, Chroma, pgvector), prompt versioning, fine-tuning workflows (LoRA, QLoRA), and LLMOps tooling such as LangSmith, LangGraph, or equivalent. [Good to Have]

Real-time feature engineering: Kafka, Apache Flink, Spark Streaming, or cloud-native streaming services (Kinesis, Pub/Sub, Event Hubs) for sub-second feature computation and serving. [Good to Have]

Distributed training frameworks: Horovod, Ray Train, DeepSpeed, or cloud-native distributed training for large-scale model training jobs. [Good to Have]

dbt, Dataform, or equivalent transformation framework for ML feature pipelines on cloud data warehouses. [Good to Have]

Infrastructure as Code (Terraform, Pulumi, or CloudFormation) for provisioning and managing ML infrastructure. [Good to Have]

Model optimization techniques: quantization, knowledge distillation, ONNX export, TensorRT, and pruning for cost-efficient high-volume serving. [Good to Have]

Responsible AI and AI safety practices: bias evaluation, fairness auditing, explainability (SHAP, LIME, Captum), and governance frameworks for production ML systems. [Good to Have]

Cloud ML certification: AWS ML Specialty, GCP Professional ML Engineer, or Azure AI Engineer Associate. [Good to Have]