saltsquare~2mo ago

AI/ML Architect/Lead

Bosnia and Herzegovina·Tuzla/sarajevolead

ArchitectConstruction & Real Estate

3 views0 saves0 applied

Apply Now

Quick Summary

Overview

Salt Square is a growing outsourcing company providing high-quality software development services to clients across a wide range of industries. Our team is composed of skilled and dedicated professionals delivering innovative solutions that meet and exceed client expectations.

Technical Tools

airflowawsdbtgithub-actionshuggingfacejenkinsjiralangchainpythonpytorchterraformab-testingci-cddata-analysisetlmachine-learningmentoringnetworkingproject-management

Our team is seeking an AI/ML Architect/Lead to own the architecture, infrastructure, and delivery of end-to-end AI/ML ecosystem. The ideal candidate is a technically deep, strategically minded engineer who has built and scaled production-grade AI/ML systems on AWS — someone who can design for long-term success while guiding implementation and delivery. This role sits at the intersection of platform engineering, data science, and applied AI, requiring hands-on expertise across the full ML lifecycle from model development through production observability.

Responsibilities

~1 min read

Reporting directly to the VP, Data Science & Analytics, the AI/ML Architect/Lead will serve as the technical authority for client’s AI/ML platform. You will design and own the infrastructure that powers our machine learning and NLP capabilities, establish MLOps standards and practices, lead the deployment and monitoring of production AI/ML systems, and provide architectural guidance and technical mentorship to the AI/ML engineering team.

The following responsibilities are considered essential functions of this position and are not intended to be

an exhaustive list of all duties.

AI/ML Infrastructure Architecture (AWS)
Architect, deploy, and manage scalable AWS infrastructure for AI/ML workloads including EC2, Lambda, ECS/EKS, S3, SageMaker, and related services
Design and maintain VPC networking, security group configurations, and IAM roles and policies governing all AI/ML platform components in partnership with infrastructure
Define and enforce infrastructure-as-code standards and CI/CD practices for AI/ML platform components
Monitor infrastructure health, optimize compute performance, and drive cost efficiency across model training and inference workloads
Ensure high availability, fault tolerance, and disaster recovery posture for all production AI/ML systems
Manage multi-account AWS environments with IAM roles, environment-specific security boundaries (dev, test, production), and secure access patterns

Architect and own the end-to-end ML lifecycle: experimentation, training, evaluation, deployment, monitoring, and retraining pipelines
Establish and enforce MLOps best practices including model versioning, experiment tracking, reproducibility standards, and deployment automation
Design and implement model serving infrastructure for both real-time inference and batch scoring, optimizing for latency, throughput, and cost
Build pipeline orchestration frameworks for ML workflows using tools such as Airflow, Step Functions, or equivalent
Implement model monitoring and observability frameworks to detect data drift, model degradation, and production anomalies
Own CI/CD pipelines for model and infrastructure promotion across development, testing, and production environments

Architect and manage Qdrant (or equivalent vector database) deployments for semantic search, similarity retrieval, and RAG (Retrieval-Augmented Generation) applications
Design embedding pipelines that transform patient-generated text into vector representations for downstream AI/ML applications
Optimize vector index configurations for query performance, recall, and storage efficiency at scale
Integrate vector retrieval layers with LLM-based applications and NLP pipelines

Partner with the Senior Data Engineers to ensure seamless integration between the Redshift data warehouse and AI/ML feature pipelines
Design and build feature stores and feature engineering pipelines that source structured and unstructured data for model training
Establish data contracts and quality standards between data engineering and AI/ML platform layers
Build ELT/ETL patterns tailored to AI/ML workloads including incremental feature computation, backfill strategies, and schema evolution handling

Own AI/ML platform security including IAM policies, encryption at rest and in transit, network access controls, and secure model artifact storage

Ensure compliance with HIPAA, SOC 2, and applicable healthcare data privacy regulations as they apply to AI/ML systems and model outputs
Design PII anonymization and de-identification pipelines to provision safe, production-representative training data to development and test environments
Implement model governance standards including audit logging, lineage tracking, and acces controls over model artifacts and inference endpoints

Serve as the technical authority and escalation point for the AI/ML engineering team, providing architectural guidance and hands-on support
Collaborate with Data Analytics, Data Engineering, Product, and Software Engineering teams to align AI/ML platform capabilities with roadmap priorities
Contribute to architecture reviews, technology evaluations, and build-vs-buy decisions across the AI/ML tooling landscape
Build and maintain architecture documentation, runbooks, and operational playbooks for all production AI/ML systems
Mentor AI/ML engineers on platform best practices, code quality standards, and production engineering principles

Cloud Infrastructure: 5+ years of hands-on experience architecting and managing AWS data and AI/ML infrastructure (EC2, Lambda, ECS/EKS, SageMaker, S3, IAM, VPC, CloudWatch)
MLOps & ML Lifecycle: 5+ years of experience building and operating production ML systems, including training pipelines, model serving, CI/CD automation, and monitoring
Model Deployment: Deep experience with model serving patterns (real-time inference, batch scoring, A/B testing, canary deployments) and frameworks such as SageMaker Endpoints, TorchServe, or equivalent
Vector Databases: Hands-on experience with Qdrant or equivalent vector databases (Pinecone, Weaviate, pgvector), including index design, embedding pipelines, and RAG architecture
Python: Strong Python development skills for ML pipeline engineering, including frameworks such as PyTorch, HuggingFace Transformers, LangChain, and AWS SDK (boto3)
Data Engineering: Experience integrating AI/ML platforms with cloud data warehouses (Redshift preferred) and building feature pipelines using tools such as dbt, Airflow, or Spark

Infrastructure as Code: Experience with Terraform, CloudFormation, or equivalent IaC tools for reproducible infrastructure provisioning
Source Control & CI/CD: Proficiency with Git including branching strategies, pull request workflows, and integration with CI/CD platforms (Jenkins, GitHub Actions, or equivalent)

Strong systems thinking with the ability to design AI/ML platforms for scalability, reliability, and long-term maintainability
Demonstrated experience with ML experiment tracking, model versioning, and reproducibility frameworks (MLflow, Weights & Biases, or equivalent)
Experience with NLP and large language model (LLM) applications, including fine-tuning, prompt engineering, and RAG patterns
Ability to translate business and product requirements into architectural decisions and phased implementation plans

Experience with healthcare data, HIPAA regulations, and patient data privacy requirements
Experience working with a US-based product team

Excellent verbal and written communication skills, with the ability to translate complex technical concepts for both technical and non-technical stakeholders
Proven ability to work cross-functionally and drive technical decisions collaboratively
Experience with project management and issue tracking software such as Jira