Founding Platform & Reliability Engineer
Quick Summary
Founding Platform & Reliability Engineer 🎨 About OpenArt OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with…
Have designed an internal platform abstraction (e.g., API gateway / workflow engine / job orchestration) that enabled multiple product teams to ship faster with fewer incidents.
OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with unprecedented speed and imagination. We believe the future of creativity is AI-native, and we're shaping that future.
What We Offer
~1 min readAbout the Role
~1 min readWe’re looking for a Founding Platform & Reliability Engineer who can own the design, scalability, and reliability of our entire infrastructure stack end-to-end, from high-level architecture decisions to hands-on implementation, observability, and cost optimization.
This is NOT a role for traditional operators or narrow DevOps specialists. You should be comfortable working across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiency—especially in a fast-evolving, AI-native environment.
You will work closely with the founders and product engineers to design and evolve the platform that powers OpenArt, shaping key decisions such as serverless vs. containerized architecture, multi-provider AI reliability, and scaling systems to millions of users—while acting as a force multiplier for the entire engineering team.
Responsibilities
~1 min read- →
Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads, etc.), and use them to drive prioritization (including error budgets)
- →
Participate in an on-call rotation and lead incident response improvements (alert quality, runbooks, escalation paths). Establish blameless postmortems and ensure action items are implemented.
- →
Implement reliability patterns at external boundaries, and build mechanisms for per-vendor “health” measurement and routing/fallback policies
- →
Stand up end-to-end observability: structured logs, metrics, traces, and dashboards that let engineers answer “what broke” and “why now” quickly.
- →
Build deploy safety practices: automated rollbacks, canarying, feature-flag patterns, and reliable CI/CD gates.
- →
Own the direction of our infrastructure architecture, including defining when serverless is the right approach versus when we should evolve toward containerized or more managed systems, and guiding the team through those transitions as we scale.
- →
Build cost observability and cost-control primitives: per-request cost attribution, caching strategies, capacity planning, and budget alerts.
- →
Act as a senior technical voice, influencing architecture, tooling, engineering best practices, and raising the overall engineering bar.
Requirements
~1 min read5+ years building and operating production systems where reliability and scaling are core.
Strong software engineering skills (you can ship production code, not just configure tools).
Cloud-native experience (AWS or GCP), ideally with serverless/event-driven systems and at least one container path (Fargate/ECS/Cloud Run/Kubernetes).
Deep knowledge of observability practices: dashboards, alerting, distributed tracing, and incident response maturity.
Ability to design resilient interactions with external dependencies (timeouts, retries/backoff/jitter, circuit breakers, idempotency).
Can communicate tradeoffs to non-infra peers clearly
Ability to operate with ambiguity and define problems before solving them.
Nice to Have
~1 min readHave designed an internal platform abstraction (e.g., API gateway / workflow engine / job orchestration) that enabled multiple product teams to ship faster with fewer incidents.
Have shipped concrete reliability outcomes: e.g., reduced MTTR, improved SLO attainment, lowered p95 latency, or reduced infra/unit costs
Prior startup experience or experience owning large surface-area features.
GCP, Cloud Run, Modal, Upstash, Sentry, Amplitude, Firebase, Redis, React / Next.js, Node.js, TypeScript, Python, etc.
What We Offer
~1 min readBay Area preferred (hybrid allowed)
Visa sponsorship available
We’ll consider remote
Location & Eligibility
Listing Details
- Posted
- March 26, 2026
- First seen
- May 6, 2026
- Last seen
- May 8, 2026
Posting Health
- Days active
- 0
- Repost count
- 0
- Trust Level
- 14%
- Scored at
- May 6, 2026
Signal breakdown
Please let embedding-vc know you found this job on Jobera.
4 other jobs at embedding-vc
View all →Explore open roles at embedding-vc.
Similar Reliability Engineer jobs
View all →Browse Similar Jobs
Stay ahead of the market
Get the latest job openings, salary trends, and hiring insights delivered to your inbox every week.
No spam. Unsubscribe at any time.