Site Reliability Engineer (AI Forms Platform)
Job Description
Job Summary
We are seeking a Site Reliability Engineer (SRE) to build and maintain the production infrastructure for a new, mission-critical forms engine. You will be responsible for ensuring high availability, implementing SOC-aligned security controls, and managing the CI/CD pipelines that enable rapid iteration. This is a builder role where you will define the architecture for a new product line. We expect you to be an AI-augmented engineer, utilizing modern AI tools to automate infrastructure coding (IaC), troubleshoot incidents faster, and optimize system performance.
Responsibilities
Infrastructure as Code: Architect and deploy secure, scalable infrastructure using Terraform, CloudFormation, or similar tools to support the new Forms Platform.Availability & Uptime: Ensure the platform meets strict SLA requirements for enterprise clients, minimizing downtime and "P1 incidents".Observability: Implement comprehensive monitoring, logging, and alerting (Datadog, New Relic, etc.) to provide deep visibility into AI model performance and system health.Security & Compliance: Design architecture that aligns with SOC standards and ensures proper handling of PII/PHI data and audit trails for model outputs.Release Engineering: Build and maintain efficient CI/CD pipelines to support the "tapering" of legacy systems and the rapid deployment of new features.Incident Response: Lead incident response efforts for the Forms Platform and conduct post-mortems to drive continuous improvement.Automation: Aggressively automate manual operations tasks using scripting (Python/Go) and AI tools to reduce toil.Qualifications
Bachelor’s degree in Computer Science, Computer Engineering, or related field.3+ years of SRE or DevOps experience, specifically in high-availability production environments.Cloud Proficiency: Deep expertise in AWS or Azure ecosystem, including container orchestration (Kubernetes/Docker).Security Mindset: Experience implementing security best practices (SOC2, HIPAA) in a cloud environment.Scripting: Proficiency in Python, Go, or Bash for automation.Agile/Scrum: 1 to 3 years experience with scrum/agile development methodologies.AI Adaptability: Willingness and ability to use AI/LLMs to accelerate infrastructure development and debugging.Communication: Excellent verbal and written communication skills to document architecture and incident reports