We are an IT and Engineering staffing agency recruiting a Site Reliability Engineer for one of our clients. This client is a leading manufacturer of mission-critical components and precision systems used by global OEMs in the aerospace, turbine engine, marine, and defense industries. Their technology powers next-generation aircraft, space vehicles, submarines, and ground-based platforms. With manufacturing and R&D facilities that depend on high-availability compute clusters, secure cloud services, and complex data pipelines, this organization’s engineering and production teams rely heavily on robust, scalable infrastructure.

As they continue to expand digital capabilities across engineering, ERP systems, and real-time monitoring of production and test environments, the Site Reliability Engineering (SRE) team is vital to maintaining operational uptime, automation, and security across all platforms.

Typical Duties and Responsibilities

We are seeking a Senior Site Reliability Engineer (SRE) to lead system performance monitoring, cloud infrastructure optimization, CI/CD automation, and operational resilience efforts. This role requires deep experience in distributed systems, infrastructure-as-code (IaC), and proactive incident management across DevOps and production systems.

Key responsibilities include:

  • Design, implement, and maintain scalable and reliable infrastructure for high-performance computing (HPC), test automation, and digital twins used in precision engineering workflows
  • Support and optimize CI/CD pipelines for software tools that interface with design, simulation, and manufacturing systems
  • Build observability platforms that include real-time monitoring, logging, and alerting with tools such as Prometheus, Grafana, ELK Stack, and Datadog
  • Develop and manage Infrastructure-as-Code (IaC) templates using Terraform, Ansible, Puppet, or Chef
  • Ensure secure and compliant deployment of cloud workloads across AWS, Azure, or on-prem hybrid environments, supporting defense-related requirements
  • Automate system health checks, failover testing, and response playbooks using Python, Go, Shell scripting, or PowerShell
  • Collaborate with security teams to enforce compliance with NIST 800-171, CMMC, DFARS, and ITAR guidelines for data protection
  • Own availability, latency, change management, and capacity planning for mission-critical internal and external services
  • Lead postmortems, root cause analyses, and continuous improvement cycles after high-severity incidents
  • Support container orchestration and microservices environments using Kubernetes (K8s) and Docker

Education

  • Bachelor’s degree in Computer Science, Software Engineering, Information Systems, or a related technical field
  • Master’s degree or equivalent military systems experience is a plus

Required Skills and Experience

  • 10+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles in complex production environments
  • Proficiency in modern infrastructure platforms including:
    • AWS (EC2, Lambda, VPC, CloudFormation)
    • Azure DevOps, AKS, VM Scale Sets
    • Kubernetes, Docker, Helm
  • Strong scripting and automation skills using Python, Bash, Go, or PowerShell
  • Experience managing Linux (RHEL, Ubuntu) and Windows Server infrastructure at scale
  • Expertise with observability and performance monitoring platforms:
    • Grafana, Prometheus, New Relic, Datadog, Splunk
    • ELK Stack, Fluentd, Graylog
  • Familiarity with CI/CD tools:
    • Jenkins, GitLab CI, CircleCI, ArgoCD, Azure Pipelines
  • Deep understanding of high availability (HA), load balancing, horizontal scaling, and disaster recovery strategies
  • Ability to collaborate with cross-functional teams including mechanical, electrical, and software engineers in a manufacturing environment

Preferred Qualifications

  • Certifications such as:
    • AWS Certified DevOps Engineer – Professional
    • Certified Kubernetes Administrator (CKA)
    • Microsoft Azure Solutions Architect
    • HashiCorp Terraform Associate
  • Experience working in or supporting IT systems in defense, aerospace, or manufacturing organizations
  • Familiarity with regulatory compliance frameworks including:
    • CMMC Level 2 or 3
    • NIST 800-171 / 800-53
    • ISO/IEC 27001
    • ITAR / DFARS compliance
  • Experience integrating with or supporting PLM, ERP, MES, or digital twin platforms in production settings
  • Background in edge computing or real-time systems for aerospace or turbine testing is a strong plus