Are you excited to work alongside talented and dedicated individuals in a fast-paced, innovation-driven environment? Join our Infrastructure team as a Site Reliability Engineer (SRE) and help shape the backbone of our engineering systems through automation, observability, and industry best practices.
What You’ll Do:Collaborate with a passionate team to guide and enhance the infrastructure that supports our engineering organization.
Champion automation and observability, enabling reliable, scalable services and improving the maturity of our systems.
Identify and implement software improvements through your expertise in scalable system design, complexity analysis, and software development.
Build and maintain a world-class APM and observability platform to proactively detect and resolve performance deviations.
Design and contribute to our cloud architecture to ensure reliability, scalability, and developer-friendly systems for application and platform teams.
Define and implement CI/CD strategies across virtual machines, containers, and serverless infrastructures.
Elevate the developer experience by building intuitive APIs and integrating best-in-class infrastructure tools.
Lead the adoption of SRE (DevOps) best practices across engineering teams and collaborate with architects and developers to deliver robust and reliable services.
Drive automation and orchestration of SaaS platforms and technical solutions with a focus on creativity, scalability, and operational excellence.
Contribute to and lead the design, development, and maintenance of infrastructure on AWS or GCP.
Embrace an agile approach, managing multiple projects simultaneously while ensuring quality and attention to detail.
Participate in on-call rotations, proactively managing incidents and ensuring system stability and availability.
Investigate operational issues and collaborate across teams to continuously improve reliability and support workflows.
Key Responsibilities Participate in the SRE on-call rotation to handle incident response, troubleshooting, and resolution. Manage and scale cloud-native infrastructure using Terraform and Infrastructure as Code (IaC) principles. Administer and maintain Kubernetes clusters on AWS, GCP, or hybrid environments. Design and improve CI/CD pipelines using CircleCI, Jenkins, and GitHub Actions. Drive and implement GitOps practices using tools like ArgoCD or FluxCD. Monitor infrastructure and applications using Prometheus, Grafana, and ing systems. Automate workflows and operational tasks using Python and shell scripts. Ensure robust release and deployment management processes are in place. Understanding of MySQL/NoSQL databases. Perform Linux system administration, including patching, tuning, and hardening. Collaborate with development, QA, and security teams to deliver reliable and secure systems. Required Skills & Experience Deep knowledge of Kubernetes and container orchestration in production. Hands-on experience with Terraform and IaC best practices. Strong expertise in either AWS or GCP cloud environments (multi-cloud a plus). Proficiency in Python and scripting languages for automation. Experience managing CI/CD pipelines with CircleCI, Jenkins, and GitHub Actions. Strong grasp of GitOps workflows and tools like ArgoCD or Flux. Solid understanding of observability principles like Prometheus and Grafana. Experience with MySQL and Linux system-level troubleshooting. Familiarity with release and deployment processes in a fast-paced environment. Git proficiency and experience with infrastructure and code versioning. Exposure to Argo Workflows, Helm, or Kustomize. Experience in log monitoring Knowledge of container security and Kubernetes network policies. Familiarity with compliance and audit requirements in production environments. Educational Qualifications:Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.