Sorry, this job is no longer available.(Loading More Opportunities)
Lead Engineer (Site Reliability and DevOps)
We're seeking an experienced Senior Site Reliability Engineer (SRE) to join the Customer Communications organization within Delta IT. SRE is what happens when you ask a software engineer to design an operations function. We are looking for someone who is a driven by the idea that engineering to make a system more resilient and efficient frees up time to build more valuable capabilities. The SRE will work within the agile DevOps squad to proactively improve the reliability and resiliency of applications and services they support. You will design, develop, test, debug, and automate tasks for software systems. As SRE you will be expected to leverage tooling and custom applications to monitor and optimize performance. The SRE will troubleshoot incidents to address failure patterns, analyze root cause, automate remediation through runbooks, and document application optimization. Our ideal candidate is well-versed in modern cloud-based and on-premise architecture and experienced in designing systems for reliability, as well as implementing monitoring, alerting, and ops automation.
What you need to succeed (minimum qualifications)
What makes you stand out?
- Building and maintain the deployment architecture to meet the development and maintenance requirements of systems/platforms.
- Working with development teams to evaluate the health, stability and reliability of applications.
- Identify and act on opportunities to take the deployment architecture to the next level in reliability, cost-effectiveness, and ease of use.
- Empower engineers to build, test, deploy, and monitor services by themselves.
- Define and implement solutions that eliminate repeating escalations
- Researches and analyzes trends and behavioral data to identify opportunities for improvements and new initiatives Utilizing monitoring, alerts, dashboards, and management tools to ensure the availability, reliability and performance of applications and services.
- Constantly working to improve and implement automation of applications tasks.
- Providing technical support for systems/platforms according to application SLA's.
- Responsible for designing and developing resiliency in the application code, troubleshooting incidents, engaging with squads to address failure patterns, and participating in incident management.
- High School Diploma, GED or High School Equivalency.
- Embraces diverse people, thinking and styles. Consistently makes safety and security of self and others the priority.
- 7 or more years of experience as an application developer or SRE.
- 2 or more years of experience with ops automation using a scripting language such as Python or Ansible.
- Site Reliability Engineering: Knowledge of the theories and methodologies of reliability engineering; ability to design, develop and support various tools, services and applications to maintain a reliable application environment.
- Performance Measurement and Tuning: Knowledge of system performance, testing and programming; ability to monitor, measure, and optimize system performance and network communication.
- CI/CD Pipeline: Knowledge of concepts, values and tools applied in building Continuous Integration (CI), Continuous Delivery and Continuous Deployment(CD) pipeline; ability to design, build, implement and maintain CI/CD pipelines to achieve the automation of software delivery process (AWS, Azure, Git).
- Software Release Management: Knowledge of strategies, practices, and tools for managing versions and distribution of software products and enhancements; ability to evaluate and improve release management practices and tools
- Application Maintenance: Knowledge of production applications; ability to monitor application functions and resolve issues to maintain optimal conditions for system applications.
- Software Engineering: Knowledge of software engineering; ability to deliver new or enhanced software products.
- Agile Development: Knowledge of agile methodologies and the agile development lifecycle; ability to utilize formal agile methodologies, disciplines, practices and techniques for the delivery of new and enhanced applications.
- Container: Knowledge of concept, functions, and capabilities of container tools and techniques; ability to effectively apply containers in various IT business environments
- Cloud Platform: Knowledge of the products and services regarding cloud platforms; ability to utilize related tools and technologies to develop cloud solutions and deploy applications on cloud platforms. Where permitted by applicable law, must have received or be willing to receive the COVID-19 vaccine by date of hire to be considered for U.S.-based job, if not currently employed by Delta Air Lines, Inc.
- AWS Certified SysOps Administrator or AWS Certified DevOps Engineer certification is preferred.
- Experience configuring, operating and optimizing services offered by Azure is a plus
- Experience with an APM tool such as Dynatrace, New Relic, AppDynamics, or Datadog is preferred.
- Experience with airline applications and infrastructure technology is a plus.
- Experience developing ops automation in Tekton pipelines is a plus.
- Experience developing applications and/or automation running in Red Hat OpenShift, AWS is a plus.
- Sound knowledge in one or more programming languages (Java, Python, C#) and performance tuning