Posted 10 days ago
Description
Responsibilities
- Provide SRE and production support with an emphasis on observability to proactively identify issues and drive incident response.
- Act as incident commander to diagnose complex issues and actively drive incident calls with technical teams, product SMEs, and Tier 2 SREs.
Qualifications
- Bachelor’s degree or foreign equivalent required from an accredited institution. Will also consider three years of progressive experience in the specialty in lieu of every year of education.
- At least 10 years of Information Technology experience.
- SRE mindset in production support with proactive issue identification using observability tools.
- Skilled in using monitoring and observability tools to track system performance.
- Experience with Splunk (including Splunk APM and Splunk O11y), AppDynamics; experience with DB, Network, Linux/Unix, Kubernetes; and experience in APM, NMON, Wireshark usage and analysis.
- Experience in production support activities including proactive issue identification leveraging observability tools and correlating inputs from dashboards and tools to drive resolution.
- Able to identify probable failure points through analysis of logs, observability dashboards, recent application changes, infra and network changes.
- Basic troubleshooting across the stack (Application, Database, Infra including container platforms, and Network).
- Experience in setting up observability dashboards based on Splunk logs.
Preferred Qualifications
- Production support expertise with SRE observability experience, including proactive issue identification using observability tools and tracking system performance.
- Experience in production support activities involving correlating inputs from dashboards and tools to drive resolution.
- Ability to swiftly identify probable failure points through analysis of multiple inputs (logs, observability dashboards, recent changes, infra, network changes).
- Strong troubleshooting across all layers of the tech stack (Application, Database, Infra including container platforms, and Network).
- Experience in setting up observability dashboards based on Splunk logs.
Communication
- Excellent communicator and capable of leading and triaging proactively identified issues/incidents where leadership may be present.
- Leadership in triage calls to direct actions for the team.
- Automation – experience in Toil identification and automation.
Technical expertise
- Analysis of issues via Splunk (including Splunk APM and Splunk O11y), AppDynamics, Grafana, RedMetrics, 1000Eyes.
- Debugging issues in VMs, load balancers, firewalls, API gateways, DB, network, Linux/Unix.
- Debugging in containerization (Docker, Kubernetes), AWS, PCF, Azure.
- Analysis of issues via APM, NMON, Wireshark usage and analysis.
- Database performance monitoring and analysis.
- Experience in UEM and synthetic monitoring setup.
- Experience in heap dump analysis, memory leak analysis, and resource optimization.