설명
Work Arrangement:
Hybrid: This role is categorized as hybrid. This means the successful candidate is expected to report to either Austin, TX or Atlanta, GA at their respective innovation centers three times per week.
The Role:
The Software Engineering Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of software systems. Their job profile includes:
- System Monitoring and Troubleshooting: Monitoring the performance and availability of software systems, identifying and resolving issues, and implementing proactive measures to prevent future incidents.
- Automation and Infrastructure: Developing and maintaining automation tools and infrastructure to streamline software deployment, configuration management, and system monitoring.
- Performance Optimization: Analyzing system performance, identifying bottlenecks, and implementing optimizations to improve the efficiency and scalability of software systems.
- Incident Response and Root Cause Analysis: Responding to incidents, conducting root cause analysis, and implementing corrective actions to prevent similar incidents in the future.
- Collaboration with Development Teams: Collaborating with software development teams to ensure that reliability and scalability considerations are incorporated into the software design and implementation.
- Continuous Improvement: Identifying opportunities for process improvement, implementing best practices, and driving initiatives to enhance the reliability and performance of software systems.
[Additional Description]
What You'll Do
- Implement scalable, reliable, secure SRE and Observability platform to monitor health of our production system and provide a holistic view of the environment.
- Deliver tools/software to improve the reliability, scalability and operability of services.
- Collaborate with engineering teams to analyze and provide inputs in architecture, infrastructure resources, observability to achieve reliability and scalability goals.
- Collaborate with engineering teams to conduct production readiness reviews, deployment, operation and refinement.
- Partner with stakeholders to ensure data and observability tools are effectively integrated with other systems and processes.
- Partner with stakeholders to identify, measure and monitor availability, latency and overall service health.
- Participate in on-call engineering duty to support production.
- Instill Site Reliability best practice through automation, data insights, and observability
- Perform initial incident root cause analysis with engineers, carryout incident postmortem.
- Build run books, tooling to carry out production support activities.
- Actively participate in technical discussions and deep dives with Architectural group
Your Skills & Abilities (Required Qualifications)
- 7+ years of hands-on SRE experience (software development, systems monitoring) with at least one of the public cloud providers – Azure (strongly preferred), AWS, GCP
- Experience operating high-availability, fault-tolerant, scalable, distributed software in production: Building monitoring, defining alerts, writing run books, establishing dashboards etc.
- Experience with monitoring and log aggregation frameworks, such as Azure Monitor/Sentinel, Datadog(preferred), Dynatrace, Elasticsearch, Kibana, Logstash.
- Strong working knowledge of Docker, Kubernetes, Terraform, Chef or Ansible
- Experience troubleshooting JVM based applications.
- Chaos engineering implementation and experience a big plus.
- Extensive knowledge Infrastructure as a code tool Terraform
- Extensive knowledge of Trace monitoring, installation and configuration of Open telemetry.
- Strong experience in scripting/programming – Python, Java, Go, PowerShell, Bash.
- Experience with configuration and management of SSO, Big Data/ No-SQL in cloud infrastructure.
- CI/CD automation frameworks knowledge - Jenkins/Azure DevOps
- Strong understanding of public cloud networking components.
- You have a story to tell how you lead and influence cross-organization effort to improve uptime to at least 99.99%
- Working experience with source control management tools, such as GitHub (Preferred), Azure Devops
- Experience with IoT stack is a big plus
- BS/MS in Computer Science/Engineering preferred
This job may be eligible for relocation benefits.
A company vehicle will be provided for this role with successful completion of a Motor Vehicle Report review.
#LI-KB1
다양성 정보
General Motors는 법적으로 금지된 차별을 배제하는 것은 물론 포용성과 소속감을 진정으로 장려하는 직장이 되기 위해 노력하고 있습니다. 당사는 다양성이 보장되는 환경에서 직원들이 역량을 발휘하고 우리 고객을 위한 더 좋은 제품을 개발할 수 있다고 믿습니다. 따라서 입사에 관심 있는 사람이 있다면 포지션별 주요 업무와 자격을 확인하고 본인이 보유한 기술과 능력에 부합하는 모든 포지션에 적극적으로 지원하기를 장려합니다. 지원자는 채용 과정에서 역할 관련 평가(해당하는 경우) 및/또는 채용 전 스크리닝을 통과해야 합니다. 자세한 정보는 GM 채용 과정 안내를 참고하십시오.
공평한 취업 기회 선언 (미국)
General Motors는 공평한 기회를 제공하는 고용주임을 자부합니다. 자격을 만족하는 지원자는 인종과 피부색, 성별, 성적 지향, 성별 정체성, 국적, 장애, 재향 군인 보호법 적용 여부와 상관없이 채용 후보로서 심사를 받습니다.
숙소 (미국 및 캐나다)
General Motors는 장애인을 포함한 모든 구직자들에게 취업 기회를 제공합니다. 구직이나 취업 지원에 도움이 되는 합리적인 숙소가 필요한 경우 [email protected]으로 이메일을 보내시거나 800-865-7580으로 전화주십시오. 이메일에, 귀하가 요청하는 특정한 숙소에 대한 설명과 귀하가 지원하는 직무와 채용 요청서 번호를 포함해주세요.