Sr Site Reliability Engineer, GPU Clusters
![]() | |
![]() | |
![]() United States, Texas, Austin | |
![]() 7171 Southwest Parkway (Show on map) | |
![]() | |
WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. AMD together we advance_ THE TEAM: AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people. THE ROLE: We are looking for a dynamic, energetic Lead / Principal Systems Design Engineer to join our growing team. As a key contributor to the success of AMD's product, you will be part of a leading team to drive and improve AMD's abilities to deliver the highest quality, industry leading technologies to market. The Systems Design Engineering team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development. The Site Reliability Engineer (SRE) position is for the Cluster Platform Engineering (CPE) team in the Data Center Cluster Solutions (DCCS) organization, as part of the AMD Data Center GPU (DCGPU) business unit. They will be responsible for helping to create and automate processes that bring up and keep deployed GPU cluster systems running. This position will be focused on the operational aspects of large-scale GPU-accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD. The SRE will work closely with the CPE Platform Engineering (PE) and Data Center Operations (DC Ops) teams as internal and external systems are brought up for customers. They will work on using software tools to convert manual processes over and automate tasks such as systems management and application monitoring. They will also work with the CPE Release Engineering (RE) team to develop and automate reliable processes for applying updates to cluster systems. This position is an exciting opportunity to help build a platform and create a world-class operation in support of this exciting growth area for AMD and for the industry. This position reports to the Senior Manager of the Site Reliability Engineering team within the Cluster Platform Engineering group. THE PERSON: As a Leader in Systems Design Engineering, you will drive balanced, scalable, and automated solutions. In this high visibility position, your software systems engineering expertise will be necessary towards product development, definition, and root cause resolution. KEY RESPONSIBILITIES: This SRE role will primarily involve learning the AMD GPU cluster systems, assisting in the bring up of these systems, and developing automation to keep them operational, as well as working with the various other DCGPU teams to incorporate requirements and address any issues on the systems. Specific responsibilities of this position include:
PREFERRED EXPERIENCE:
ACADEMIC CREDENTIALS: Bachelors or Masters degree in electrical engineering, computer engineering, or computer science #LI-RW1 #LI-HYBRID At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD's Employee Stock Purchase Plan. You'll also be eligible for competitive benefits described in more detail here. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process. |