We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Manager, Site Reliability Engineering

Massachusetts Medical Society
paid holidays, sick time, tuition assistance, 401(k)
United States, Massachusetts, Waltham
860 Winter Street (Show on map)
Sep 19, 2025

Manager, Site Reliability Engineering

Category
Information Technology
Job Location
860 Winter St, Waltham, Massachusetts
Tracking Code
8320
Position Type
Full-Time/Regular

The Massachusetts Medical Society (MMS) is the statewide professional association for physicians and medical students, supporting 25,000 members. We are dedicated to educating and advocating for the physicians of Massachusetts and patients locally and nationally. A leadership voice in health care, the MMS contributes physician and patient perspectives to influence health-related legislation at the state and federal levels, works in support of public health, provides expert advice on physician practice management, and addresses issues of physician well-being. Under the auspices of NEJM Group, the MMS extends our mission globally by advancing medical knowledge from research to patient care through the New England Journal of Medicine, NEJM Evidence, NEJM AI, NEJM Catalyst, NEJM Journal Watch, and through our accredited and comprehensive continuing medical education programs.


The world has changed, and so has the way we work. The MMS has adopted a flexible work model that allows most employees to choose where they work - at home, onsite in our Waltham office, or a combination of the two - based on their preferences and our business needs. Because what matters is the work we do, not where we do it.


The Manager, Site Reliability Engineering leads a team of approximately seven SREs accountable for ensuring the reliability, scalability, and cost-effectiveness of MMS platforms across a hybrid physical data center and AWS environment. This role combines people leadership with hands-on technical decision-making.


Key outcomes include clear prioritization in a high-volume, ambiguous environment; disciplined incident response and learning; and continuous reduction of toil through automation and self-service. The manager owns intake management, capacity planning, and delivery of reliability roadmaps while collaborating closely with product and other technical teams.


The successful candidate thrives in a fast-paced, dynamic environment where priorities shift quickly and complex challenges require creative solutions. They are agile, resourceful, and comfortable alternating between guiding technical problem-solving, shaping long-term architecture, and addressing urgent operational needs. Above all, they provide clarity and direction-helping the team stay focused on the right priorities and ensuring initiatives advance efficiently without unnecessary process overhead.


Responsibilities:


Strategic Responsibilities:



  • Drive execution of roadmaps for observability, CI/CD, security hardening, and platform upgrades; communicating status and risks.
  • Collaborate with Cloud Architects to ensure the adoption of resilient, self-healing, scalable design patterns to support delivery and testing of our highly available multi-tier applications.
  • Establish and evolve frameworks for reliability, incident response, and continuous learning to drive operational excellence.
  • Partner with security teams to strengthen practices and embed robust guardrails aligned with industry best standards.
  • Own incident response and communications; ensure postmortems are completed with action items tracked to closure with measurable KPIs.
  • Define SLIs/SLOs and manage error budgets with partner teams; align SLAs with business impact.


People and Project Management



  • Manage, coach, and develop ~7 SREs; run 1:1s, growth plans, and performance reviews.
  • Own intake triage, backlog hygiene, and capacity planning to balance reliability, feature enablement, and operational demands.
  • Collaborate with stakeholders to define and prioritize objectives, ensuring alignment with business goals.
  • Guide team to develop self-service frameworks that empower development teams while maintaining operational standards.
  • Manage on-call program health for 24/7 support of our global sites and services: rotation design, coverage, runbooks, escalation paths, paging policy, and after-hours expectations.
  • Oversee release planning, define service level agreements, and foster the migration of legacy applications to modern CI/CD pipelines.
  • Foster a culture of collaboration, accountability, and continuous improvement.
  • Other responsibilities as assigned.


Qualifications:



  • Bachelor's degree in related field with 6+ years of experience in software development or DevOps, or equivalent education and experience is required.
  • 2+ years directly managing SRE/DevOps teams of 5-10 engineers in a dynamic environment.
  • Hands-on expertise with hybrid cloud architectures, Linux systems (Amazon Linux) and Windows systems.
  • Strong experience in CI/CD pipeline design (GitHub Actions or Jenkins) and IaC (Terraform or CloudFormation)
  • Hands-on proficiency with observability tools (Datadog, New Relic, or Prometheus).
  • Experience implementing security best practices across compliance, vulnerability management, and identity/access management.
  • Excellent communication and project management skills; proficiency with tools such as Jira and Confluence.
  • Proven problem-solving skills, with the ability to learn quickly and adapt solutions creatively.
  • Demonstrated ability to work cooperatively and communicate effectively in an Agile team environment.


  • Self-motivated with the ability to operate independently and set priorities in ambiguous situations.
  • Experience with containerization and orchestration (Docker and Kubernetes).
  • Previous exposure to API management tool (MuleSoft preferred).
  • Experience with self-healing system design and automated failure recovery strategies.
  • Scripting proficiency in Python, Bash, or PowerShell.


Benefits:


Our generous benefits offerings include: 3 weeks of paid vacation, 6 personal days, 12 sick days, 13 paid holidays, medical and dental plans, 401(k) plans with company match, backup childcare assistance, tuition assistance and more!


The MMS has earned praise as one of the Top Places to Work in Massachusetts by The Boston Globe for the past 15 years in a row! The Globe surveys employees regarding their opinions about company leadership, benefits, ethics, values and culture, and recognizes those companies who receive high marks from their employees.


The MMS is an Equal Opportunity Employer, committed to providing opportunities to veterans and people with disabilities and a work environment that is welcoming to all.



Applied = 0

(web-759df7d4f5-mz8pj)