未读消息消息

购物车

我的订单

个人中心

店铺

我的订单收藏

拍卖

拍卖交易我的竞拍收藏

我的好友资金账户

卖家中心

客服 |

帮助中心 9:00-20:30 在线留言

客服电话

010-89648155

服务时间

客服咨询 8:00-21:00

纠纷处理 9:00-21:00

图书审核 9:00-18:00

监督与建议

请选择

手机孔网

Site Reliability Engineering：How Google Runs Production Systems

作者: Betsy Beyer , Chris Jones , Jennifer Petoff , Niall Richard Murphy

出版社: O'Reilly Media

出版时间: 2016-04

ISBN: 9781491929124

装帧: 其他

开本: 其他

纸张: 其他

分类: 外文古旧书>英文书>计算机与互联网

5人买过

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?
In Betsy Beyer
Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.
Chris Jones
Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google’s advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He’s also a licensed professional engineer.
Jennifer Petoff
Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.
Niall Richard Murphy
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland’s peering hub. He is the author or coauthor of a number of technical papers and/or books, including "IPv6 Network Administration" for O’Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons. Chapter 1Introduction
The Sysadmin Approach to Service Management
Google’s Approach to Service Management: Site Reliability Engineering
Tenets of SRE
The End of the Beginning
Chapter 2The Production Environment at Google, from the Viewpoint of an SRE
Hardware
System Software That “Organizes” the Hardware
Other System Software
Our Software Infrastructure
Our Development Environment
Shakespeare: A Sample Service
Principles
Chapter 3Embracing Risk
Managing Risk
Measuring Service Risk
Risk Tolerance of Services
Motivation for Error Budgets
Chapter 4Service Level Objectives
Service Level Terminology
Indicators in Practice
Objectives in Practice
Agreements in Practice
Chapter 5Eliminating Toil
Toil Defined
Why Less Toil Is Better
What Qualifies as Engineering?
Is Toil Always Bad?
Conclusion
Chapter 6Monitoring Distributed Systems
Definitions
Why Monitor?
Setting Reasonable Expectations for Monitoring
Symptoms Versus Causes
Black-Box Versus White-Box
The Four Golden Signals
Worrying About Your Tail (or, Instrumentation and Performance)
Choosing an Appropriate Resolution for Measurements
As Simple as Possible, No Simpler
Tying These Principles Together
Monitoring for the Long Term
Conclusion
Chapter 7The Evolution of Automation at Google
The Value of Automation
The Value for Google SRE
The Use Cases for Automation
Automate Yourself Out of a Job: Automate ALL the Things!
Soothing the Pain: Applying Automation to Cluster Turnups
Borg: Birth of the Warehouse-Scale Computer
Reliability Is the Fundamental Feature
Recommendations
Chapter 8Release Engineering
The Role of a Release Engineer
Philosophy
Continuous Build and Deployment
Configuration Management
Conclusions
Chapter 9Simplicity
System Stability Versus Agility
The Virtue of Boring
I Won’t Give Up My Code!
The “Negative Lines of Code” Metric
Minimal APIs
Modularity
Release Simplicity
A Simple Conclusion
Practices
Chapter 10Practical Alerting from Time-Series Data
The Rise of Borgmon
Instrumentation of Applications
Collection of Exported Data
Storage in the Time-Series Arena
Rule Evaluation
Alerting
Sharding the Monitoring Topology
Black-Box Monitoring
Maintaining the Configuration
Ten Years On…
Chapter 11Being On-Call
Introduction
Life of an On-Call Engineer
Balanced On-Call
Feeling Safe
Avoiding Inappropriate Operational Load
Conclusions
Chapter 12Effective Troubleshooting
Theory
In Practice
Negative Results Are Magic
Case Study
Making Troubleshooting Easier
Conclusion
Chapter 13Emergency Response
What to Do When Systems Break
Test-Induced Emergency
Change-Induced Emergency
Process-Induced Emergency
All Problems Have Solutions
Learn from the Past. Don’t Repeat It.
Conclusion
Chapter 14Managing Incidents
Unmanaged Incidents
The Anatomy of an Unmanaged Incident
Elements of Incident Management Process
A Managed Incident
When to Declare an Incident
In Summary
Chapter 15Postmortem Culture: Learning from Failure
Google’s Postmortem Philosophy
Collaborate and Share Knowledge
Introducing a Postmortem Culture
Conclusion and Ongoing Improvements
Chapter 16Tracking Outages
Escalator
Outalator
Chapter 17Testing for Reliability
Types of Software Testing
Creating a Test and Build Environment
Testing at Scale
Conclusion
Chapter 18Software Engineering in SRE
Why Is Software Engineering Within SRE Important?
Auxon Case Study: Project Background and Problem Space
Intent-Based Capacity Planning
Fostering Software Engineering in SRE
Conclusions
Chapter 19Load Balancing at the Frontend
Power Isn’t the Answer
Load Balancing Using DNS
Load Balancing at the Virtual IP Address
Chapter 20Load Balancing in the Datacenter
The Ideal Case
Identifying Bad Tasks: Flow Control and Lame Ducks
Limiting the Connections Pool with Subsetting
Load Balancing Policies
Chapter 21Handling Overload
The Pitfalls of “Queries per Second”
Per-Customer Limits
Client-Side Throttling
Criticality
Utilization Signals
Handling Overload Errors
Load from Connections
Conclusions
Chapter 22Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid Them
Preventing Server Overload
Slow Startup and Cold Caching
Triggering Conditions for Cascading Failures
Testing for Cascading Failures
Immediate Steps to Address Cascading Failures
Closing Remarks
Chapter 23Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination Failure
How Distributed Consensus Works
System Architecture Patterns for Distributed Consensus
Distributed Consensus Performance
Deploying Distributed Consensus-Based Systems
Monitoring Distributed Consensus Systems
Conclusion
Chapter 24Distributed Periodic Scheduling with Cron
Cron
Cron Jobs and Idempotency
Cron at Large Scale
Building Cron at Google
Summary
Chapter 25Data Processing Pipelines
Origin of the Pipeline Design Pattern
Initial Effect of Big Data on the Simple Pipeline Pattern
Challenges with the Periodic Pipeline Pattern
Trouble Caused By Uneven Work Distribution
Drawbacks of Periodic Pipelines in Distributed Environments
Introduction to Google Workflow
Stages of Execution in Workflow
Ensuring Business Continuity
Summary and Concluding Remarks
Chapter 26Data Integrity: What You Read Is What You Wrote
Data Integrity’s Strict Requirements
Google SRE Objectives in Maintaining Data Integrity and Availability
How Google SRE Faces the Challenges of Data Integrity
Case Studies
General Principles of SRE as Applied to Data Integrity
Conclusion
Chapter 27Reliable Product Launches at Scale
Launch Coordination Engineering
Setting Up a Launch Process
Developing a Launch Checklist
Selected Techniques for Reliable Launches
Development of LCE
Conclusion
Management
Chapter 28Accelerating SREs to On-Call and Beyond
You’ve Hired Your Next SRE(s), Now What?
Initial Learning Experiences: The Case for Structure Over Chaos
Creating Stellar Reverse Engineers and Improvisational Thinkers
Five Practices for Aspiring On-Callers
On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
Closing Thoughts
Chapter 29Dealing with Interrupts
Managing Operational Load
Factors in Determining How Interrupts Are Handled
Imperfect Machines
Chapter 30Embedding an SRE to Recover from Operational Overload
Phase 1: Learn the Service and Get Context
Phase 2: Sharing Context
Phase 3: Driving Change
Conclusion
Chapter 31Communication and Collaboration in SRE
Communications: Production Meetings
Collaboration within SRE
Case Study of Collaboration in SRE: Viceroy
Collaboration Outside SRE
Case Study: Migrating DFP to F1
Conclusion
Chapter 32The Evolving SRE Engagement Model
SRE Engagement: What, How, and Why
The PRR Model
The SRE Engagement Model
Production Readiness Reviews: Simple PRR Model
Evolving the Simple PRR Model: Early Engagement
Evolving Services Development: Frameworks and SRE Platform
Conclusion
Conclusions
Chapter 33Lessons Learned from Other Industries
Meet Our Industry Veterans
Preparedness and Disaster Testing
Postmortem Culture
Automating Away Repetitive Work and Operational Overhead
Structured and Rational Decision Making
Conclusions
Chapter 34Conclusion
Appendix Availability Table
Appendix A Collection of Best Practices for Production Services
Fail Sanely
Progressive Rollouts
Define SLOs Like a User
Error Budgets
Monitoring
Postmortems
Capacity Planning
Overloads and Failure
SRE Teams
Appendix Example Incident State Document
Appendix Example Postmortem
Lessons Learned
Timeline
Supporting information:
Appendix Launch Coordination Checklist
Appendix Example Production Meeting Minutes
内容简介:
The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?
In
作者简介:
Betsy Beyer
Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.
Chris Jones
Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google’s advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He’s also a licensed professional engineer.
Jennifer Petoff
Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.
Niall Richard Murphy
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland’s peering hub. He is the author or coauthor of a number of technical papers and/or books, including "IPv6 Network Administration" for O’Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.
目录:
Chapter 1Introduction
The Sysadmin Approach to Service Management
Google’s Approach to Service Management: Site Reliability Engineering
Tenets of SRE
The End of the Beginning
Chapter 2The Production Environment at Google, from the Viewpoint of an SRE
Hardware
System Software That “Organizes” the Hardware
Other System Software
Our Software Infrastructure
Our Development Environment
Shakespeare: A Sample Service
Principles
Chapter 3Embracing Risk
Managing Risk
Measuring Service Risk
Risk Tolerance of Services
Motivation for Error Budgets
Chapter 4Service Level Objectives
Service Level Terminology
Indicators in Practice
Objectives in Practice
Agreements in Practice
Chapter 5Eliminating Toil
Toil Defined
Why Less Toil Is Better
What Qualifies as Engineering?
Is Toil Always Bad?
Conclusion
Chapter 6Monitoring Distributed Systems
Definitions
Why Monitor?
Setting Reasonable Expectations for Monitoring
Symptoms Versus Causes
Black-Box Versus White-Box
The Four Golden Signals
Worrying About Your Tail (or, Instrumentation and Performance)
Choosing an Appropriate Resolution for Measurements
As Simple as Possible, No Simpler
Tying These Principles Together
Monitoring for the Long Term
Conclusion
Chapter 7The Evolution of Automation at Google
The Value of Automation
The Value for Google SRE
The Use Cases for Automation
Automate Yourself Out of a Job: Automate ALL the Things!
Soothing the Pain: Applying Automation to Cluster Turnups
Borg: Birth of the Warehouse-Scale Computer
Reliability Is the Fundamental Feature
Recommendations
Chapter 8Release Engineering
The Role of a Release Engineer
Philosophy
Continuous Build and Deployment
Configuration Management
Conclusions
Chapter 9Simplicity
System Stability Versus Agility
The Virtue of Boring
I Won’t Give Up My Code!
The “Negative Lines of Code” Metric
Minimal APIs
Modularity
Release Simplicity
A Simple Conclusion
Practices
Chapter 10Practical Alerting from Time-Series Data
The Rise of Borgmon
Instrumentation of Applications
Collection of Exported Data
Storage in the Time-Series Arena
Rule Evaluation
Alerting
Sharding the Monitoring Topology
Black-Box Monitoring
Maintaining the Configuration
Ten Years On…
Chapter 11Being On-Call
Introduction
Life of an On-Call Engineer
Balanced On-Call
Feeling Safe
Avoiding Inappropriate Operational Load
Conclusions
Chapter 12Effective Troubleshooting
Theory
In Practice
Negative Results Are Magic
Case Study
Making Troubleshooting Easier
Conclusion
Chapter 13Emergency Response
What to Do When Systems Break
Test-Induced Emergency
Change-Induced Emergency
Process-Induced Emergency
All Problems Have Solutions
Learn from the Past. Don’t Repeat It.
Conclusion
Chapter 14Managing Incidents
Unmanaged Incidents
The Anatomy of an Unmanaged Incident
Elements of Incident Management Process
A Managed Incident
When to Declare an Incident
In Summary
Chapter 15Postmortem Culture: Learning from Failure
Google’s Postmortem Philosophy
Collaborate and Share Knowledge
Introducing a Postmortem Culture
Conclusion and Ongoing Improvements
Chapter 16Tracking Outages
Escalator
Outalator
Chapter 17Testing for Reliability
Types of Software Testing
Creating a Test and Build Environment
Testing at Scale
Conclusion
Chapter 18Software Engineering in SRE
Why Is Software Engineering Within SRE Important?
Auxon Case Study: Project Background and Problem Space
Intent-Based Capacity Planning
Fostering Software Engineering in SRE
Conclusions
Chapter 19Load Balancing at the Frontend
Power Isn’t the Answer
Load Balancing Using DNS
Load Balancing at the Virtual IP Address
Chapter 20Load Balancing in the Datacenter
The Ideal Case
Identifying Bad Tasks: Flow Control and Lame Ducks
Limiting the Connections Pool with Subsetting
Load Balancing Policies
Chapter 21Handling Overload
The Pitfalls of “Queries per Second”
Per-Customer Limits
Client-Side Throttling
Criticality
Utilization Signals
Handling Overload Errors
Load from Connections
Conclusions
Chapter 22Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid Them
Preventing Server Overload
Slow Startup and Cold Caching
Triggering Conditions for Cascading Failures
Testing for Cascading Failures
Immediate Steps to Address Cascading Failures
Closing Remarks
Chapter 23Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination Failure
How Distributed Consensus Works
System Architecture Patterns for Distributed Consensus
Distributed Consensus Performance
Deploying Distributed Consensus-Based Systems
Monitoring Distributed Consensus Systems
Conclusion
Chapter 24Distributed Periodic Scheduling with Cron
Cron
Cron Jobs and Idempotency
Cron at Large Scale
Building Cron at Google
Summary
Chapter 25Data Processing Pipelines
Origin of the Pipeline Design Pattern
Initial Effect of Big Data on the Simple Pipeline Pattern
Challenges with the Periodic Pipeline Pattern
Trouble Caused By Uneven Work Distribution
Drawbacks of Periodic Pipelines in Distributed Environments
Introduction to Google Workflow
Stages of Execution in Workflow
Ensuring Business Continuity
Summary and Concluding Remarks
Chapter 26Data Integrity: What You Read Is What You Wrote
Data Integrity’s Strict Requirements
Google SRE Objectives in Maintaining Data Integrity and Availability
How Google SRE Faces the Challenges of Data Integrity
Case Studies
General Principles of SRE as Applied to Data Integrity
Conclusion
Chapter 27Reliable Product Launches at Scale
Launch Coordination Engineering
Setting Up a Launch Process
Developing a Launch Checklist
Selected Techniques for Reliable Launches
Development of LCE
Conclusion
Management
Chapter 28Accelerating SREs to On-Call and Beyond
You’ve Hired Your Next SRE(s), Now What?
Initial Learning Experiences: The Case for Structure Over Chaos
Creating Stellar Reverse Engineers and Improvisational Thinkers
Five Practices for Aspiring On-Callers
On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
Closing Thoughts
Chapter 29Dealing with Interrupts
Managing Operational Load
Factors in Determining How Interrupts Are Handled
Imperfect Machines
Chapter 30Embedding an SRE to Recover from Operational Overload
Phase 1: Learn the Service and Get Context
Phase 2: Sharing Context
Phase 3: Driving Change
Conclusion
Chapter 31Communication and Collaboration in SRE
Communications: Production Meetings
Collaboration within SRE
Case Study of Collaboration in SRE: Viceroy
Collaboration Outside SRE
Case Study: Migrating DFP to F1
Conclusion
Chapter 32The Evolving SRE Engagement Model
SRE Engagement: What, How, and Why
The PRR Model
The SRE Engagement Model
Production Readiness Reviews: Simple PRR Model
Evolving the Simple PRR Model: Early Engagement
Evolving Services Development: Frameworks and SRE Platform
Conclusion
Conclusions
Chapter 33Lessons Learned from Other Industries
Meet Our Industry Veterans
Preparedness and Disaster Testing
Postmortem Culture
Automating Away Repetitive Work and Operational Overhead
Structured and Rational Decision Making
Conclusions
Chapter 34Conclusion
Appendix Availability Table
Appendix A Collection of Best Practices for Production Services
Fail Sanely
Progressive Rollouts
Define SLOs Like a User
Error Budgets
Monitoring
Postmortems
Capacity Planning
Overloads and Failure
SRE Teams
Appendix Example Incident State Document
Appendix Example Postmortem
Lessons Learned
Timeline
Supporting information:
Appendix Launch Coordination Checklist
Appendix Example Production Meeting Minutes

查看详情

Site Reliability Engineering：How Google Runs Production Systems

内容简介:

作者简介:

目录: