インシデント事後レビューのベストプラクティス

インシデント事後レビューへのアプローチは、完了させる必要があるタスクと同じくらい重要です。インシデントが発生すると緊張感が高まる可能性があります。ユーザーがプロセスに真摯に向き合って困難な問題に取り組む準備を整えるためには、心理的な安心感を持ってもらう必要があります。

インシデント事後レビューのベストプラクティス

Establish a blameless culture

Allow people involved in an incident to account for all their actions, their impact, and what they knew and when, without fear of punishment or retribution. This approach is key to making sure your teams openly share information and get to the root cause of an incident. If anyone fears rebuke they may hold back information or try to redirect blame. When this happens, people lose trust in each other.

Avoid pointing fingers

In your post-incident review meeting—and in the subsequent write-ups—avoid language that singles out individuals as personally responsible for the incident. Instead, focus on actions, results, and impact.

Keep critique constructive

While it’s important to keep the conversation safe and objective, getting to the root cause of the incident is critical to resolving it. Make sure the room doesn't try to steer away from an uncomfortable truth or try to reach an easy consensus. You can use a technique in your meeting called ‘The 5 Whys' to uncover all the deep factors contributing to the problem. Read how to run a ‘5 Whys Analysis’ with the Atlassian playbook.

Review every post-incident review

An unreviewed post-incident review might as well not have been written. Once a post-incident review has been drafted, it’s important to review it to close out any unresolved work items, capture ideas to consider in the future, and finalize the report. It’s a good idea to schedule a recurring meeting with engineering (and anyone else who may have an interest, like customer support or account managers), at least monthly, to review your post-incident reviews. You can choose to look over recent reviews or older reports and share any relevant lessons.

インシデント事後レビュー計画の作成

In order for post-incident reviews to be effective—and allow you to build a culture of continuous improvement—you want to implement a simple, repeatable process that everyone can participate in. How you do this will depend on your culture and your team, but the key to conducting post-incident reviews that improve your team and systems is to have a process and stick to it. Read how Atlassian runs its post-incident review process.

こちらから情報を発信しています。

1. レビューが必要なインシデントを判断する

組織内のインシデントには、明確かつ測定可能な重大度レベルが必要です。これらの重大度レベルは、インシデント事後レビュープロセスをトリガーするために使用されます。たとえば、重大度 1 以上のインシデントはインシデント事後レビューをトリガーして、重大度の低いインシデントについてはインシデント事後のレビューを任意に指定できます。チームリーダーまたは経営陣によってインシデントのレビューが必要であると判断されたインシデントについては、インシデント事後レビューをリクエストする機会を与えることをご検討ください。

2. インシデントから 2 日以内にレビューの下書きを作成する

インシデントを解決したら、休憩して少し休むことも大切です。しかし、インシデント事後レビューを書くことを遅らせることは避けましょう。時間をかけすぎると、重要な詳細が失われたり忘れられたりする可能性があります。インシデントチームとのミーティングの直後、インシデント解決から 24～48 時間 (かつ 5 営業日以内) に下書きを作成するのが理想的です。

3. ロールと所有者を割り当てる

ミーティングを開いて、レビューに記録される詳細を熟議します。レビューの下書き作成は、特定のユーザーに委任することをお勧めします。できれば、インシデントに精通していて、原因と緩和策を理解するために必要なレベルの技術的/組織的知識を持っているユーザーが理想です。

4. テンプレートから作業する

テンプレートによって、重要な詳細情報を余すことなく記述できます。これは事後分析全体で一貫性を保つ優れた方法です。インシデント事後レビューテンプレートで開始する一例をご参照ください。

5. タイムラインを含める

タイムラインは、インシデントのドキュメント化において非常に役立ちます。多くの場合、何が起こったのかを手早く把握しようとする読者が最初に見るのはタイムラインです。インシデントのアクティビティフィードを使用すると、いつ何が起きたのかを確認できます。できるだけ明確かつ具体的に記録しましょう。たとえば「11 時頃」ではなく「太平洋標準時間の午前 11 時 14 分」とします。

含める重要な時刻は次のとおりです。

最初のアラートまたはチケット
最初のコミュニケーションアナウンス (内部や外部)
ステータスページの更新時刻
発生したすべての修復試行の時刻 (コードのロールバックなど)
解決された時刻

6. できるだけ多くの詳細を追加する

Leaving out details is a quick path to writing post-incident reviews that are unhelpful and unclear. Add as many details as possible about what happened and what was done during the incident. Instead of “then public comms went out,” say “We sent the initial public comms announcing the incident on our public status page and Twitter account.” Include as many links as possible to work items, status updates, documentation and monitoring charts, and don’t be afraid to attach relevant screenshots.

7. インシデントメトリックを取得する

When you capture metrics in your post-incident reviews you apply hard data to the incidents and their impact. Having these data points helps you determine if your team is headed in the right direction; reducing the number of incidents, their severity, and downtime. With consistent metrics being measured, you can take a step back and look at incident trends over time.

考慮すべきメトリックは次のとおりです。

ダウンタイムの時間 (分)。この数値の増減を追跡できます。
インシデントの重大度。システムの相対的な信頼性を判断できます。
平均解決時間 (MTTR)。インシデントが最初に報告された時点からインシデント解決までの平均時間を測定します。

この内容はお役に立ちましたか?

正確ではなかった明確ではなかった関係なかった

さらにヘルプが必要ですか?

アトラシアンコミュニティをご利用ください。

コミュニティに質問

Jira Service Management サポート

Jira Service Management is getting a new navigation

インシデント事後レビューのベストプラクティス