Welcome!

VictorOps Helps Your IT/DevOps Team Solve Problems Faster

VictorOps Blog

Subscribe to VictorOps Blog: eMailAlertsEmail Alerts
Get VictorOps Blog via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Blog Feed Post

VictorOps Loves Atlassian Part 3: HipChat

It’s time for the final installment of the 3-part blog series. You can catch up with Part 1 about Jira and Part 2 about Confluence, if interested.

Let’s go over the integration between VictorOps and Hipchat. The integration is a bi-directional integration in which anything that is entered in the VictorOps native chat functions will show up inside of HipChat and vice-versa.

image (2)

Tribal Knowledge

It’s a best practice to have your team members chat all of their activity when triaging and resolving an IT incident. It captures the thinking process of how the teams begin to work on the incident. One team member may chat about what monitoring tools they dove into, while another may bring up which servers they have already reset.

If there are questions being asked between teams, it is good to have those on the record as well. This gives the teams more context as to why some actions were taken. All of this tribal knowledge is important when you are trying to improve your Incident Response Strategy.

[Note: Although it is hard to replace face-to-face communication, it is also hard to capture it. Leveraging HipChat as a platform to collaborate will get the teams started on recording their activities.]

Timelining

So you have your teams chatting each other up in HipChat, now what?

You record it! Then you put it in timeline order against your alerts, which are also in timeline order. Fortunately, this is easy to do when using the VictorOps and HipChat integration.

Timeline order you say?

Yes, let’s think about that for a moment. When you have your monitoring alerts and your chats intertwined in sequential order, you accomplish two things. You are recording the technological events that occur within the stack and the human events that occur in real life. The second thing you accomplish is putting these recorded events in order to paint a better picture of what happened.

Here’s an simple example (most recent first):

Screen Shot 2016-03-15 at 4.09.12 PM

Process Improvement

This is where the what meets the why. Why do we do all of this? When you have this information recorded in timeline order, you can begin to pick up feedback, insights and patterns into how you can improve the way your team responds to incidents. The ultimate goal here is to reduce downtime by getting better at responding to alerts more efficiently and effectively.

The feedback you receive is seeing what works and what didn’t. If you are seeing a particular action never resolving the issue, then you can deprioritize that action for this particular alert and save those precious seconds focusing on actions that work. You wouldn’t reset the server over and over again if the alert continues so your team should be checking other places.

Same goes for the actions that do work. Now if the issue resolves, that’s great, you now have the actions captured that lead to the resolution; you’d want to prioritize those. This is the feedback cycle that helps the teams know what’s working and what’s not. Otherwise, you keep burning time repeating ineffective actions time and time again.

Atlassian-HipChat-logo-aha

However, finding the resolution quickly is not always an option. Sometimes you don’t get a resolution, but you get clues. This is where insights kick in. When you find a clue, you want to chat (record) into HipChat what you did to find that clue. “I’ve looked at the DNS and it looks good but this is weird….”.

The team begins to leave behind a trail of how they eventually found the resolution. You can turn these insights into actionable steps in your response plan (runbook) or you can have discussions as to the motivation behind these actions and educate the team. The insights become a key factor in improving your processes since they will allow you to see new incident-response tactics.

When the clues lead to nowhere, this is where patterns become useful. Over time, you will begin to see multiple alerts with individual responses from the teams. Looking at the HipChat activity and the alerts in an over-arching “super lens” across all of the incidents will give you a different view of how your teams are responding to the alerts.

You may be able to see that alerts are being handled better during different times of the day or when different people are involved. You can use this pattern data to help compare and contrast one set of incidents vs another set of incidents with different teams involved. We hope that you don’t have so many incidents to where you get big sample sizes of data but in a perfect world, incidents would never exist. It happens, and if it happens more than once, you want to capture what the teams are doing each time to look for patterns.

So there it is. You want to have your team members share their tribal knowledge by chatting their actions. Those chats should be recorded alongside the alerts, in timeline order. These timelines should be reviewed to give you feedback, insights and patterns. That information should guide your discussions on how you’ll lower downtime by getting to the resolution faster.

Use Chat!

The post VictorOps Loves Atlassian Part 3: HipChat appeared first on VictorOps.

Read the original blog entry...

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.