Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
I've never worked for a company with formalized on call, which has kind of resulted in me feeling like I'm on call all the time at points in my career to be totally honest. I would prefer concrete on call hours to feeling stressed all the time. #DevDiscuss
We do weekly on-call rotations, and it usually consists of restarting crashed machines or switching out bad hardware in our data center. A lot of our stuff is automated to alert us via PagerDuty #devdiscuss
Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
First off, having an on call schedule is critical for a team because it allows the burden to be passed around. If you ever feel like your burned out from being on call than you may be missing a good schedule and the team is not incrementally improving on call #DevDiscuss
I've worked during "vacation" time (a lot) and on big holidays like Thanksgiving and Christmas Eve to be totally honest. And I've worked a huge number of weekends.
I can handle a huge workload, but I need that workload to be planned out instead of spontaneous.
#DevDiscuss
We currently use OpsGenie to manage on call schedules and they are directly integrated into our calendars which is awesome. Also using the mobile app makes it very easy to Ack alerts as they come up. #DevDiscuss
Thankfully my team is pretty laid back with on call...if it's between 5pm and 9am and not causing a problem, we usually just handle the ticket the next day #DevDiscuss
My worst on call story involves a monthly cron job failing over the weekend. I was new and unable to debug it, neither were the other two teammates I called. We didn’t solve it until Tuesday of that week. It was incredibly stressful #devdiscuss
Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
I have never been on call *knocks on wood*. I have gotten texts from team members when my product owner wants something fixed ASAP on my way to work. Those are never fun and I spend my whole train ride into work anxious. #devdiscuss
We just talked about some of the holiday schedule today.
I honestly think getting this right is going to be one of the hardest part of scaling https://t.co/lWepprFqmW. There is still too much locked in my head if a real emergency were to occur and I was unreachable. #DevDiscuss
Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
Being a fully remote team Slack is critical for alerting and ops issues. But that is only as good as the alerts being correctly tuned. Nothing is more annoying than an alarm that isn't actually a big deal, delete those now. Alarms should mean that it needs your atte. #DevDiscuss
We have an "Engineer on Duty" system. The assigned coder is in charge of intercepting new tickets, deciding who will fix them, or if they're issues that need to be fixed at all.
We actually don't have a real "on call" system for things outside working hours. #DevDiscuss
Keeping a run book for common support issues and how to quickly triage them is an important resource. However, keeping it up to date is always a challenge, keep that in mind. #DevDiscuss
On-call should not be painful.
It starts with observability. Teams who have proper _actionable_ metrics reduce pain points for devs who are on call. Spend time and energy building out the dashboards and alerts that make sense. #DevDiscuss
I’ve never had a project with on call support since then. But it taught me that throwing a new hire into the deep end just to save the senior resources the time is a horrible practice. Borders on hazing. Support your people so they can support your system. #devdiscuss
Tribal knowledge is always an issue with on call stuff. Get as many things documented as possible and get everyone the context they need. This often means its going to hurt a little bit at first, but the team is going to be better in the long run #DevDiscuss
One time I had to take down a machine for routine maintenance and forgot to gracefully shut down the services running in it, so like 3 teams got critical alerts saying their stuff was down when I shut it down....Oops #DevDiscuss
Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
Two things:
a) working for a company where everything was expected ASAP no matter the time of day/whether you were on or off.
b) me being a super perfectionist and needing to get things done for my sanity.
#DevDiscuss
I totally agree with this. I'm the most junior coder in my office and I'm only tapped for on-call issues if they fall into my specific expertise. Broader ones go to the senior staff first. #DevDiscuss
Metrics, observability, and the ability to quickly ship fixes (i.e. CI/CD) is an important cycle for on call. You should be continously improving these things as they will degrade over time and need sprucing up. #DevDiscuss
There's definitely a certain old school "no pain, no gain" mentality with ops.
Limiting the pain is easier said than done, but it definitely feels like the pain is allowed and accepted more than it could/should be. #DevDiscuss
Dear lord that first point is the worst. My old job often had designers deciding product requirements and due dates regardless of dev input later on. The result was so much stress it resulted in much extra work time, formally on-call or not. #DevDiscuss
In reply to
@ASpittel, @MattHutchison43, @ThePracticalDev
I have some friends in the New York City online journalism scene. Talk about being on-call 24/7.
They're basically expected to be on your laptop within ten minutes of anything especially newsworthy breaking. 😰
#DevDiscuss
Two things:
a) working for a company where everything was expected ASAP no matter the time of day/whether you were on or off.
b) me being a super perfectionist and needing to get things done for my sanity.
#DevDiscuss
An apparent benefit of my front-end focus is I get called for on-call emergencies much less. The big ones are server failures or blackouts, and I'm rarely needed for those while I'm debugging IE.
That's not always good though. When they've hit me, I was unprepared... #DevDiscuss
The situation was particularly bad. The main/only maintainer was out fo the country. They’d just changed a fundamental certificate and hadn’t tested properly. I was taught the system without working on it through a few meetings to do on call. #devdiscuss
As someone who majored in journalism, I know exactly how this feels. I remember people reporting on school blackouts and even shootings right after they happened. Editors were at the computers all night sometimes... #DevDiscuss
If anyone had any idea how to monitor effectively on the frontend, there might be more on-call frontend devs.
We're all living in blissful ignorance of most of the frontend chaos. 😄
#DevDiscuss
An apparent benefit of my front-end focus is I get called for on-call emergencies much less. The big ones are server failures or blackouts, and I'm rarely needed for those while I'm debugging IE.
That's not always good though. When they've hit me, I was unprepared... #DevDiscuss
I do know how to at least start/stop things on the server if and when things go wrong. I try to be useful on that side of things, even though I don’t know a whole lot. #DevDiscuss
If I had a dollar every time there was a front-end fire people didn't notice for weeks due to not being able to reproduce the specific conditions that caused it, I'd have exactly $57, give or take $100,000 more #DevDiscuss
If anyone had any idea how to monitor effectively on the frontend, there might be more on-call frontend devs.
We're all living in blissful ignorance of most of the frontend chaos. 😄
#DevDiscuss
An apparent benefit of my front-end focus is I get called for on-call emergencies much less. The big ones are server failures or blackouts, and I'm rarely needed for those while I'm debugging IE.
That's not always good though. When they've hit me, I was unprepared... #DevDiscuss
I think a lot of this fits into the "rockstar developer" trope that gets tweeted about a lot. I think that often times it's not the developer's fault that they become the single point of failure, they aren't given the time to make themselves duplicatable. #DevDiscuss
If you have something similar to a style guide or pattern library, building out front-end components with different content setups can help prepare you and avoid the on-call "it's not working in X scenario" nightmares #DevDiscuss
The idea of "oh this huge complicated thing is easy for that one dev to do" isn't too far away from "oh this terrifying fire on the live site is easy for that dev to fix," sadly. #DevDiscuss
I think a lot of this fits into the "rockstar developer" trope that gets tweeted about a lot. I think that often times it's not the developer's fault that they become the single point of failure, they aren't given the time to make themselves duplicatable. #DevDiscuss
I've noticed our on-call nightmare scenarios have decreased quite a bit since we hired a full-time DevOps Manager. We saw the value in having someone whose full job was keeping things running smooth, and it's a good investment for all companies to make. #DevDiscuss
A lot of my "on call" hasn't even been due to downtime or a bug, it's due to a new feature, or a task that needs to be added, or a job that needs to be run manually. Or a student question that needs to be answered. #DevDiscuss
It is where we work at least. When I started I merged a rock-stupid simple change on Friday, and it was reverted simply because there was never mergings after 12 on Friday. Lots of weekends saved from that, I'm guessing. #DevDiscuss
The first time I was on call, nine years ago, I was terrified, but I learned a lot and teammates helped me out. That’s how I discovered that oncall is good for the soul. I love it now. #DevDiscuss
Maybe slightly off topic - I've never had the privilege of being on call, but from a DevRel standpoint, we are always on.
Tweets, Stack Overflow, Comment response, conferences etc. Less of a "on call" fire response but it's similar.
Less pressure for sure. #DevDiscuss
Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
That seems like a failure of company culture. No features or tasks should ever be beyond normal work schedules. That means planning wasn’t done properly. On call “should” be reserved for unscheduled outages and major issues. #devdiscuss
That is part of their job yes. They keep other senior staff in the loop about what's being done and how they've fixed previous problems, so it helps reinforce things overall. #DevDiscuss
All of those community engagement functions are definitely work and a full time job. They are valuable, but it’s good to remind ourselves how much time they truly take and the fact that we can’t be expected to be a part of everything 24/7. #devdiscuss
I got a call once from my boss while I was on-call as he shamefully admitted he had accidentally powered off the production SAN and needed help recovering the environment. #DevDiscuss
That's a good point. Not being in the loop for many of these things has made it harder to potentially avoid these kinds of mistakes if they appear again later. I get a rundown after, but never first-hand experience fixing it. #DevDiscuss
I have been on teams where on call is an absolute misery. Alarms going off all the time, no one really wants to help, and no time to actually improve things. Those are the teams that give on call a very bad name. It's not suppose to be that way. #DevDiscuss
We used PagerDuty in the past, but felt like we used it wrong. It got too noisy. We’ve since built a small internal system to relay critical issues to us.
Too many false positives defeats the purpose of monitoring and makes on-call almost worthless. #DevDiscuss
Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
I was on-call when I was working as a system admin for my college department. It was a half-time job with no concrete on-call schedule. No systematic way to monitor either. We were really naive back then with no backup plan, no rotation. #DevDiscuss
We rotate through the team 1 week each. We have monitoring set up to send alerts to email, so we check at least 1x a hour. I usually send my boss this Beyonce meme when it starts and this MTM meme when it's over. I try to block the difficult ones. #DevDiscuss
Dunno if this counts, but as a junior dev I always feel "on call" with my learning. I feel like if I don't spend at least a little time outside of work studying or practicing something new, I'm not meeting my obligations. Not as tough, but still takes a toll! #DevDiscuss
That’s reasonable and an unfair burden this industry can have. We should always support learning on the job and we can’t expect everyone to know the newest thing when no team uses it in real development yet. #devdiscuss
If you are in one of these teams: start keeping track of how much time support is draining the team and then point to the $$$ figures associated with that. That's how much dinero the company is losing by not giving support a priority in the backlog. #DevDiscuss
Pattern I'm noticing - the less structure or rules a company has with who is "on call" and when, the worse it is. Maybe no official "on call" guidelines makes it harder for devs to know when they can say no to requests outside of working hours without consequences? #DevDiscuss
I got called about an outage half-way through recording this podcast interview.
It was pretty hilarious (in hindsight)
#DevDiscusshttps://t.co/XEygQWwBwz
IMO the goal should be to reduce impact of on-call incidents (hard to do). Many things can be proactively planned for if you can see what’s going on.
Checkups can help with this. Here’s a great talk about them: https://t.co/sIG8AANMl4#DevDiscuss
Oh, this is really interesting -- I think this can extend to community involvement as well. I got aggressively pinged over and over again including follow ups on Thanksgiving day this year because I wasn't responding to emails/dms fast enough. #DevDiscuss
Yeah, it's always a tough balance. A shortcut I've found to useful knowledge is, after my teammates did an interview, asking what the potential should have known but didn't 😛 #DevDiscuss
With Sentry, you can selectively relay issues via webhooks. This helps you focus on key areas.
Pingdom can monitor any endpoint on your server.
Combined this provides robust monitoring. Both have webhooks to relay data to you. #DevDiscuss
totally -- these were companies looking for sponsored blog posts! which I don't even do!
Also, think tonight has gotten me overly heated, sorry haha.
#DevDiscuss
I totally agree. Is someone wants your time and attention, they should be willing to accommodate your availability. Otherwise it's not networking, it's just making unreasonable demands of someone #DevDiscuss
Yeah, it's actually part of why I'm grateful I'm still classified as a "junior" developer at my job. It makes it more natural for me to pair for advice or take some extra time in the day to research something #DevDiscuss
Most memorable: NYE a few years ago, we discovered a bug in our recently released Billing service which would have mistakenly charged every customers for a full year of service at midnight! (regardless of their subscription) We were able to mitigate it, just in time. #DevDiscuss
Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
Don’t be sorry. That’s not acceptable. However, I think you and I have a similar Achilles heel, we can’t handle unread messages. It’s ok to ignore rude people on your vacation! #devdiscuss
Haha no need to apologize. I've seen you tweet about frustrating messages from people (many different types) and you've definitely earned some venting. #DevDiscuss
I think we should all do that more and have it be acceptable. After all, aren’t we all sharing knowledge and constantly learning? We’re better off when we do. #devdiscuss
I've never thought about how we can be "on call" to our communities as well, but considering how often I check my Slack and Twitter feeds, I kinda already know what that's like haha #DevDiscuss
Oh, this is really interesting -- I think this can extend to community involvement as well. I got aggressively pinged over and over again including follow ups on Thanksgiving day this year because I wasn't responding to emails/dms fast enough. #DevDiscuss
I’ve started muting entire slack groups and then I don’t check in at all. Needed a middle ground so I get very few notifications on specific channels. #devdiscuss
We rotate every week, and are on call alongside a traditional Operations team. We found that most outages require the domain knowledge of an application developer, and the infrastructure knowledge of a dev/ops person. #DevDiscuss
My rule of thumb is if I don't post anything in a Slack for a week, I leave it no matter what's being posted. There's enough information feeds out there, got to narrow out the more useful/engaging ones. #DevDiscuss
I can understand that. However, many of the communities I engage with are to support others and share resources. So it’s always going to be easier to sit back and focus on my stuff, but I want to do more than that, when I’m able. #devdiscuss
Unrelated thought I just had related to some topics here: does @ThePracticalDev have a Slack group? Because there's lots of community members here I'd love to chat with in that format too. #DevDiscuss
Same -- I super tailor Slack groups, and only get notifications from two! I don't do most email or twitter notifications either which helps! #DevDiscuss
When I was on call I lived 70 miles from the DC. Came in on a call at 2am. While there for 15 minutes my car was towed because the company had no parking at the location. Had to pay out of my pocket. Didn’t even get paid extra for my time. #DevDiscuss
I'll be in New Orleans over the holidays. Unless there's an emergency loud enough to be heard over the parties at Bourbon Street, it ain't my problem! #DevDiscuss
we (I led the engineering team) wanted to be included in the on-call rotation because it allowed us to get a sense of how real customers were using our product in real situations #DevDiscuss
there would definitely be mundane stuff, but often if there were repeat cases, that was a great opportunity to add another dashboard or diagnostic tool #DevDiscuss
It brings up an important point though. Tools alone can’t help your monitoring and on-call situation. You’ll have to identify your goals first. Then choose tools to help you meet those goals. #devdiscuss
Big enterprise, so we have lots of rotations. And lots of systems.
- Most level 1/ level 2 is handled by vendors. They'll execute playbooks then triage/ escalate to the appropriate level 3 or 4 teams as needed.
- Generally, app teams are level 4.
#DevDiscuss
Time for #DevDiscuss
Tonight's topic is: Being on-call
- How does your team schedule and deal with "on call" scenarios?
- What tech does your team use to monitor/alert/notify?
- How do you prepare for being on call?
- Any stories about difficult on-call situations?
we experimented with different scheduling scenarios - each had pluses & minuses - but what we overall liked best is week-long assignments (scheduled months in advance) with 8-hour shifts - so 3 ppl at any one time to cover our worldwide customer commitments 24/7 #DevDiscuss