Difference between revisions of "IncidentLog"
(Not having someone in command is a risk whilst we're not ticking over without worry)
|Line 93:||Line 93:|
[[User:Mrcoolbp|mrcoolbp]] ([[User talk:Mrcoolbp|talk]]) 23:24, 11 March 2014 (UTC)
[[User:Mrcoolbp|mrcoolbp]] ([[User talk:Mrcoolbp|talk]]) 23:24, 11 March 2014 (UTC)
Revision as of 22:44, 16 March 2014
This is a log of incidents that occurred and how they were remedied.
Begin 22:18 PST, end c. 23:50
martyp reports same problem as before on main site, 0 comment counts, audioguy agrees to have a look.
Slashd not running.
Attempt to start slashd, which started but with complaints about not being able to connect to the database. Decided to restart apache, due to the strong dependencies there with modperl.
Apache refused to start, citing lack of connection to database. Stopped slashd, in order to try bringing up Apache first, then slash. Apache still refused to start, unable to connect to the database.
At this point NCommander appeared, audioguy handed all incident response authority to him.
NCommander had some difficulty connecting due to an extremely poor connection at his location (ssh being blocked), but was finally able to get in through the Web interface at Linode. (Interesting but incidental fact: an attempt was made to get in through audioguys account on slashcott which is on an alternate port, but ssh access was thwarted even on this alternate port.)
NCommander determined the problem was due to the ssl cert on the database having expired. This was set up by someone no longer here with a very short expiration time, for unknown reasons.
NCommander was able to regenerate a new ssl cert and bring things back up.
A short discussion with NCommander afterward brought out the following facts:
The database was confirmed to be on a different machine, to which audioguy had no access. Audioguy would not have been able to solve this problem, having access only to the main server, not the database server. The database uses ssl because of this remote connection.
The main server cannot be brought up by a simple reboot in an emergency. The Apache server must be started by hand. (It is located in /srv/soylentnews.org/apache an unusual location).
NCommander has no idea why Nginx is running on the slash machine.
A brief discussion about documenting System Administration was held, this is to be a high priority once NCommander returns from vacation. NCommander received a reprieve from various possible flogging and keel-hauling due to his timely appearance to save the day.
After Action Recommendations:
- The same people who have access to the main server should also have access to the database server, since the system cannot be brought up from a dead stop without this access.
- The system will not start itself from a cold reboot at present. This is non-standard behavior, and is easily fixed. While I am not familiar with the specifics of how this is done on Ubuntu, every other unix style system I am aware of has a means to provide a local start and local stop function in the startup sequence as a way to automate special cases. In addition, it seems likely that Ubuntu would have a startup script for apache 1.3, perhaps an old one, that could be modified to work. Generally systems have some way to specify start order as well, which we need. This should be done.
- Someone needs to have a look at nginx and and see exactly what it is doing. If nothng useful, it should be shut down. It is unusual to have to have two separate httpd servers running on the same machine, particularly if that machine is a production machine designed to handle one application. Unless it has a clear use, it is simply an additional load and possible security risk at this point.
- The start and stop procedures for slash need to be clearly documentd somewhere. If no one objects, I will begin a new section on the wiki that has at least the very basics, at a level that someone not very familiar with the system can understand.
March 11, 2014
Summary. There was a report of a problem at http://soylentnews.org; logged-in users saw a comment count of zero (0) for the three newest stories on the main page. When a user went to the actual story, there were comments there. Separately, non-logged-in users failed to even see the most-recent 3 stories. In light of recent developments, credentials had been removed for many people; the one person who had access to everything was unavailable.
What went well:
- Generally followed procedures outlined in ICS (Incident Command System)
- Used private channel on http://irc.sylnt.us/ to communicate in real time.
- Incident Commander performed a .op and prefixed nick with "cmdr_" to clearly identify role.
- The main site stayed up as we attempted to solve the problem.
- Staff jumped in and offered their services.
- Provided some updates to the community via IRC on channel: #Soylent
- Focused on gathering data to identify problem and outline possible solutions.
- Performed confirmation after it appeared the problem was fixed, to ensure it actually was fixed.
- Requested feedback on lessons learned.
- Incident Commander admitted mistakes.
- There were some ruffled feathers, but all-in-all people worked well together.
- Established a follow-up to find underlying cause of problem and to document solution.
What did not go well:
- Did not use staff mailing list to inform all staff there was a problem.
- Incident Commander failed to recognize when a technical lead role was needed and to delegate a task leader for it.
- Key people who had domain knowledge and access were unavailable.
- Other people with the know-how to diagnose and fix the problem lacked the credentials they needed to do so.
- Lacked alternative means to contact key people. (e.g. phone numbers)
There is a fine core of dedicated professionals who genuinely want to see the site succeed. They rose to the occasion and strove to work together. We successfully diagnosed and solved the problem without causing further damage. We learned what happened when there was a failure to identify when a task leader was needed and to delegate appropriately. We successfully coordinated efforts from people who were distributed in multiple locations and time zones.
After Action Report
Upon looking at the slashd log it was confirmed by Dev staff that slashd died almost the same time that the DNS was taken down in the previous issue. Dev staff believe that due to slashes reliance on the fully qualified domain name for many urls that it did not handle the no DNS issue very well and slashd died because of it. Our recommendation is to add the FQDN to the host file and to add some watchdog to the system to make sure slashd and other systems stay running. Given that this is a Ubuntu system, it should be possible to turn get Upstart to respawn it as a system service. It seems probably slashd died at the hands of the OOM killer, yet we are lacking the kernel logs for that, which might tell us if it was slashd or some other task that was the reason for the OOM state.
Summary: In migrating the service linode and remaining credit from the other 2 unused linodes to NCommander's account, the original account had to be closed. After closing the account and transferring over the svc. linode, somehow the DNS zone record was lost and nearly everything went down.
- Mailed staff via mailing list to notify of site going down
- Got in touch with dev team members via IRC
- Discovered no one available had access to lindoe manager interface
- Devs tried rebooting the production linode via root access
- This lead to 503 errors
After much confusion and digging, we figured out what the problem is (the DNS zone records going missing with the account closure), I called linode, had the file(s) transferred to Ncommander's account and reinstated.
This caused problems with slashd (see above)
Devs: feel free to add to this log what we learned, how to prevent in the future.
Lack of Commander Incidents
Sat 2014-03-16: 17:59 <@janrinok> Sorry guys, but I've got to go look after my other half. 17:59 <@janrinok> .deop 17:59 -!- mode/#staff [-o janrinok] by what-if-i 18:04 -!- kobach [~nope@Soylent/Staff/IRC/kobach] has left #staff [nope] 18:05 -!- janrinok [~janrinok@Soylent/Staff/Editor/janrinok] has quit [Quit: leaving] 18:41 <+FatPhil> dammit 18:41 <+FatPhil> .op
Sun 2014-03-17: 00:18 <@FatPhil> "21:52 * FatPhil has itchy feet" That was 2 and a half hours ago - I'm cracking open a beer now 00:23 <@FatPhil> .deop 00:43 -!- You have been marked as being away