Difference between revisions of "EmergencyProcedures"

From SoylentNews
Jump to: navigation, search
Line 23: Line 23:
 
* The slashd daemon, which acts as a timed batch processor for various slash events and tasks.
 
* The slashd daemon, which acts as a timed batch processor for various slash events and tasks.
  
In addition on the main server, an http caching program called varnish is running, which also must be functional, and running.
+
In addition on the main server, an http caching program called varnish is running, which also must be functional, and running, and a cache program for the database, memcached.
  
 
* Varnish - an http cache.
 
* Varnish - an http cache.
 +
* memcached - A database cache
  
For slash to work, all three (or four) componants must be running.
+
For slash to work, all three componants must be running, plus varnish and memcached if they are in use.
  
 
To tell if they are running, use the command 'pstree' which will show, in condensed form, the tasks that are running.
 
To tell if they are running, use the command 'pstree' which will show, in condensed form, the tasks that are running.
  
Here is what the result of the pstree command looks like on the slashcott server:
+
Here is what the result of the pstree command looks like on the main server:
  
 
<pre>
 
<pre>
[root@slashcode init.d]# pstree
+
# pstree
init─┬─auditd───{auditd}
+
init─┬─atd
     ├─httpd───10*[httpd]
+
    ├─cron
 +
    ├─dbus-daemon
 +
     ├─httpd─┬─10*[httpd]
 +
    │      └─sh
 +
    ├─landscape-clien─┬─landscape-broke───2*[{landscape-broke}]
 +
    │                ├─landscape-manag
 +
    │                └─landscape-monit
 +
    ├─linode-longview
 +
    ├─login───bash
 
     ├─master─┬─pickup
 
     ├─master─┬─pickup
     │        └─qmgr
+
     │        ├─qmgr
     ├─6*[mingetty]
+
     │        └─tlsmgr
     ├─mysqld_safe───mysqld───37*[{mysqld}]
+
    ├─memcached───5*[{memcached}]
 +
     ├─nginx───4*[nginx]
 
     ├─ntpd
 
     ├─ntpd
 
     ├─rsyslogd───3*[{rsyslogd}]
 
     ├─rsyslogd───3*[{rsyslogd}]
     ├─sshd─┬─3*[sshd───sshd───bash───sudo───su───bash]
+
     ├─slapd───4*[{slapd}]
     │      └─sshd───sshd───bash───sudo───su───bash───pstree
+
     ├─sshd───sshd───sshd───bash───sudo───su───bash───pstree
     ├─su───slashd───2*[slashd]
+
     ├─sudo───slashd───3*[slashd]
     └─udevd───2*[udevd]
+
     ├─udevd───2*[udevd]
[root@slashcode init.d]#
+
    ├─upstart-socket-
 +
    ├─upstart-udev-br
 +
    └─varnishd───varnishd───20*[{varnishd}]
 
</pre>
 
</pre>
  
Line 59: Line 71:
 
<pre>
 
<pre>
 
     ├─httpd───10*[httpd]
 
     ├─httpd───10*[httpd]
</pre>
 
 
We can see that mysql is running, it shows up like this:
 
<pre>
 
    ├─mysqld_safe───mysqld───37*[{mysqld}]
 
 
</pre>
 
</pre>
  
Line 77: Line 84:
 
|─varnishd───varnishd───20*[{varnishd}]
 
|─varnishd───varnishd───20*[{varnishd}]
 
</pre>
 
</pre>
 +
 +
As well as memcached:
 +
<pre>
 +
    ├─memcached───5*[{memcached}]
 +
</pre>
 +
 +
On the database machine for the main site, we would see that mysql is running,
 +
it shows up like this:
 +
<pre>
 +
    ├─mysqld_safe───mysqld───37*[{mysqld}]
 +
</pre>
 +
  
 
The exact numbers (10*[httpd] etc) are not important.
 
The exact numbers (10*[httpd] etc) are not important.
  
This is what a properly running system should look like, on slashcott. On the main site, the database is on one machine, which must have mysql running, and and the other machine must have apache (httpd) running, varnish running, and slashd running.
+
This is what a properly running system should look like, on the main site. The database is on one machine, which must have mysql running, and and the other machine must have apache (httpd) running, varnish running, memcached running, and slashd running.
  
 
'''Important note - at the present time, you cannot fix either system by a reboot. Apache MUST be started by hand'''
 
'''Important note - at the present time, you cannot fix either system by a reboot. Apache MUST be started by hand'''
Line 142: Line 161:
  
 
You may see some errors when starting slashd, at the present time this is 'normal'. Provided it starts, and seems to operate correctly, there is little that can be done about this until some code fixes are done.
 
You may see some errors when starting slashd, at the present time this is 'normal'. Provided it starts, and seems to operate correctly, there is little that can be done about this until some code fixes are done.
 +
 +
=== If memcached is not running ===
 +
 +
Restart it:
 +
<pre>
 +
# /etc/init.d/memcached start
 +
</pre>
 +
 +
Confirm it is running after a few seconds by looking at the process list again. (pstree)
 +
  
 
=== If the database server is not running ===
 
=== If the database server is not running ===
Line 177: Line 206:
 
# /srv/soylentnews.org/apache/bin/apachectl stop
 
# /srv/soylentnews.org/apache/bin/apachectl stop
 
# /etc/init.d/varnish stop
 
# /etc/init.d/varnish stop
 +
# /etc/init.d/memcached stop
 
# /etc/init.d/mysqld stop
 
# /etc/init.d/mysqld stop
 
</pre>
 
</pre>
Line 183: Line 213:
 
<pre>
 
<pre>
 
# /etc/init.d/mysqld start
 
# /etc/init.d/mysqld start
 +
# /etc/init.d/memcached start
 
# /etc/init.d/varnish start
 
# /etc/init.d/varnish start
 
# /srv/soylentnews.org/apache/bin/apachectl start
 
# /srv/soylentnews.org/apache/bin/apachectl start

Revision as of 22:11, 18 March 2014

SystemAdministration - parent

Why This is Here

In ALL emergency situations, the most qualified person to fix them should be sought out. These instructions are here to cover the hopefully VERY RARE case where that is not possible.

They are deliberately written at a very basic level, probably more basic than anyone with access to the servers would actually need. Howver, it is still useful to have a sort of checklist. The times when services are down tend to be stressful, and humans sometimes do not operate at their best when stressed. Plus, who knows? Maybe someday someone will have to walk the janitor through this...

What you need to do anything useful

You need ssh access to the machines affected, with the ability to get root privileges. Without this, there is no way to do anything useful to fix any problem with the system

You must ssh to the appropriate machine, typically to a user account, and then sudo su - to root.

(This ability must be obtained in advance, such information cannot for obvious reasons be placed on a public wiki)

Slash basic description

The slash system consists of three separate componants:

  • A Mysql database server - The database engine which holds most of the data such as articles, users, etc.
  • The Apache web server - which handles the web interface
  • The slashd daemon, which acts as a timed batch processor for various slash events and tasks.

In addition on the main server, an http caching program called varnish is running, which also must be functional, and running, and a cache program for the database, memcached.

  • Varnish - an http cache.
  • memcached - A database cache

For slash to work, all three componants must be running, plus varnish and memcached if they are in use.

To tell if they are running, use the command 'pstree' which will show, in condensed form, the tasks that are running.

Here is what the result of the pstree command looks like on the main server:

# pstree
init─┬─atd
     ├─cron
     ├─dbus-daemon
     ├─httpd─┬─10*[httpd]
     │       └─sh
     ├─landscape-clien─┬─landscape-broke───2*[{landscape-broke}]
     │                 ├─landscape-manag
     │                 └─landscape-monit
     ├─linode-longview
     ├─login───bash
     ├─master─┬─pickup
     │        ├─qmgr
     │        └─tlsmgr
     ├─memcached───5*[{memcached}]
     ├─nginx───4*[nginx]
     ├─ntpd
     ├─rsyslogd───3*[{rsyslogd}]
     ├─slapd───4*[{slapd}]
     ├─sshd───sshd───sshd───bash───sudo───su───bash───pstree
     ├─sudo───slashd───3*[slashd]
     ├─udevd───2*[udevd]
     ├─upstart-socket-
     ├─upstart-udev-br
     └─varnishd───varnishd───20*[{varnishd}]

If you get 'garbage' as the output of the pstree command, use this form instead:

[root@slashcode init.d]# pstree -A

We can see that apache is running, it shows up as httpd:

     ├─httpd───10*[httpd]

And we can see that slashd is running:

     |-su---slashd---2*[slashd]

This one is ever so slightly easy to miss, because it is running as a child of the su command, so is not first on the list.

On the main server, Varnish must also be running:

|─varnishd───varnishd───20*[{varnishd}]

As well as memcached:

     ├─memcached───5*[{memcached}]

On the database machine for the main site, we would see that mysql is running, it shows up like this:

     ├─mysqld_safe───mysqld───37*[{mysqld}]


The exact numbers (10*[httpd] etc) are not important.

This is what a properly running system should look like, on the main site. The database is on one machine, which must have mysql running, and and the other machine must have apache (httpd) running, varnish running, memcached running, and slashd running.

Important note - at the present time, you cannot fix either system by a reboot. Apache MUST be started by hand

Primary site down

Log in to main slash machine, obtain root.

Check process list:

root@soylent-www:/etc/varnish# pstree -A

You should see these in process list, among other unrelated processes:

├─httpd───10*[httpd]

|-su---slashd---2*[slashd]

|─varnishd───varnishd───20*[{varnishd}]

Make sure that slashd, apache, and varnish are running, as described above.

If only varnishd is not running:

Restart varnish:

# /etc/init.d/varnish start

Check that is actually running after a few seconds by looking at the task list again. (pstree)

If varnish will not restart, you will likely need to get help from someone familiar with its configuration.

If only Apache is not running:

Start apache:

# /srv/soylentnews.org/apache/bin/apachectl start

Confirm it is running after a few seconds by looking at the process list again. (pstree)

If you get an error message about apache being unable to conect to the dastabase, you will need to ssh into the database machine and restart the database, then come back and restart apache. Apache cannot start unless it can make a database connection.

If only slashd is not running

Slashd cannot start unless Apache is running.

To start slashd:

# /etc/init.d/slash start

Confirm it is running after a few seconds by looking at the process list again. (pstree)

Note the command above ends in slash, not slashd. Slash is the name of the script, it starts the daemon known as slashd.

If slashd will not start, make sure apache and the database are running.

You may see some errors when starting slashd, at the present time this is 'normal'. Provided it starts, and seems to operate correctly, there is little that can be done about this until some code fixes are done.

If memcached is not running

Restart it:

# /etc/init.d/memcached start

Confirm it is running after a few seconds by looking at the process list again. (pstree)


If the database server is not running

You will need to ssh to the database machine.

start mysql:

# /etc/rc.d/init.d/mysqld start

Confirm it is running after a few seconds by looking at the process list again. (pstree)

├─mysqld_safe───mysqld───37*[{mysqld}]

If it will not start, you will need additional help.

You -may- need to restart Apache after restarting the db.

on the main machine:

# /srv/soylentnews.org/apache/bin/apachectl stop
 wait about ten seconds or check the tasklist that apache is fully stopped
# /srv/soylentnews.org/apache/bin/apachectl start

None of the above fixes work, or multiple problems

In this case it may be worth a try to bring all slash related processes down, and do a cold restart, in the proper order.

Processes are usually stopped in the reverse order they start, so to bring everything down:

# /etc/init.d/slash stop
# /srv/soylentnews.org/apache/bin/apachectl stop
# /etc/init.d/varnish stop
# /etc/init.d/memcached stop
# /etc/init.d/mysqld stop

And then to bring them all back up:

# /etc/init.d/mysqld start
# /etc/init.d/memcached start
# /etc/init.d/varnish start
# /srv/soylentnews.org/apache/bin/apachectl start
# /etc/init.d/slash start

If none of this works, you will need help from the assigned sysadmin for the machine.

Slashcott Development site down

There are just two differences in this machine and the main site:

  • Things are in different locations
  • The mysql database server is on the same machine as the site

This is a non-critical machine, so I am just going to list the commands for reference.

Stop Services:

# /etc/rc.d/init.d/slash stop
# /usr/local/apache/bin/apachectl stop
# /etc/rc.d/init.d/varnish stop
# /etc/rc.d/init.d/mysqld stop

And then to bring them all back up:

# /etc/rc.d/init.d/mysqld start
# /etc/rc.d/init.d/varnish start
# /usr/local/apache/bin/apachectl start
# /etc/rc.d/init.d/slash start