THIS PAGE IS NOW OBSOLETE

  • It was too specific to the old Solaris 8 environment running an old version of University of Washington IMAP.

Resolving a type of crisis on mail.cs.uwaterloo.ca

The mail.cs server provides both an SMTP service for clients to send messages and IMAP/POP services for clients to fetch their email.

Once in a while, mail.cs gets overwhelmed with IMAP processes and stops accepting messages. This is an undesirable state, which should be corrected when it is noticed. A suggested procedure for documenting and correcting the problem is:

  • record current load average and process listing in ST (see Tools below for a good procedure)
    • of particular interest are large numbers of processes attributed to a single userid and multiple processes (determined by fuser) accessing files under /var/mail)
  • disable IMAP and POP services (after making a backup copy, edit /etc/inetd.conf to comment out the four imap/pop3d lines, then HUP the inetd process, e.g. sudo killall -s HUP inetd or sudo kill -HUP `ps -C inetd -o pid=`)
  • wait for the system to quiesce (return to low single digit load average)
    • kill off IMAP and POP processes only if the system does not recover in a few minutes
  • enable IMAP and POP services (restore /etc/inetd.conf from the backup copy made earlier, then HUP the inetd process)
  • monitor the system for several minutes, looking for a recurrence of the problem

As a follow-up, check with the people with active connections at the time of the SMTP outage to see whether they need help managing their inbox or ThunderbirdConnectionTimeout, particularly those individuals with a large inbox (/var/mail/userid) at the time of the problem.

Looking for what happened to a particular piece of mail

Log files

On most systems, logs are kept in /var/log/syslog Older copies of the logs are usually kept as compressed files in the same directory, eg: syslog.2.gz

Note: Outbound mail will be logged on mail.cs, inbound email will be logged on mx100.cs

To look for email to/from a particular userid, eg: jdoe, use grep or zgrep:

# ssh mail.cs (or mx100.cs for inbound email)
# cd /var/log
# grep jdoe syslog*
# zgrep jdoe *.gz

Tools

Sorted count of processes per user

mail.cs% ps -adelf | awk '{print $3}' | sort | uniq -c | sort -nr
  71 root
   6 mgandalf
   5 ptook
   4 fbaggins
   4 mbrandyb
   4 sgamgee
[...]

In a crisis situation, you will sometimes see tens of processes running for a single user.

Sorted count of multiple accesses per mailbox

For only users with active processes, run the "fuser" command on their mailbox and show the output only for cases with more than one process.

mail.cs% cd /var/mail
mail.cs% ps -adelf | awk '{print $3}' | sort | uniq | \
           sh -c 'xargs /usr/sbin/fuser 2>&1' | \
           grep -v ': No such file or directory' | \
           grep '  *[^ ][^ ]*  *[^ ][^ ]*'
mgandalf:    10632o    1808o
fbaggins:    11225o   28655o

In a crisis situation, you will nearly always see very many processes reading a single mailbox.

arpepper:    12160o   12159o   12158o   12157o   12156o   12155o   12154o   1215
3o   12152o   12151o   12150o   12149o   12148o   12147o   12146o   12145o   121
44o   12143o   12142o   12141o   12140o   12139o   12138o   12137o   12136o   12
135o   12134o   12133o   12132o   12131o   12130o   12129o   12128o   12127o   1
2126o   12125o   12124o   12123o   12122o   12121o   12120o   12119o   12118o   
12117o   12116o   12115o   12114o   12113o   12112o   12111o

It can be reassuring to watch the list decline.

cscf.cs% cd /var/mail
cscf.cs% fuser arpepper
arpepper:    12160o   12159o   12158o   12157o   12156o   12155o   12154o   1215
3o   12152o   12151o   12150o   12149o   12148o   12147o   12146o   12145o   121
44o   12143o   12142o   12141o   12140o   12139o   12138o   12137o   12136o   12
135o   12134o   12133o   12132o   12131o
cscf.cs% fuser arpepper
arpepper:    12160o   12159o   12158o   12157o   12156o   12155o   12154o   1215
3o   12152o   12151o   12150o   12149o   12148o   12147o   12146o   12145o   121
44o   12143o   12142o   12141o   12140o   12139o   12138o   12137o
cscf.cs% 

In some cases, it can take a few minutes for each process to disappear.

When you re-enable imap, perhaps watch to see if the situation begins reoccurring for the same mailboxes.

How to find out if a user has been using imap

eg: on mail.cs looking for user imauser

cscf.cs>2# ssh -x mail.cs
Last login: Fri Jun 17 11:32:20 2011 from cscf.cs.uwaterl

services116.cs# cd /var/log

services116.cs# zgrep user=imauser syslog* | egrep 'ipop3d|imapd' | wc
       0       0       0

services116.cs# zgrep imauser syslog* | wc
       0       0       0 

Topic revision: r12 - 2013-10-01 - AdrianPepper
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback