====== Orphaned QLOGIN Sessions ======

===== The Problem =====

A user connects to a cluster head node using SSH.  The user then opens an interactive session with a compute node by means of the ''qlogin'' command:

<code bash>
$ ssh mills.hpc

............................................................

    Mills cluster (mills.hpc.udel.edu)

    This computer system is maintained by University of
    Delaware IT.  Links to documentation and other online
    resources can be found at:

      http://docs.hpc.udel.edu/

    For support, please contact consult@udel.edu

............................................................

[(it_css:traine)@mills it_css]$ qlogin
Your job 69114 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 69114 has been successfully scheduled.
Establishing /opt/shared/GridEngine/local/qlogin_ssh session to host n010 ...
[traine@n010 it_css]$
</code>

This is illustrated in the following cartoon:

{{ :technical:gridengine:qlogin-connected.png |User establishes QLOGIN session}}

While this ''qlogin'' session is active, the user's computer somehow becomes disconnected from the cluster head node.  Some ways in which this can happen include:

  * user accidentally closes the SSH window
  * user puts his/her laptop to sleep
  * the network infrastructure between the user and the cluster experiences issues (e.g. wireless access point goes down)

This severs the interactive session between the user's computer and the cluster head node.  However, the ''qlogin'' session still exists between the cluster head node and compute node -- its network connection has not been disrupted.  Though the user cannot reestablish control of the ''qlogin'' session, the session will remain active indefinitely:

{{ :technical:gridengine:qlogin-disconnected.png |SSH connection broken, QLOGIN session orphaned}}

Orphaned ''qlogin'' sessions may be detrimental since they block queue slots and may consume system resources.


===== The Solution =====

UD IT has implemented a service that periodically checks for orphaned ''qlogin'' sessions.  Any new orphaned sessions that are identified result in an email being sent to the user who owns the session:

<code>
To: traine@udel.edu
Date: Fri, 24 Aug 2012 14:20:00 -0400 (EDT)
Subject: [Mills] Orphaned QLOGIN sessions detected
Reply-to: consult@udel.edu


This is an automated message from the Mills cluster.  You currently have 2 orphaned QLOGIN sessions running on the cluster head node.  These QLOGIN sessions will be automatically killed 5 days from now to reclaim the queue slots they occupy.  For more information on what an "orphaned QLOGIN session" means, please visit


 http://mills.hpc.udel.edu/wiki/technical/gridengine/qlogin-orphan


You can dispose of these QLOGIN sessions yourself using the method outlined on the cited webpage and the following information:


  ssh pid    GridEngine job information
 ---------- ------------------------------------------------------------
  27571      Job id: 62048 (shepherd pid: 32427 on n018)
  23081      Job id: 64143 (shepherd pid:  3925 on n045)


If you have an orphaned QLOGIN session running an active computational task and you do NOT want it to be killed automatically, reply to this message immediately.  Be sure to identify the session's process identifier (pid).  The message will be sent to the IT Support Center (consult@udel.edu).
</code>

Five days from the time this message was sent -- in this example, any time after Wednesday, August 29, 2:20 p.m. -- the two processes listed will be automatically killed and their resources reclaimed by Grid Engine.  The user can also manually kill any of the orphaned sessions before five days passes using the ''kill'' command:

<code bash>
[(it_css:traine)@mills it_css]$ kill -9 27571 23081
</code>

==== What's Running in an Orphan? ====

Users who receive an email like the one above may not know what programs are running within each orphaned session.  Ideally, when using ''qlogin'' to run a long computational task the ''-N'' flag can be used to set the job name to a short description of the task (versus the default name, ''QLOGIN''):
<code sh>
[(it_css:traine)@mills it_css]$ qlogin -N 'Gaussian, C6H6 opt+freq'
Your job 75136 ("Gaussian, C6H6 opt+freq") has been submitted
  :
[traine@n014 ~]$
</code>
Subsequent use of ''qstat'' will show the leading part of the assigned job name.  The full name can be displayed using
<code sh>
[(it_css:traine)@mills it_css]$ qstat -j 75136 | grep job_name
job_name:                   Gaussian, C6H6 opt+freq
</code>
to allow user to confirm the job's identity.

If the user neglected to leverage job naming, then the "shepherd pid" can be used //on the compute node// to display the processes associated with the orphan.  Assume that the user received an email notification which listed an orphan with the GridEngine information:  ''Job id: 75136 (shepherd pid:  3925 on n014)''.  Logging in to ''n014'' and using the ''pstree'' command on process 3925 shows all child processes associated with the shepherd: 
<code sh>
[(it_css:traine)@mills it_css]$ ssh n014 pstree -a -p 3925
sge_shepherd,3925 -bg
  `-sshd,3926
      `-sshd,3927
          `-bash,3929 /home/software/GridEngine/6.2u7/default/spool/n014/job_scripts/75136
              `-g09,3930 c6h6-opt-freq.com
</code>
Thus, this particular orphan is running the ''g09'' program with a single command line argument (''c6h6-opt-freq.com'').

==== Mitigating Orphaning using Screen ====

Users who must run a lengthy computation task via ''qlogin'' should mitigate the chance of their session's being orphaned by issuing the ''qlogin'' command inside a ''screen'' process:

{{ :technical:gridengine:qlogin-screened.png |Protecting a QLOGIN session inside SCREEN}}

The ''screen'' command acts as the controlling terminal for the ''qlogin'' and its SSH session and allows the user to disconnect from and reconnect to them at will.

<note important>If you did not use ''screen'' to protect your ''qlogin'' session and it contains an active computational task that you would prefer NOT be killed automatically, follow the directions in the last paragraph of the message and reply to it.  IT will mark the sessions you indicate so that they are not killed automatically; be sure to manually kill each once computation has completed!  If after two weeks any of the sessions are still active you will again be notified and must again request that the session not be killed, etc.</note>

==== Implementation Details ====

=== Identifying Orphans ===

Orphaned ''qlogin'' sessions are identified by their child ''/usr/bin/ssh'' process with the following criteria:

  * A parent process id (ppid) of 1
  * The optional presence of the X11-tunneling arguments, "-X -Y" 
  * The presence of a "-p [0-9]+" argument
  * A final argument of a compute node, "n[0-9]{3}"

All matching processes on the head node are deemed orphans.  Each process has a unique identifier constructed for it using its vital statistics (in hexadecimal):

<code>
uid:gid:port:pid:pgrp:sessionid:starttime

e.g.

  49B:49B:132D:100A:1009:1008:5F43D0
</code>

The chances of two ''qlogin'' orphans having all seven data points in common is extremely low.  Even should user X in group Y somehow manage to get the same TCP/IP ''port'', ''pid'', ''pgrp'', and ''sessionid'' assigned to the ''ssh'' process, the ''starttime'' monotonically increases between subsequent ''qlogin'' invocations and should be unique.  The ''starttime'' is relative to the system boot time, so it will roll over when the head node is rebooted -- but rebooting the head node will kill all ''qlogin'' sessions, anyway.

=== Maintaining State ===

An SQLite3 database (''/opt/etc/qlogin-orphan.sqlite3db'') is constructed which holds the data for orphans in the following table:

<code sql>
CREATE TABLE orphans (
  orphanId        CHARACTER VARYING(64) UNIQUE NOT NULL,
  uid             INTEGER NOT NULL,
  gid             INTEGER NOT NULL,
  tcpIpPort       INTEGER NOT NULL,
  pid             INTEGER NOT NULL,
  pgrp            INTEGER NOT NULL,
  sessionId       INTEGER NOT NULL,
  startTime       INTEGER NOT NULL,
  node            CHARACTER(4),
  jobId           TEXT,
  noAutoKill      INTEGER DEFAULT 0,
  created         TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  notified        TIMESTAMP,
  killed          TIMESTAMP
);
</code>

  * The ''noAutoKill'' is set by IT-NSS iff the owner indicates that the orphan is actively computing and should not be killed.
  * The ''notified'' timestamp is set once the owner has been emailed to notify him/her that the process will be killed in 5 days.
  * The ''killed'' timestamp is set once the ''notified'' timestamp is greater-than 5 days old AND the process has been issued a kill signal
  * Tuples in the table are removed after 14 days; any orphan not killed after 14 days will thus re-enter the cycle.

A session is marked for no automatic removal by setting its ''noAutoKill'' field in the database, e.g.

<code sql>
sqlite> UPDATE orphans SET noAutoKill=1 WHERE orphanId = '49B:49B:132D:100A:1009:1008:5F43D0'
</code>