<booktoc>
====== GECO Core Library ======

At the core of GECO is a shared library containing APIs used to:

  * Create and manage per-job CGroup containers
  * Manage GECO state information for running jobs
  * Load, serialize, and deserialize per-job granted resource information (e.g. core counts, memory limits, coprocessor hardware)
  * Manage a "runloop" that handles timed execution of tasks and i/o from one or more sources
  * Send/receive "quarantine" messages
  * Output timestamped, classed informational messages (a'la syslog)

This document provides a rudimentary summary of these services.  The source code header files include more extensive API documentation.

===== GECOCGroup =====

Functions in this API are used to initialize/deinitialize GECO's top-level CGroup containers, e.g.

  * ''/cgroup/cpuset/GECO''
  * ''/cgroup/memory/GECO''

The subsystems included in the list can vary based upon which subsystems are set to be "managed" by GECO.  By default, the ''cpuset'' and ''memory'' subgroups are managed.

This API also includes functions to intelligently allocate //N// CPU cores to a job.  The [[https://www.open-mpi.org/projects/hwloc/|hwloc]] library is used to select cores, and GECOCGroup internally maintains a bitmap indicating processor core assignment/availability.  The API also has the ability to reset that bitmap by scanning all active job ''cpuset'' CGroup containers.

===== GECOResource =====

This API scans the output from ''qstat -xml -j #.#'' to produce an object containing job resource properties of interest to GECO.  This includes:

  * Per-node
    * Slot (core) count
    * Memory limits (physical, virtual)
    * List of granted coprocessors (GPU, Phi)
  * Owner (uid/gid)
  * Working directory
  * Array job? Standby job?
  * Trace level (for logging GECO message to ''/opt/geco'')
  * Walltime limit

The API examines ''qstat'' output read from a file descriptor; a helper function is included to execute the ''qstat'' command internally.

A pair of functions define a serialization mechanism whereby the resulting GECOResource object can be written to a file and later reconstituted in memory.  This feature eliminates the need for repeated execution of expensive ''qstat'' queries and significantly increases the efficiency of the GECO tools.  The files use a simple fixed, typed format that is human-readable but easily parsed by the runtime environment.  The fields are enclosed by a "magic" prefix that contains the version of the format:
<code>
GECOResourceSet_v1{...}
</code>
Inside the curly braces, fields are separated by commas and are typed according to the prefixes:
^Prefix^Data type^
|''i''|Standard-width (32-bit) integer|
|''li''|Long (64-bit) integer|
|''lf''|Double-precision (long) floating-point|
|''b''|Binary (1 or 0)|
|''s#:''|Fixed-length string of bytes; the ''#'' is an integer indicating the length of the byte string|
Global fields (applying to every node in the job) appear first and in this order:
^Type^Description^
|''li''|Grid Engine job id|
|''li''|Grid Engine task id|
|''lf''|Walltime limit (in seconds)|
|''b''|Job is standby?|
|''lf''|Per-slot virtual memory limit|
|''i''|Trace level|
|''i''|Node count|
|''b''|Job is an array job?|
|''b''|Phi should be booted for job owner access?|
|''s#:''|Grid Engine job name|
|''s#:''|Job owner user name|
|''s#:''|Job owner group name|
|''s#:''|Working directory|
Following these global fields are resource lists for each participating node:
<code>
...,s4:n000{...}
</code>
Inside the per-node curly braces, fields are (in order):
^Type^Description^
|''b''|Is slave node?|
|''i''|Slot (core) count|
|''lf''|Physical memory limit|
|''lf''|Virtual memory limit|
|''s#:''|GPU coprocessor list|
|''s#:''|Phi coprocessor list|
For example, a job running across two nodes using 36 cores and a single Phi on each node might look like:
<code>
GECOResourceSet_v1{li3324,li1,lf0,b0,lf1000000000,i0,i2,b0,b1,s7:My test,s4:frey,s6:it_nss,s10:/home/1001,s4:n000{b0,i20,lf2000000000,lf20000000000,s0:,s4:mic0},s4:n003{b1,i12,lf1200000000,lf12000000000,s0:,s4:mic1}}</code>

===== GECOQuarantine =====

The GECOQuarantine API implements a socket-based protocol for the ''LD_PRELOAD'' library to request process quarantine.  The server- and client-side functionality are implemented together, not in separate APIs.

The kind of socket used by the API is communicated to the API textually.  The socket type leads the string just as a scheme leads-off a URL:
^Scheme^Socket type^Description of payload^
|''service:''|IPv4, 127.0.0.1|A named TCP/IP service, e.g. "''nameserver''" or "''smtp''"|
|''port:''|IPv4, 127.0.0.1|Numerical TCP/IP port, e.g. "''9091''"|
|''path:''|Unix domain|File path, e.g. "''/var/run/gecod.s''"|
| |Inferred|A number (e.g. "''9091''") produces an IPv4 loopback socket on that port; a string starting with "''/''" produces a Unix domain socket at the given path|

The following socket descriptions are all equivalent and produce a socket associated with port 8080 on the loopback interface:
  * ''service:http-alt''
  * ''service:8080''
  * ''port:8080''
  * ''8080''
Likewise for Unix domain sockets, the following are equivalent:
  * ''/var/run/gecod.s''
  * ''path:/var/run/gecod.s''

The API includes functions that generate opaque objects representing the messages to be sent between the two processes.  Messages are cryptographically signed to better ensure their integrity.  There are currently two messages defined.

==== Job started ====

The ''LD_PRELOAD'' library sends a //job started// message containing a pid and the Grid Engine job and task id associated with it.

==== Acknowledge job started ====

On receipt of a //job started// message, ''gecod'' returns to ''LD_PRELOAD'' an //acknowledge job started// message containing the Grid Engine job and task id and a boolean indicating whether or not quarantine was successful.

===== GECOLog =====

The GECO tools are written to provide as much information as possible to the user or system administrator using them.  Through the code there are messages of varying //verbosity level// produced.  Any GECO program can alter the maximum verbosity of the GECOLog API and thus filter-out messages at higher levels.  The levels (in increasing order of verbosity) are:

  * Quiet
  * Error
  * Warn
  * Info
  * Debug

At the lowest level (Quiet) the GECOLog API will not output any messages.  At the Info and Debug levels the API will produce extensive messages that are meant to aid in tracing execution.  The API defaults to the Error verbosity level.

The format of the log lines can be modified at any time to include any of the following:

  * Timestamp
  * Process pid
  * Verbosity level label

The API can also be configured to also send messages to the syslog facility.

An example that includes all three fields:

<code>
2015-11-19T13:56:46-0500 [28152|WARN ]: !! All resource information must be pre-populated for jobs since qstat use is disabled !!
2015-11-19T13:56:46-0500 [28152|WARN ]: !! Per-job data should be serialized to /opt/geco/resources/<jobid>.<taskid> using geco-rsrcinfo !!
  :
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOQuarantine.c:856) received quarantine command = { command=1, dataLen=24, mac=DFFAAC2B1B8ACE1AF1580291D0C182926B035A4E30CC60ADBFDBD791D3FBFF6F }
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOJob.c:395) loading resource information for 366474.27 from /opt/geco/resources/366474.27
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:456) created /cgroup/cpuset/GECO/366474.27
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1068) succeeded reading available cpuset.cpus from /cgroup/cpuset/GECO/cpuset.cpus
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1073) scanning /cgroup/cpuset/GECO for existing cpuset.cpus allocations
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1104)   allocated cpuset.cpus = 
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1109)   available cgroup.cpus = 0-19
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1212) optimal cgroup.cpus calculated as 0 (1)
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOJob.c:721) 1 core allocated to 366474.27
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOJob.c:724)   => 0
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1333) copied /cgroup/cpuset/GECO/cpuset.mems to /cgroup/cpuset/GECO/366474.27/cpuset.mems
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1358) set /cgroup/cpuset/GECO/366474.27/cpuset.cpu_exclusive to 1
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOJob.c:740) 366474.27 successfully bound to allocated cpuset
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:456) created /cgroup/memory/GECO/366474.27
</code>

===== GECORunloop =====

The GECO daemon uses a //runloop// to handle the multiplexing of event/data sources.  The daemon will typically be waiting for data coming from:

  * Patched ''exec()'' functions in an ''sshd'' or ''sge_shepherd'' daemon requesting process quarantine
  * Kernel netlink notification of process state changes (fork/exec/exit)
  * Kernel notification of out-of-memory process kills

Each one of these //polling sources// is a separate socket interface which must be checked for data.  The daemon registers each socket and a set of callback functions to react to events occurring on it with the runloop.

A runloop can also have an arbitrary number of //observers// registered with it.  An //observer// is a function that will be called each time the runloop transitions from one state to another.  The state changes that trigger a call are configured per-observer when it is registered with the runloop:

  * entering runloop
  * before entering polling
  * after polling exits
  * before processing polling sources
  * after processing polling sources
  * exiting runloop

By abstracting the socket-polling mechanism the daemon becomes a single-threaded loop which invokes ''epoll_wait()'' with an appropriate timeout (the //granularity// of the runloop, which defaults to 60 seconds).