technical:gridengine:geco:02_library

<booktoc>

GECO Core Library

At the core of GECO is a shared library containing APIs used to:

  • Create and manage per-job CGroup containers
  • Manage GECO state information for running jobs
  • Load, serialize, and deserialize per-job granted resource information (e.g. core counts, memory limits, coprocessor hardware)
  • Manage a "runloop" that handles timed execution of tasks and i/o from one or more sources
  • Send/receive "quarantine" messages
  • Output timestamped, classed informational messages (a'la syslog)

This document provides a rudimentary summary of these services. The source code header files include more extensive API documentation.

Functions in this API are used to initialize/deinitialize GECO's top-level CGroup containers, e.g.

  • /cgroup/cpuset/GECO
  • /cgroup/memory/GECO

The subsystems included in the list can vary based upon which subsystems are set to be "managed" by GECO. By default, the cpuset and memory subgroups are managed.

This API also includes functions to intelligently allocate N CPU cores to a job. The hwloc library is used to select cores, and GECOCGroup internally maintains a bitmap indicating processor core assignment/availability. The API also has the ability to reset that bitmap by scanning all active job cpuset CGroup containers.

This API scans the output from qstat -xml -j #.# to produce an object containing job resource properties of interest to GECO. This includes:

  • Per-node
    • Slot (core) count
    • Memory limits (physical, virtual)
    • List of granted coprocessors (GPU, Phi)
  • Owner (uid/gid)
  • Working directory
  • Array job? Standby job?
  • Trace level (for logging GECO message to /opt/geco)
  • Walltime limit

The API examines qstat output read from a file descriptor; a helper function is included to execute the qstat command internally.

A pair of functions define a serialization mechanism whereby the resulting GECOResource object can be written to a file and later reconstituted in memory. This feature eliminates the need for repeated execution of expensive qstat queries and significantly increases the efficiency of the GECO tools. The files use a simple fixed, typed format that is human-readable but easily parsed by the runtime environment. The fields are enclosed by a "magic" prefix that contains the version of the format:

GECOResourceSet_v1{...}

Inside the curly braces, fields are separated by commas and are typed according to the prefixes:

PrefixData type
iStandard-width (32-bit) integer
liLong (64-bit) integer
lfDouble-precision (long) floating-point
bBinary (1 or 0)
s#:Fixed-length string of bytes; the # is an integer indicating the length of the byte string

Global fields (applying to every node in the job) appear first and in this order:

TypeDescription
liGrid Engine job id
liGrid Engine task id
lfWalltime limit (in seconds)
bJob is standby?
lfPer-slot virtual memory limit
iTrace level
iNode count
bJob is an array job?
bPhi should be booted for job owner access?
s#:Grid Engine job name
s#:Job owner user name
s#:Job owner group name
s#:Working directory

Following these global fields are resource lists for each participating node:

...,s4:n000{...}

Inside the per-node curly braces, fields are (in order):

TypeDescription
bIs slave node?
iSlot (core) count
lfPhysical memory limit
lfVirtual memory limit
s#:GPU coprocessor list
s#:Phi coprocessor list

For example, a job running across two nodes using 36 cores and a single Phi on each node might look like:

GECOResourceSet_v1{li3324,li1,lf0,b0,lf1000000000,i0,i2,b0,b1,s7:My test,s4:frey,s6:it_nss,s10:/home/1001,s4:n000{b0,i20,lf2000000000,lf20000000000,s0:,s4:mic0},s4:n003{b1,i12,lf1200000000,lf12000000000,s0:,s4:mic1}}

The GECOQuarantine API implements a socket-based protocol for the LD_PRELOAD library to request process quarantine. The server- and client-side functionality are implemented together, not in separate APIs.

The kind of socket used by the API is communicated to the API textually. The socket type leads the string just as a scheme leads-off a URL:

SchemeSocket typeDescription of payload
service:IPv4, 127.0.0.1A named TCP/IP service, e.g. "nameserver" or "smtp"
port:IPv4, 127.0.0.1Numerical TCP/IP port, e.g. "9091"
path:Unix domainFile path, e.g. "/var/run/gecod.s"
InferredA number (e.g. "9091") produces an IPv4 loopback socket on that port; a string starting with "/" produces a Unix domain socket at the given path

The following socket descriptions are all equivalent and produce a socket associated with port 8080 on the loopback interface:

  • service:http-alt
  • service:8080
  • port:8080
  • 8080

Likewise for Unix domain sockets, the following are equivalent:

  • /var/run/gecod.s
  • path:/var/run/gecod.s

The API includes functions that generate opaque objects representing the messages to be sent between the two processes. Messages are cryptographically signed to better ensure their integrity. There are currently two messages defined.

The LD_PRELOAD library sends a job started message containing a pid and the Grid Engine job and task id associated with it.

On receipt of a job started message, gecod returns to LD_PRELOAD an acknowledge job started message containing the Grid Engine job and task id and a boolean indicating whether or not quarantine was successful.

The GECO tools are written to provide as much information as possible to the user or system administrator using them. Through the code there are messages of varying verbosity level produced. Any GECO program can alter the maximum verbosity of the GECOLog API and thus filter-out messages at higher levels. The levels (in increasing order of verbosity) are:

  • Quiet
  • Error
  • Warn
  • Info
  • Debug

At the lowest level (Quiet) the GECOLog API will not output any messages. At the Info and Debug levels the API will produce extensive messages that are meant to aid in tracing execution. The API defaults to the Error verbosity level.

The format of the log lines can be modified at any time to include any of the following:

  • Timestamp
  • Process pid
  • Verbosity level label

The API can also be configured to also send messages to the syslog facility.

An example that includes all three fields:

2015-11-19T13:56:46-0500 [28152|WARN ]: !! All resource information must be pre-populated for jobs since qstat use is disabled !!
2015-11-19T13:56:46-0500 [28152|WARN ]: !! Per-job data should be serialized to /opt/geco/resources/<jobid>.<taskid> using geco-rsrcinfo !!
  :
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOQuarantine.c:856) received quarantine command = { command=1, dataLen=24, mac=DFFAAC2B1B8ACE1AF1580291D0C182926B035A4E30CC60ADBFDBD791D3FBFF6F }
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOJob.c:395) loading resource information for 366474.27 from /opt/geco/resources/366474.27
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:456) created /cgroup/cpuset/GECO/366474.27
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1068) succeeded reading available cpuset.cpus from /cgroup/cpuset/GECO/cpuset.cpus
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1073) scanning /cgroup/cpuset/GECO for existing cpuset.cpus allocations
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1104)   allocated cpuset.cpus = 
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1109)   available cgroup.cpus = 0-19
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1212) optimal cgroup.cpus calculated as 0 (1)
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOJob.c:721) 1 core allocated to 366474.27
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOJob.c:724)   => 0
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1333) copied /cgroup/cpuset/GECO/cpuset.mems to /cgroup/cpuset/GECO/366474.27/cpuset.mems
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:1358) set /cgroup/cpuset/GECO/366474.27/cpuset.cpu_exclusive to 1
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOJob.c:740) 366474.27 successfully bound to allocated cpuset
2015-11-19T13:57:14-0500 [28152|INFO ]:(GECOCGroup.c:456) created /cgroup/memory/GECO/366474.27

The GECO daemon uses a runloop to handle the multiplexing of event/data sources. The daemon will typically be waiting for data coming from:

  • Patched exec() functions in an sshd or sge_shepherd daemon requesting process quarantine
  • Kernel netlink notification of process state changes (fork/exec/exit)
  • Kernel notification of out-of-memory process kills

Each one of these polling sources is a separate socket interface which must be checked for data. The daemon registers each socket and a set of callback functions to react to events occurring on it with the runloop.

A runloop can also have an arbitrary number of observers registered with it. An observer is a function that will be called each time the runloop transitions from one state to another. The state changes that trigger a call are configured per-observer when it is registered with the runloop:

  • entering runloop
  • before entering polling
  • after polling exits
  • before processing polling sources
  • after processing polling sources
  • exiting runloop

By abstracting the socket-polling mechanism the daemon becomes a single-threaded loop which invokes epoll_wait() with an appropriate timeout (the granularity of the runloop, which defaults to 60 seconds).

  • technical/gridengine/geco/02_library.txt
  • Last modified: 2016-01-04 16:02
  • by 127.0.0.1