technical:whitepaper:rjava-gcc-optimization

rJava: when compilers get too smart

In the midst of installing the thousands of packages available in the CRAN library, a subset of the packages displayed a curious behavior: when R attempted to load the package to verify installation, the R subprocess would hang indefinitely. With timeouts in place, the installation of the package eventually fails.

An initial check of the hung R subprocess showed it to be stuck in a function named RinitJVM_with_padding():

#0  0x00007f2954334740 in RinitJVM_with_padding () from /opt/shared/r/add-ons/r4.1.3/cran/20220517/rJava/libs/rJava.so
#1  0x00007f295433491d in RinitJVM_jsw () from /opt/shared/r/add-ons/r4.1.3/cran/20220517/rJava/libs/rJava.so
#2  0x00007f2961899364 in do_External () from /opt/shared/r/4.1.3/lib64/R/lib/libR.so
#3  0x00007f29618d66bd in bcEval () from /opt/shared/r/4.1.3/lib64/R/lib/libR.so
#4  0x00007f29618f0340 in Rf_eval.localalias () from /opt/shared/r/4.1.3/lib64/R/lib/libR.so
     :

The goal of the function is to grow the stack by some distance JSW_PADDING (defaults to 2 MiB) before starting the Java interpreter — guard pages to better-insulate the Java VM from the C code in front of it. The RinitJVM_jsw() function calls RinitJVM_with_padding() and then RinitJVM_with_padding() calls itself recursively until its local variable is at least JSW_PADDING bytes away from where RinitJVM_jsw() was placing its local variables.

The correct execution of the RinitJVM_with_padding() chain relies on the stack's predictably growing on each invocation. There are several comments surrounding the C source code that mention the need to try to fool the C compiler so it does not optimize-out those local variables; this includes mention of their having tried other stack-based allocation schemes (e.g. via alloca()) to no avail.

The odd part about the GDB stack trace was that it never went any deeper into a RinitJVM_with_padding() recursion, execution just stayed in the first invocation. The assembly dump for the region through which execution was looping indefinitely looked like this:

0x7f2954334730 <RinitJVM_with_padding+80>       sub    %r10,%rax
0x7f2954334733 <RinitJVM_with_padding+83>       imul   %r8,%rax
0x7f2954334737 <RinitJVM_with_padding+87>       sub    %rax,%rcx
0x7f295433473a <RinitJVM_with_padding+90>       mov    %rcx,%rdx
0x7f295433473d <RinitJVM_with_padding+93>       test   %rcx,%rcx
0x7f2954334740 <RinitJVM_with_padding+96>       jle    0x7f2954334770 <RinitJVM_with_padding+144>
0x7f2954334742 <RinitJVM_with_padding+98>       sub    %r11,%rdx
0x7f2954334745 <RinitJVM_with_padding+101>      test   %rdx,%rdx
0x7f2954334748 <RinitJVM_with_padding+104>      jle    0x7f2954334770 <RinitJVM_with_padding+144>
0x7f295433474a <RinitJVM_with_padding+106>      sub    %rsi,%rdx
0x7f295433474d <RinitJVM_with_padding+109>      test   %rdx,%rdx
0x7f2954334750 <RinitJVM_with_padding+112>      jle    0x7f2954334770 <RinitJVM_with_padding+144>
0x7f2954334752 <RinitJVM_with_padding+114>      sub    %rdi,%rdx
0x7f2954334755 <RinitJVM_with_padding+117>      test   %rdx,%rdx
0x7f2954334758 <RinitJVM_with_padding+120>      jle    0x7f2954334775 <RinitJVM_with_padding+149>
0x7f295433475a <RinitJVM_with_padding+122>      mov    %rdx,%rcx
0x7f295433475d <RinitJVM_with_padding+125>      mov    %rbx,%rax
0x7f2954334760 <RinitJVM_with_padding+128>      sub    %rbp,%rcx
0x7f2954334763 <RinitJVM_with_padding+131>      test   %rcx,%rcx
0x7f2954334766 <RinitJVM_with_padding+134>      jg     0x7f2954334730 <RinitJVM_with_padding+80> 

First and foremost note that nothing is ever being pushed onto the stack in this segment. Working-out what each register holds:

RegisterPurposeInitial Value
r10Base stack pointerX
raxCurrent stack pointerX + 4
r8Stack growth direction, ±1+1
rcxBytes remaining to be allocatedC
rdxTemporary byte count
r11Pointer decrement-1
rsiPointer decrement-1
rdiPointer decrement-1
rdpPointer decrement-1
rbxSaved starting value for current stack pointerX + 4

For C > 4, the code determines (rax - r10) to be 4 and calculates C = C - 4. It then does in seuquence subtraction of registers r11, rsi, rdi, and rdp off C, exiting the function iff C drops to zero or negative. But subtraction of a -1 means we are doing four C = C + 1 operations, so by the end of the segment of code C has been restored to its initial value (which is still C > 4) and the loop continues with rax being restored to X + 4. The compiler has apparently transformed the original source code into a loop that plays with the same base pointer over and over again because the stack is never being grown. This seems to imply that the compiler's recursion analysis and elimination logic does not consider the implicit stack growth side effect that makes the local dummy variable in each invocation unique.

The following test code was used to explore the issue. It is a modified variant on the RinitJVM_with_padding() function that is still in use in rJava:

jsw.c
#include <unistd.h>
#include <stdint.h>
#include <limits.h>
#include <stdio.h>
 
#define JSW_PADDING 2*1024*1024
#define C_Stack_Direction 1
extern void * __libc_stack_end;
 
#if defined(NO_OPT)
#pragma GCC push_options
#pragma GCC optimize ("O0")
#endif
#if defined(NO_RETURN)
#   define RinitJVM_with_padding_RETURN(RVAL) { RVAL; }
void
#else
#   define RinitJVM_with_padding_RETURN(RVAL) { return RVAL; }
intptr_t
#endif
RinitJVM_with_padding(
    char        *dumdum,
    intptr_t    padding,
    char        *last
)
{
    intptr_t    dp;
    char        dummy[1];
    dummy[0] = (char)(uintptr_t)&dummy;
    padding -= (dp = (last - dummy) * C_Stack_Direction);
#if !defined(NO_PRINTF)
    printf("%p %p: padding = %p ; dp = %p\n", __builtin_frame_address(0), dummy, padding, dp);
#endif
    if ( padding > 0 )
        RinitJVM_with_padding_RETURN( RinitJVM_with_padding(dumdum, padding, dummy) )
    else
        RinitJVM_with_padding_RETURN( (uintptr_t)dummy );
 
}
#if defined(NO_OPT)
#pragma GCC pop_options
#endif
 
 
int
main()
{
    char dummy[1], dummy2[45];
    uintptr_t R_CStackStart = (uintptr_t) __libc_stack_end;
    intptr_t R_CStackLimit = 8192 * 1024;
 
    intptr_t padding = 0;
    {
        int dummy;
        intptr_t usage = (R_CStackStart - (uintptr_t)&dummy);
        usage += JSW_PADDING + 512; /* 512 is a buffer for C recursive calls */
        if(R_CStackLimit == -1 || usage < ((intptr_t) R_CStackLimit))
            padding = JSW_PADDING;
    }
 
    /* reduce the risk that dummy will be optimized out */
    dummy[0] = (char) (uintptr_t) &dummy;
    printf("padding = %p\n", padding);
#if defined(NO_RETURN)
    RinitJVM_with_padding(dummy, padding, dummy);
#else
    padding = RinitJVM_with_padding(dummy, padding, dummy);
    printf("FINAL:  %p - %p = %p\n", dummy, padding, (intptr_t)dummy - padding);
#endif
    return 0;
}

Various GCC versions (4.8, 7.3, 10.1, 11.2) all transformed RinitJVM_with_padding() into a form that yielded an infinite loop (gcc -DNO_PRINTF -DNO_RETURN -o jsw jsw.c -g -O2).

For GCC 4.8 and 7.3, inserting a printf() statement into the function to show the stack pointer, padding remaining, and padding change on each invocation yielded a working function (external dependency on the local variable disallows its being optimized-out). The address of the local variable moved monotonically downward from the base stack pointer from when RinitJVM_with_padding() was first called, by 48 bytes for each printf() (gcc -DNO_RETURN -o jsw jsw.c -g -O2).

padding = 0x200000
0x7ffd59ed1cc0 0x7ffd59ed1c9f: padding = 0x1fffc3 ; dp = 0x3d
0x7ffd59ed1c80 0x7ffd59ed1c5f: padding = 0x1fff83 ; dp = 0x40
0x7ffd59ed1c40 0x7ffd59ed1c1f: padding = 0x1fff43 ; dp = 0x40
0x7ffd59ed1c00 0x7ffd59ed1bdf: padding = 0x1fff03 ; dp = 0x40
0x7ffd59ed1bc0 0x7ffd59ed1b9f: padding = 0x1ffec3 ; dp = 0x40
0x7ffd59ed1b80 0x7ffd59ed1b5f: padding = 0x1ffe83 ; dp = 0x40
   :
0x7ffd59cd1e00 0x7ffd59cd1ddf: padding = 0x103 ; dp = 0x40
0x7ffd59cd1dc0 0x7ffd59cd1d9f: padding = 0xc3 ; dp = 0x40
0x7ffd59cd1d80 0x7ffd59cd1d5f: padding = 0x83 ; dp = 0x40
0x7ffd59cd1d40 0x7ffd59cd1d1f: padding = 0x43 ; dp = 0x40
0x7ffd59cd1d00 0x7ffd59cd1cdf: padding = 0x3 ; dp = 0x40
0x7ffd59cd1cc0 0x7ffd59cd1c9f: padding = 0xffffffffffffffc3 ; dp = 0x40

For GCC 10.1 and 11.2, the printf() dependency also produced a working function. However, the stack pointer shifted both upwards and downwards as the recursion progressed: the stack pointer would drop by 64 bytes, then through two subsequent printf() calls rise by 1 byte. In each case, the function's stack pointer stayed the same.

padding = 0x200000
0x7ffdd834d1d0 0x7ffdd834d1ad: padding = 0x1fffc1 ; dp = 0x3f
0x7ffdd834d1d0 0x7ffdd834d1ae: padding = 0x1fffc2 ; dp = 0xffffffffffffffff
0x7ffdd834d1d0 0x7ffdd834d1af: padding = 0x1fffc3 ; dp = 0xffffffffffffffff
0x7ffdd834d190 0x7ffdd834d16d: padding = 0x1fff81 ; dp = 0x42
0x7ffdd834d190 0x7ffdd834d16e: padding = 0x1fff82 ; dp = 0xffffffffffffffff
0x7ffdd834d190 0x7ffdd834d16f: padding = 0x1fff83 ; dp = 0xffffffffffffffff
0x7ffdd834d150 0x7ffdd834d12d: padding = 0x1fff41 ; dp = 0x42
0x7ffdd834d150 0x7ffdd834d12e: padding = 0x1fff42 ; dp = 0xffffffffffffffff
0x7ffdd834d150 0x7ffdd834d12f: padding = 0x1fff43 ; dp = 0xffffffffffffffff
   :
0x7ffdf339f290 0x7ffdf339f26d: padding = 0x41 ; dp = 0x42
0x7ffdf339f290 0x7ffdf339f26e: padding = 0x42 ; dp = 0xffffffffffffffff
0x7ffdf339f290 0x7ffdf339f26f: padding = 0x43 ; dp = 0xffffffffffffffff
0x7ffdf339f250 0x7ffdf339f22d: padding = 0x1 ; dp = 0x42
0x7ffdf339f250 0x7ffdf339f22e: padding = 0x2 ; dp = 0xffffffffffffffff
0x7ffdf339f250 0x7ffdf339f22f: padding = 0x3 ; dp = 0xffffffffffffffff
0x7ffdf339f210 0x7ffdf339f1ed: padding = 0xffffffffffffffc1 ; dp = 0x42

Appearances are that the printf() dependency prevented full flattening of the recursion into a single contained loop, with a recursion of loops being the result. Each invocation extends the stack by 64 bytes, generating multiple local variables (versus just one) that the code loops over. Because the stack pointer appears to drop AND rise, the padding value is not accurately altered: the actual delta between the final stack location and the first is 0x1B05201D bytes, about 216 times the desired 0x200000 bytes (for 4.8 and 7.3, the delta was 0x200021 bytes).

Having a printf() present to force a dependency on the local variable is not a viable solution; likewise, introducing calls to some other dummy function are not a great solution to the problem. Since GCC 4.4, C source code can affect the compiler behavior to enforce, for example, no optimization of segments of code. The code delimited by the NO_OPT macro in jsw.c (above) informs GCC that the RinitJVM_with_padding() function should not be subject to compiler optimizations. A subsequent compilation (gcc -DNO_OPT -DNO_RETURN -o jsw jsw.c -g -O2) produces the following behavior in each version of GCC:

padding = 0x200000
0x7ffd8a647550 0x7ffd8a647547: padding = 0x1fffdb ; dp = 0x25
0x7ffd8a647510 0x7ffd8a647507: padding = 0x1fff9b ; dp = 0x40
0x7ffd8a6474d0 0x7ffd8a6474c7: padding = 0x1fff5b ; dp = 0x40
0x7ffd8a647490 0x7ffd8a647487: padding = 0x1fff1b ; dp = 0x40
0x7ffd8a647450 0x7ffd8a647447: padding = 0x1ffedb ; dp = 0x40
0x7ffd8a647410 0x7ffd8a647407: padding = 0x1ffe9b ; dp = 0x40
   :
0x7ffd8a447690 0x7ffd8a447687: padding = 0x11b ; dp = 0x40
0x7ffd8a447650 0x7ffd8a447647: padding = 0xdb ; dp = 0x40
0x7ffd8a447610 0x7ffd8a447607: padding = 0x9b ; dp = 0x40
0x7ffd8a4475d0 0x7ffd8a4475c7: padding = 0x5b ; dp = 0x40
0x7ffd8a447590 0x7ffd8a447587: padding = 0x1b ; dp = 0x40
0x7ffd8a447550 0x7ffd8a447547: padding = 0xffffffffffffffdb ; dp = 0x40

The final test is to remove the printf() dependency (gcc -DNO_OPT -DNO_PRINTF -o jsw jsw.c -g -O2):

padding = 0x200000
FINAL:  0x7ffe31fa0d3c - 0x7ffe31da0d17 = 0x200025

The rJava package source was downloaded and unpacked, and the src/init.c file was patched to place the GCC pragmas:

#pragma GCC push_options
#pragma GCC optimize ("O0")

and

#pragma GCC pop_options

around the RinitJVM_with_padding() function. The install.packages() R function was invoked on the patched source directory and installed without issue. Subsequent installation of rJava-dependent packages no longer hung when attempting to load the Java VM.

  • technical/whitepaper/rjava-gcc-optimization.txt
  • Last modified: 2022-06-17 14:58
  • by frey