Contact Us

See the future of customer intelligence at Connect ‘18.

Learn More
Solving a java runtime memory error

Smart Code, Dumb Behavior – Solving a Java Runtime Memory Problem

Written By:

One of our applications, imaginatively named “Photo Pull”, has a fairly simple purpose in life:

  • Pull contact photos from various third-party sources.
  • Rescale the photos to a desired thumbnail size.
  • Push the result to S3.

It seems like such an application would be almost trivial to run stably. On the contrary, it has been one of our most problematic solutions — not because of our code, but rather because of the code within Java Runtime.

Smart Code, Dumb Behavior

The first problem we ran into was a continuous growth of memory usage by Photo Pull. Over a period of a few hours, the process would eventually consume all memory on the system, until eventually being killed by the Linux OOM-killer and restarted by Storm.

The normal diagnostic tools for Java memory usage were not much help. All we learned was that gigabytes of non-heap memory were being leaked, while heap size remained around a couple hundred megabytes.

There is, however, a statistical approach to determining whose fault a memory leak is. The component which leaks memory is the one most likely to push memory usage over any arbitrary threshold. Therefore, if we can get a stack trace for an OutOfMemoryError multiple times, we can obtain high certainty of where the memory leak is. The only problem now is that the process is getting killed by Linux, rather than discovering for itself that memory is exhausted.

The trick is to use ulimit to set an arbitrary memory ceiling for the process, but that ceiling needs to be significantly lower than system memory. For example:

ulimit -m $((1024*1024))

This ulimit code will cap the process memory size at 1GB on FreeBSD. With the memory limit set, it was just a matter of letting the process run for around an hour until it finally crashed.

The resulting stack trace lead us into a native stack frame:

com.sun.imageio.plugins.jpeg.JPEGImageReader.initJPEGImageReader().

The source for this function is at jdk/src/share/native/sun/awt/image/jpeg/imageioJPEG.c in the OpenJDK 6 source. A brief investigation reveals the problem:

/* ... snip to line 1450 */    /* We use our private extension JPEG error handler.
     */    jerr = malloc (sizeof(struct sun_jpeg_error_mgr));

    /* ... snip to line 1476 */
    /* Establish the setjmp return context for sun_jpeg_error_exit to use. */    if (setjmp(jerr->setjmp_buffer)) {
        /* If we get here, the JPEG code has signaled an error. */        char buffer[JMSG_LENGTH_MAX];
        (*cinfo->err->format_message) ((struct jpeg_common_struct *) cinfo,
                                      buffer);
        JNU_ThrowByName(env, "javax/imageio/IIOException", buffer);
        return 0;
    }

In other words, the C code allocates space for its error handler on line 1452, but then never frees it in the case where it throws an IIOException. It is unclear whether cinfo is already leaked.

Examination of the same file in OpenJDK 7 reveals that this bug was fixed in that version. The code ported to Java 7 fairly easily, and did not leak memory after running for a few hours.

Segmentation Fault

A few hours after proclaiming victory over the memory leak, one of the workers silently crashed. Then another. Storm dutifully restarted them, but there was no obvious evidence for why they were dying — no log messages, no memory growth, no evidence of being killed by Storm. We resorted to running the process locally again, eventually getting the following message:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb120713242, pid=11629, tid=140394149959424
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libpthread.so.0+0xe242]  sem_post+0x12

The resulting hs_err_pid*.log file had lacked any Java stack trace, and the stack dump produced by the JVM unfortunately includes only the bottom-most frame. Nonetheless, we started by investigating the conditions that caused the
POSIX Threads library to trigger a segfault. 18 bytes into the sem_post function finds us:

e230: mov    eax,DWORD PTR [rdi]
e232: cmp    eax,0x7fffffff
e237: je     e26c 
e239: lea    esi,[rax+0x1]
e23c: lock cmpxchg DWORD PTR [rdi],esi
e240: jne    e232 
e242: cmp    QWORD PTR [rdi+0x8],0x0        # Here
e247: je     e262 
e249: mov    eax,0xca
e24e: mov    esi,0x1
e253: or     esi,DWORD PTR [rdi+0x4]
e256: mov    edx,0x1
e25b: syscall
e25d: test   rax,rax
e260: js     e265 
e262: xor    eax,eax
e264: ret
e265: mov    eax,0x16
e26a: jmp    e271 
e26c: mov    eax,0x4b
e271: mov    rdx,QWORD PTR [rip+0x209d08]
e278: mov    DWORD PTR fs:[rdx],eax
e27b: or     eax,0xffffffff
e27e: ret

This point of failure in and of itself is rather surprising. An access to the rdi register succeeded only a few instructions earlier at e230, albeit eight bytes earlier in memory. But the JVM places the rdi register at 0x00007fb11046e000. Adding eight bytes cannot possibly cross a page boundary.

Fortunately, the JVM dump also gives us a memory map. It turns out that this memory address points into a native shared library:

7fb11046d000-7fb11046e000 rw-p 00007000 ca:01 77 /lib/x86_64-linux-gnu/libnss_dns-2.15.so
7fb11046f000-7fb110473000 r--s 0008a000 ca:01 396307 /opt/jdk1.7.0/jre/lib/jsse.jar

rdi points to the beginning of a 1-page hole in the memory map. This suggests that there was something there when e230 was executed, but unmapped by the time e242 was executed. But it’s also possible that the caller didn’t intend to use this address, noting that rdi happens to be around 232 bytes above the current stack frame. However, we still don’t know whose fault this problem is.

After reproducing the crash around ten times, we finally got a slightly different crash. Again, we have a garbage pointer passed into a pthreads function. But this time, we also got a Java stack trace in the dump.

It turns out that the JRE code simply delegates to libccm, which seems to be a bit buggy handling certain color spaces. It sometimes either directly broke pthreads, or clobbered memory in such a way that the JVM would later do so.

How We Fixed It

Unfortunately, the final solution wound up being somewhat less exciting than the investigation. We found that simply switching to Apache Commons Imaging instead of the Java ImageIO library would solve our problems. But, as often is the case, solving a complex problem sometimes requires you to do far more investigation than actual repair.

Quickly build contact management features into your app.

Get Started for FreeNo Credit Card Required

Like this post? Share it:

Recent Posts