Why We Needed to Compile Our Own NodeJS

Assaf, one of our main developers, noticed some crashes in our NodeJS processes. Here’s the crash log:

FATAL ERROR: invalid array length Allocation failed - JavaScript heap out of memory
 1: 0x9da7c0 node::Abort() [node]
 2: 0x9db976 node::OnFatalError(char const*, char const*) [node]
 3: 0xb39f1e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb3a299 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xce5635  [node]
 6: 0xcc24a5 v8::internal::Factory::CopyWeakArrayListAndGrow(v8::internal::Handle<v8::internal::WeakArrayList>, int, v8::internal::AllocationType) [node]
 7: 0xebe86a v8::internal::WeakArrayList::EnsureSpace(v8::internal::Isolate*, v8::internal::Handle<v8::internal::WeakArrayList>, int, v8::internal::AllocationType) [node]
 8: 0xebeb3b v8::internal::PrototypeUsers::Add(v8::internal::Isolate*, v8::internal::Handle<v8::internal::WeakArrayList>, v8::internal::Handle<v8::internal::Map>, int*) [node]
 9: 0xe88054 v8::internal::JSObject::LazyRegisterPrototypeUser(v8::internal::Handle<v8::internal::Map>, v8::internal::Isolate*) [node]
10: 0xeb1384 v8::internal::Map::GetOrCreatePrototypeChainValidityCell(v8::internal::Handle<v8::internal::Map>, v8::internal::Isolate*) [node]
11: 0xd68e5c v8::internal::LoadHandler::LoadFromPrototype(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::JSReceiver>, v8::internal::Handle<v8::internal::Smi>, v8::internal::MaybeObjectHandle, v8::internal::MaybeObjectHandle) [node]
12: 0xd70e67 v8::internal::LoadIC::ComputeHandler(v8::internal::LookupIterator*) [node]
13: 0xd77bcd v8::internal::LoadIC::UpdateCaches(v8::internal::LookupIterator*) [node]
14: 0xd7824c v8::internal::LoadIC::Load(v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Name>) [node]
15: 0xd7cf01 v8::internal::Runtime_LoadIC_Miss(int, unsigned long*, v8::internal::Isolate*) [node]
16: 0x1374fd9  [node]
Aborted (core dumped)

This is a native crash, as the binary of NodeJS just stop working. The error is really bizarre, what does “invalid array length Allocation failed - JavaScript heap out of memory” mean? At first, we thought NodeJS reached its 2GB limit, but that was not the cause. Our memory seems quite stable and in the normal range. After Assaf researched the issue further, he found out that JS has a limit regarding array size and once you reach it, this error occurs. Why didn’t the error say so instead? Ask NodeJS, because I have no idea.

This specific crash kept happening after we upgraded to NodeJS 12. However, we did not suspect it was an issue with Node12 as we had crashes before, and we suspected that the error message was just different in this case. It even seems like the issue could be on our end, in that we might have an array that is too large.

Over time, this specific crash happened more frequently and the stack-trace started to show many different locations inside the code, but it was in the same process. One of those processes\services was my responsibility, and that is why I am here telling you this story 🙂

I benchmarked everything in my code in order to find what could make this happen. I did memory and CPU profiling to a running system. I worked on this issue for days, without finding anything wrong with my code. Nothing that would lead to a huge array being created and nothing was leaking.

 

 

What I felt at this point

At this point, I assumed the problem was not in my code, so I revisited the crashed systems to find the common denominator. In our system, we have what we call a Task, which is a basic container to a request. This module is used widely in our system and it’s heavily tested, which why I didn’t suspect this module caused the crashes. However, in our crashed system, millions of finished Tasks completed running and that was the only common denominator I found. I asked Assaf to benchmark Tasks serially, gain millions of completed tasks, and tell me what happens. And to our surprise, we see the same crash!

Upon further investigation, nothing had leaked in the Tasks code from our side, so Assaf continued to eliminate code until he found the issue. The issue appears to be a bug with NodeJS 12 Anonymous generator functions, which was probably not natively released, causing some internal V8 array to increase in size and crash the process. My code used a lot of anonymous generator functions, so that’s why it kept crashing.

We wanted to contact NodeJS and submit a bug report. In order to do so, Assaf wrote this simple block of code which recreates the crash:

"use strict";
const co = require('co');

function* test() {
    for (let i = 0; i < 1000000000; i++) {
        function* a() {}
        yield* a(); 
        if (i % 1000000 == 0) {
            console.log("Cycle", i, process.memoryUsage().heapUsed);
        }
    }
}

co(function*(){
    yield* test();
})

We reported this issue to NodeJS:

Which then redirected us to the V8 team:

The awesome team at V8 quickly found the issue and fixed it. Great! But when will V8 merge it to the production branches? And when will NodeJS backport it into a release? These fixes could take months, but our services are crashing right now. We already implemented code based on the Node 12 feature set, we can’t really go back at this point.

We decided to compile our own NodeJS version with this fix backported into it. It was just a small fix, a few lines, and it seems that it won’t affect anything. We followed instructions from NodeJS:

I backported the fix, compiled it, and sent Assaf the NodeJS binary for testing. He ran all our testing against it, and it seems perfect: very stable and no more crashes after testing it in our lab. Cool.

The real issue is that I can’t really ship my binary to our customers, because they use a variety of OSes, and my binary might not be compatible with their OSes. We first need to understand the compatibility level of NodeJS binary and how they compile their code, so we can do the same. You would think this kind of information would be widely available, but apparently not. There is no single place that tells you how the release servers of NodeJS are created. They just call it the “Release Servers”. Not very helpful!

In order to understand the compatibility level of a binary in a “black-box” fashion, we need to use objdump -T node_binary command. This command creates output from the symbols of the binary, and we are going to look at the GLIBC, CXX, and ABI symbols. The output will look something like this:

/usr/bin/node:     file format elf64-x86-64
DYNAMIC SYMBOL TABLE:
0000000000000000  w   D  *UND*  0000000000000000              _ITM_addUserCommitAction
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 dlerror
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 setbuf
0000000000000000      DF *UND*  0000000000000000  GLIBCXX_3.4 _ZNSs7replaceEmmPKcm
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 strdup
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 readv
0000000000000000      DF *UND*  0000000000000000  GLIBCXX_3.4 _ZSt19__throw_logic_errorPKc
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 uname
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 feof
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 lchown
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.3.2 epoll_ctl
0000000000000000      DF *UND*  0000000000000000  CXXABI_1.3  __cxa_begin_catch
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 fdopen
0000000000000000      DF *UND*  0000000000000000  GLIBCXX_3.4 _ZNSt9exceptionD2Ev
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 symlink
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.2.5 select
...
...
...

Now we need to understand what is the maximum version required by the binary for GLIBC, GLIBCXX, and ABI. This will tell us the compatibility level. We could do it manually using grep and our mind:

objdump -T /usr/bin/node | grep -i ABI_
objdump -T /usr/bin/node | grep -i CXX_
objdump -T /usr/bin/node | grep -i GLIBC_

Or, using the GLIBC part for example, we could use automated tools like this one:

#!/bin/bash

# This scripts lets you check which minimum GLIBC version an executable requires.
# Simply run './glibc-check.sh path/to/your/binary'
#
# You can set `MAX_VER` however low you want, although I (fasterthanlime)
# feel like `2.13` is a good target (For reference, Ubuntu 12.04 has GLIBC 2.15)
MAX_VER=2.13

SCRIPTPATH=$( cd $(dirname $0) ; pwd -P )
BINARY=$1

# Version comparison function in bash
vercomp () {
    if [[ $1 == $2 ]]
    then
        return 0
    fi
    local IFS=.
    local i ver1=($1) ver2=($2)
    # fill empty fields in ver1 with zeros
    for ((i=${#ver1[@]}; i<${#ver2[@]}; i++))
    do
        ver1[i]=0
    done
    for ((i=0; i<${#ver1[@]}; i++)) do if [[ -z ${ver2[i]} ]] then # fill empty fields in ver2 with zeros ver2[i]=0 fi if ((10#${ver1[i]} > 10#${ver2[i]}))
        then
            return 1
        fi
        if ((10#${ver1[i]} < 10#${ver2[i]}))
        then
            return 2
        fi
    done
    return 0
}

IFS="
"
VERS=$(objdump -T $BINARY | grep GLIBC | sed 's/.*GLIBC_\([.0-9]*\).*/\1/g' | sort -u)

for VER in $VERS; do
  vercomp $VER $MAX_VER
  COMP=$?
  if [[ $COMP -eq 1 ]]; then
    echo "Error! ${BINARY} requests GLIBC ${VER}, which is higher than target ${MAX_VER}"
    echo "Affected symbols:"
    objdump -T $BINARY | grep GLIBC_${VER}
    echo "Looking for symbols in libraries..."
    for LIBRARY in $(ldd $BINARY | cut -d ' ' -f 3); do
      echo $LIBRARY
      objdump -T $LIBRARY | fgrep GLIBC_${VER}
    done
    exit 27
  else
    echo "Found version ${VER}"
  fi
done

Source: https://gist.github.com/fasterthanlime/17e002a8f5e0f189861c.

After we determine the compatibility level, we need to find a matching machine. I first tried to match the GLIBC version, I used docker for my testing as it was the easiest. I found out that centos 6 with devtoolset-7 package provide the most compatible binary to NodeJS.

https://hub.docker.com/r/conanio/gcc7-centos6/

I compared the GLIBC version at first (and it was the same) then I compiled NodeJS and compared the compatibility of the binary. It seemed to be exactly the same.

At this point, I wanted to create a tar.gz archive of our patched NodeJS version:

make -j <CORES> binary ARCH=x64

The “ARCH” is very important because NodeJS has a bug in which, without it, the compilation will fail. The command will output an archive file we could use, and it will be exactly the same as a NodeJS package.

In conclusion, we learned the hard way how NodeJS might be unstable and basic core functionality might break during updates of the framework. We also learned how to determine the compatibility level of a binary, and most of all, how to compile NodeJS with our own fix built into it. I hope you find it useful in case you face a similar issue. Have fun compiling 🙂