November 16, 2023

Deep JS. In memory of data and types

Level: Senior, Senior+

We were all taught that JavaScript has primitive and reference data types. Comprehensive information is available in the MDN, and the Internet is full of articles on this subject.

Theory by theory, however, JS code is executed not in theory, but in practice. More precisely, it is compiled and executed by the JS engine. There are several such engines, they were developed by different people and for different purposes. It would be naive to assume that they are all completely identical to each other. So, it's time to figure out how very specific data is actually stored on a very specific JS engine. As a test subject, let's take one of the most common, to date, the V8 engine from Google.

But before we get to the analysis, let's remember the main theoretical points.

Primitive data types are immutable values stored in memory and represented in language structures at a low level. Actually, in JavaScript, everything except Object refers to primitive types, namely:

Reference data types, they are also known as objects, are memory areas of indeterminate size, and accessible by an Identifier (a reference to this memory area). In JavaScript, there is only one such type - Object. In addition to Object, there is also a separate Function structure, which, in fact, is also an Object. Object is the only mutable data type in JavaScript. This means that the variable does not store the object value itself, but only the reference identifier. When performing any manipulations with an object, the value changes directly in the memory area, but the reference to this area remains the same until we redefine it explicitly or implicitly. The object remains in memory as long as there is an active reference to it. If the reference is deleted or is no longer used in the script, such an object will soon be destroyed by the garbage collector, but more on that next time.

So, the theory has been remembered, let's now see if everything is so unambiguous in practice? The experiments will be carried out on the latest, at the time of the research, version of the V8 engine 12.1.138 from November 15, 2023.

Let's start the analysis with the most understandable, it seems, type. For digital systems, there is nothing more natural than numbers.

Number

According to the documentation, the Number type in JavaScript is a 64-bit double precision number according to the IEEE 754 standard

const number = 1; 

// expected value in memory 
// 
// 0000 0000 0000 0000 0000 0000 0000 0000 
// 0000 0000 0000 0000 0000 0000 0000 0001

Let's look at this number in V8 in the debag mode. To do this, use the engine's system helper %DebugPrint

d8> const number = 1;
%DebugPrint(number);
DebugPrint: Smi: 0x1 (1)

1

It looks quite expected. We see a simple value 0x1 with some type of Smi. But shouldn't there be a Number type here, as the ECMAScript specification says? Unfortunately, it is not possible to find answers to such questions in the official documentation of the engine, so let's turn directly to the source codes.

Smi

/src/objects/smi.h

// Smi represents integer Numbers that can be stored in 31 bits.
// Smis are immediate which means they are NOT allocated in the heap.
// The ptr_ value has the following format: [31 bit signed int] 0
// For long smis it has the following format:
//     [32 bit signed int] [31 bits zero padding] 0
// Smi stands for small integer.
class Smi : public AllStatic {

Thus, Smi (Small Integer) is a 31-bit integer.

The maximum value of such a number +(2**30 - 1), minimum is -(2**30 - 1)

d8> %DebugPrint(2**30 - 1)
DebugPrint: Smi: 0x3fffffff (1073741823)

1073741823

d8> %DebugPrint(-(2**30 - 1))
DebugPrint: Smi: 0xc0000001 (-1073741823)

-1073741823

OK, but the specification says that the Number type allows you to store 64-bit numbers, however, Smi is only able to work with 31-bit ones. What about the others? Well, let's see.

HeapNumber

/src/objects/heap-number.h

Let's take a number 1 greater than the maximum Smi

d8> %DebugPrint(2**30)
DebugPrint: 0x36ac0011c291: [HeapNumber] in OldSpace
- map: 0x36ac00000789 <Map[12](HEAP_NUMBER_TYPE)>
- value: 1073741824.0
0x36ac00000789: [Map] in ReadOnlySpace
- map: 0x36ac000004c5 <MetaMap (0x36ac0000007d <null>)>
- type: HEAP_NUMBER_TYPE
- instance size: 12
- elements kind: HOLEY_ELEMENTS
- enum length: invalid
- stable_map
- back pointer: 0x36ac00000061 <undefined>
- prototype_validity cell: 0
- instance descriptors (own) #0: 0x36ac000006d9 <DescriptorArray[0]>
- prototype: 0x36ac0000007d <null>
- constructor: 0x36ac0000007d <null>
- dependent code: 0x36ac000006b5 <Other heap object (WEAK_ARRAY_LIST_TYPE)>
- construction counter: 0

1073741824

It turns out that a 64-bit number in the V8 representation is an object specifically of the HeapNumber type. The fact is that such numbers (they are also double-precision numbers), according to the IEEE standard, consist of several parts, a sign (1 bit), an exponent (11 bits) and a mantis (52 bits). In fact, such a structure is stored in memory with two 32-bit words, where the first word is part of the mantissa, the second is a mix of the sign, exponent and the remaining part of the mantissa. In order to optimize performance, V8 independently implements the mathematics of such numbers, which leads it to the description of the corresponding class.

A similar pattern will obviously be observed with floating-point numbers.

d8> %DebugPrint(0.1)
DebugPrint: 0x36ac0011c605: [HeapNumber] in OldSpace
- map: 0x36ac00000789 <Map[12](HEAP_NUMBER_TYPE)>
- value: 0.1
0x36ac00000789: [Map] in ReadOnlySpace
- map: 0x36ac000004c5 <MetaMap (0x36ac0000007d <null>)>
- type: HEAP_NUMBER_TYPE
- instance size: 12
- elements kind: HOLEY_ELEMENTS
- enum length: invalid
- stable_map
- back pointer: 0x36ac00000061 <undefined>
- prototype_validity cell: 0
- instance descriptors (own) #0: 0x36ac000006d9 <DescriptorArray[0]>
- prototype: 0x36ac0000007d <null>
- constructor: 0x36ac0000007d <null>
- dependent code: 0x36ac000006b5 <Other heap object (WEAK_ARRAY_LIST_TYPE)>
- construction counter: 0

0.1

Visually, the difference between Smi and HeapNumber can be seen by taking a Heap Snapshot in an executable environment. To do this, create a small script that stores two numbers in memory.

/* We will enclose the values in the context of the function */
function V8Snapshot() {
  this.number1 = 1;     // Smi
  this.number2 = 2**30; // HeapNumber
}

// Next, create two instances of the same class,
// thus, we will have 4 references to 2 values
const v8Snapshot1 = new V8Snapshot();
const v8Snapshot2 = new V8Snapshot();

Let's use the standard browser toolkit Chrome Dev Tools -> Memory and take a snapshot of Heap Snapshot.

In the snapshot we see two instances of the V8Snapshot class, both store pointers to the numbers number1 and number2.

It is noteworthy here that in both instances number1 points to the same memory area with the address @233347, whereas number2 in both cases has different addresses, respectively, two identical number2 values are stored in memory at the moment. This is the fundamental difference between Smi and HeapNumber. Small numbers, in fact, are constant, and, being assigned for the first time, are not duplicated in the future, and all pointers to them refer to the same value. HeapNumber, on the other hand, is a dynamic structure, in order to find a previously stored value, it will still have to be pre-calculated, which negates all the benefits of reuse.

Conclusion

The V8 engine, in fact, does not have a Number type, instead, it has two other types:

  • Smi - integers in the range -(2**30 - 1) ... +(2**30 - 1), are represented in memory as a 31-bit value
  • HeapNumber - integers outside of Smi and floating-point numbers are represented in memory as an internal specialized object

With numbers, it seems to be clear. And what about the other types?

String

/src/objects/string.h

// The String abstract class captures JavaScript string values:
//
// Ecma-262:
//  4.3.16 String Value
//    A string value is a member of the type String and is a finite
//    ordered sequence of zero or more 16-bit unsigned integer values.
//
// All string values have a length field.
class String : public TorqueGeneratedString<String, Name> {

Let's look at what is in practice

d8> %DebugPrint("")
DebugPrint: 0x25800000099: [String] in ReadOnlySpace: #
0x258000003d5: [Map] in ReadOnlySpace
- map: 0x0258000004c5 <MetaMap (0x02580000007d <null>)>
- type: INTERNALIZED_ONE_BYTE_STRING_TYPE
- instance size: variabl
- elements kind: HOLEY_ELEMENTS
- enum length: invalid
- stable_map
- non-extensible
- back pointer: 0x025800000061 <undefined>
- prototype_validity cell: 0
- instance descriptors (own) #0: 0x0258000006d9 <DescriptorArray[0]>
- prototype: 0x02580000007d <null>
- constructor: 0x02580000007d <null>
- dependent code: 0x0258000006b5 <Other heap object (WEAK_ARRAY_LIST_TYPE)>
- construction counter: 0

""

It's pretty obvious here. We see a String type object with an undefined size. According to the specification, a String is an array of characters, and an array in JavaScript is an object. Although the specification says that String is one of the primitive types, in fact, it is a full-fledged object with all the attributes inherent in objects, with the exception of mutability. The developers of the engine deliberately excluded the mutability of the String object, as required by the specification.

As with numbers, let's look at the memory snapshot.

/*
 * For the purity of the experiment,
 * let's take an empty string and not an empty one
 */
function V8Snapshot() {
  this.emptyString = '';
  this.string = 'JavaScript';
}

const v8Snapshot1 = new V8Snapshot();
const v8Snapshot2 = new V8Snapshot();

Here we see that both instances use the same string pointers. Moreover, by running the script several times, we will see the same addresses every time. This is achieved due to the so-called String Pool concept used in many programming languages. In simple terms, a string is a sequence of characters, based on this sequence, you can easily build a hash of the entire object. This hash, in the future, will be a pointer to an instance of the object in the HashMap. Thus, when receiving a string, the engine compiles its hash, looks to see if there is a string with such a hash in the pool, and, if there is a string, returns a pointer to it. Otherwise, it will write a new line to the pool.

Boolean, Null, Undefined

In theory, Boolean can take only two values, true or false. For this, as a rule, 1 bit is enough, where 0 = false and 1 = true. Let's see if this is the case in V8.

Boolean

d8> %DebugPrint(true)
DebugPrint: 0x36ac000000c1: [Oddball] in ReadOnlySpace: #true
0x36ac0000053d: [Map] in ReadOnlySpace
- map: 0x36ac000004c5 <MetaMap (0x36ac0000007d <null>)>
- type: ODDBALL_TYPE
- instance size: 28
- elements kind: HOLEY_ELEMENTS
- enum length: invalid
- stable_map
- non-extensible
- back pointer: 0x36ac00000061 <undefined>
- prototype_validity cell: 0
- instance descriptors (own) #0: 0x36ac000006d9 <DescriptorArray[0]>
- prototype: 0x36ac0000007d <null>
- constructor: 0x36ac0000007d <null>
- dependent code: 0x36ac000006b5 <Other heap object (WEAK_ARRAY_LIST_TYPE)>
- construction counter: 0

true

An unexpected twist. It turns out that the Boolean inside V8 is also an object, almost the same as HeapNumber, but with the Oddball type. What is Oddball, just below, but for now, I will pay attention that a similar structure can be observed in other simple types.

Null

d8> %DebugPrint(null)
DebugPrint: 0x36ac0000007d: [Oddball] in ReadOnlySpace: #null
0x36ac00000515: [Map] in ReadOnlySpace
- map: 0x36ac000004c5 <MetaMap (0x36ac0000007d <null>)>
- type: ODDBALL_TYPE
- instance size: 28
- elements kind: HOLEY_ELEMENTS
- enum length: invalid
- stable_map
- undetectable
- non-extensible
- back pointer: 0x36ac00000061 <undefined>
- prototype_validity cell: 0
- instance descriptors (own) #0: 0x36ac000006d9 <DescriptorArray[0]>
- prototype: 0x36ac0000007d <null>
- constructor: 0x36ac0000007d <null>
- dependent code: 0x36ac000006b5 <Other heap object (WEAK_ARRAY_LIST_TYPE)>
- construction counter: 0

null

Undefined

d8> %DebugPrint(undefined)
DebugPrint: 0x25800000061: [Oddball] in ReadOnlySpace: #undefined
0x258000004ed: [Map] in ReadOnlySpace
- map: 0x0258000004c5 <MetaMap (0x02580000007d <null>)>
- type: ODDBALL_TYPE
- instance size: 28
- elements kind: HOLEY_ELEMENTS
- enum length: invalid
- stable_map
- undetectable
- non-extensible
- back pointer: 0x025800000061 <undefined>
- prototype_validity cell: 0
- instance descriptors (own) #0: 0x0258000006d9 <DescriptorArray[0]>
- prototype: 0x02580000007d <null>
- constructor: 0x02580000007d <null>
- dependent code: 0x0258000006b5 <Other heap object (WEAK_ARRAY_LIST_TYPE)>
- construction counter: 0

undefined

Oddball

/src/objects/oddball.h

// The Oddball describes objects null, undefined, true, and false.
class Oddball : public PrimitiveHeapObject {

As you can see, Oddball is an object that implements the abstract class PrimitiveHeapObject, just like, for example, HeapNumber, which we talked about a little earlier. PrimitiveHeapObject extends those structures that implement primitive, according to the specification, data types.

static const uint8_t kFalse = 0;
static const uint8_t kTrue = 1;
static const uint8_t kNotBooleanMask = static_cast<uint8_t>(~1);
static const uint8_t kNull = 3;
static const uint8_t kUndefined = 4;

From the comment and the structure, it is clear that this object describes 4 possible values, null, undefined, true and false. But these values are obscenely simple. Why do we need such difficulties?

Actually, it's a matter of optimization and performance. These 4 values are, in fact, constants. During the execution of the script, these values can occur thousands of times. It would be extremely wasteful to allocate a new memory area for each variable declaration with one of these types. Therefore, V8 reserves these 4 values in advance, even before the script execution begins. Further, when encountering one of them, the engine can operate with a simple reference-pointer to a preloaded immutable object.

Let's look at the memory snapshot.

function V8Snapshot() {
  this.true = true;
  this.false = false;
  this.null = null;
  this.undefined = undefined;
}

const v8Snapshot1 = new V8Snapshot();
const v8Snapshot2 = new V8Snapshot();

Here we see that all 4 values are Oddball and have permanent system addresses defined even before the script is run.

Conclusion

So, we looked under the hood of the V8 engine and looked at how the main data types are arranged in it. The research showed that the practical implementation does not always correspond to the theoretical basis laid down for it. This, of course, does not mean that the ECMAScript specification is not correct or that the engine developers did not follow it. It is important to understand that the specification is a kind of abstract logical layer that defines general concepts and principles. The actual application development of the engine according to the specification is a lower-level story. In addition to implementing the basic requirements, developers must take care of many issues related to performance, optimization and, at the same time, take into account the features of different architectures and operating systems.

As we can see, almost all data types, except Smi, in the V8 engine are object types, and variables are pointers to them.

In general, the concepts of "primitive" and "object" in JavaScript were and remain as they were laid down in the specification. But while working with data types, it should be understood that these concepts are more logical rather than physical. The physical implementation of a particular type at the engine level may differ and have individual characteristics.

My telegram channels:

EN - https://t.me/frontend_almanac
RU - https://t.me/frontend_almanac_ru

Русская версия: https://blog.frontend-almanac.ru/5688-ygxnVD