August 20, 2024

V8. Working with Strings. Expanding Vocabulary

What is a String

To better understand what happens under the hood of V8, it is useful to recall some theory first.

The ECMA-262 specification states:

The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 2**53 - 1 elements.

The String type is a set of all ordered sequences of zero or more 16-bit unsigned integers (“elements”) with a maximum length of 2**53 - 1 elements.

In practice, it is necessary to store additional information along with strings in machine memory to be able to determine the end of the string in the general heap. There are two approaches for this purpose. The first is an array of characters: a structure that represents a sequence of elements and a separate field for the length of this sequence. The second is the null-terminator method, where a special character indicating the end of the string is placed at the end of the sequence of elements. Depending on the system, the terminator character can be the byte 0xFF or, for example, an ASCII code character. However, the byte 0x00 is the most commonly used. Strings with this terminating element are referred to as null-terminated strings. Both methods have their own advantages and disadvantages, which is why in the modern world, both methods are often combined, and strings in memory can simultaneously be null-terminated and store their length. This approach is used, for instance, in the C++ language starting with version 11.

In addition to the general definition of a string, the ECMA-262 specification has several other requirements alongside the other data types in the JavaScript language. I wrote in detail about types and their representation within the V8 engine in my article "Deep JS: In Memory of Types and Data." In that article, we learned that all types have their own hidden class, which stores all necessary attributes, service information, and encapsulates the logic of working with that type. It is obvious that for working with strings, the standard C++ library std::string is insufficient, and they are transformed into an internal class v8::String.

Types of Strings

To meet all the specification requirements and due to various optimizations, strings within V8 are divided into different types.

One-byte / Two-byte

Despite the specification directly defining all strings as sequences of 16-bit elements, this is quite wasteful from an optimization standpoint. After all, not all characters require 2 bytes for representation. The standard ASCII table contains 128 elements, including Arabic numerals, Latin alphabet letters in uppercase and lowercase without diacritical marks, basic punctuation, mathematical symbols, and control characters. This entire table implies coding with only 8 bits, and the set of characters is widely used in practice. Therefore, the V8 developers decided to encode strings that contain only 8-bit characters separately from the standard 16-bit ones. This allows for significant memory savings.

Internalized / Externalized Strings

Besides bit-depth, strings differ in their storage location. After being converted into an internal structure, the string gets placed into the so-called StringTable, which we will discuss a bit later. For now, note that there are three such tables. The StringTable itself stores strings within a single Isolate (essentially, within a single browser tab). Additionally, the engine allows strings to be stored outside the heap and even outside the engine itself, in external storage. Pointers to such strings are placed in a separate table called the ExternalStringTable. Furthermore, there is another table — the StringForwardingTable — for the purposes of garbage collection. Yes, strings, like objects, participate in garbage collection, and since they are stored in separate tables, they require specific mechanisms for this process. We will discuss this in detail a bit later.

Strings by purpose

The JavaScript language is quite flexible. During code execution, strings can be transformed multiple times and participate in related processes. For better performance, several functional types are allocated to them. These include: SeqString, ConsString, ThinString, and SlicedString and ExternalString. We will discuss each type in detail later. For now, let's summarize everything mentioned above.

This is the general classification of string types in V8. Besides the main types, there are also several derived types. For example, SeqString and ExternalString can be placed in a common shared heap, where their types may be SharedSeq<one|two>ByteString and SharedExternal<one|two>ByteString, respectively. Additionally, ExternalString can be uncachable and have additional types like UncachableExternal<one|two>ByteString and even SharedUncachableExternal<one|two>ByteString.

AST

Before a string takes its final form in the depths of V8, the engine must first read and interpret it. This is done using the same mechanism that interprets all other syntactic units of the language. Specifically, this involves constructing an abstract syntax tree (AST), which I discussed in the article Deep JS. Scopes of darkness or where variables live.

Upon receiving a string node through a string literal or some other means, V8 creates an instance of the AstRawString class. If the string is a concatenation of two or more other strings, it creates an AstConsString. These classes are designed to store strings outside V8's heap, in the so-called AstValueFactory. Later, these strings will be assigned a type and will be internalized, i.e., moved into V8's heap.

The length of the string

The size of each string in memory depends on its length. The longer the string, the more memory it occupies. But are there any limitations on the maximum string size imposed by the engine? The specification explicitly states that a string cannot exceed 2**53 - 1 elements. However, in practice, this number can be different. After all, a maximum one-byte string would weigh in at 8,192 TB (8 PB), while a two-byte string would correspondingly be 4,096 TB (4 PB). Clearly, no personal computer has that much RAM, which is why JavaScript engines have their own limits that are significantly stricter than the requirements of the specification. Specifically in V8 (the version of V8 at the time of writing this article is 12.8.325), the maximum string length depends on the system architecture.

/include/v8-primitive.h#126

static constexpr int kMaxLength =
    internal::kApiSystemPointerSize == 4 ? (1 << 28) - 16 : (1 << 29) - 24;

For 32-bit systems, this number is 2**28 - 16, and for 64-bit systems, it's slightly higher: 2**29 - 24. These limitations prevent a one-byte string from occupying more than 256 MB in 32-bit systems, and a one-byte string from occupying more than 512 MB. In 64-bit systems, the maximum values are no more than 512 MB for one-byte strings and 1024 MB for two-byte strings. If a string exceeds the length limit, the engine will return an Invalid string length error.

const length = (1 << 29) - 24;
const longString = '"' + new Array(length - 2).join('x');

String.prototype.link(longString); // RangeError: Invalid string length

To be objective, it is also important to note that the maximum heap size plays an indirect role, as it is not a constant and can be adjusted by the system or engine configuration.

// d8 --max-heap-size=128

const length = (1 << 27) + 24; // 129 MB
const longString = '"' + new Array(length - 2).join('x');

String.prototype.link(longString);

// Fatal JavaScript out of memory: Reached heap limit

StringTable

During the execution of a JavaScript program, it can manipulate a large number of both strings and variables that reference them. As we have seen, each string can consume a significant amount of memory. Moreover, strings within a JS program can be cloned, modified, concatenated, and transformed in various ways multiple times. As a result, duplicates of string values can end up in memory. Storing multiple copies of the same sequence of characters would be extremely wasteful. Therefore, in the world of system programming, the practice of storing string values in a separate structure that ensures there is no duplication of values, and that all string variables reference this structure, has long been actively applied. In Java and .NET, for example, such a structure is called a "StringPool", and in JavaScript, it is called a StringTable.

The essence of this is that once a string is included in the AST, a hash key is generated for it based on its type and value. The generated key is saved in the string object, and it is then checked against a special table of strings located in the heap. If such a key exists in the table, the corresponding JS variable will receive a reference to the existing string object. Otherwise, a new string object will be created, and the key will be placed in the table. This process is called internalization.

InternalizedString

Just above, I mentioned that in the internal representation of V8, there are many types of strings. Internalized strings, in their most basic form, receive the type InternalizedString, which indicates that the string is located in the StringTable.

As usual, let's take a look at the structure of the string inside the engine by compiling V8 in debug mode and running it with the flag --allow-natives-syntax.

d8> %DebugPrint("FrontendAlmanac")

DebugPrint: 0x28db000d9b99: [String] in OldSpace: #FrontendAlmanac
0x28db000003d5: [Map] in ReadOnlySpace
 - map: 0x28db000004c5 <MetaMap (0x28db0000007d <null>)>
 - type: INTERNALIZED_ONE_BYTE_STRING_TYPE
 - instance size: variable
 - elements kind: HOLEY_ELEMENTS
 - enum length: invalid
 - stable_map
 - non-extensible
 - back pointer: 0x28db00000061 <undefined>
 - prototype_validity cell: 0
 - instance descriptors (own) #0: 0x28db00000701 <DescriptorArray[0]>
 - prototype: 0x28db0000007d <null>
 - constructor: 0x28db0000007d <null>
 - dependent code: 0x28db000006dd <Other heap object (WEAK_ARRAY_LIST_TYPE)>
 - construction counter: 0

The first thing we see here is the type INTERNALIZED_ONE_BYTE_STRING_TYPE, which means that the string is one-byte and placed in the StringTable. Let's try to create another string literal with the same value.

d8> const string2 = "FrontendAlmanac";
d8> %DebugPrint(string2);

DebugPrint: 0x28db000d9b99: [String] in OldSpace: #FrontendAlmanac
0x28db000003d5: [Map] in ReadOnlySpace
 - map: 0x28db000004c5 <MetaMap (0x28db0000007d <null>)>
 - type: INTERNALIZED_ONE_BYTE_STRING_TYPE
 - instance size: variable
 - elements kind: HOLEY_ELEMENTS
 - enum length: invalid
 - stable_map
 - non-extensible
 - back pointer: 0x28db00000061 <undefined>
 - prototype_validity cell: 0
 - instance descriptors (own) #0: 0x28db00000701 <DescriptorArray[0]>
 - prototype: 0x28db0000007d <null>
 - constructor: 0x28db0000007d <null>
 - dependent code: 0x28db000006dd <Other heap object (WEAK_ARRAY_LIST_TYPE)>
 - construction counter: 0

Notice that the constant string2 refers to the exact same string instance at the address 0x28db000d9b99.

We can observe a similar situation in the Chrome browser.

// Замкнем значения в функции
function V8Snapshot() {
  this.string1 = "FrontendAlmanac";
  this.string2 = "FrontendAlmanac";
}

const v8Snapshot = new V8Snapshot();

Both V8Snapshot properties refer to the same address @61559. The contents of the entire StringTable can be seen right there in the snapshot in the (string) object.

Here, we can find our string and see which variables refer to it.

You have probably noticed that in addition to our string, there are over 4k other values in the table. The thing is, V8 also stores its internal strings here, including keywords, different names of system methods and functions, and even texts of JavaScript scripts.

SeqString

You have probably noticed that in addition to our string, there are over 4k other values in the table. The thing is, V8 also stores its internal strings here, including keywords, different names of system methods and functions, and even texts of JavaScript scripts.

The concept of InternalizedString appeared relatively recently, in 2018, as part of the TurboFan optimizing compiler. Before this, plain strings had the type SeqString. Technically, InternalizedString differs from SeqString in its internal structure. Specifically, the classes SeqOneByteString and SeqTwoByteString contain a pointer to a character array chars and a number of methods that can interact with it. The implementation of the SeqString class looks literally as follows:

/src/objects/string.h#733

// The SeqString abstract class captures sequential string values.
class SeqString : public String {
 public:
  // Truncate the string in-place if possible and return the result.
  // In case of new_length == 0, the empty string is returned without
  // truncating the original string.
  V8_WARN_UNUSED_RESULT static Handle<String> Truncate(Isolate* isolate,
                                                       Handle<SeqString> string,
                                                       int new_length);
                                                       
  struct DataAndPaddingSizes {
    const int data_size;
    const int padding_size;
    bool operator==(const DataAndPaddingSizes& other) const {
      return data_size == other.data_size && padding_size == other.padding_size;
    }
  };
  DataAndPaddingSizes GetDataAndPaddingSizes() const;
  
  // Zero out only the padding bytes of this string.
  void ClearPadding();
  
  EXPORT_DECL_VERIFIER(SeqString)
};

One of its descendants, SeqOneByteString, looks like this:

/src/objects/string.h#763

// The OneByteString class captures sequential one-byte string objects.
// Each character in the OneByteString is an one-byte character.
V8_OBJECT class SeqOneByteString : public SeqString {
 public:
  static const bool kHasOneByteEncoding = true;
  using Char = uint8_t;
  
  V8_INLINE static constexpr int32_t DataSizeFor(int32_t length);
  V8_INLINE static constexpr int32_t SizeFor(int32_t length);
  
  // Dispatched behavior. The non SharedStringAccessGuardIfNeeded method is also
  // defined for convenience and it will check that the access guard is not
  // needed.
  inline uint8_t Get(int index) const;
  inline uint8_t Get(int index,
                     const SharedStringAccessGuardIfNeeded& access_guard) const;
  inline void SeqOneByteStringSet(int index, uint16_t value);
  inline void SeqOneByteStringSetChars(int index, const uint8_t* string,
                                       int length);
                                       
  // Get the address of the characters in this string.
  inline Address GetCharsAddress() const;
  
  // Get a pointer to the characters of the string. May only be called when a
  // SharedStringAccessGuard is not needed (i.e. on the main thread or on
  // read-only strings).
  inline uint8_t* GetChars(const DisallowGarbageCollection& no_gc);
  
  // Get a pointer to the characters of the string.
  inline uint8_t* GetChars(const DisallowGarbageCollection& no_gc,
                           const SharedStringAccessGuardIfNeeded& access_guard);
                           
  DataAndPaddingSizes GetDataAndPaddingSizes() const;
  
  // Initializes padding bytes. Potentially zeros tail of the payload too!
  inline void clear_padding_destructively(int length);
  
  // Maximal memory usage for a single sequential one-byte string.
  static const int kMaxCharsSize = kMaxLength;
  
  inline int AllocatedSize() const;
  
  // A SeqOneByteString have different maps depending on whether it is shared.
  static inline bool IsCompatibleMap(Tagged<Map> map, ReadOnlyRoots roots);
  
  class BodyDescriptor;
  
 private:
  friend struct OffsetsForDebug;
  friend class CodeStubAssembler;
  friend class ToDirectStringAssembler;
  friend class IntlBuiltinsAssembler;
  friend class StringBuiltinsAssembler;
  friend class StringFromCharCodeAssembler;
  friend class maglev::MaglevAssembler;
  friend class compiler::AccessBuilder;
  friend class TorqueGeneratedSeqOneByteStringAsserts;
  
  FLEXIBLE_ARRAY_MEMBER(Char, chars);
} V8_OBJECT_END;

For comparison, the InternalizedString class looks like this:

/src/objects/string.h#758

V8_OBJECT class InternalizedString : public String{

  // TODO(neis): Possibly move some stuff from String here.
} V8_OBJECT_END;

As you can see, there is no implementation here at all, as the concept of InternalizedString was introduced merely for the sake of terminology. In fact, all the logic related to memory allocation, encoding, decoding, comparing, and modifying strings is handled in the base class String. Characters are stored not as an array but directly as a sequence of 32-bit or 64-bit character codes in memory. The class itself only has a system pointer to the beginning of the corresponding memory area. This structure is referred to as "FlatContent," and the strings are thus called flat strings.

V8_OBJECT class String : public Name {
  ...
 private:
  union {
    const uint8_t* onebyte_start;
    const base::uc16* twobyte_start;
  };
  ...
}

So, what is SeqString in practice?

d8> const seqString = [
  "F", "r", "o", "n", "t", "e", "n", "d",
  "A", "l", "m", "a", "n", "a", "c"
].join("");
d8> 
d8> %DebugPrint(seqString);

DebugPrint: 0x2353001c94e1: [String]: "FrontendAlmanac"
0x235300000105: [Map] in ReadOnlySpace
 - map: 0x2353000004c5 <MetaMap (0x23530000007d <null>)>
 - type: SEQ_ONE_BYTE_STRING_TYPE
 - instance size: variable
 - elements kind: HOLEY_ELEMENTS
 - enum length: invalid
 - stable_map
 - non-extensible
 - back pointer: 0x235300000061 <undefined>
 - prototype_validity cell: 0
 - instance descriptors (own) #0: 0x235300000701 <DescriptorArray[0]>
 - prototype: 0x23530000007d <null>
 - constructor: 0x23530000007d <null>
 - dependent code: 0x2353000006dd <Other heap object (WEAK_ARRAY_LIST_TYPE)>
 - construction counter: 0

In the example above, the string was created by combining elements of an array. Since the array had to be created regardless during this operation, and the resulting string cannot be checked in the StringTable until it has been combined, the variable is assigned the type SeqString.

Let’s modify the previous example a bit.

function V8Snapshot() {
  this.string1 = "FrontendAlmanac";
  this.string2 = [
    "F", "r", "o", "n", "t", "e", "n", "d",
    "A", "l", "m", "a", "n", "a", "c"
  ].join("");
}

const v8Snapshot = new V8Snapshot();

How will the string table look in this case?

Since string1 and string2 are two different objects with different types of strings, each of them has its own address. Moreover, each of them will have its own unique hash key. Therefore, in the table, we can see two identical values with different keys. In other words, there are duplicate strings. Generally, all such duplicates can be seen by applying the Duplicated strings filter in the snapshot.

Furthermore, if we attempt to create several SeqString objects with the same value, their duplicates also will not appear in the table.

function V8Snapshot() {
  this.string1 = "FrontendAlmanac";
  this.string2 = [
    "F", "r", "o", "n", "t", "e", "n", "d",
    "A", "l", "m", "a", "n", "a", "c"
  ].join("");
  this.string3 = [
    "F", "r", "o", "n", "t", "e", "n", "d",
    "A", "l", "m", "a", "n", "a", "c"
  ].join("");
}

const v8Snapshot = new V8Snapshot();

ConsString

Let's consider another example.

d8> const consString = "Frontend" + "Almanac";
d8> %DebugPrint(consString);

DebugPrint: 0xf0001c952d: [String]: c"FrontendAlmanac"
0xf000000155: [Map] in ReadOnlySpace
 - map: 0x00f0000004c5 <MetaMap (0x00f00000007d <null>)>
 - type: CONS_ONE_BYTE_STRING_TYPE
 - instance size: 20
 - elements kind: HOLEY_ELEMENTS
 - enum length: invalid
 - non-extensible
 - back pointer: 0x00f000000061 <undefined>
 - prototype_validity cell: 0
 - instance descriptors (own) #0: 0x00f000000701 <DescriptorArray[0]>
 - prototype: 0x00f00000007d <null>
 - constructor: 0x00f00000007d <null>
 - dependent code: 0x00f0000006dd <Other heap object (WEAK_ARRAY_LIST_TYPE)>
 - construction counter: 0

The variable consString is formed by concatenating two string literals. Such strings are given the type ConsString.

/src/objects/string.h#916

V8_OBJECT class ConsString : public String {
 public:
  inline Tagged<String> first() const;
  inline void set_first(Tagged<String> value,
                        WriteBarrierMode mode = UPDATE_WRITE_BARRIER);
  
  inline Tagged<String> second() const;
  inline void set_second(Tagged<String> value,
                         WriteBarrierMode mode = UPDATE_WRITE_BARRIER);
  
  // Doesn't check that the result is a string, even in debug mode.  This is
  // useful during GC where the mark bits confuse the checks.
  inline Tagged<Object> unchecked_first() const;
  
  // Doesn't check that the result is a string, even in debug mode.  This is
  // useful during GC where the mark bits confuse the checks.
  inline Tagged<Object> unchecked_second() const;
  
  V8_INLINE bool IsFlat() const;
  
  // Dispatched behavior.
  V8_EXPORT_PRIVATE uint16_t
  Get(int index, const SharedStringAccessGuardIfNeeded& access_guard) const;
  
  // Minimum length for a cons string.
  static const int kMinLength = 13;
  
  DECL_VERIFIER(ConsString)
  
 private:
  friend struct ObjectTraits<ConsString>;
  friend struct OffsetsForDebug;
  friend class V8HeapExplorer;
  friend class CodeStubAssembler;
  friend class ToDirectStringAssembler;
  friend class StringBuiltinsAssembler;
  friend class maglev::MaglevAssembler;
  friend class compiler::AccessBuilder;
  friend class TorqueGeneratedConsStringAsserts;
  
  friend Tagged<String> String::GetUnderlying() const;
  
  TaggedMember<String> first_;
  TaggedMember<String> second_;
} V8_OBJECT_END;

The ConsString class has two pointers to other strings, which can in turn be of any type, including ConsString. As a result, this type can form a binary tree of ConsString nodes, where the leaves are not ConsString. The final value of such a string is obtained by concatenating the leaves of the tree from left to right, from the deepest node to the first one.

function V8Snapshot() {
  this.string1 = "FrontendAlmanac";
  this.string2 = "Frontend" + "Almanac";
  this.string3 = "Frontend" + "Almanac";
}

const v8Snapshot = new V8Snapshot();

Just like with SeqString, each concatenated string has its own instance of the class and, accordingly, its own hash key. These strings can also be duplicated in the string table. However, in the table, we can also find the internalized leaves of this concatenation.

It is important to make a disclaimer. Not every concatenation operation results in the creation of a ConsString.

d8> const string = "a" + "b";
d8> %DebugPrint(string);

DebugPrint: 0x1d48001c9ec5: [String]: "ab"
0x1d4800000105: [Map] in ReadOnlySpace
 - map: 0x1d48000004c5 <MetaMap (0x1d480000007d <null>)>
 - type: SEQ_ONE_BYTE_STRING_TYPE
 - instance size: variable
 - elements kind: HOLEY_ELEMENTS
 - enum length: invalid
 - stable_map
 - non-extensible
 - back pointer: 0x1d4800000061 <undefined>
 - prototype_validity cell: 0
 - instance descriptors (own) #0: 0x1d4800000701 <DescriptorArray[0]>
 - prototype: 0x1d480000007d <null>
 - constructor: 0x1d480000007d <null>
 - dependent code: 0x1d48000006dd <Other heap object (WEAK_ARRAY_LIST_TYPE)>
 - construction counter: 0

In the example above, the string is represented by the type SeqString. The fact is that the procedures for reading and writing to the ConsString structure are not free. While the structure itself is considered efficient in terms of memory optimization, all the advantages become evident only with relatively long strings. In the case of short strings, the overhead of maintaining a binary tree nullifies these advantages. Therefore, the V8 developers empirically determined a critical string length below which the ConsString structure is inefficient. This number is 13.

/src/objects/string.h#940

// Minimum length for a cons string.
static const int kMinLength = 13;

SlicedString

d8> const parentString = " FrontendAlmanac FrontendAlmanac ";
d8> const slicedString1 = parentString.substring(1, 16);
d8> const slicedString2 = parentString.slice(1, 16);
d8> const slicedString3 = parentString.trim();
d8> 
d8> %DebugPrint(slicedString1);

DebugPrint: 0x312b001c9509: [String]: "FrontendAlmanac"
0x312b000001a5: [Map] in ReadOnlySpace
 - map: 0x312b000004c5 <MetaMap (0x312b0000007d <null>)>
 - type: SLICED_ONE_BYTE_STRING_TYPE
 - instance size: 20
 - elements kind: HOLEY_ELEMENTS
 - enum length: invalid
 - stable_map
 - non-extensible
 - back pointer: 0x312b00000061 <undefined>
 - prototype_validity cell: 0
 - instance descriptors (own) #0: 0x312b00000701 <DescriptorArray[0]>
 - prototype: 0x312b0000007d <null>
 - constructor: 0x312b0000007d <null>
 - dependent code: 0x312b000006dd <Other heap object (WEAK_ARRAY_LIST_TYPE)>
 - construction counter: 0

This type is assigned to strings formed as a result of being a substring of another string. Such operations include substring(), slice() and trimming methods trim(), trimStart(), and trimEnd().

The SlicedString class stores only a pointer to the parent string, along with the offset and length of the sequence within the parent string. This significantly reduces memory usage and improves performance when working with such strings. However, the same rule applies here as with ConsString: the length of the substring must not be less than 13 characters. Otherwise, this optimization loses its value, as SlicedString requires unpacking the parent string and adding the offset to the starting address of the sequence.

Another important feature is that SlicedString cannot be nested.

function V8Snapshot() {
  this.parentString = " FrontendAlmanac FrontendAlmanac ";
  this.slicedString1 = this.parentString.trim();
  this.slicedString2 = this.slicedString1.substring(0, 15);
}

const v8Snapshot = new V8Snapshot();

In the example above, we attempt to create a SlicedString from another SlicedString. However, slicedString1 cannot be the parent string because it is itself a SlicedString.

Therefore, the parent string for both properties will be "parentString".

ThinString

There are instances when it is necessary to internalize a string, but for some reason, it cannot be done immediately. In such cases, a new string object is created and internalized, while the original string is converted into a ThinString type, which essentially serves as a reference to its internalized version. ThinString is very similar to ConsString, but it only contains a single reference.

d8> const string = "Frontend" + "Almanac"; // create ConsString
d8> const obj = {};
d8> 
d8> obj[string]; // convert to ThinString
d8> 
d8> %DebugPrint(string);

DebugPrint: 0x335f001c947d: [String]: >"FrontendAlmanac"
0x335f00000425: [Map] in ReadOnlySpace
 - map: 0x335f000004c5 <MetaMap (0x335f0000007d <null>)>
 - type: THIN_ONE_BYTE_STRING_TYPE
 - instance size: 16
 - elements kind: HOLEY_ELEMENTS
 - enum length: invalid
 - stable_map
 - non-extensible
 - back pointer: 0x335f00000061 <undefined>
 - prototype_validity cell: 0
 - instance descriptors (own) #0: 0x335f00000701 <DescriptorArray[0]>
 - prototype: 0x335f0000007d <null>
 - constructor: 0x335f0000007d <null>
 - dependent code: 0x335f000006dd <Other heap object (WEAK_ARRAY_LIST_TYPE)>
 - construction counter: 0

In the example above, we first create a ConsString. We then use this string as a key for an object. To find the key in the object's properties array, the string needs to be flat; however, at this point, it is represented as a binary tree with two nodes. In this situation, the engine is forced to compute a flat value from the ConsString, create a new string object, intern it, and convert the original string into a ThinString. This particular case was mentioned in an article about string cleansing on Habr, although it did not explain why this occurs.

Now, let's take a look at DevTools. We won't be able to see the ThinString here, but we can observe that the property string is represented as InternalizedString. This is because ThinString, I reiterate, is merely a reference to the internalized version of the string.

ExternalString

Another type of string is the ExternalString. The V8 engine provides the capability to create and store strings outside the engine itself. To facilitate this, the ExternalString type and the corresponding API were introduced. References to these strings are stored in a separate table called the ExternalStringTable in the heap. Typically, such strings are created by the engine consumer for their specific needs. For instance, browsers might store external resources in this way. Additionally, the consumer can fully control the lifecycle of these resources, with one caveat: the consumer must ensure that such strings are not deallocated while the ExternalString object is alive in the V8 heap.

The screenshot above shows one of these strings created by the browser. However, we can also create our own. To do this, we can use the internal API method externalizeString (V8 must be started with the --allow-natives-syntax flag).

//> d8 --allow-natives-syntax --expose-externalize-string
d8> const string = "FrontendAlmanac";
d8> 
d8> externalizeString(string);
d8> 
d8> %DebugPrint(string);

DebugPrint: 0x7d6000da08d: [String] in OldSpace: #FrontendAlmanac
0x7d600000335: [Map] in ReadOnlySpace
 - map: 0x07d6000004c5 <MetaMap (0x07d60000007d <null>)>
 - type: EXTERNAL_INTERNALIZED_ONE_BYTE_STRING_TYPE
 - instance size: 20
 - elements kind: HOLEY_ELEMENTS
 - enum length: invalid
 - stable_map
 - non-extensible
 - back pointer: 0x07d600000061 <undefined>
 - prototype_validity cell: 0
 - instance descriptors (own) #0: 0x07d600000701 <DescriptorArray[0]>
 - prototype: 0x07d60000007d <null>
 - constructor: 0x07d60000007d <null>
 - dependent code: 0x07d6000006dd <Other heap object (WEAK_ARRAY_LIST_TYPE)>
 - construction counter: 0

Garbage Collection

Strings, like any other variables, participate in the garbage collection process. I've covered this topic in detail in my article Garbage Collection in V8, so I won't go into it here. More interestingly, the StringTable itself also participates in this process. To be precise, any transformations and deletions of strings in the StringTable occur during Full GC. For this, a temporary table called the StringForwardingTable is used, into which only relevant strings are placed during garbage collection. After this, the reference to the StringTable is updated to point to the new table.

Conclusion

So, we have familiarized ourselves with the organization of strings within the V8 engine. We have learned more about the string table and the different types of strings.

Here are some key points and conclusions.

  • Strings can be one-byte or two-byte. Two-byte strings require approximately twice as much memory, as each character in such a string is encoded with two bytes, regardless of whether it is an ASCII character or not. Therefore, if you have a choice of which string to use, a one-byte string is preferable in most cases.
const myMap = {
  "Frontend Almanac": undefined, // one-byte key,
  "Frontend Альманах": undefined, // two-byte key
}
  • Despite the specified number 2**53 - 1 in the specification, the actual length of strings on real systems is much lower. On 32-bit systems, V8 allows storing strings no longer than 2**28 - 16 characters, whereas on 64-bit systems, it is no more than 2**29 - 24 characters.
  • Seq, Cons, and Sliced strings can be duplicated in the string table.
const string1 = "FrontendAlmanac"; // creates internalized string
const string2 = "FrontendAlmanac"; // doesn't create new string
const string3 = "Frontend" + "Almanac"; // creates duplicated string
  • SlicedStrings cannot be nested. They always refer to the original internalized string. The parent string remains alive as long as at least one of its child SlicedStrings exists.
let parentString = "FrontendAlmanac"; // parent string
let slicedString1 = parentString.slice(1); // refers to the parentString
let slicedString2 = slicedString1.slice(1); // also refers to the parentString
let slicedString3 = slicedString2.slice(1); // and this one too

slicedString1 = undefind;
slicedString2 = undefind;
// parentString still not garbage collected
// as the slicedString3 is stil alive
  • If a string needs to be internalized but cannot be done "in place" — for example, if it is a Cons or Sliced string — the engine will calculate the flat value, create a new internalized string object, and convert the reference to this internalized version.
const string = "Frontend" + "Almanac"; // ConsString

const obj = {};
obj[string]; // ThinString

// string is not ConsString anymore, now it's ThinString and refers to
// a flat internalized valued in the StringTable
  • The previous example can be used to eliminate duplicate strings in the StringTable, as suggested in the article about string cleansing.
const string1 = "FrontendAlmanac";
const string2 = "Frontend" + "Almanac"; // creates duplicate
const string3 = "Frontend" + "Almanac"; // creates duplicate

const obj = {};
obj[string2]; // ThinSting
obj[string3]; // ThinString

// Now the string2 and string3 are not ConsString anymoer and they refer
// to their internalized version, in this case to the string1,
// which is already in the StringTable.

I have described only a few cases of working with strings. The article's format is not sufficient to cover all possible features and special cases. Therefore, I invite everyone to share their experiences and ask questions in the comments on my Telegram channel.


My telegram channels:

EN - https://t.me/frontend_almanac
RU - https://t.me/frontend_almanac_ru

Русская версия: https://blog.frontend-almanac.ru/v8-strings