Declarative Objectivity (DO) Language : Attribute Structures : String Attributes
String Attributes
String attributes are attributes that can hold string values.
   
String Values
Logical Type
Settings
Options1
Quick Look
Byte or Unicode string values (variable, fixed, or optimized size)
String
Encoding:
Byte
Utf8
Utf16
Utf32
lastName: String
 
 
lastName: String {
  Encoding: Utf32,
  Storage: Variable
}
 
 
state: String {
  Encoding: Byte,
  Storage: Fixed,
  FixedLength: 2
}
Storage:
Variable
Fixed
Optimized
FixedLength
n
1. You can omit settings for default options, which are indicated in boldface.
Discussion 
For general syntax information, see About Attribute Structures.
For examples of string values, see String Tokens.
A string value represents a sequence of characters in text. The type characteristics of a string attribute determine both the character encoding and the allocation of space for string values it can store.
A string value is implicitly terminated by a null character. Storage for the terminating null character is managed automatically, so, for example, you do not need to count it explicitly when specifying the maximum size of fixed-capacity strings.
For considerations when designing a string attribute, see Guidelines for Choosing a Character Encoding and Guidelines for Choosing a Storage Option.
Specifying the Attribute Type
A string data attribute has the following logical type:
Logical Type
Description
String
Sequences of characters in text.
Specifying Type Characteristics
A string attribute uses the following settings to specify detailed type characteristics:
Setting
Specfies
Options
Encoding:
The character encoding for string values
Byte
Represents characters as byte-length (8-bit) numeric values according to according to a standard such as ASCII or ISO Latin-1.
Utf8
Represents characters in the Unicode UTF-8 encoding format.
Utf16
Represents characters in the Unicode UTF-16 encoding format.
Utf32
Represents characters in the Unicode UTF-32 encoding format.
Storage:
The space allocated for string values
Variable
Allocation for strings of any length.
Fixed
Allocation for non-extensible strings of a particular maximum size.
Optimized
Allocation for strings that are optimized for (but not limited to) a particular size.
FixedLength:
The maximum size of fixed-capacity strings. Optimal size of optimized strings.
n
Positive integer number of characters (in Byte or Utf32 encoded strings) or code units (in Utf8 or Utf16 encoded strings).
Do not count the terminating null character.
About Character Encodings
The component characters of a string value are represented as numeric values according to the character encoding that is part of the string’s data type. A character encoding governs not only the mapping of characters to numeric values, but also the way these numeric values are represented in storage. Depending on the encoding, a single character’s numeric value may be represented as one or more code units, where a code unit is a minimal combination of bits. Consequently, a code unit may, but need not, correspond to a complete character (also called a code point).
Two general kinds of character encodings are supported:
Byte-string encoding, which represents each character as an 8-bit numeric value according to a standard such as ASCII and ISO Latin-1.
The byte-string encoding uses a single 8-bit code unit to represent each character.
Unicode encoding, which represents characters as numeric values according to the Unicode Standard.
A Unicode encoding further specifies an encoding format (UTF-8, UTF-16, or UTF-32) to indicate the size of the code units that represent the numeric values.
The following table summarizes the details of the supported character encodings:
 
Character Encoding
Code Unit Size
Code Units Per Character
Description
Byte
8 bits
Exactly 1
Byte-string encoding.
UTF8
8 bits
1, 2, 3, or 4
Unicode UTF-8 encoding format.
UTF16
16 bits
1 or 2
Unicode UTF-16 encoding format.
A pair of code units representing a single character is called a surrogate pair.
UTF32
32 bits
Exactly 1
Unicode UTF-32 encoding format.
About Space Allocation for String Values
Space is allocated for string values according to one of three general storage options:
Fixed-Capacity Strings
Variable Strings
Optimized Strings
These storage options govern not only the amount of space that is allocated, but also the way that space is structured.
When considering the storage options for a string attribute, you must consider the typical size of the string values to be stored by that attribute. A string’s size is the number of code units in it, and may differ from the string’s length, which is the number of whole characters it contains:
For a string in the Byte or Utf32 encoding, the size is equal to the length, because every character is represented as exactly one code unit.
For a string in the Utf8 or Utf16 encoding, the size can be larger than the length, because a given character may be represented by two or more code units.
Fixed-Capacity Strings
A fixed-capacity string is a sequence of characters that cannot be extended beyond a specified maximum size, or capacity. Fixed-capacity storage is appropriate for an attribute that will store strings of up to (but not exceeding) the maximum size. For example, a string attribute with a fixed capacity of 2 can accommodate strings consisting of either 1 or 2 code units.
When a schema class has a fixed-capacity string attribute, storage for a string of the specified capacity N is allocated within each object of the class. (Additional space is automatically allocated for a terminating null character, so the actual allocation is for N+1 code units.) When you assign a string to the attribute, the string’s characters are embedded directly in the object, along with the terminating null character.
When defining a fixed-capacity string attribute, you should set the capacity N to the size of the longest string value that will be assigned to the attribute. Note:
If an assigned string has fewer than N code units, the excess space beyond the termination null is left uninitialized.
If an assigned string has more than N code units, an exception is thrown.
Variable Strings
A variable string is a sequence of any number of characters. Variable storage is appropriate for an attribute that will store strings whose lengths are not known or are known to vary widely.
Structurally, a variable string is a compound object consisting of a reference to an extensible vector of elements (code units). The reference occupies a fixed amount of space. When a schema class has a variable string attribute, storage for the reference portion of the string is embedded in each object of the class, and storage for the vector portion is allocated outside the object. The vector occupies a variable amount of space and may be relocated by certain operations. Elements in the vector are guaranteed contiguous within virtual memory.
Optimized Strings
An optimized string is a sequence of any number of characters, with storage that is optimized for strings containing up to a specified number of code units N. Optimized storage is appropriate for an attribute that will store strings that are known to generally be of a certain size, but are not limited to that size. An optimized string attribute provides very efficient storage and character access for strings whose size you can predict, while still providing the flexibility of a variable-length string if an occasional large string needs to be stored.
Structurally, an optimized string is a compound object consisting of a fixed-capacity string of a specified capacity N, plus a reference to an extensible vector of elements (code units). When a schema class has an optimized string attribute, storage for the fixed part and for the reference is embedded in each object of the class. (Additional space is automatically allocated for a terminating null character, so the actual allocation for the fixed part is for N+1 code units.) Storage for the vector is allocated only as needed when you assign a string to the attribute:
If the assigned string contains N or fewer code units, these characters and a terminating null are embedded in the fixed part, and the vector is not allocated.
If the assigned string has more than N code units, the vector is allocated, and all of the characters are stored in it. The fixed part remains unused.
The vector’s storage is allocated outside the object. The vector occupies a variable amount of space and may be relocated by certain operations. Elements in the vector are guaranteed contiguous within virtual memory.
For example, assume an optimized string attribute has a fixed capacity of 2.
An assigned string of size 1 or 2 is stored in the fixed part.
An assigned string of size 3 or more is stored in the vector.
Using the fixed part of the attribute instead of the vector:
Provides more efficient storage, because the stored string’s characters are embedded directly in the containing object, without the overhead of the vector and the wasted space of the allocated but unused fixed part.
Provides better performance for accessing the stored string’s characters, because those characters can be accessed directly, without the dereference operation that must be performed to obtain the vector.
In general, when specifying the capacity N for an optimized string attribute, you should minimize the use of the vector as much as possible, by choosing N so that a high percentage (for example, 90%) of the anticipated string values will fit within the fixed part of the attribute. However, you should avoid setting N too high. If many of the assigned strings are significantly shorter than N, the attributes containing the shorter strings will also contain wasted space.
Guidelines for Choosing a Character Encoding
A string attribute’s character encoding determines the efficiency with which the string values are stored and accessed. For optimal efficiency, use the following guidelines to choose the character encoding.
Choose the Byte-String Encoding Whenever Possible
If the ISO/IEC 8859-1 standard (Latin-1 character set) provides all of the characters you will ever need, use the byte-string encoding, unless you want Unicode support for some other reason. The Latin-1 character set generally encompasses Western European languages.
The byte-string encoding uses a single 8-bit code unit for the numeric value of each character in the Latin-1 character set, including characters in the non-ASCII subset. Therefore, the entire 8 bits of space allocated for a code unit is used, with no wasted space, and character access is very efficient because every character maps to a single code unit.
Consider the Tradeoffs Between the UTF Encoding Formats
If you need to store characters outside the Latin-1 character set, or you want Unicode support for the Latin-1 character set, you must choose one of the three character encodings corresponding to UTF encoding formats. A Unicode representation is required for the characters of an Asian, Eastern European, Middle Eastern, or other non-Western-European language.
When choosing among Unicode encoding formats, you should consider which format provides the best storage-space usage and character access for the majority of the characters to be stored. Although any character can be stored in any of the three UTF encoding formats, the formats differ in terms of the amount of space that must be reserved for each character and number of code units to be used for each character. For example:
The numeric value for a Latin-1 character such as ‘A’ requires 8 bits, which is stored as a single code unit in each of the three encoding formats. No space is wasted for the character in the UTF-8 encoding. However, in the UTF-16 format, 8 bits are wasted, and in the UTF-32 format, 24 bits are wasted.
The numeric value for an uncommon character outside the Basic Multilingual Plane might require up to 21 bits. Such a character can be stored in 4 UTF-8 code units, 2 UTF-16 code units, or 1 UTF-32 code unit. Roughly the same amount of space is unused (or used for overhead) in each of the encoding formats. However, character access is less efficient in the UTF-8 or UTF-16 encoding, where multiple code units must be combined to access the character’s numeric value.
For optimal efficiency, you should base your choice on the actual string values to be stored. If the majority of the characters in your string data will fit within 8 bits, you should consider Unicode UTF-8, regardless of the requirements of the overall character set, because larger code units allocated for the 8-bit characters will contain wasted space.
The following table summarizes the general tradeoffs among the three Unicode character encodings:
Unicode Encoding Format
Considerations
Unicode UTF-8
Most efficient storage-space usage for the Latin-1 character set.
Least efficient character access for non-Latin-1 character sets, or for Latin-1 characters in the non-ASCII subset, because many of these characters require multiple code units.
Unicode UTF-16
Most efficient storage-space usage for Asian languages.
A generally good balance between efficient storage and efficient character access.
Unicode UTF-32
Most efficient character access for any character set, but generally wastes a lot of space.
Recommended only for character sets outside the Basic Multilingual Plane.
Guidelines for Choosing a Storage Option
When choosing whether to make an attribute fixed-capacity, variable, or optimized, you should consider your application’s data and performance requirements, and make tradeoffs as follows:
 
Storage Option
Considerations
Variable
Recommended: When the strings to be stored vary widely in size or if their size cannot be predicted.
Tradeoff: Has extra storage overhead and less efficient character access.
Fixed
Recommended: When the strings to be stored are of a known size, or vary only a little in size.
Tradeoff: Cannot accommodate strings with more than the allocated number of code units; attempting to assign a longer string returns an error. Can waste a significant amount of space if string values vary widely in size.
Optimized
Recommended: When the strings to be stored are known to be generally smaller than a certain size, but are not limited to that size.
Tradeoff: Has extra storage overhead, even if the vector is never used.