From campbell@bloodandcoffee.net Sun Jul 24 19:38:09 2005 -0700 Status: X-Status: X-Keywords: Newsgroups: Date: Sun, 24 Jul 2005 19:38:09 -0700 (PDT) From: Taylor Campbell To: srfi-75@srfi.schemers.org Subject: an alternative possibility Fcc: sent-mail Message-ID: X-Cursor-Pos: To 0 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII It is clear that there are many conflicting goals of the SRFI and its discussors. Much of this stems from two desires: to retain the same model used by the old R5RS string API, or at least some semblance of compatibility therewith; and to prescribe Unicode in some form or another. Over the past few days, I and several others have been working on a radically different design for a string API, among whose differences from R5RS's model are the use of an opaque string cursor mechanism for indexing, the removal of the notion of a separate character data type, and immutability of all strings (not just literals). The goal of this new string design is to provide a higher-level API that is much more flexible in underlying implementation decisions and encodings. Indeed, the result is entirely agnostic to whether the Scheme system supports full Unicode, purely 7-bit ASCII, or some other encoding[s]. This may not be the right forum to discuss the alternative string API, but I felt it was relevant to mention it here. (So far all discussion on it has taken place in the #scheme IRC channel on irc.freenode.net.) The rest of this mail will be occupied with discussions of some of the designs and the rationale therefor. First, a sketch of a specification for the API can be found at: (This is, of course, only a sketch, not a full specification. There's a great deal left unspecified but implied between its designers; if there's anything unclear in it, feel free to ask for clarification.) I'll now expound on several of the significant design decisions: * String units versus a separate character type In the alternative string API, there is no separate 'character' data type. Instead, strings are sequences of some unspecified units. In Latin-1-only or similar systems, these units will merely be 8-bit characters; in systems that support Unicode, these will be grapheme clusters, encoded by any Unicode mechanism -- it doesn't matter whether the individual code points are encoded as UTF-8, UTF-16, &c. The length of a string, which STRING-LENGTH returns, is the number of such units. STRING-REF simply returns a string of length one. There are many advantages to this approach, both possibilities for exploitation and inherent advantages. The first is that encoding can be swept under the rug of the implementation; programmers deal with a much higher-level notion of units in strings. (I'll get to how actual binary encoding works later -- it suffices now to say that there is a separate facility for encoding strings into SRFI 74 blobs.) Because of this, not only do we no longer need to worry about exactly what a character is and whether it should include bare Unicode surrogates, but the encoding of strings need not be prescribed at all. Strings can be processed at a higher level than working with individual code points -- they can work with the higher-level notion of characters in strings --, and Unicode normalization (in systems to which it matters, i.e. those that support Unicode at all) is an issue that can be left to the binary encoding facility. This would all hold true still if there were a separate character data type, just one that held more data than simply a code point, but if the character data type were more than just a code point -- a sequence of them, rather --, there would be little point in distinguishing it from a string at all. Operations on units in strings can still be defined simply by passing a cursor (e.g., STRING-UNIT=? in the specification sketch) if the cost of allocation incurred by STRING-REF is deemed too high -- it is rare that characters are operated on alone outside of the context of strings even in existing code --, or unit strings of a single code point can be represented just like characters often were in old systems, with a new primitive immediate type tag. * Opaque string cursors versus natural number indexing The notion of string cursors has been brought up before here, but only in passing, and not considering the possibility of also removing natural number indexing. Using this for strings encoded as UTF-8 or with some other variable-length encoding is obviously necessary for any reasonable algorithmic complexity. It is made even more necessary by the model suggested above wherein the basic unit of strings is grapheme clusters (in Unicode-supporting Scheme systems). However, the string cursor abstraction need not be a heavy-weight one; this is discussed in more detail in the specification sketch. For the rare applications in which indexing by natural numbers is necessary -- and I doubt whether any of these are in tight loops: if they were, they would be inapplicable to variable-length encodings like UTF-8 ayway --, STRING-CURSOR->INDEX, INDEX->STRING-CURSOR, and STRING-CURSOR-DIFFERENCE are provided. They all have trivial definitions in terms of the two basic cursor arithmetic and comparison operators, too. * Low-level binary encoding of strings As mentioned earlier in this mail, no internal encoding of strings is prescribed; not even Unicode is required. Binary encoding of strings is well-partitioned from the rest of the API: there are several operations to convert strings to and from blobs, parameterized with a first-class text encoding descriptor. The TEXT-ENCODING procedure returns such a descriptor given the name of an encoding; for instance, to get a blob containing the UTF-8 representation of some string, you would write: (string->blob string (text-encoding 'utf-8)) (If the system doesn't support UTF-8, TEXT-ENCODING would return false, and that code would be an error, so code would, of course, have to be more robustly written than that; however, this string API is meant not to prescribe particular encodings, not even Unicode, so, for example, lightweight Scheme systems for microcontrollers wouldn't be either weighed down by Unicode or marked as non-conforming.) This text encoding mechanism can finish in one fell swoop the whole matter of normalization in a Unicode instantiation of it: for the internal encoding of strings, one normalization can simply be chosen and always used, and there can be text encodings named for particular normalizations; for instance, an encoding of UTF-8, in normal form D, may be represented with the name UTF-8-NORMAL-D. If the Scheme system supports differing encodings per string, each Unicode encoding can just store what kind of normalization it uses, and other procedures can work appropriately with that. Operations on string units will work 'as expected,' because you needn't to worry about getting a non-normalized character out of some string and trying to compare it to a normalized sequence in another: grapheme clusters encompass the boundaries of what normalization can affect, and, if the system procedures normalize string units (grapheme clusters) on-demand, it's completely transparent to the programmer.