From campbell@bloodandcoffee.net Sun Jul 24 19:38:09 2005 -0700
Status: 
X-Status: 
X-Keywords:
Newsgroups: 
Date: Sun, 24 Jul 2005 19:38:09 -0700 (PDT)
From: Taylor Campbell <campbell@bloodandcoffee.net>
To: srfi-75@srfi.schemers.org
Subject: an alternative possibility
Fcc: sent-mail
Message-ID: <Pine.LNX.4.44.0507241426270.20303@autodrip.bloodandcoffee.net>
X-Cursor-Pos: To 0
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

It is clear that there are many conflicting goals of the SRFI and its
discussors.  Much of this stems from two desires: to retain the same
model used by the old R5RS string API, or at least some semblance of
compatibility therewith; and to prescribe Unicode in some form or
another.

Over the past few days, I and several others have been working on a
radically different design for a string API, among whose differences
from R5RS's model are the use of an opaque string cursor mechanism for
indexing, the removal of the notion of a separate character data type,
and immutability of all strings (not just literals).  The goal of this
new string design is to provide a higher-level API that is much more
flexible in underlying implementation decisions and encodings.  Indeed,
the result is entirely agnostic to whether the Scheme system supports
full Unicode, purely 7-bit ASCII, or some other encoding[s].

This may not be the right forum to discuss the alternative string API,
but I felt it was relevant to mention it here.  (So far all discussion
on it has taken place in the #scheme IRC channel on irc.freenode.net.)
The rest of this mail will be occupied with discussions of some of the
designs and the rationale therefor.

First, a sketch of a specification for the API can be found at:
  <http://www.bloodandcoffee.net/campbell/scm-text.text>
(This is, of course, only a sketch, not a full specification.  There's
a great deal left unspecified but implied between its designers; if
there's anything unclear in it, feel free to ask for clarification.)

I'll now expound on several of the significant design decisions:

* String units versus a separate character type

In the alternative string API, there is no separate 'character' data
type.  Instead, strings are sequences of some unspecified units.  In
Latin-1-only or similar systems, these units will merely be 8-bit
characters; in systems that support Unicode, these will be grapheme
clusters, encoded by any Unicode mechanism -- it doesn't matter whether
the individual code points are encoded as UTF-8, UTF-16, &c.  The
length of a string, which STRING-LENGTH returns, is the number of such
units.  STRING-REF simply returns a string of length one.

There are many advantages to this approach, both possibilities for
exploitation and inherent advantages.  The first is that encoding can
be swept under the rug of the implementation; programmers deal with a
much higher-level notion of units in strings.  (I'll get to how actual
binary encoding works later -- it suffices now to say that there is a
separate facility for encoding strings into SRFI 74 blobs.)  Because of
this, not only do we no longer need to worry about exactly what a
character is and whether it should include bare Unicode surrogates, but
the encoding of strings need not be prescribed at all.  Strings can be
processed at a higher level than working with individual code points --
they can work with the higher-level notion of characters in strings --,
and Unicode normalization (in systems to which it matters, i.e. those
that support Unicode at all) is an issue that can be left to the binary
encoding facility.

This would all hold true still if there were a separate character data
type, just one that held more data than simply a code point, but if the
character data type were more than just a code point -- a sequence of
them, rather --, there would be little point in distinguishing it from
a string at all.  Operations on units in strings can still be defined
simply by passing a cursor (e.g., STRING-UNIT=? in the specification
sketch) if the cost of allocation incurred by STRING-REF is deemed too
high -- it is rare that characters are operated on alone outside of the
context of strings even in existing code --, or unit strings of a
single code point can be represented just like characters often were in
old systems, with a new primitive immediate type tag.

* Opaque string cursors versus natural number indexing

The notion of string cursors has been brought up before here, but only
in passing, and not considering the possibility of also removing
natural number indexing.  Using this for strings encoded as UTF-8 or
with some other variable-length encoding is obviously necessary for any
reasonable algorithmic complexity.  It is made even more necessary by
the model suggested above wherein the basic unit of strings is grapheme
clusters (in Unicode-supporting Scheme systems).  However, the string
cursor abstraction need not be a heavy-weight one; this is discussed in
more detail in the specification sketch.

For the rare applications in which indexing by natural numbers is
necessary -- and I doubt whether any of these are in tight loops: if
they were, they would be inapplicable to variable-length encodings like
UTF-8 ayway --, STRING-CURSOR->INDEX, INDEX->STRING-CURSOR, and
STRING-CURSOR-DIFFERENCE are provided.  They all have trivial
definitions in terms of the two basic cursor arithmetic and comparison
operators, too.

* Low-level binary encoding of strings

As mentioned earlier in this mail, no internal encoding of strings is
prescribed; not even Unicode is required.  Binary encoding of strings
is well-partitioned from the rest of the API: there are several
operations to convert strings to and from blobs, parameterized with a
first-class text encoding descriptor.  The TEXT-ENCODING procedure
returns such a descriptor given the name of an encoding; for instance,
to get a blob containing the UTF-8 representation of some string, you
would write:

  (string->blob string (text-encoding 'utf-8))

(If the system doesn't support UTF-8, TEXT-ENCODING would return false,
and that code would be an error, so code would, of course, have to be
more robustly written than that; however, this string API is meant not
to prescribe particular encodings, not even Unicode, so, for example,
lightweight Scheme systems for microcontrollers wouldn't be either
weighed down by Unicode or marked as non-conforming.)

This text encoding mechanism can finish in one fell swoop the whole
matter of normalization in a Unicode instantiation of it: for the
internal encoding of strings, one normalization can simply be chosen
and always used, and there can be text encodings named for particular
normalizations; for instance, an encoding of UTF-8, in normal form D,
may be represented with the name UTF-8-NORMAL-D.  If the Scheme system
supports differing encodings per string, each Unicode encoding can just
store what kind of normalization it uses, and other procedures can work
appropriately with that.  Operations on string units will work 'as
expected,' because you needn't to worry about getting a non-normalized
character out of some string and trying to compare it to a normalized
sequence in another: grapheme clusters encompass the boundaries of what
normalization can affect, and, if the system procedures normalize 
string units (grapheme clusters) on-demand, it's completely transparent
to the programmer.