A New String API                                                -*- outline -*-

####                  This document is obsolete!                           ####
#### Please refer to <http://mumble.net/~campbell/proposals/new-text.txt>. ####

Strings are sequences -- not necessarily arrays -- of some unspecified unit.
(In ASCII, this unit is the 7-bit character; in Unicode, this unit will likely
be the grapheme cluster.)  String units are indexed by opaque cursor objects,
_not_ natural numbers as in most systems.  Binary representation & encoding is
very carefully separated from the bulk of the API: to operate on a string's
binary representation, one must operate on a SRFI 74 blob that contains the
encoded string; direct access to a string's internal encoding is not provided.

Separate specifications would define exactly what units are in strings.  For
instance, a basic ASCII specification would specify that the basic unit of
strings is the ASCII character; a Unicode specification would define the units
of strings as Unicode grapheme clusters.

* Base String Operations

(STRING? obj) -> boolean

(STRING-LENGTH string) -> natural number
(STRING-EMPTY? string) -> boolean
    = (ZERO? (STRING-LENGTH string))
(UNIT-STRING? obj) -> boolean
    = (AND (STRING? obj) (= 1 (STRING-LENGTH obj)))

Explicit string units are represented as strings themselves of length one.  The
following invariant will always hold for any string and valid cursor:

  (STRING-LENGTH (STRING-REF string cursor)) = 1

(STRING-REF string cursor) -> unit-string
    = (STRING-SLICE string cursor (STRING-CURSOR+ cursor string))
  Note: STRINGS ARE IMMUTABLE!!!  There is *no* STRING-SET!.  Also, STRING-REF
  may allocate on the heap, because it returns a string.

(STRING-SLICE string start [end]) -> string
(STRING-COPY string [start [end]]) -> string
  STRING-SLICE is guaranteed to be an amortized constant-time operation, in
  that it may not copy all input strings, but must use some mechanism of shared
  storage.  Also, any string cursors into a string returned by STRING-SLICE are
  valid cursors into the original input string to STRING-SLICE.  STRING-COPY is
  guaranteed *not* to share any storage, so that what storage the whole input
  string occupied can be recycled.  Cursors into the string returned by
  STRING-COPY are *not* guaranteed to be valid cursors into the input string,
  and almost certainly will not be.

  STRING-SLICE has only an optional end cursor argument because it would be
  pointless with two optional arguments, neither supplied -- (STRING-SLICE
  <string>) might as well be written as simply <string> --, but STRING-COPY is
  still useful with only one argument, as a sort of recycling operator.
  STRING-SLICE is allowed to return an object that is identical, in the sense
  of EQ?, to its string parameter; STRING-COPY is not.

  (What about STRING-TAKE, STRING-DROP, and a required-three-argument
  STRING-SLICE, to simplify the single-optional-argument wart?)

(STRING-APPEND string ...) -> string
(STRING-JOIN strings prefix infix suffix) -> string
(STRING-DISPLACE string start end substring) -> string
    = (STRING-APPEND (STRING-SLICE string (STRING-START-CURSOR string)
                                   start)
                     substring
                     (STRING-SLICE string end
                                   (STRING-END-CURSOR string)))

(STRING-CASE-FOLD string) -> string
(STRING-UPCASE string) -> string
(STRING-DOWNCASE string) -> string
(STRING-TITLECASE string) -> string
  Case operations on strings.  These are locale-_insensitive_.  Separate
  locale-sensitive operations can be defined by a separate specification that
  also defines operations on locales; e.g., (STRING-TITLECASE/LOCALE string
  [locale]), where the locale argument defaults to some representation of the
  current locale.

  (I hate the name STRING-CASE-FOLD, but the other options I can think of
  aren't much better: STRING-FOLDCASE, FOLD-STRING-CASE, FOLDCASE-STRING,
  CASE-FOLD-STRING, ...)

(STRING-SEARCH string pattern [cache]) -> [match-start match-end] or [#F #F]
(STRING-SEARCH* string pattern [cache]) -> start/end list
(COMPUTE-STRING-SEARCH-CACHE pattern) -> string-search-cache
  STRING-SEARCH searches for the first occurrence of the supplied pattern in
  the supplied string.  If such an occurrence is found, it returns two cursors,
  one at the start of the pattern (inclusive) and one at the end of the pattern
  (exclusive) in the string.  If no such occurrence is found, it returns
  (VALUES #F #F).  STRING-SEARCH* returns a list of pairs representing the
  spans of occurrences of the pattern in the input string: each element's car
  is a cursor for the start of an occurrence of the pattern, inclusive, and its
  cdr is a cursor for the end, exclusive.  Occurrences may overlap, but the
  start cursors are guaranteed to be ordered increasingly and likewise the end
  cursors.

  The optional cache argument allows a cache, such as a KMP restart vector or a
  set of Boyer-Moore search functions, to be precomputed preceding the actual
  search, often if the search is repeated.  Exactly what the cache represents
  is unspecified; implementations are encouraged to experiment with this.  It
  is an error to supply a pattern and a cache such that the cache was not
  created by a call to COMPUTE-STRING-SEARCH-CACHE with the same pattern.

(STRING-SEARCH/CI string pattern [cache]) -> [match-start match-end] or [#F #F]
(STRING-SEARCH*/CI string pattern [cache]) -> start/end list
(COMPUTE-STRING-SEARCH-CACHE/CI pattern) -> string-search-cache
  Equivalents of the above three procedures, but insensitive to the case of the
  strings.

* String Cursors

String cursors are a disjoint type, likely to be a primitive immediate type.
They represent the locations of individual units relative to a certain string.
(Cursors are tied to specific strings (no pun intended); however, in the
interest of allowing efficient implementations (which is a critical goal of
this specification), strings must be passed along with cursors to them in order
to allow cursors to have lightweight representations.)  The 'end cursor' of a
string does not represent a particular unit: it is instead the upper bound of a
string.  (STRING-CURSOR- (STRING-END-CURSOR string) string) returns a cursor on
the last unit in the string, or it is an error if the string is empty.

String cursors are required to be disjoint -- most notably from fixnums --
because it would be too easy to confuse them with natural number indices of
the underlying string units (or some other notion of characters that an easily
confused programmer might have) when they actually represent, say, octet
offsets.  The *only* operations that should be well-defined on cursors are
those that maintain alignment with the string's underlying representation of
units.  (Consider the havoc wrought if someone were to perform arbitrary fixnum
arithmetic on a cursor to a string with lots of combining codepoints...encoded
internally with UTF-8, where cursors are represented with octet offsets.)
String cursors can still, however, be represented efficiently by allocating a
primitive immediate type tag, just like characters often had in older systems.

(STRING-CURSOR? cursor [string]) -> boolean
  The optional string argument here tests whether CURSOR is a valid cursor into
  the string.  Otherwise this simply tests whether CURSOR is a string cursor.

(STRING-START-CURSOR string) -> cursor
(STRING-END-CURSOR string) -> cursor

(STRING-CURSOR-AT-START? cursor string) -> boolean
    = (STRING-CURSOR=? cursor (STRING-START-CURSOR string) string)
(STRING-CURSOR-AT-END? cursor string) -> boolean
    = (STRING-CURSOR=? cursor (STRING-END-CURSOR string) string)

(STRING-CURSOR=? cursor-a cursor-b string) -> boolean
(STRING-CURSOR<? cursor-a cursor-b string) -> boolean
    .
    .
    .
    &c.
  (Should these take the string argument first so they can be n-ary?  Are n-ary
   versions of these predicates useful?)

(STRING-CURSOR+ cursor string [addend]) -> cursor
(STRING-CURSOR- cursor string [subtrahend]) -> cursor
  ADDEND & SUBTRAHEND default to 1.

  (jcowan wants codepoint-granular cursors.  I have not yet thought this out
   well enough to decide whether or not to include them.)

** Conversion between Cursors and Natural Number Indices

(STRING-CURSOR->INDEX cursor string) -> natural number
    = (STRING-LENGTH (STRING-SLICE string (STRING-START-CURSOR string) cursor))
(INDEX->STRING-CURSOR index string) -> cursor
    = (STRING-CURSOR+ (STRING-START-CURSOR string) string index)
(STRING-CURSOR-DIFFERENCE cursor1 cursor2 string) -> integer
    = (STRING-LENGTH (STRING-SLICE string cursor1 cursor2))
        if cursor1 < cursor2 (i.e. points to an earlier unit), or
      (- (STRING-LENGTH (STRING-SLICE string cursor2 cursor1)))
        if the converse is true.
    = (- (STRING-CURSOR->INDEX cursor1 string)
         (STRING-CURSOR->INDEX cursor2 string))
  STRING-CURSOR->INDEX & INDEX->STRING-CURSOR convert between string cursors
  and natural number indices of units in strings.  STRING-CURSOR-DIFFERENCE
  returns the number of string units between the two cursors, inclusive on the
  lower bound and exclusive on the upper bound.

  Warning!  These may require O(n) time, where n is the index of the grapheme
  clusters, to complete!  Do not use these lightly!

* String Comparison and Ordering

(STRING=? string ...) -> boolean
(STRING<? string ...) -> boolean
(STRING>? string ...) -> boolean
(STRING<=? string ...) -> boolean
(STRING>=? string ...) -> boolean

(STRING-CI=? string ...) -> boolean
(STRING-CI<? string ...) -> boolean
(STRING-CI>? string ...) -> boolean
(STRING-CI<=? string ...) -> boolean
(STRING-CI>=? string ...) -> boolean
  = (STRING...? (STRING-CASE-FOLD string) ...)

These procedures all impose a total ordering on strings and units of strings.
The only restriction on this total ordering is that it be a globally consistent
ordering within a given Scheme system; it may _not_ be locale-sensitive.
Locale-sensitive collation is left to other specifications.

* String Literals

String literals may be _reliably_ written in plain ASCII, with the double-quote
character given meaning as a string terminator and backslash given meaning as
the escape character; \\ and \" are defined to mean a single backslash or
double-quote character, respectively, and \x is defined for literal codepoint
insertion.  \x escapes are of the form "\x(" <hex codepoint> (<whitespace> <hex
codepoint>)* ")"; e.g., "\x(5A)" is equivalent to "Z".  Only ASCII characters
may be portably encoded using \x, and so only one codepoint, of at most seven
bits, may be written in a \x(...) escape; however, in contexts known to use
other encodings, larger codepoints and sequences of codepoints may be used.
For example, "\x(03A5 0301)" in Unicode is a unit string, i.e. the one grapheme
cluster GREEK UPSILON WITH ACUTE AND HOOK SYMBOL, composed of two codepoints.

Scheme systems are encouraged to support a mechanism by which string constant
tables can be defined in separate modules, to support internationalization; in
those modules, the strings may reliably use encodings beyond ASCII.

* Encoding Strings in Binary Buffers (SRFI 74 Blobs)

(TEXT-ENCODING symbolic-name) -> text encoding descriptor or #F
  This mechanism allows querying of the available encodings without prescribing
  a large set of them (bound to top-level variables, for example).  Standard
  ones would include ASCII, LATIN-1, UTF-8, &c.; but note that implementations
  would not be _required_ to support any of the 'standard' encodings: those
  standard encodings are simply standard symbolic names that implementations
  should support if they _do_ support the corresponding encoding.  This allows,
  for example, portable code to be written that deals with UTF-8 if it is
  available or ASCII if it is not, without having to recompile it depending on
  current availability in the system (based on loaded libraries or something).

(STRING->BLOB string encoding) -> blob

(STRING->BLOB! string encoding blob [blob-start blob-end])
    -> [octets cursor]
  Returns the number of octets filled in the blob and the cursor of the first
  string unit that was not written into BLOB.  Only complete units of strings
  are written into the blob.  If the resultant cursor is an end cursor of
  STRING, in the sense of STRING-CURSOR-AT-END?, the string was completely
  copied into the blob.  If the resultant cursor is a start cursor of STRING,
  in the sense of STRING-CURSOR-AT-START?, none of the string was copied into
  the blob: there was not enough room even for the first unit (or either of the
  blob and the string was empty).

(BLOB->STRING blob encoding [blob-start blob-end]) -> string

  (What if BLOB is mutated?  Should there be a variant that assumes it will
   never be, so that some of the blob can be shared?)

* Examples

(define (string-substitute string old new)
  (receive (start end) (string-search string old)
    (if (and start end)
        (string-displace string start end new)
        string)))

(define (string-substitute* string pattern substitute)
  (let ((cache (compute-string-search-cache pattern)))
    (receive (start end) (string-search string pattern cache)
      (if (not (and start end))
          string
          (call-with-string-output-port ; Some sort of way to collect a
            (lambda (output-port)       ; string -- string ports work.
              (let loop ((previous-start (string-start-cursor string))
                         (match-start start)
                         (match-end end))
                (write-string output-port
                              ;; Collect between here (START) and the start of
                              ;; the pattern match.
                              (string-slice string previous-start match-start))
                (write-string output-port substitute)
                (receive (start end)
                         (string-search (string-slice string match-end)
                                        pattern cache)
                  (if (and start end)
                      (loop match-end start end))))))))))