A New Text API for Scheme                              -*- outline -*-

* Copying

Copyright (C) 2007 Taylor R Campbell.
All rights reserved.

This document is a draft.  Please do not copy or distribute it, except
to identify it by URI.  It will remain indefinitely at
<http://mumble.net/~campbell/proposals/new-text.txt>.

* Issues

** Binary Formats

Procedures are needed for encoding and decoding the binary
representations of text to and from octet vectors of some nature.

** Bounds on Running Time of Indexing

Will Clinger wants stronger guarantees for the running times of the
indexing operations.  I am not willing to prohibit space-efficient
internal representations, like UTF-8, for texts by offering these
guarantees.  This will be an issue of some contention, I predict.

** Counting Code Points

Should there be a TEXT-LENGTH procedure, which returns the number of
code points in the text?  I hope for a better name, if so.

** Character Data Type

John Cowan wants to use the data type that Scheme calls `characters'
as simply a wrapper for code points (and consequently to specify the
domain of characters to be that of the Unicode code points (or scalar
values)).  I am not sure whether I want this, or whether I'd rather
simply use exact, non-negative integers in a particular range.

** `Code Point' versus `Scalar Value'

Throughout this document I use the term `code point', not `scalar
value'.  The Unicode standard uses `scalar value' only when discussing
Unicode encoding forms, and nowhere else.  I do not, however, want
texts to be permitted to contain surrogate code points exposed to
programs.  (Internal representations based on UTF-16 are permissible,
however, as long as TEXT-FORWARD skips over two code units if they are
surrogates, and TEXT-REF always returns the composition of a surrogate
pair, not either of the two surrogate code points.)

** Grapheme Cluster Motion

There should be procedures for grapheme cluster motion of positions
(TEXT-FORWARD-DGC and TEXT-BACKWARD-DGC).

** Input

Input of text from ports, and text input ports (analogous to SRFI 6:
Basic String Ports), should be specified.

** Comparison and Binary versus Canonical Equivalence

Text comparison is currently defined by binary equivalence.  I am
passionately ambivalent about this: on the one hand, I want
programmers to be able to write programs that reliably compare text
for canonical equivalence; on the other hand, binary equivalence is
useful and more primitive.  What I absolutely *don't* want to do is to
leave it unspecified whether the text comparison routines work with
canonical equivalence or not.

However, I am not sure I want to require programmers to construct
intermediate canonical decompositions of texts that might be very
large only to compare them.  Avoiding this requires many operations to
be duplicated for binary versus canonical equivalence.

** Read Syntax

We need a read syntax for texts.  #"...", perhaps?

In this syntax, we also need syntax for code point escapes.  I think
that R6RS's -- \xHEXDIGITS; -- is probably worth using; e.g., a CRLF
text would be #"\xC;\xA;", and `foo' and `bar' in separate paragraphs
would be #"foo\x2029;bar".

** Existing Data Types

Perhaps there should be procedures for converting to and from Scheme
strings.

** Normalization

Normalization is a major issue.  It will have to wait until the next
draft of this document for an explanation of all the gruesome, subtle
details.  For these reasons which I have yet to explain, there are not
presently any TEXT-NORMALIZE-* routines (for * in {NFD, NFC, NFKD,
NFKD}).

** Collation

Should anything regarding collation be included?  (Real collation,
that is, not just the locale-independent lexicographic ordering
procedures that we already have here.)

** Implementation

The implementation should be finished, and possibly written in a more
portable way.  (Doing so is not hard given an octet vector
abstraction.)

** Naming Issues

- POSITION-IN-TEXT? reads well, but sets of all sorts of inconsistency
  bells in my head.

- TEXT-REF is very concise, but I am worried that it is too close to
  STRING-REF and VECTOR-REF.  Also, it does not indicate that it
  obtains the code point *after* the position; perhaps there ought to
  be another operation for obtaining the code point before the
  position, and the two ought to be given symmetric names.  Symmetric
  names would probably be much less concise than TEXT-REF, but then
  TEXT-REF may be too terse.

- TEXT-FORWARD and TEXT-BACKWARD are not very descriptive.  My intent
  was that they would be basic operations that everyone would learn
  and not need to remember by giving them long names, so concision and
  flexibility for appending higher-level units to the name -- e.g.,
  TEXT-FORWARD-DGC (default grapheme cluster), TEXT-FORWARD-WORD, &c.
  -- would be more important, but I'm not sure whether this is a good
  idea.

- TEXT-FORWARD-SEARCH versus TEXT-SEARCH-FORWARD.  MIT Scheme uses the
  latter (STRING-SEARCH-FORWARD).  The former makes it more consistent
  with the convention I imagined for text position motion, but it's
  not actually a text motion operation.  Also, MIT Scheme puts the
  pattern first in the arguments; at the moment, the procedures in
  this draft put it second, and the string to search through first.

- I have never been able to think of a pattern by which to name the
  case-mapping routines so that mapping to upper, lower, and title
  case follows the same pattern as folding the case, without giving a
  clumsy name to either the case-folding routine or the other
  case-mapping routines.  Here are some possibilities, the first of
  which follows R6RS's naming scheme (which Will Clinger recommends):

  TEXT-UPCASE      TEXT-DOWNCASE    TEXT-TITLECASE   TEXT-FOLDCASE
  TEXT-UPPERCASE   TEXT-LOWERCASE   TEXT-TITLECASE   TEXT-FOLDEDCASE
  TEXT-UPPER-CASE  TEXT-LOWER-CASE  TEXT-TITLE-CASE  TEXT-FOLDED-CASE
  TEXT-CASE-UP     TEXT-CASE-DOWN   TEXT-CASE-TITLE  TEXT-CASE-FOLD

  TEXT-UPPERCASE?  TEXT-LOWERCASE?  TEXT-TITLECASE?  TEXT-FOLDEDCASE?
  TEXT-UPPER-CASE? TEXT-LOWER-CASE? TEXT-TITLE-CASE? TEXT-FOLDED-CASE?
  TEXT-UPPER-CASE? TEXT-LOWER-CASE? TEXT-TITLE-CASE? TEXT-CASE-FOLDED?

* Specification

Two new data types are introduced:

- A /text/ is an object that contains text represented as sequences of
  discrete /code points/.  Texts are immutable, and disjoint from all
  existing data types in Scheme.

- A /text position/ is an object that identifies the position between
  two code points in text, or one at the start or the end of text.
  Text positions are *not* necessarily disjoint from existing data
  types in Scheme; they may be identical to exact, non-negative
  integers, for example.

Internally, code points in texts may not be laid out in uniform-width
cells in memory.  Consequently, random access to code points by their
natural number in the sequence may not run in constant time.
Positions in texts are therefore identified by objects whose
representation is unspecified.  Following the convention of text
editors, positions identify not the sequence numbers of code points,
but positions between code points, or before or after all code points
in the text.  The latter positions are known as the start and end
positions of the text.

Texts may be `sliced up' and seen in parts with the TEXT-SLICE
procedure or its derivatives.  The only distinction between a text
that is a slice of another text is the start and end positions; if a
text is derived by taking a slice of a larger text, each refers to the
same underlying textual structure, but simply narrower or broader
views of it.  Because operating on a slice of a text is such a common
operation, the running time of TEXT-SLICE is guaranteed to be
sublinear in the number of code points within that slice of text;
programmers are encouraged to use TEXT-SLICE.

Any positions in the larger text that fall between the slice's start
and end positions will identify the analogous positions in the slice,
and any identify positions in the slice will also identify analogous
positions in the larger text.  A text position is said to be /valid in
a text/ if and only if the position was derived from a slice of the
text, or the position was derived from a larger text of which the text
in question is a slice, and falls between the text's start and end
positions.  Any text position is valid in an arbitrary number of
related texts, all of which are slices of one another; there is
therefore no unique text associated with a text position, and no
specified way to obtain the text from which a text position was
derived.

The consequences of using a text position with a text in which it is
not valid are unspecified.  Some implementations may associate with
every text a common identifying datum, such as a pointer to the
internal storage that a set of texts are slices of, and associate this
with every text position derived from any of these texts.  Other
implementations may represent text positions as tagged pointers into
the contents of the common internal storage.  These implementations
can detect misuses of text positions with texts in which they are
invalid, at the expense of some performance because text positions
require heap storage and extra time to fetch information from them, or
at the expense of some implementation complexity.

Still other implementations may not associate this information with
text positions, so they may not be able to detect misuses text
positions with texts in which they are invalid; instead they may
simply yield unpredictable results.  Such implementations are allowed
to represent text positions as exact, non-negative integers denoting
byte offsets from the start of some internal storage, which may fit in
machine registers and require no heap space, and which consequently
may operate much faster.  For example, a text might be a triple of a
pointer to a byte vector, a starting byte offset, and an ending byte
offset, and a text position might be a byte offset from the start of
that byte vector.

(This list of implementation strategies is neither exhaustive nor
normative; it is only illustrative of the certain pragmatic
implications of different strategies, and may be useful to
implementors.)

The space of code points in this interface is a subset of the set of
Unicode code points.  Implementations must support at least the
US-ASCII subset of code points, and are encouraged, but not required,
to support the full space of Unicode code points.  Programs should not
be written to assume that random access to code points by natural
number index will run in constant time (using TEXT-POSITION->INDEX and
INDEX->TEXT-POSITION), as may be the case in implementations that
support only US-ASCII, or an 8-bit superset thereof; these programs
may perform badly in implementations that support Unicode and use
variable-width encodings internally for text, such as UTF-8 or UTF-16.

** Procedures

*** Texts

  (TEXT? <object>) -> boolean

    Disjoint type predicate for text objects.

  (TEXT-EMPTY? <text>) -> boolean

    Returns true if there are no code points in <text>, and false if
    otherwise.  This is equivalent to

      (TEXT-POSITION=? (TEXT-START-POSITION <text>)
                       (TEXT-END-POSITION <text>)
                       <text>),

      (TEXT-POSITION-AT-START? (TEXT-END-POSITION <text>) <text>),

    and

      (TEXT-POSITION-AT-END? (TEXT-START-POSITION <text>) <text>).

  (EMPTY-TEXT) -> text

    Returns a text containing no code points.  It is not specified
    whether this yields an identical object for each invocation.

      (let ((text (empty-text)))
        (text-position=? (text-start-position text)
                         (text-end-position text)
                         text))
      ;Value: #T

  (TEXT-COPY <text>) -> text

    Returns a copy of <text> that shares no internal structure with
    <text>, and no extra, invisible internal storage occupied by any
    sibling slices of <text>.  Positions that were valid in <text> are
    not valid in copies of it.

    In implementations where text is represented without any invisible
    internal storage, such as implementations based on trees of
    content, this may simply return <text>.

  (TEXT-START-POSITION <text>) -> position

    Returns the first position in <text>, before which there are no
    code points.

  (TEXT-END-POSITION <text>) -> position

    Returns the last position in <text>, after which there are no code
    points.

*** Text Positions

  (POSITION-IN-TEXT? <position> <text>) -> boolean

    Returns true if <position> is a valid position in <text>, and
    false if not.  <Position> must be a text position in either case.

    This procedure might not precisely discriminate validity:
    positions may appear valid in some text from which they were not
    derived, and the value may vary from implementation to
    implementation.

  (TEXT-POSITION=? <position-a> <position-b> <text>) -> boolean

    Returns true if <position-a> and <position-b> are the same
    position in <text>.  Returns false if they are different
    positions.  It is an error if either of <position-a> or
    <position-b> is not a valid position in <text>.

  (TEXT-POSITION<?  <position-a> <position-b> <text>) -> boolean
  (TEXT-POSITION>=? <position-a> <position-b> <text>) -> boolean
  (TEXT-POSITION>?  <position-a> <position-b> <text>) -> boolean
  (TEXT-POSITION<=? <position-a> <position-b> <text>) -> boolean

    Procedures imposing a total ordering on text positions.  It is an
    error if <position-a> and <position-b> are not valid positions in
    <text>.

  (TEXT-POSITION-AT-START? <position> <text>) -> boolean

    Returns true if <position> is the position before all code points
    in <text>, or false if not.  It is an error if <position> is not a
    valid position in <text>.

      (define (text-position-at-start? position text)
        (text-position=? position (text-start-position text)))

  (TEXT-POSITION-AT-END? <position> <text>) -> boolean

    Returns true if <position> is the position after all code points
    in <text>, or false if not.  It is an error if <position> is not a
    valid position in <text>.

      (define (text-position-at-end? position text)
        (text-position=? position (text-end-position text)))

  (TEXT-REF <text> <position>) -> code point

    Returns the first code point in <text> after <position>.  Signals
    an error if there are no code points after <position>.  It is an
    error if <position> is not a valid position in <text>.  TEXT-REF
    runs in time sublinear in the size of <text>.

  (TEXT-FORWARD <position> <text> [<count>]) -> position or false

    Returns a text position in <text> that is <count> code points
    after <position>, or false if there are fewer than <count> code
    points after <position> in <text>.  If <count> is not supplied,
    its default value is 1.  It is an error if <position> is not a
    valid position in <text>.  TEXT-FORWARD runs in time sublinear in
    the size of <text>.

  (TEXT-BACKWARD <position> <text> [<count>]) -> position or false

    Returns a text position in <text> that is <count> code points
    before <position>, or false if there are fewer than <count> code
    points before <position> in <text>.  If <count> is not supplied,
    its default value is 1.  It is an error if <position> is not a
    valid position in <text>.  TEXT-BACKWARD runs in time sublinear in
    the size of <text>.

*** Text Slices

  (TEXT-SLICE <text> <start-position> <end-position>) -> text

    Returns a text that contains the sequence of code points in <text>
    between <start-position> and <end-position>.  This slice of the
    text may refer to storage shared by <text>.  The running time of
    TEXT-SLICE on average must be sublinear in the number of code
    points between <start-position> and <end-position>.

    For example, TEXT-SLICE may run in (amortized) constant time, if a
    text is a triple of internal storage, a start bound, and an end
    bound, and TEXT-SLICE need only construct a new triple with
    tighter bounds; or TEXT-SLICE may run in logarithmic time, if a
    text is structured as a tree of content.  TEXT-SLICE may *not*
    simply copy all of the code points in <text>; programmers may rely
    on its performance, and should not be afraid to use it frequently.

  (TEXT-PREFIX <text> <end-position>) -> text
  (TEXT-SUFFIX <text> <start-position>) -> text

    TEXT-PREFIX returns a slice of <text> that contains the sequence
    of code points before <end-position>.  TEXT-SUFFIX returns a slice
    of <text> that contains the sequence of code points after
    <start-position>.

      (define (text-prefix text end-position)
        (text-slice text (text-start-position text) end-position))

      (define (text-suffix text start-position)
        (text-slice text start-position (text-end-position text)))

  (TEXT-SPLIT-AT <text> <position>) -> [text text]

    Returns two values: the prefix of <text> before <position> and the
    suffix of <text> after <position>.

      (define (text-split-at text position)
        (values (text-prefix text position)
                (text-suffix text position)))

*** Code Point Indexing

  Because text is composed of a sequence of code points, each code
  point in the sequence may be assigned a natural number index
  starting from 0.  There is an isomorphism between these code point
  indices and text positions.  The following procedures deal with this
  isomorphism, where indices are exact, non-negative integers.

  (TEXT-POSITION->INDEX <position> <text>) -> index

    Returns the number of code points in <text> that precede
    <position>, or the number of iterated applications of TEXT-FORWARD
    to the start position of <text> necessary to find a position
    equal to <position>.  This operation may run in time proportional
    to the value of the index.  It is an error if <position> is not a
    valid position in <text>.

      (define (text-position->index position text)
        (do ((index 0 (+ index 1))
             (position* (text-start-position text)
                        (text-forward position* text)))
            ((text-position=? position* position)
             index)))

  (INDEX->TEXT-POSITION <index> <text>) -> position

    Returns a text position after the <index>th code point in <text>,
    or iteratively applies TEXT-FORWARD <index> times to the start
    position of <text>.  This operation may run in time proportional
    to the value of the index.  It is an error if <index> exceeds the
    number of code points in <text>.

      (define (index->text-position index text)
        (do ((index index (- index 1))
             (position (text-start-position text)
                       (text-forward position text)))
            ((zero? index)
             position)))

  (TEXT-POSITION-DIFFERENCE <position-a> <position-b> <text>)
      -> integer difference

    Returns the number of code points after <position-b> and before
    <position-a>.  If <position-a> precedes <position-b> in <text>,
    this is equal to the additive inverse of the number of code points
    after <position-a> and before <position-b>, i.e.
    (- (TEXT-POSITION-DIFFERENCE <position-b> <position-a> <text>)).

      (define (text-position-difference position-a position-b text)
        (- (text-position->index position-a text)
           (text-position->index position-b text)))

      (define (text-position-difference position-a position-b text)
        (let ((distance
               (text-position-distance position-a position-b text)))
          (if (text-position<? position-a position-b text)
              (- 0 distance)
              distance)))

  (TEXT-POSITION-DISTANCE <position-a> <position-b> <text>)
      -> absolute value

    Returns the number of code points between <position-a> and
    <position-b> in <text>.  This is the absolute value of the
    difference between <position-a> and <position-b>.

      (define (text-position-distance position-a position-b text)
        (abs (text-position-difference position-a position-b text)))

*** Text Comparison

  (TEXT=? <text> ...) -> boolean

    Returns true if each of the texts supplied as arguments contain
    identical sequences of code points; returns false if otherwise.

      (define (binary-text=? text-a text-b)
        (receive (start end) (text-search-forward text-a text-b)
          (and 
           start end
           (text-position=? start (text-start-position text-a) text-a)
           (text-position=? end (text-end-position text-a) text-a))))

  (TEXT-CI=? <text> ...) -> boolean

    Returns true if each of the texts supplied as arguments contain
    identical case-folded sequences of code points; returns false if
    otherwise.  This is equivalent to

      (TEXT=? (TEXT-CASE-FOLD <text>) ...).

      (define (binary-text-ci=? text-a text-b)
        (receive (start end) (text-search-forward/ci text-a text-b)
          (and 
           start end
           (text-position=? start (text-start-position text-a) text-a)
           (text-position=? end (text-end-position text-a) text-a))))

  (TEXT-PREFIX? <prefix> <text>) -> boolean
  (TEXT-SUFFIX? <suffix> <text>) -> boolean

    These return true if <prefix> or <suffix> contains a sequence of
    code points identical to some prefix or suffix, respectively, of
    the sequence of code points in <text>.  <Prefix>, <suffix>, and
    <text> must all be texts.

      (define (text-prefix? prefix text)
        (receive (start end) (text-search-forward text prefix)
          (and start end
               (text-position=? start
                                (text-start-position text)
                                text))))

  (TEXT-PREFIX-CI? <prefix> <text>) -> boolean
  (TEXT-SUFFIX-CI? <suffix> <text>) -> boolean

    Case-insensitive variants of the above.

      (define (text-prefix-ci? prefix text)
        (text-prefix? (text-case-fold prefix)
                      (text-case-fold text)))

  (TEXT-MISMATCH <text-a> <text-b>)
      -> [position-a position-b] or [#F #F]

    Compares the code points in <text-a> and <text-b>.  If the two
    texts contain identical sequences of code points, returns the two
    values #F and #F; otherwise, returns two positions in <text-a> and
    <text-b>, respectively, preceding the first code point that
    differs between the two texts, or after which one text's sequence
    of code points ends but the other's does not.

  (TEXT-MISMATCH-CI <text-a> <text-b>)
      -> [position-a position-b] or [#F #F]

    Case-insensitive variant of the above procedure.

**** Lexicographic Ordering

  (TEXT<? <text> ...) -> boolean
  (TEXT>=? <text> ...) -> boolean
  (TEXT>? <text> ...) -> boolean
  (TEXT<=? <text> ...) -> boolean

    These impose a total lexicographic ordering on texts by the
    sequence of code points in them.

  (TEXT-CI<? <text> ...) -> boolean
  (TEXT-CI>=? <text> ...) -> boolean
  (TEXT-CI>? <text> ...) -> boolean
  (TEXT-CI<=? <text> ...) -> boolean

    These impose a total lexicographic ordering on texts by the
    sequence of code points in them, insensitive to case differences.
    These are equivalent to passing the results of case-folding the
    input texts to the case-sensitive routines above.

  (TEXT-COMPARE <text-a> <text-b> <if-eq> <if-lt> <if-gt>)
      -> values

    Compares the sequence of code points in <text-a> and <text-b>, and
    calls either <if-lt>, <if-eq>, or <if-gt> in a tail position with
    respect to the call to TEXT-COMPARE.  TEXT-COMPARE calls <if-eq>
    with zero arguments if the two texts contain identical sequences
    of code points.  If the two texts contain distinct sequences of
    code points, then TEXT-COMPARE calls either <if-lt> or <if-gt>
    with two arguments: two positions in <text-a> and <text-b>,
    respectively, preceding the first code point that differs between
    the two texts, or after which one text ends but the other does
    not.  TEXT-COMPARE will call <if-lt> if both texts have more code
    points and the first code point in <text-a> that differs from the
    corresponding code point in <text-b> is numerically less than its
    its counterpart in <text-b>, or if <text-a> is shorter than
    <text-b>; and TEXT-COMPARE will call <if-gt> otherwise.

      (define (text-mismtach text-a text-b)
        (define (if-match)
          (values #f #f))
        (define (if-mismatch position-a position-b) 
          (values position-a position-b))
        (text-compare text-a text-b if-match if-mismatch if-mismatch))

      (define (text-compare text-a text-b if= if< if>)
        (receive (position-a position-b) (text-mismatch text-a text-b)
          (if (not (and position-a position-b))
              (if=)
              ((cond ((text-position-at-end? position-a) if<)
                     ((text-position-at-end? position-b) if>)
                     (else
                      (if (< (text-ref text-a position-a)
                             (text-ref text-b position-b))
                          if<
                          if>)))
               position-a
               position-b))))

      (define (binary-text<? text-a text-b)
        (text-compare text-a text-b
          (lambda () #f)
          (lambda (position-a position-b) #t)
          (lambda (position-a position-b) #f)))

*** Case Mapping

  (TEXT-UPPER-CASE <text>) -> text
  (TEXT-LOWER-CASE <text>) -> text
  (TEXT-TITLE-CASE <text>) -> text
  (TEXT-CASE-FOLD <text>) -> text

    These map <text> to their upper-case, lower-case, title-case, or
    case-folded equivalents, respectively, according to the Unicode
    Standard, Chapter 3, Section 13, `Default Case Operations', `Case
    Conversion of Strings'.

  (TEXT-UPPER-CASE? <text>) -> boolean
  (TEXT-LOWER-CASE? <text>) -> boolean
  (TEXT-TITLE-CASE? <text>) -> boolean
  (TEXT-CASE-FOLDED? <text>) -> boolean

    These test whether <text> is entirely upper-case, lower-case,
    title-case, or case-folded, respectively, according to the Unicode
    Standard, Chapter 3, Section 13, `Default Case Operations', `Case
    Detection for Strings'.

*** Text Search

  (TEXT-SEARCH-FORWARD <text> <pattern>)
      -> [start-position end-position] or [#F #F]
  (TEXT-SEARCH-BACKWARD <text> <pattern>)
      -> [start-position end-position] or [#F #F]

    TEXT-SEARCH-FORWARD searches forward through <text> for the first
    occurrence of the pattern; TEXT-SEARCH-BACKWARD searches backward
    through <text> for the last occurrence of <pattern>.  If an
    occurrence is found, returns the start and end positions of the
    occurrence of <pattern> in <text>; if no occurrence is found,
    returns (VALUES #F #F).

    <Pattern> may be a text or a cache as returned by
    TEXT-FORWARD-SEARCH-CACHE or TEXT-BACKWARD-SEARCH-CACHE,
    respectively.

  (TEXT-SEARCH-FORWARD* <text> <pattern>) -> occurrence list
  (TEXT-SEARCH-BACKWARD* <text> <pattern>) -> occurrence list

    TEXT-SEARCH-FORWARD* searches forward through <text> for all
    occurrences of <pattern>, from start to end; TEXT-SEARCH-BACKWARD*
    searches backward through <text> for all occurrences of <pattern>,
    from end to start.  The occurrences are returned as a list of
    pairs whose cars are the starting positions of the occurrences and
    whose cdrs are the ending positions of the occurrences.
    Occurrences may overlap.

    <Pattern> may be a text or a cache as returned by
    TEXT-FORWARD-SEARCH-CACHE or TEXT-BACKWARD-SEARCH-CACHE,
    respectively.

      (define (text-search-forward* text pattern)
        (let ((cache (text-forward-search-cache pattern)))
          (let loop ((text text) (matches '()))
            (receive (start end) (text-search-forward text cache)
              (if (and start end)
                  ;; To implement non-overlapping patterns, we could
                  ;; write (TEXT-SUFFIX TEXT END), rather than
                  ;; (TEXT-SUFFIX TEXT (TEXT-FORWARD START TEXT))
                  ;; here.
                  (loop (text-suffix text (text-forward start text))
                        (cons (cons start end) matches))
                  (reverse matches))))))

  (TEXT-FORWARD-SEARCH-CACHE <pattern>) -> search cache
  (TEXT-BACKWARD-SEARCH-CACHE <pattern>) -> search cache

    If <pattern> is a search cache as returned by one of these
    procedures, they return <pattern>.  If <pattern> is a text, they
    perform any useful calculations to search for <pattern> in texts,
    and return an object caching the results of these calculations,
    suitable for passing to any of the above text search routines.
    For example, an implementation of text searching that uses the KMP
    algorithm might compute a restart vector for the pattern.
    TEXT-FORWARD-SEARCH-CACHE may be used only for the forward search
    routines; TEXT-BACKWARD-SEARCH-CACHE, for the backward search
    routines.

    These procedures must be idempotent.  This allows programs to
    insert a sort of optional optimization annotation when they will
    use a pattern multiple times for search only, while permitting
    their callers to use the same annotations.  See the above sample
    definition of TEXT-SEARCH-FORWARD* for an example.

    Text search caches are independent of the instance of searching;
    they may be reused arbitrarily many times with the above search
    procedures.

    The precise nature of the object returned by these procedures is
    unspecified.  The following is a perfectly valid, though
    discouraged, implementation of these procedures:

      (define (text-forward-search-cache pattern) pattern)
      (define (text-backward-search-cache pattern) pattern)

  (TEXT-SEARCH-FORWARD/CI <text> <pattern>)
      -> [start-position end-position] or [#F #F]
  (TEXT-SEARCH-BACKWARD/CI <text> <pattern>)
      -> [start-position end-position] or [#F #F]
  (TEXT-SEARCH-FORWARD*/CI <text> <pattern>) -> occurrence list
  (TEXT-SEARCH-BACKWARD*/CI <text> <pattern>) -> occurrence list
  (TEXT-FORWARD-SEARCH-CACHE/CI <pattern>) -> search cache
  (TEXT-BACKWARD-SEARCH-CACHE/CI <pattern>) -> search cache

    Case-insensitive variants of the preceding six procedures.

*** Text Editing: Concatenation, Replacement, Insertion, and Deletion

  (TEXT-APPEND <text> ...) -> text
  (TEXT-CONCATENATE <list-of-texts>) -> text
  (TEXT-JOIN <list-of-texts> <prefix> <infix> <suffix>)
      -> text

    These return concatenations of text.  TEXT-APPEND returns the
    concatenation of each <text>.  TEXT-CONCATENATE returns the
    concatenation of each element of <list-of-texts>.  TEXT-JOIN
    returns a text composed by concatenating <prefix>, each of the
    elements of <list-of-texts> with <infix> between each element, and
    <suffix>.  If <list-of-texts> is empty, the result will be that of
    (TEXT-APPEND <prefix> <suffix>); any other behaviour must be
    written explicitly by the programmer.

    Text positionss that were valid in the input texts are not valid
    in the output texts.  Implementations are permitted to return the
    sole argument passed to TEXT-APPEND, or the sole element of unary
    lists passed to TEXT-CONCATENATE, but programs may not rely on
    this.

      (define (text-append . texts)
        (text-concatenate texts))

      (define (text-concatenate texts)
        (text-join texts (empty-text) (empty-text) (empty-text)))

  (TEXT-REPLACE <text> <start-position> <end-position> <replacement>)
      -> text

    Returns a text composed of the sequence of code points in <text>
    before <start-position>, followed by the sequence of code points
    in <replacement>, followed by the sequence of code points in
    <text> after <end-position>.  It is an error if either of
    <start-position> and <end-position> is not a valid position in
    <text>.

      (define (text-replace text start-position end-position
                            replacement)
        (receive (text* position*)
                 (text-delete* text start-position end-position)
          (text-insert text* position* replacement))

      (define (text-substitute text pattern substitution)
        (receive (start end) (text-search-forward text pattern)
          (if (and start end)
              (text-replace text start end substitution)
              text)))

      ;;; The following definition of TEXT-SUBSTITUTE* may perform
      ;;; well in an implementation of text based on balanced trees,
      ;;; where TEXT-REPLACE can run in logarithmic time, but is
      ;;; unlikely to perform well in an implementation based on
      ;;; pointers into contiguous blocks of memory where TEXT-REPLACE
      ;;; runs in time linearly proportional to the size of the text.
      ;;;
      ;;; Further below, there is another definition, which uses ports
      ;;; to accumulate the output text, and which is likely to
      ;;; perform well in all systems.

      (define (text-substitute* text pattern substitution)
        (let ((cache (text-forward-search-cache text pattern)))
          (let loop ((text text)
                     (position (text-start-position text)))
            (receive (start end)
                     (text-search-forward (text-suffix text position)
                                          cache)
              (if (not (and start end))
                  text
                  (loop (text-replace text start end substitution)
                        end))))))

  (TEXT-INSERT <text> <position> <insertion>) -> text

    Returns a text composed of the sequence of code points in <text>
    before <position>, followed by the sequence of code points in
    <insertion>, followed by the sequence of code points in <text>
    after <position>.  It is an error if <position> is not a valid
    position in <text>.

      (define (text-insert text position insertion)
        (text-replace text position position insertion))

  (TEXT-DELETE <text> <start-position> <end-position>) -> text

    Returns a text composed of the sequence of code points in <text>
    before <start-position> followed by the sequence of code points in
    <text> after <end-position>.  It is an error if either of
    <start-position> and <end-position> is not a valid position in
    <text>.

      (define (text-delete text start-position end-position)
        (text-replace text start-position end-position (empty-text)))

  (TEXT-DELETE* <text> <start-position> <end-position>)
      -> [text position]

    Returns two values: a text composed by deleting the sequence of
    code points in <text> between <start-position> and <end-position>,
    as if with TEXT-DELETE; and the position in the resultant text
    where the first text was deleted.

  (TEXT-EXTRACT-AND-DELETE <text> <start-position> <end-position>)
      -> [text extraction position]

    Returns three values: a text composed by deleting the sequence of
    code points in <text> between <start-position> and <end-position>,
    as if with TEXT-DELETE; a text composed of the sequence of code
    points between <start-position> and <end-position> in <text>, i.e.
    the text that was deleted; and the position in the first text
    where the second text was deleted, as with TEXT-DELETE*.

*** Text I/O

  (WRITE-TEXT <text> <output-port>)

    Writes the sequence of code points in <text> to <output-port>,
    according to the particular encoding associated with
    <output-port>.  <Output-port> may impose limitations on the code
    points that it can encode, and consequently WRITE-TEXT may signal
    an error if <text> contains code points outside those limitations.

  (OPEN-OUTPUT-TEXT) -> output-text-port

    Returns an output port that accumulates the text written to it.
    The accumulated text may be subsequently retrieved by passing the
    port to GET-OUTPUT-TEXT or GET-OUTPUT-TEXT!.  There are no
    limitations on the text that the port can accumulate.

  (GET-OUTPUT-TEXT <output-text-port>) -> text

    Returns the text accumulated by <output-text-port> so far.  This
    does not destroy <output-text-port>; that is, it may continue to
    accumulate text.

  (GET-OUTPUT-TEXT! <output-text-port>) -> text

    Returns the text accumulated by <output-text-port> and destroys
    <output-text-port>; the consequences of writing further data to
    it, or passing it to GET-OUTPUT-TEXT or GET-OUTPUT-TEXT! again,
    are unspecified.

      (define (call-with-output-text procedure)
        (let ((output-port (open-output-text)))
          (procedure output-port)
          (get-output-text! output-port)))

      (define (text-substitute* text pattern substitution)
        (let ((cache (text-forward-search-cache pattern)))
          (receive (start end) (text-search-forward text cache)
            (if (not (and start end))
                text
                (call-with-output-text
                  (lambda (output-port)
                    (let loop ((text text)
                               (match-start start)
                               (match-end end))
                      (write-text (text-prefix text match-start)
                                  output-port)
                      (write-text substitution output-port)
                      (let ((text (text-suffix text match-end)))
                        (receive (start end)
                                 (text-search-forward text cache)
                          (if (and start end)
                              (loop text start end)))))))))))