A New String API -*- outline -*- #### This document is obsolete! #### #### Please refer to . #### Strings are sequences -- not necessarily arrays -- of some unspecified unit. (In ASCII, this unit is the 7-bit character; in Unicode, this unit will likely be the grapheme cluster.) String units are indexed by opaque cursor objects, _not_ natural numbers as in most systems. Binary representation & encoding is very carefully separated from the bulk of the API: to operate on a string's binary representation, one must operate on a SRFI 74 blob that contains the encoded string; direct access to a string's internal encoding is not provided. Separate specifications would define exactly what units are in strings. For instance, a basic ASCII specification would specify that the basic unit of strings is the ASCII character; a Unicode specification would define the units of strings as Unicode grapheme clusters. * Base String Operations (STRING? obj) -> boolean (STRING-LENGTH string) -> natural number (STRING-EMPTY? string) -> boolean = (ZERO? (STRING-LENGTH string)) (UNIT-STRING? obj) -> boolean = (AND (STRING? obj) (= 1 (STRING-LENGTH obj))) Explicit string units are represented as strings themselves of length one. The following invariant will always hold for any string and valid cursor: (STRING-LENGTH (STRING-REF string cursor)) = 1 (STRING-REF string cursor) -> unit-string = (STRING-SLICE string cursor (STRING-CURSOR+ cursor string)) Note: STRINGS ARE IMMUTABLE!!! There is *no* STRING-SET!. Also, STRING-REF may allocate on the heap, because it returns a string. (STRING-SLICE string start [end]) -> string (STRING-COPY string [start [end]]) -> string STRING-SLICE is guaranteed to be an amortized constant-time operation, in that it may not copy all input strings, but must use some mechanism of shared storage. Also, any string cursors into a string returned by STRING-SLICE are valid cursors into the original input string to STRING-SLICE. STRING-COPY is guaranteed *not* to share any storage, so that what storage the whole input string occupied can be recycled. Cursors into the string returned by STRING-COPY are *not* guaranteed to be valid cursors into the input string, and almost certainly will not be. STRING-SLICE has only an optional end cursor argument because it would be pointless with two optional arguments, neither supplied -- (STRING-SLICE ) might as well be written as simply --, but STRING-COPY is still useful with only one argument, as a sort of recycling operator. STRING-SLICE is allowed to return an object that is identical, in the sense of EQ?, to its string parameter; STRING-COPY is not. (What about STRING-TAKE, STRING-DROP, and a required-three-argument STRING-SLICE, to simplify the single-optional-argument wart?) (STRING-APPEND string ...) -> string (STRING-JOIN strings prefix infix suffix) -> string (STRING-DISPLACE string start end substring) -> string = (STRING-APPEND (STRING-SLICE string (STRING-START-CURSOR string) start) substring (STRING-SLICE string end (STRING-END-CURSOR string))) (STRING-CASE-FOLD string) -> string (STRING-UPCASE string) -> string (STRING-DOWNCASE string) -> string (STRING-TITLECASE string) -> string Case operations on strings. These are locale-_insensitive_. Separate locale-sensitive operations can be defined by a separate specification that also defines operations on locales; e.g., (STRING-TITLECASE/LOCALE string [locale]), where the locale argument defaults to some representation of the current locale. (I hate the name STRING-CASE-FOLD, but the other options I can think of aren't much better: STRING-FOLDCASE, FOLD-STRING-CASE, FOLDCASE-STRING, CASE-FOLD-STRING, ...) (STRING-SEARCH string pattern [cache]) -> [match-start match-end] or [#F #F] (STRING-SEARCH* string pattern [cache]) -> start/end list (COMPUTE-STRING-SEARCH-CACHE pattern) -> string-search-cache STRING-SEARCH searches for the first occurrence of the supplied pattern in the supplied string. If such an occurrence is found, it returns two cursors, one at the start of the pattern (inclusive) and one at the end of the pattern (exclusive) in the string. If no such occurrence is found, it returns (VALUES #F #F). STRING-SEARCH* returns a list of pairs representing the spans of occurrences of the pattern in the input string: each element's car is a cursor for the start of an occurrence of the pattern, inclusive, and its cdr is a cursor for the end, exclusive. Occurrences may overlap, but the start cursors are guaranteed to be ordered increasingly and likewise the end cursors. The optional cache argument allows a cache, such as a KMP restart vector or a set of Boyer-Moore search functions, to be precomputed preceding the actual search, often if the search is repeated. Exactly what the cache represents is unspecified; implementations are encouraged to experiment with this. It is an error to supply a pattern and a cache such that the cache was not created by a call to COMPUTE-STRING-SEARCH-CACHE with the same pattern. (STRING-SEARCH/CI string pattern [cache]) -> [match-start match-end] or [#F #F] (STRING-SEARCH*/CI string pattern [cache]) -> start/end list (COMPUTE-STRING-SEARCH-CACHE/CI pattern) -> string-search-cache Equivalents of the above three procedures, but insensitive to the case of the strings. * String Cursors String cursors are a disjoint type, likely to be a primitive immediate type. They represent the locations of individual units relative to a certain string. (Cursors are tied to specific strings (no pun intended); however, in the interest of allowing efficient implementations (which is a critical goal of this specification), strings must be passed along with cursors to them in order to allow cursors to have lightweight representations.) The 'end cursor' of a string does not represent a particular unit: it is instead the upper bound of a string. (STRING-CURSOR- (STRING-END-CURSOR string) string) returns a cursor on the last unit in the string, or it is an error if the string is empty. String cursors are required to be disjoint -- most notably from fixnums -- because it would be too easy to confuse them with natural number indices of the underlying string units (or some other notion of characters that an easily confused programmer might have) when they actually represent, say, octet offsets. The *only* operations that should be well-defined on cursors are those that maintain alignment with the string's underlying representation of units. (Consider the havoc wrought if someone were to perform arbitrary fixnum arithmetic on a cursor to a string with lots of combining codepoints...encoded internally with UTF-8, where cursors are represented with octet offsets.) String cursors can still, however, be represented efficiently by allocating a primitive immediate type tag, just like characters often had in older systems. (STRING-CURSOR? cursor [string]) -> boolean The optional string argument here tests whether CURSOR is a valid cursor into the string. Otherwise this simply tests whether CURSOR is a string cursor. (STRING-START-CURSOR string) -> cursor (STRING-END-CURSOR string) -> cursor (STRING-CURSOR-AT-START? cursor string) -> boolean = (STRING-CURSOR=? cursor (STRING-START-CURSOR string) string) (STRING-CURSOR-AT-END? cursor string) -> boolean = (STRING-CURSOR=? cursor (STRING-END-CURSOR string) string) (STRING-CURSOR=? cursor-a cursor-b string) -> boolean (STRING-CURSOR boolean . . . &c. (Should these take the string argument first so they can be n-ary? Are n-ary versions of these predicates useful?) (STRING-CURSOR+ cursor string [addend]) -> cursor (STRING-CURSOR- cursor string [subtrahend]) -> cursor ADDEND & SUBTRAHEND default to 1. (jcowan wants codepoint-granular cursors. I have not yet thought this out well enough to decide whether or not to include them.) ** Conversion between Cursors and Natural Number Indices (STRING-CURSOR->INDEX cursor string) -> natural number = (STRING-LENGTH (STRING-SLICE string (STRING-START-CURSOR string) cursor)) (INDEX->STRING-CURSOR index string) -> cursor = (STRING-CURSOR+ (STRING-START-CURSOR string) string index) (STRING-CURSOR-DIFFERENCE cursor1 cursor2 string) -> integer = (STRING-LENGTH (STRING-SLICE string cursor1 cursor2)) if cursor1 < cursor2 (i.e. points to an earlier unit), or (- (STRING-LENGTH (STRING-SLICE string cursor2 cursor1))) if the converse is true. = (- (STRING-CURSOR->INDEX cursor1 string) (STRING-CURSOR->INDEX cursor2 string)) STRING-CURSOR->INDEX & INDEX->STRING-CURSOR convert between string cursors and natural number indices of units in strings. STRING-CURSOR-DIFFERENCE returns the number of string units between the two cursors, inclusive on the lower bound and exclusive on the upper bound. Warning! These may require O(n) time, where n is the index of the grapheme clusters, to complete! Do not use these lightly! * String Comparison and Ordering (STRING=? string ...) -> boolean (STRING boolean (STRING>? string ...) -> boolean (STRING<=? string ...) -> boolean (STRING>=? string ...) -> boolean (STRING-CI=? string ...) -> boolean (STRING-CI boolean (STRING-CI>? string ...) -> boolean (STRING-CI<=? string ...) -> boolean (STRING-CI>=? string ...) -> boolean = (STRING...? (STRING-CASE-FOLD string) ...) These procedures all impose a total ordering on strings and units of strings. The only restriction on this total ordering is that it be a globally consistent ordering within a given Scheme system; it may _not_ be locale-sensitive. Locale-sensitive collation is left to other specifications. * String Literals String literals may be _reliably_ written in plain ASCII, with the double-quote character given meaning as a string terminator and backslash given meaning as the escape character; \\ and \" are defined to mean a single backslash or double-quote character, respectively, and \x is defined for literal codepoint insertion. \x escapes are of the form "\x(" ( )* ")"; e.g., "\x(5A)" is equivalent to "Z". Only ASCII characters may be portably encoded using \x, and so only one codepoint, of at most seven bits, may be written in a \x(...) escape; however, in contexts known to use other encodings, larger codepoints and sequences of codepoints may be used. For example, "\x(03A5 0301)" in Unicode is a unit string, i.e. the one grapheme cluster GREEK UPSILON WITH ACUTE AND HOOK SYMBOL, composed of two codepoints. Scheme systems are encouraged to support a mechanism by which string constant tables can be defined in separate modules, to support internationalization; in those modules, the strings may reliably use encodings beyond ASCII. * Encoding Strings in Binary Buffers (SRFI 74 Blobs) (TEXT-ENCODING symbolic-name) -> text encoding descriptor or #F This mechanism allows querying of the available encodings without prescribing a large set of them (bound to top-level variables, for example). Standard ones would include ASCII, LATIN-1, UTF-8, &c.; but note that implementations would not be _required_ to support any of the 'standard' encodings: those standard encodings are simply standard symbolic names that implementations should support if they _do_ support the corresponding encoding. This allows, for example, portable code to be written that deals with UTF-8 if it is available or ASCII if it is not, without having to recompile it depending on current availability in the system (based on loaded libraries or something). (STRING->BLOB string encoding) -> blob (STRING->BLOB! string encoding blob [blob-start blob-end]) -> [octets cursor] Returns the number of octets filled in the blob and the cursor of the first string unit that was not written into BLOB. Only complete units of strings are written into the blob. If the resultant cursor is an end cursor of STRING, in the sense of STRING-CURSOR-AT-END?, the string was completely copied into the blob. If the resultant cursor is a start cursor of STRING, in the sense of STRING-CURSOR-AT-START?, none of the string was copied into the blob: there was not enough room even for the first unit (or either of the blob and the string was empty). (BLOB->STRING blob encoding [blob-start blob-end]) -> string (What if BLOB is mutated? Should there be a variant that assumes it will never be, so that some of the blob can be shared?) * Examples (define (string-substitute string old new) (receive (start end) (string-search string old) (if (and start end) (string-displace string start end new) string))) (define (string-substitute* string pattern substitute) (let ((cache (compute-string-search-cache pattern))) (receive (start end) (string-search string pattern cache) (if (not (and start end)) string (call-with-string-output-port ; Some sort of way to collect a (lambda (output-port) ; string -- string ports work. (let loop ((previous-start (string-start-cursor string)) (match-start start) (match-end end)) (write-string output-port ;; Collect between here (START) and the start of ;; the pattern match. (string-slice string previous-start match-start)) (write-string output-port substitute) (receive (start end) (string-search (string-slice string match-end) pattern cache) (if (and start end) (loop match-end start end))))))))))