* The Stubber -*- outline -*- This is the reference manual for the Scheme Stubber, a language, and its implementation, for describing the invocation of C libraries from Scheme. ** Introduction The purpose of the stubber is to describe in Scheme programs how a C program would invoke a C library, and how Scheme and C data should be translated from one language to the other. A Scheme programmer ought to need to know only how a C programmer would invoke a C library, and how the Scheme is to look -- Scheme programmers ought not to concern themselves with the representation of Scheme objects, magic constants found in C header files, the layouts of C data structures, the ABI and C calling convention of the local operating system and machine, &c. Descriptions of Scheme interfaces to C libraries should not change when the Scheme implementation changes its object representations, garbage collection technique, or primitive C interface. It should also be easy to write robust interfaces to C libraries, so that a keyboard interrupt at an inopportune moment when the programmer wants to enter a debugger will not easily leak foreign resources such as memory, database handles, or file descriptors. A C stub description begins with the special form (BEGIN-C-STUB ) and ends with the special form (END-C-STUB) at the top level. The in a BEGIN-C-STUB form is used to identify the stub description, and may be used, for example, to name files containing C source code or object code or dynamically loadable libraries. The BEGIN-C-STUB form establishes compile-time state used by the stubber macros; the END-C-STUB form disestablishes this state. Failure to include (END-C-STUB) leads to unpredictable consequences, so let this be the first warning: Warning: DO NOT FORGET TO (END-C-STUB) [The reader will shortly be inundated with warnings, most of which are concerned with the danger of dealing with C at all. The reader will be expected to heed these warnings, and will be expected not to grow weary of the author's constant stream of them.] Within the BEGIN-C-STUB / END-C-STUB bracket may occur arbitrary Scheme forms, and a number of stub-related macros at the top level. Many stub descriptions will begin by naming C header files to include with the C-SYSTEM-INCLUDE and C-INCLUDE special operators: (c-system-include "errno.h") corresponds with the C code #include and (c-include "pathnames.h") with #include "pathnames.h" These could also have been written with the more general C-DECLARE special operator, in which is written arbitrary C code; for example, (c-declare "#include " "#include \"pathnames.h\" ") Each string in a C-DECLARE form is given its own line as C code. The closing quote marks are aligned to set them apart from the C code. The DEFINE-C special operator defines a Scheme procedure calling which has the effect of running C code. For example, this DEFINE-C form defines a Scheme procedure that calls the Unix strerror(3) library routine to find a string associated with a system error code: (define-c (error-code->string (error-code (c-integral "int"))) (c-copied-asciz-string "strerror (error_code)")) The first part, specifying the name of and the parameters of the procedure, (error-code->string (error-code (c-integral "int"))) says that we are defining a procedure named ERROR-CODE->STRING with one parameter, named ERROR-CODE, whose argument must be an integer. The procedure will convert the integer into a C value of type `int', which will be named `error_code' for future reference. The name `error_code' is generated from the name ERROR-CODE by mapping letters to lowercase and hyphens to underscores. The second part, (c-copied-asciz-string "strerror (error_code)") says that when ERROR-CODE->STRING is called, after the arguments have been converted into C data, control should enter the C code strerror (error_code) which must return a pointer to a zero-terminated array of bytes. Writing C-COPIED-ASCIZ-STRING tells the stubber to interpret these bytes as US-ASCII code points and to create a Scheme string from them, which ERROR-CODE->STRING then returns. C-INTEGRAL and C-COPIED-ASCIZ-STRING are names for conversions. Some conversions convert data only from Scheme to C, some convert only from C to Scheme, and some convert in both directions. Both C-INTEGRAL and C-COPIED-ASCIZ-STRING are bidirectional conversions. Procedures defined with DEFINE-C can also declare local C variables, execute C statements ignoring their values, and return multiple values from C to Scheme. For example, a first approximation of an interface to the `pipe' system call might be (define-c (open-pipe-fds) (c-declare "int fds [2];") (c-begin "(void) pipe (fds);" (c-values (c-integral "int" "fds [0]") (c-integral "int" "fds [1]")))) Within DEFINE-C, C-DECLARE adds local declarations in the scope of the statements and expressions elsewhere in the DEFINE-C form. C-BEGIN first executes the statements leading up to the final subform, and then yields the value, or values, of the final subform. C-VALUES returns multiple values to Scheme. Note that C-INTEGRAL here is used to convert C data to Scheme, the syntax for which includes a C expression, unlike its earlier use to specify a parameter for DEFINE-C. ** Managing and Releasing External Resources In Scheme, every object has unlimited extent. In implementations with finite amounts of memory, a garbage collector reclaims the storage occupied by objects that are no longer referenced, so that the storage may be reused by new objects. Memory is not the only resource that should be released when no longer needed; others include file descriptors, database connections, and external memory not managed by the Scheme implementation. Some systems allow procedures to be associated with objects so that the garbage collector calls the procedures when the objects are about to be collected as garbage; such procedures are sometimes named `finalization procedures'. For example, if DATABASE is an object describing a connection to a database, which should be released when no longer referenced, one might write (call-when-unreferenced! database (lambda (database) (close-database-connection database))) (The finalization procedure does not remember the object in a free variable, but instead is passed the object as an argument; if the finalization procedure remembered the object in a free variable, then that would be an extant reference, and prevent the finalization procedure from ever being called!) This approach requires the garbage collector to support the association of finalization procedures with objects, which in turn requires care to implement, because a finalization procedure of an object might create new references to the object, and prevent it from having its memory reclaimed. An alternative approach -- which requires of the garbage collector only support for weak references -- is to use one object for reference, and another, known as a descriptor, to store all information needed to release the resource. When there cease to be (strong) references to the object, the finalization procedure will be passed the descriptor to release the resource, and no new references to the object can be created. *** Annotated Example of Finalization: File Descriptors On Unix, I/O devices are represented by small integers called `file descriptors'. On the Scheme side, we want the garbage collector to close unreferenced I/O devices, so that we must represent I/O devices, `files', by objects whose storage the garbage collector will reclaim, such as records. (define-record-type (%make-file descriptor) file? (descriptor file.descriptor set-file.descriptor!)) The `descriptor' field contains an alien, a collection of bytes storing a C object, namely an integer for a file descriptor. We could have stored a Scheme integer in the `descriptor' field, but allocating space for a C integer beforehand will let us more robustly call C routines that open I/O devices. (define-c (allocate-file-descriptor) (c-alien "int" "-1")) (define-c (close-file-descriptor (descriptor (c-alien "int"))) (c-begin "if (descriptor >= 0) " " (void) close (descriptor); " (c-unspecific))) Next we create a finalizer to store the open files and a procedure for releasing the resources associated with their descriptors. We also supply a predicate for files, so that non-files cannot accidentally be stored in the finalizer, and procedures for reading from and writing to the descriptor field of files. (define file-finalizer (make-default-finalizer (lambda (descriptor) (close-file-descriptor descriptor)) (lambda (object) (file? object)) (lambda (file descriptor) (set-file.descriptor! file descriptor)))) Now we may allocate space for a descriptor and record it in the finalizer using ALLOCATE-FILE-DESCRIPTOR and FINALIZER/ADD-OBJECT!. (define (make-file) (let* ((descriptor (allocate-file-descriptor)) (file (%make-file descriptor))) (finalizer/add-object! file-finalizer file descriptor) file)) When the finalizer finalizes an object, it calls whatever procedure we passed to release the object's associated resource -- in this case, CLOSE-FILE-DESCRIPTOR --, and sets the object's descriptor field to #F. We can force a file to be finalized with FINALIZER/REMOVE-OBJECT!, and we can test whether it has been finalized, i.e. whether the file is closed, by testing whether its descriptor field is #F. (define (close-file file) (finalizer/remove-object! file-finalizer file)) (define (file-closed? file) (not (file.descriptor file))) Finally, to open a file, we create a file record and pass its newly allocated descriptor to C, after the record is safely stored in the file finalizer. (define (open-file pathname flags mode) (let ((file (make-file))) (%open-file pathname flags mode (file.descriptor file)) file)) (define-c (%open-file (pathname c-immutable-asciz-unix-pathname) (flags (c-integral "int")) (mode (c-integral "mode_t")) ((file-descriptor "fd_pointer") (c-alien-pointer "int"))) (c-void "(*fd_pointer) = (open (pathname, flags, mode));")) But what happens if `open' fails? This leads us to the subject of invoking Unix system calls and handling their failure, for which we shall briefly discard the `file' abstraction and consider just wrappers around the system calls themselves, using integers to represent file descriptors. ** Unix System Calls: Errors, Interruption, and Blocking When invoking Unix system calls, many things can go wrong. The system might detect invalid arguments, which should be reported by signalling an error in Scheme. A signal might be delivered during the system call, which Scheme may need to know about; for instance, a timer interrupt might tell Scheme that it is time to switch threads. Earlier we ignored the return value of the `close' system call, but this may cause errors to be masked. When a C procedure called from Scheme discovers that a system call has failed, it should invoke the cpp macro `SCHEME_SYSCALL_FAILURE' with the name of the system call, the error code with which the system call failed, and a command that restarts the system call (e.g., by jumping to the start of a loop invoking the system call). `SCHEME_SYSCALL_FAILURE' expands to code that may perform a non-local exit to signal a system call error, or may handle the failure in some way or another (e.g., by doing nothing) and then execute a command to restart the system call. [`A non-local exit?', you ask. `Surely that is prone to error, and apt to leak file descriptors or malloc'd memory referenced only by the local variables of my C procedure!', you argue. You are correct. At present, this manual documents the way things do work, not the way things ought to work. See notes.txt for notes on an alternative approach, which I hope to implement when I find the time, at which point I shall update the manual to reflect the better approach.] Using `SCHEME_SYSCALL_FAILURE', we can define CLOSE-FILE-DESCRIPTOR thus: (define-unix-syscall (close-file-descriptor (fd (c-integral "int"))) (c-declare "int status;") (c-begin "if (fd >= 0) " " { " " loop: " " status = (close (fd)); " " if (status < 0) " " SCHEME_SYSCALL_FAILURE (close, errno, goto loop); " " } " (c-unspecific))) (Here we used DEFINE-UNIX-SYSCALL, rather than DEFINE-C. The difference is that DEFINE-UNIX-SYSCALL alerts the Scheme system that the C procedure thereby defined will perform system calls and may observe system call failures, so that the Scheme system will be prepared to handle `SCHEME_SYSCALL_FAILURE'.) This approach is preferable to the above because it covers more conditions than just `EINTR'. For instance, `ENFILE' or `EMFILE' might trigger a garbage collection in an attempt to close any unreachable open files, before restarting the system call as if it had been interrupted. The cpp macro `SCHEME_VOID_SYSCALL' abstracts this pattern of system calls such as `close', which return -1 to indicate failure and 0 to indicate success, and aside from an error code return no useful value (hence `void'). With `SCHEME_VOID_SYSCALL', CLOSE-FILE-DESCRIPTOR looks like: (define-unix-syscall (close-file-descriptor (fd (c-integral "int"))) (c-begin "if (fd >= 0) SCHEME_VOID_SYSCALL (close, (close (fd)));" (c-unspecific))) Even conciser is the DEFINE-UNIX-VOID-SYSCALL form, which assumes that the system call takes all the same arguments as the procedure enclosing it: (define-unix-void-syscall (unix-close (fd (c-integral "int"))) "close") Some system calls return -1 to indicate failure, and some positive value that has a meaning other than that the system call was successful. For example, we might wrap the `open' system call with (define-unix-syscall (unix-open (pathname c-immutable-asciz-string) (flags (c-integral "int")) (mode (c-integral "mode_t"))) (c-declare "int fd;") "loop: " " fd = (open (pathname, flags, mode)); " " if (fd < 0) " " SCHEME_SYSCALL_FAILURE (open, errno, goto loop); " (c-integral "int" "fd")), but it can be expressed more concisely using `SCHEME_UINT_SYSCALL', just like `SCHEME_VOID_SYSCALL': (define-unix-syscall (unix-open (pathname c-immutable-asciz-string) (flags (c-integral "int")) (mode (c-integral "mode_t"))) (c-declare "int fd;") (c-begin "SCHEME_UINT_SYSCALL (open, fd, (open (pathname, flags, mode)));" (c-integral "int" "fd"))), or with DEFINE-UNIX-UINT-SYSCALL, just like DEFINE-UNIX-VOID-SYSCALL: (define-unix-uint-syscall (unix-open (pathname c-immutable-asciz-string) (flags (c-integral "int")) (mode (c-integral "mode_t"))) "open"). However, this is not a robust way to wrap `open', because it is susceptible to interruption causing leaked file descriptors. For instance, if, while running a program from a REPL, one types ^C to stop the computation, a file descriptor may be lost on its way from the call to `open' to the Scheme code that calls UNIX-CLOSE, or even between when C calls `open' and when Scheme returns its value from UNIX-OPEN. Instead, using the `file' abstraction from the previous section, %OPEN-FILE should be defined as follows: (define-unix-syscall (%open-file (pathname c-immutable-asciz-unix-pathname) (flags (c-integral "int")) (mode (c-integral "mode_t")) (fd-pointer (c-alien-pointer "int"))) (c-begin "SCHEME_UINT_SYSCALL " " (open, (*fd_pointer), (open (pathname, flags, mode))); " (c-unspecific))) For a more complete implementation of a file descriptor abstraction, supporting many operations to take and to cede responsibility for file descriptors, see examples/unix-fd.scm. ** Copying Copyright (c) 2009, Taylor R. Campbell. Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved.