BitterSuite3 Proposal
NOTE: Several suggested features have not been implemented. Please see the official documentation for the actual features of this testing code.
This page contains ideas for the implementation of BitterSuite3, and in particular what design goals should be different than those of BitterSuite2. The main potential gains are for the ability for a course to test out implementation of a new language without the need to port around the
BitterSuite base code, and for testing to occur at leaf nodes instead of at test files, so test files can potentially be read once for several tests (for example, a compiled C program that differentiates tests from command-line arguments instead of different compiled constants).
Of course, reference should be made frequently to
ImprovePit#BitterSuite while this is being implemented.
General notes
Overall Behaviour
Options will be passed around via a hash table, keyed by symbols (make-custom-hash instead
of make-hash to optimize symbol comparisons? But make-hash will work initially). Every time
the directory hierarchy is descended, a copy of the hash table (after all processing at the
current level has been done) will be passed to the child directory, so that values in the child
can be overwritten without affecting the parent.
In the current version, sections of the struct that is passed around are automatically cleared when the next directory is entered. This cannot happen in the hashtable. Values that are entered into the hashtable should be functions, evaluators, etc. so they are already in place to be used in lower-level tests.
The problem is that this may require a bunch of open ports, and there is a strict upper limit... this may mean that we test-on-demand as we recurse the directory tree instead of building a table beforehand, and nest custodians in every recursive call so any ports opened in a given directory will be cleaned up automatically once the child directories are complete.
There is currently some dependency on environment variables in the Scheme testing code. This should be eliminated entirely, with the exception of the explicitly exporting some appropriate values to some of the external evaluators. This can be done with semi-complex command-line parsing at the entry point, followed by a parameterization of some of the values that it is not desirable to pass from function to function.
Should options be processed in a predetermined order; ie, always prioritize language?
Likely not for simplicity in the backend; the
ordering from options.ss can always be used regardless of context.
Definition Loading
Interpret "language" by looking for a module first in
$course/bittersuite_languages/$language/definitions.ss
, and then by checking
$suitedir/languages/$language/definitions.ss
; if it's not found in either location,
then fail to change language. Note that it should be possible to have nested directory structure in $language;
for example, this could specify
scheme/beginner
or
scheme/beginner-abbr
etc., both of which are dependent on
shared code in
scheme/common
(which has helper modules, but no
definitions.ss
to load).
The file
definitions.ss
must provide the following functions
- (initialize ht) - Consumes the hash-table ht for the current testing level, and then initializes any language-specific keys to appropriate default values. Return value is ignored.
- (parse-option ht key . values) - For the given key and values from options.ss, verify that this is an acceptable key (returning #f otherwise), do appropriate type-checking and conversions on the values (returning 'bad-value if this is not possible), and then inserting an appropriate key-value pair into the hash table ht and returning a non-false value. Note that if key matching fails, then the default BitterSuite options.ss parser should take over (for standard behaviour of value, etc. options); however, this should not be done for value failure as it indicates a failure in the preconditions of the language in question.
- (interpret-file ht file) - For any file that's discovered besides options.ss, pass it to the tester to determine if it knows how to handle it. Also provide the hash-table in case anything needs to be added to or read from it. This should return a non-false value if the file was handled and #f otherwise. NB: The interpret-file hook can be used to implement non-leaf "tests", i.e. testing code which is not at a leaf node. This can be used, for example, to use a single driver to test several different inputs, where the test file would reside in the same level as several directories (each containing just an input file and nothing else).
- (run-test ht) - We've hit a leaf directory, which signals that a test must be done. All files in the directory (options-related or otherwise) will be processed first. Naturally, the hash-table is required for any options that should be passed to the test. This should return (values (union number 'defer) string), where the first value is either 'defer (for a diff test later) or a number representing the percentage of marks that should be earned for this test, and the string is a status message for the autotesting output given to the student.
There was some consideration given to a fifth method,
(cleanup ht). The intent was that this would be called when leaving the current directory, after all child directories have been processed, providing a chance to clean up anything that will no longer be accessible at the parent directory; for example, code evaluators created at the current level. However, as every directory level is managed by a custodian, and this custodian is directed to shutdown all, it's not clear this is necessary. Even subsidiary processes should be able to be designed such that they shut down once any ports tied to them are closed... but it may turn out there's a need for this anyway.
The following test module meets the above function requirements:
#lang scheme
(provide initialize parse-option interpret-file run-test)
; This defines a test-runner that passes 1/3 of the time, gets half marks 1/3 of the time,
; and fails 1/3 of the time. Half of these are direct, and half are diff tests.
; Leave hash-table parsers as default cascaders
(define (parse-option . args) #f)
(define (interpret-file . args) #f)
(define num-states 6)
(define state (void)) ; Initialized via initialize
(define (next-state)
(set! state (remainder (add1 state) num-states)))
(define (initialize ht)
(set! state (sub1 num-states)))
(define (run-test ht)
(next-state)
(cond
[(= state 0) (values 100 "Test passed completely")]
[(= state 1) (values 50 "Test was only half correct")]
[(= state 2) (values 0 "Miserable failure")]
[else
(let ([out-file (open-output-file (hash-ref ht 'base-output))])
(write-string (let ([secs (current-seconds)])
(cond
[(= state 3) (format "pass~npass")]
[(= state 4) (format "pass~n~a" secs)]
[(= state 5) (format "~a~n~a" secs secs)]))
out-file)
(close-output-port out-file)
(values 'defer "Deferred for file test..."))]))
On the implementation side, when language is read by the default options parser it should call a function similar to the below and then place the embedded wrapper function into the hashtable (with key 'language?) for use in processing all other options:
(define (language-fn req-file)
(parameterize ([current-namespace (make-base-namespace)])
(namespace-require `(file ,req-file))
(let ([e-ns (current-namespace)])
(lambda (arg)
(eval arg e-ns)))))
It loads the module code from definitions.ss into a special namespace, and returns a function
that will allow the calling of the functions referenced above via eval.
Note: if we can get away with making an empty-namespace instead of base-namespace because
of the simplicity of calls made to it (most of the functionality being provided by
separate modules), then we should.
Note that namespace creation is expensive! What should happen is that this creation is done
once, and then a mapping from language key to language evaluator should be stored by the testing suite so it can be recalled quickly if the language is specified again. This could happen, for example, on assignment that has several questions testing Scheme, and several other questions testing Python.
The fact that we're allowing languages to be specified independently means it is no longer possible to have a centralized handles-and-messages. However, we
should be able to provide a module that provides custom exceptions (some of which will add appropriate prefixes to exn-message as is done with some of the strings in handles-and-messages) or alternatively string-building functions, and it should be possible to have mzscheme include the directory with this file in the module search-path so the language-module writers can make use of it.
A note on diff
As the suite now supports a percentage value rather than pass/fail, a standard diff no longer suffices for I/O tests. The new default diff is given below.
#!/usr/bin/env bash
# A friendly default diffing program
# The remainder of any status message should be written to stdout.
# The percentage earned should be written to fd 3.
# If the percentage earned is not 100, then the calling program
# should generate a side-by-side comparison.
if diff -ibB -q "$1" "$2" > /dev/null 2>&1; then
echo 'passed.'
echo '100' >&3
else
echo 'FAILED. See output comparison below for details.'
echo '0' >&3
fi
Language Implementation Pointers
Tags such as equal, expected, and result have been considered common to the core of the language, but will be no longer. The reason is that they imply an interpreter context, and are not relevant to languages such as C and "script." The tags exposed to the end-user from a common set will be language (indicates a language evaluator should be loaded), value (total marks for a given test), desc / description (a description of the purpose of the test), timeout (test time limit in seconds), memory (test memory usage in megabytes), and diff (specifying a program to use to compare output). Timeout and memory will require direct language support, as there are different mechanisms that may be appropriate in different language contexts. Functions that provide default entries to the hash table for all of these are provided in default-language-handler.ss.
Languages such as Scheme and Python should ideally reference a common module that defines some behaviour for these tags. Result should be a single function call; however, expected should accept multiple values. It would make sense, given the part-marks behaviour, to allow a corresponding multiple-valued expected-weight (defaulting to 100) that specify percentage weights assigned to different correct answers. So, for example, a combination of (expected 3 4 5) and (expected-weight 100) would mean that any of those answers would receive a mark of 100, whereas (expected 3 4 5) and (expected-weight 100 80) would mean that 3 gets a mark of 100% but 4 and 5 only get marks of 80%, and (expected 3 4 5) (expected-weight 100 80 60 40) would mean that 3 gets a mark of 100%, 4 gets a mark of 80%, and 5 gets a mark of 60%.
There should also be mark-explanation tags to specify the format of the message to return from run-test under different scenarios. Different possibilities are available for this, but it could involve special keys. For example, there could be defaults of (mark-explanation 100 "Passed") and (mark-explanation 0 "FAILED -- saw " 'result " but expected " 'expected) where the back-end code is able to do appropriate substitution when it sees the keys 'result and 'expected.