ISG Web>ISGScripts>Testing>ImprovePit>BitterSuiteImprovePit (2010-04-30, TerryVaskor)

BitterSuite Suggested Improvements

Creation of testing conventions

All of the language modules in particular should have a rigorous series of tests written for them, likely using the schemeunit framework introduced in version 4.2.5 of PLT Scheme.

Moving away from command-line configuration

Right now, runTests and computeMarks must have command-line arguments to differentiate behaviour, such as the verbosity of test runs or the use of nroff on the marking schemes.

This puts an unnecessarily unclear burden on course staff.

Instead, this should be a configuration file in the test suite directory (something like BitterSuiteConfig.ss?) which contains a series of S-expressions.

(verbosity 4)
(use-nroff? #t)

This will be easier for tutors to manage, will be much clearer, and will be much more apparent in the Mac Finder than a potentially hidden .rstrc file.

Scheme language error

CS 135 in Winter 2010 tripped over an apparent problem with the sandbox library. Suite 0old of assignment a07b would sometimes trigger memory errors in sandboxed evaluators. This did not appear to be a standard "memory limit reached" exception that could be handled by BitterSuite (like the timeouts that are caught); instead, it appeared to corrupt the evaluator state, automatically failing all subsequent tests. Raising the memory limit on the sandbox did not appear to make a difference.

The "fix" in suite 0 was to reset the evaluator state on every test; while not fixing the error entirely, it would at least allow tests following to execute properly. This was unnecessarily ugly, and added significant run time for each student's tests (at least, on Solaris; it's possibly less noticeable on Linux).

This should be investigated. The submitted code and testing suites should be in the cs135 account's archives.

Python Language Error

This seems to work properly for the most part. However, in Winter 2009, there was one assignment (a9) where some students were being given errors that their code failed to load, even though it would load normally via python on the command line and worked properly in the old BitterSuite 2 tester.

The reasons have still not been tracked down, but the error is not known to have recurred.

runTests

It should be possible to run tests that do not contribute to automarking. The purpose of these will be a side effect of creating special files that are operated on later (for handmarking, for example), and will not be written to the marking output file. This will require a new return value; in addition to a number or 'defer, something like 'no-mark.
Any directory with a file (not subdirectory, in case it's a logical test name) called something like “skip” should be pruned from testing, as a way to facilitate quick testing of particular subsets of the entire testing suite.
Whenever evaluator creation fails, the potentially verbose error is repeated for every test that uses that evaluator. Instead, dump it to a file which is =keepFile=d, and just say "see information below" in the test output?
Instead of just timing out in terms of real/effective time, it would also be handy to be able to specify other metrics. So, for example (timeout 45) or (timeout real 45) could override the real timeout time to be 45 seconds, whereas (timeout user 30) could override the user time timeout to be 30 seconds. See SchemeModuleLimitUserTime for a currently very limited Scheme-level user-time timeout mechanism, and /fsys1/.software/arch/drscheme-4.1.3/distribution/collects/scheme/sandbox.ss starting at line 209 for how proper timeout threading is done in the sandbox limits. It should be possible to do this, with the watcher thread doing basically "ps -p$subprocid -otime" at time intervals of half of the remaining time before kill. A script-level mechanism is already in place, in a runTests-side script called timeout.
We should be trying to automate Scheme test coverage detection, using (sandbox-coverage-enabled #t) and then reading the syntax objects with get-uncovered-expressions. Ideally, this would be done via the testing suite in a reasonably clean fashion...
It may seem reasonable for Python evaluators to handle standalone programs; ie, it could be ideal I/O test behaviour (flagged by lack of an expected expression in test.*): Always rebind input for evaluating result. If a result expression exists, discard and rebind output. If it does not, keep output generated by the student code, under the assumption that it's a self-contained program. The reason this is not feasible is that it changes the semantics of loading code (for example, ignoring raised exceptions and discarding any exception-causing code), so these tests should instead be relegated to external, shell-scripted tests. (However, it's now done for Scheme, so could it be simulated in Python as a special option too...?)
Add information about the -v command-line flag to the man page.
Fix code that is spitting errors of the form 1 /u2/isg/u/tavaskor/working/bittersuite3/runTests-files/timeout: line 111 : kill: (22) - No such process
Scheme exception handlers should not catch exn:break! This will block ctrl+c handlers...
Verbosity should be configurable as an option in addition to at the command-line; this enables temporary elevating of verbosity on a particular subtree of directories for debugging purposes.
Extend the stderr handling capabilities of C to all languages? It would be useful for any test that returns 'defer as a mark.

Implementation details

Language evaluators should be cached to avoid expensive re-creation (particularly in the case of running on Solaris?).

computeMarks

Diff tests on empty files should result in automatic failures; all real output tests should generate something.
Diff testing requirements can be relaxed in two ways:
- If nothing is provided to fd3, then supply a mark of 100 if the command has exit status 0 and supply a mark of 0 otherwise.
- If nothing is provided to fd4, then supply default output messages.
Long values in expected and result (ie, large tree structures) do not appear to be truncated in the Autotesting output, resulting in a lot of wasted printing; investigation needs to be done, and corrections if necessary.
Fix the check on environment variables in computeMarks, possibly verify runTests.
Only have line numbers in files where it makes sense in RST output. For example, Scheme language load errors are currently dumped with line numbers, which is not helpful information

general?

Multiline descriptions should not cause bizarre failures as they currently do. Using the proper ASCII characters for record and field separator will rectify this.
When doing Scheme output, the suite still really needs to report errors such as "the function name cannot be found."
Add information about the intermediate files (and the format of them) in the man page.

Topic revision: r7 - 2010-04-30 - TerryVaskor

ISG Web

ISG Web Home
- Changes
- Index
- Search

Webs
- AIMAS
- CERAS
- CF
- CrySP
- External
- Faqtest
- HCI
- Himrod
- ISG
- Main
- Multicore
- Sandbox
- TWiki
- TestNewSandbox
- TestWebS
- UW

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit

Instructional Support Group, David R. Cheriton School of Computer Science, University of Waterloo