This is the artifact for the POPL 2022 paper "A formal foundation for symbolic evaluation with merging". A copy of the paper is included in the artifact.
The artifact is packaged as a Docker image. See the hardware requirements and installation sections for more details. The repository is simply the source of the Docker image.
The paper uses Rosette and Rosette* to refer to versions 3 and 4 of the Rosette system, respectively. The artifact may refer to them as Rosette 3 and Rosette 4.
- List of claims
- Hardware requirements
- Installation and sanity testing
- Evaluation instructions
- Additional artifact description
We mechanized our symbolic semantics and proved its correctness in Lean.
We also created a reference interpreter named Leanette, which is proven correct.
These are inside the subdirectory lean
.
The following table links sections in the paper with files under the directory lean
.
File | Description | Note |
---|---|---|
lean/src/cs/lang.lean |
Syntax and concrete semantics of λ_c |
Section 3.1 |
lean/src/cs/detc.lean |
Determinism of λ_c |
Theorem 1 |
lean/src/cs/sym.lean |
Symbolic factory interface | Section 4.1 |
lean/src/cs/svm.lean |
Evaluation rules for S_c |
Section 4.2 |
lean/src/cs/thm.lean |
Properties and correctness of S_c |
Theorem 2—8 |
lean/src/interp/sym/defs.lean |
Leanette implementation | Section 6.1 |
lean/src/interp/sym/thm.lean |
Correctness of Leanette | Theorem 9 |
lean/src/sym_factory/sym_factory.lean |
Symbolic factory F_L |
Section 6.1 |
lean/src/sym_factory/red.lean |
Reducibility and correctness of F_L |
Theorem 10 |
We also created a Racket implementation, known as Rosette*, which implements the symbolic semantics with an optimized symbolic factory.
The subdirectory rosette-4/rosette
(or this link if you wish to view it outside the Docker image)
contains an implementation of Rosette*, as described in Section 6.2.
To make sure that Rosette* is implemented correctly, we tested it via solver-aided differential testing.
The subdirectory diff-testing
contains the differential testing setup that tests Leanette against Rosette*,
as described in Section 6.3 (and also Figure 7, Table 1, and Table 2).
We compared the interface of Rosette* and Rosette by porting benchmarks written in Rosette to Rosette*.
Two subdirectories rosette-benchmarks-3
and rosette-benchmarks-4
contain the benchmarks described in Section 7.1.1 and 7.1.2.
The directory interface
contains tools that read these benchmarks to generate Table 3.
We tested that the performance of Rosette* matches that of Rosette on the benchmarks mentioned above.
The subdirectory perf
contains a tool that tests the performance of Rosette* and Rosette for SymPro benchmarks
(which are in rosette-benchmarks-3
and rosette-benchmarks-4
), as described in Section 7.2 (and Table 4).
The following table lists SymPro benchmarks.
Benchmark | ID | Entry file |
---|---|---|
Bagpipe | bagpipe |
bagpipe/setups/textbook/run.rkt |
Bonsai | bonsai |
bonsai/nanoscala.rkt |
Cosette | cosette |
cosette/cidr-benchmarks/oracle-12c-bug.rkt |
Ferrite | ferrite |
ferrite/rename.rkt |
Fluidics | fluidics |
fluidics/ex2.rkt |
GreenThumb | greenthumb |
greenthumb/GA/output/0/driver-0.rkt |
IFCL | ifcl |
ifcl/test.rkt |
Memsynth | memsynth |
memsynth/case-studies/synthesis/ppc/ppc0.rkt |
Neutrons | neutrons |
neutrons/filterWedgeProp.rkt |
Nonograms | nonograms |
nonograms/puzzle/src/run-batch-learn-rules.rkt |
Quivela | quivela |
quivela/test-inc-arg.rkt |
RTR | rtr |
rtr/benchmarks/all.rkt |
SynthCL | synthcl |
synthcl/examples/sobelFilter/test.rkt |
Wallingford | wallingford |
wallingford/tests/all-tests.rkt |
WebSynth | websynth |
websynth/test/all-tests.rkt |
Separately, the subdirectory jitterbug-benchmarks
contains a tool that tests the performance of Rosette* and Rosette for Jitterbug benchmarks.
To use this artifact, you will need a machine capable of running Docker with at least 8GB of free disk space, and at least 16GB of RAM. It is known that Docker does not work on ARM-based systems such as Apple M1.
For reference, we tested this artifact on a machine running Linux 5.4.0 and Docker 20.10.8 with an Intel Core i7-7700K CPU @ 4.20GHz, configured with 16GB of RAM. The results you obtain may vary from ours depending on your particular machine setup, including CPU, available RAM, concurrently running processes, etc.
The only required installation is Docker. See https://docs.docker.com/install/ for details on how to install Docker. If you are using macOS (non-ARM-based), Docker For Mac is a good solution.
After installing Docker, you can download and run the Docker image by running:
# Download image (~2GB download)
$ docker pull unsat/leanette-popl22-artifact:latest
# Run Image
$ docker run -it --name leanette-artifact unsat/leanette-popl22-artifact:latest
This will drop you into a shell inside the container,
in the /workspace
directory, the main directory containing
source code and experiments.
We have installed vim
into the container for convenience to edit and view files;
you can also use Docker to copy files into and out of the container,
see https://docs.docker.com/engine/reference/commandline/cp/.
If you leave the container and wish to get back, you will
need to restart it with the command
docker start -ia leanette-artifact
. If you wish to instead start from a fresh
copy of the image, run docker rm leanette-artifact
to remove the container and then follow the instructions
for the first run of the container above.
To test that Leanette works, run:
$ cd /workspace/lean
$ lean src/workspace/load.lean
This should output:
"(ans (state #t #t) (term_bool #t))"
To test that Rosette* works, run:
$ cd /workspace
$ raco cross -q --workspace /workspace/rosette-4 racket examples/correct-ex-1.rkt
This should produce the output (unsat)
.
Next, run:
$ raco cross -q --workspace /workspace/rosette-4 racket examples/incorrect-ex-1.rkt
This should produce a model, for example:
(model
[y 0])
Compared to Rosette, Rosette* adds the statement assume
to the language.
Therefore, correct-ex-1.rkt
, which uses the assume
statement, should fail to run in Rosette.
$ raco cross -q --workspace /workspace/rosette-3 racket examples/correct-ex-1.rkt
should fail with output beginning with examples/correct-ex-1.rkt:12:4: assume: unbound identifier
.
However, the program examples/rosette-3.rkt
which does not use the assume
statement, should succeed when run with Rosette.
$ raco cross -q --workspace /workspace/rosette-3 racket examples/rosette-3.rkt
This should produce a model, for example:
(model
[y -1])
Test the oracle on a predefined program that passes the differential testing (Figure 7b in the paper) by running:
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket tester.rkt ./predefined/fig-7b.rkt
You should see the following output:
======= Program ./predefined/fig-7b.rkt =======
Computed in X.XXs
Rosette*:
(halt (state #t #f))
Rosette* evaluation is trivial
Rosette* state size: 2
Computed in X.XXs
Leanette:
(ans (state #t #f) (undef))
Leanette evaluation is non-trivial
Leanette state size: 46
Leanette and Rosette* agree
Note: it may appear that Leanette's state is trivial, but that's because the output shown above is also reduced by Rosette's symbolic factory. The actual state size is 46, which is not trivial.
Next, test the reporter by running:
$ raco cross -q --workspace /workspace/rosette-4 racket report.rkt
You should see the following output:
┌─────┬─────┬───┬────────────────────┬───┬───┬────────────────────┬──────┬──────┬───────────────────┬───┬──────────────────┬───┬──────────┐
│Range│Count│Avg│(Leanette state) Sym│Max│Avg│(Rosette* state) Sym│Max │Avg │(Leanette time) Max│Avg│(Rosette time) Max│Avg│Validated?│
├─────┼─────┼───┼────────────────────┼───┼───┼────────────────────┼──────┼──────┼───────────────────┼───┼──────────────────┼───┼──────────┤
│9 - 9│1 │9 │1 │46 │46 │0 │+nan.0│+nan.0│X.X │X.X│X.X │X.X│✓ │
└─────┴─────┴───┴────────────────────┴───┴───┴────────────────────┴──────┴──────┴───────────────────┴───┴──────────────────┴───┴──────────┘
Similarly, test the oracle on a predefined program that does not pass the differential testing (Figure 7a in the paper) by running:
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket tester.rkt ./predefined/fig-7a.rkt
You should see the following output:
======= Program ./predefined/fig-7a.rkt =======
Computed in X.XXs
Rosette*:
(ans (state #t #t) #t)
Rosette* evaluation is trivial
Rosette* state size: 2
Leanette runs out of fuel
Leanette evaluation is trivial
Leanette state size: 0
Next, test the reporter by running:
$ raco cross -q --workspace /workspace/rosette-4 racket report.rkt
You should see the following output:
program at ./predefined/fig-7a.rkt fails differential testing because Leanette runs out of fuel
Note that report.rkt
is a script to generate Table 1.
Because in sanity test runs there is only one trivial data point, a lot of values do not make sense, so it results in +nan.0
.
Run the following command:
$ cloc --version
It should produce the output 1.86
.
There is no sanity test for this step (the previous sanity tests already verify that Rosette* and Rosette can be run).
In general, the output related to time duration will not exactly match the paper due to differences in hardware and machine configuration. However, we hope it won't deviate significantly, and that the data you obtain will maintain relative order with respect to other data points. For example, if a table in the paper indicates that that Rosette* is faster than Rosette in a particular benchmark, Rosette* should still be faster than Rosette in your run too, even though the measured times will not exactly match.
Expected time: less than 1 minute
To check all proofs, run the following commands:
$ cd /workspace/lean
$ ./reset.sh
$ leanproject build
The script reset.sh
resets the lean
directory (e.g., clears compiled files).
The command leanproject build
checks proofs in every file and warns for any uses of sorry
(i.e., unchecked proofs).
There should be exactly one warning, as follows:
lean/src/test_sorry.lean:1:0: warning: declaration 'test_sorry' uses sorry
This is an intentional sorry
inserted into the project to ensure that leanproject build
can detect unchecked proofs.
The file test_sorry.lean
is not used otherwise. To validate this, you can run:
$ git rm src/test_sorry.lean
$ git commit -m "remove test_sorry.lean" # we require that the worktree is clean, so commit the change
$ ./reset.sh
$ leanproject build
To check proofs in a specific file, run lean <file.lean>
.
Unless the file is test_sorry.lean
, there should be no warning or error.
Expected time: N/A
You can find the the implementation in /workspace/rosette-4/rosette
(or this link if you wish to view it outside the Docker image).
If you wish, you may create your own Rosette* files and run them. See the reference for more details.
More thorough evaluation of Rosette* is described in the next sections of this guide (solver-aided differential testing and performance evaluation).
Expected time: less than 4 hours
The results of Figure 7 should already be reproduced as a part of the sanity testing process. This section details differential testing of randomly generated programs, which consists of multiple steps:
- Compile Leanette so that it can readily run tests
- Generate tests
- Run the oracle on the generated tests
- Check the quality of the tests (coverage) by seeing if the tests can detect mistakes in intentionally incorrect interpreters.
Expected time: less than 1 minute
Simply run:
$ cd /workspace/lean
$ leanproject build
If you have already done this as a part of the previous step (and have not run reset.sh
after that), this step can be skipped.
Expected time: less than 1 minute
There are two possible choices here. You do not need to do both. Only one suffices.
Generate 10,000 test files as described in the paper by running:
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket generation.rkt
The script reset.sh
resets the diff-testing
directory (e.g., clear previously generated test files and results).
The script generation.rkt
generates 10,000 test files in the directory generation
with seed 42.
Feel free to inspect them if you wish to. If you wish to reproduce Table 1, we recommend that you keep using the default test generation setting,
but otherwise, you can change the seed to other numbers. See the reference for how to do so.
If you don't have enough time to run the oracle on all 10,000 test files, you can generate fewer programs by running:
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket generation.rkt --sub-iterations <sub> --iterations <main>
The above command will generate <sub>
* <main>
test files.
In the default test generation setting, <sub>
is 20 and <main>
is 500. We recommend that you consider lowering <sub>
first.
For example, the command:
$ raco cross -q --workspace /workspace/rosette-4 racket generation.rkt --sub-iterations 1 --iterations 500
will generate 500 test files. On our machine with four threads, the oracle can check all 500 programs under 10 minutes.
Note however that any setting that deviates from the default will not reproduce Table 1. However, the differential testing should still pass (see the test report section for more details).
Expected time: depends on how many tests you generate. For 10,000 tests:
- Expected time with one thread: less than 16 hours
- Expected time with four threads: less than 4 hours
Run the oracle:
$ cd /workspace/diff-testing
$ raco cross -q --workspace /workspace/rosette-4 racket tester.rkt ./generation
This command runs the oracle with 4 threads and 80s timeout and outputs the result to the screen.
It also creates a log file at /workspace/diff-testing/workspace/log.rktd
that the reporter will read.
Due to hardware difference, you may need to adjust these parameters.
See the reference for more details.
(Note that in the paper, we used the timeout of 40s, but inside Docker, we find that 40s is not enough, so we bump the timeout up to 80s.)
Expected time: less than 1 minute
Report the result by running:
$ cd /workspace/diff-testing
$ raco cross -q --workspace /workspace/rosette-4 racket report.rkt
The output consists of two parts: the programs that do not pass the differential testing and a table similar to Table 1 in the paper.
First, for programs that do not pass the differential testing, there are 5 possible reasons:
- Leanette and Rosette* disagree: this case should never happen, as it indicates that Rosette* and Leanette actually disagree. If you found one, please let the authors know!
- Leanette has a fatal error: this case should never happen. It's likely that there is a technical error somewhere. If you found one, please let the authors know!
- Leanette runs out of fuel: this case is possible when the generated program contains an unreachable infinite loop, but Leanette's symbolic factory is not strong enough to deduce the unreachability. The case is accounted for in our framework (see Section 5 and Example 9) and should not be counted as a mistake. It would be very rare for this case to happen. In our evaluation in the paper, this case does not arise from randomly generated programs on our machine.
- Rosette* timeout: either the program takes a really long time to evaluate or it contains an infinite loop. It would be very rare for this case to happen. In our evaluation in the paper, this case does not arise from randomly generated programs on our machine.
- Leanette timeout: the program takes a really long time to evaluate. It would be very rare for this case to happen. In our evaluation in the paper, this case does not arise from randomly generated programs on our machine.
You can re-run the oracle on each program that doesn't pass the differential testing to see more details.
If you encounter the timeout problems with the default test generation setting for the generation process, it should be the case that all generated programs are terminated. So you can also re-run that specific program with higher timeout.
Second, for the table, if you use the default test generation setting, the result should be similar to Table 1 with minor differences in evaluation time.
┌─────────┬─────┬───┬────────────────────┬───────┬─────┬────────────────────┬───┬───┬───────────────────┬───┬──────────────────┬───┬──────────┐
│Range │Count│Avg│(Leanette state) Sym│Max │Avg │(Rosette* state) Sym│Max│Avg│(Leanette time) Max│Avg│(Rosette time) Max│Avg│Validated?│
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│1 - 12 │1,006│6 │220 │635 │41 │88 │20 │4 │3.0 │2.7│0.3 │0.1│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│13 - 24 │1,021│19 │349 │1,142 │77 │174 │36 │6 │2.9 │2.6│1.2 │0.1│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│25 - 38 │1,042│32 │430 │2,855 │102 │224 │34 │6 │2.8 │2.7│0.3 │0.1│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│39 - 50 │1,036│44 │433 │16,979 │242 │187 │34 │7 │3.0 │2.7│0.4 │0.1│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│51 - 65 │1,040│58 │450 │7,523 │186 │227 │30 │8 │2.9 │2.7│0.3 │0.1│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│66 - 77 │1,040│72 │459 │30,386 │474 │239 │45 │8 │3.6 │2.7│0.3 │0.1│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│78 - 88 │1,053│83 │454 │19,835 │426 │231 │36 │8 │3.3 │2.7│0.5 │0.1│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│89 - 102 │1,042│95 │427 │45,893 │482 │190 │41 │9 │4.4 │2.7│0.3 │0.2│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│103 - 119│1,009│111│425 │61,439 │686 │218 │45 │9 │5.2 │2.7│0.4 │0.2│✓ │
├─────────┼─────┼───┼────────────────────┼───────┼─────┼────────────────────┼───┼───┼───────────────────┼───┼──────────────────┼───┼──────────┤
│120 - 158│711 │129│324 │293,171│2,001│159 │36 │9 │40.0 │2.7│0.4 │0.2│✓ │
└─────────┴─────┴───┴────────────────────┴───────┴─────┴────────────────────┴───┴───┴───────────────────┴───┴──────────────────┴───┴──────────┘
The following list details our injected mistakes:
Patch file | Description |
---|---|
diff-testing/patch/incorrect_and.patch |
Make and(b_a, b_b) = b_a where b_a and b_b are non-literal symbolic booleans |
diff-testing/patch/incorrect_or.patch |
Make or(b, b) = tt where b is a non-literal symbolic boolean |
diff-testing/patch/incorrect_lt.patch |
Make op(<, z_a, z_b) = z_b < z_a where either z_a or z_b is a non-literal symbolic integer |
diff-testing/patch/incorrect_call_op.patch |
In CallOp , skip strengthen |
diff-testing/patch/incorrect_call.patch |
In Call , change the guards for evaluation of the body to be tt |
diff-testing/patch/incorrect_call_bad.patch |
In CallBad , use the input symbolic state without asserting γ |
diff-testing/patch/incorrect_let.patch |
In Let , do not bind any variable |
diff-testing/patch/incorrect_if_sym.patch |
In IfSym , swap not(truth(v)) and truth(v) when calling merge_R(., .) |
diff-testing/patch/incorrect_if_true_false.patch |
In IfTrue and IfFalse , swap the conditions |
diff-testing/patch/incorrect_error.patch |
In Err , do Abort |
diff-testing/patch/incorrect_abort.patch |
In Abort , do Err |
There are two possible choices here. You do not need to do both. Only one suffices. We recommend the default test generation setting as it's going to be much quicker and matches the paper.
Expected time: less than 5 minute
We have already identified test programs in generation
that cause faultily patched Leanette (BadLeanette) to produce incorrect results.
First, run:
$ cd /workspace/lean
$ leanproject build
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket generation.rkt
$ git status
The commands will generate 10,000 test files and prepare Leanette.
git status
should report that the worktree is clean (no "Untracked files" or "Changes").
If it is not clean, running git clean -df
and git reset --hard
from /workspace
will make the worktree clean.
Second, run:
$ ./test-coverage.sh
The script test-coverage.sh
will patch Leanette with each patch file and run the oracle on known test programs that cause them to fail differential testing.
The expected output is that the oracle reports failure on all of them.
program at ./generation/prog23160718406451015591104819237807520888040421765198.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog100208336541399653250221187751391297369.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog478473500332049125701102687487886134892618785332689813519393109671586439862663588789828131845030156.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog16803.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog666831569393266787587089860792333636466692132130.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog19436053114232394.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog15.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog1331343393596762924043919245445877806201864368967369509032193564215325013490794460521988648842748377343143771160948246706431109369.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog271706286423030047271490064405043091576078507211894322485583089278608494173230103175221675458186681751625690187786390.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog735983.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog100026410668776740492900673318090171927233255361720922983271273849413253676502727166331182369672360102050306966877856993674940497.rkt fails differential testing because Leanette and Rosette* disagree
Note that Lean will also emit a lot of errors and warnings. This is expected, since the proofs are no longer valid after patching. However, Lean should still be able to run Leanette with no problem.
Expected time: depends on various factors, but generally, it could take extremely long time, as you will run the test oracle on all generated tests for 11 times.
If you don't want to use the default test generation setting, you can instead run:
$ git status # should report the directory is clean
$ raco cross -q --workspace /workspace/rosette-4 racket patcher.rkt ./patch/<file.patch> ./generation
$ git reset --hard
which will apply a given patch to Leanette, and run the oracle on programs in generation
.
You can interrupt (ctrl-c) the process when you notice that the oracle reports that Rosette* and (Bad)Leanette disagree.
If you know Lean, you can also modify Leanette to create your own BadLeanette and run the oracle yourself on generation
.
Note that the modification must make the interpreter incorrect (it is possible to modify it without compromising its correctness, so you need to be careful).
The expected result is that for every patch, one of the generated tests in generation
will catch the discrepancy between Rosette* and BadLeanette, provided that there are enough generated tests.
Expected time: less than 20 minutes
To generate Table 3, run:
$ cd /workspace/interface
$ ./reset.sh
$ ./run.sh # (~17 minutes)
The script reset.sh
removes any previously generated outputs.
The script run.sh
counts the number of lines by going through module dependencies,
performs line diffing, and reports the result. It will print out the intermediate results
as it runs, ending with a table that matches Table 3 from the paper:
┌───────────────────────────────────────────┬───────────┬────────────┬────────┬────┬─────────┐
│Benchmark │Rosette LoC│Rosette* LoC│LoC diff│ │Solver │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Bagpipe~\cite{weitz:bagpipe} │2,643 │2,643 │+2 │-2 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Bonsai~\cite{chandra:bonsai} │437 │437 │+1 │-1 │Boolector│
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Cosette~\cite{chu:cosette} │774 │774 │+0 │-0 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Ferrite~\cite{ferrite} │348 │348 │+2 │-2 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Fluidics~\cite{willsey:puddle} │98 │98 │+0 │-0 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│GreenThumb~\cite{phothilimthana:greenthumb}│3,554 │3,556 │+13 │-11 │Boolector│
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│IFCL~\cite{rosette:pldi} │483 │483 │+13 │-13 │Boolector│
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│MemSynth~\cite{memsynth} │13,303 │13,307 │+8 │-4 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Neutrons~\cite{pernsteiner:neutrons} │37,174 │37,174 │+2 │-2 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Nonograms~\cite{butler:nonograms} │3,368 │3,374 │+11 │-5 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Quivela~\cite{aws:quivela} │1,010 │1,009 │+10 │-11 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│RTR~\cite{kazerounian:rtr} │1,649 │1,640 │+12 │-21 │CVC4 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│SynthCL~\cite{rosette:pldi} │2,257 │2,256 │+42 │-43 │Boolector│
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Wallingford~\cite{borning:wallingford} │2,532 │2,533 │+12 │-11 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│WebSynth~\cite{rosette:pldi} │1,181 │1,189 │+14 │-6 │Z3 │
├───────────────────────────────────────────┼───────────┼────────────┼────────┼────┼─────────┤
│Jitterbug~\cite{jitterbug} │29,280 │28,935 │+478 │-823│Boolector│
└───────────────────────────────────────────┴───────────┴────────────┴────────┴────┴─────────┘
Expected time: less than 1 hour
Run the following command:
$ cd /workspace/perf
$ ./reset.sh
$ ./perf.sh sympro
The output should roughly match Table 4a and the SymPro row of Table 4c in the paper, with minor differences in columns about time duration.
┌───────────┬───────────────┬────┬───────┬─────┬────────────────┬────┬───────┬─────┐
│Benchmark │(Rosette) Total│Eval│Solving│Terms│(Rosette*) Total│Eval│Solving│Terms│
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Bagpipe │25 │24 │2 │47 │26 │24 │2 │48 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Bonsai │23 │21 │3 │664 │46 │42 │3 │1,222│
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Cosette │13 │9 │5 │131 │14 │10 │5 │154 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Ferrite │23 │13 │10 │34 │32 │13 │19 │40 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Fluidics │20 │7 │13 │284 │25 │8 │18 │308 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│GreenThumb │56 │8 │48 │239 │6 │2 │4 │74 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│IFCL │154 │11 │143 │383 │119 │12 │107 │438 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│MemSynth │24 │22 │2 │61 │27 │25 │2 │163 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Neutrons │44 │43 │<\,1 │444 │13 │12 │<\,1 │172 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Nonograms │17 │3 │14 │51 │23 │4 │19 │73 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Quivela │35 │33 │2 │1,113│39 │34 │4 │1,340│
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│RTR │353 │332 │21 │741 │135 │101 │34 │1,106│
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│SynthCL │252 │16 │236 │726 │247 │18 │229 │732 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│Wallingford│12 │<\,1│11 │4 │12 │<\,1│11 │5 │
├───────────┼───────────────┼────┼───────┼─────┼────────────────┼────┼───────┼─────┤
│WebSynth │17 │12 │5 │1,012│24 │19 │5 │778 │
└───────────┴───────────────┴────┴───────┴─────┴────────────────┴────┴───────┴─────┘
┌────────────┬─────┬────┬───────────┬─────┬────┬──────────────┬─────┬────┬────────────┬─────┬────┐
│(Total) Best│Worst│Avg │(Eval) Best│Worst│Avg │(Solving) Best│Worst│Avg │(Terms) Best│Worst│Avg │
├────────────┼─────┼────┼───────────┼─────┼────┼──────────────┼─────┼────┼────────────┼─────┼────┤
│0.11 │1.95 │0.85│0.26 │2.06 │0.90│0.09 │1.90 │1.00│0.31 │2.65 │1.06│
└────────────┴─────┴────┴───────────┴─────┴────┴──────────────┴─────┴────┴────────────┴─────┴────┘
Also see the reference for how to run a specific benchmark or fewer benchmarks.
Expected time: less than 24 hours
This step describes how to generate the results described in section 7.2 that compare the performance of the Rosette version of Jitterbug to that of the Rosette* port.
First, switch to the directory containing the Jitterbug benchmarks.
$ cd /workspace/jitterbug-benchmarks
Generating the performance results will take several hours, as the benchmarks are run single-threaded for consistency across runs.
The script will output the raw performance data to the console and to the files jitterbug-rosette3-data.csv
and jitterbug-rosette4-data.csv
in this directory.
You may choose to skip this step and instead use the pre-computed copies of the data to see the
formatted data without waiting for benchmarks to complete.
To run the benchmarks and generate fresh versions of the data, run the following:
$ ./run-jitterbug-benchmarks.sh # ~19 hours
Note that running the benchmarks will overwrite the pre-computed copies of the data. You can revert to the pre-computed
versions with git checkout '*.csv'
.
After the command completes (or if you are using the pre-computed copies of the data),
run the following command to format the data as it appears in Table 4b and the Jitterbug row of table 4c in the paper.
$ ./show-performance-table.sh # <1 second
If using the pre-computed data, this will produce the output below. If you ran the benchmarks to generate the data yourself, the results may vary depending on your particular machine (e.g., CPU, amount of available RAM, etc.)
Table 4b:
| Total (s.) || Evaluation (s.) || Solving (s.) || Terms (*10^3) ||
+ mean + median + max ++ mean + median + max ++ mean + median + max ++ mean + median + max ++
Rosette | 50 | 22 | 9,963 || 2 | 1 | 69 || 48 | 20 | 9,894 || 119 | 15 | 1,678 ||
Rosette* | 38 | 20 | 4,100 || 1 | 1 | 73 || 36 | 19 | 4,027 || 120 | 23 | 1,837 ||
Performance results for the 668 ported and original Jitterbug benchmarks.
Table 4c (Jitterbug results only)
| Total || Evaluation || Solving || Terms ||
+ best + worst + average ++ best + worst + average ++ best + worst + average ++ best + worst + average ++
Jitterbug | 0.23 | 6.03 | 0.91 || 0.33 | 2.18 | 0.87 || 0.22 | 6.48 | 0.91 || 0.79 | 8.12 | 1.09 ||
Performance ratios between the ported and original code for SymPro and Jitterbug benchmarks.
See the reference for additional artifact description.