-
Notifications
You must be signed in to change notification settings - Fork 3
/
DEV-NOTES
132 lines (102 loc) · 7.79 KB
/
DEV-NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
* To run a quick set of tests:
python setup.py build_ext -i
ZS_QUICK_TEST=1 nosetests --all-modules zs
* To run a complete set of tests with HTML coverage output:
nosetests --all-modules --with-cov --cov-conf .coveragerc --cov-report term-missing --cov-report html zs
* windows compilation
https://github.com/cython/cython/wiki/64BitCythonExtensionsOnWindows
https://matthew-brett.github.io/pydagogue/python_msvc.html
* Next
** to alleviate multiprocessing flakiness, and speed things up in general, when parallelism="guess" then delay spawning worker pool until after we hit the third block within a single read thread
** windows wheels? (and also for backports.lzma?)
** once RTD is straightened out, make backports.lzma a proper requirement
** figure out how to get cython coverage (CYTHON_TRACE?)
** check whether the 50 MIB/s/worker we get from both deflate and lzma is b/c of being bottlenecked on unpack_data_records
* LZMA
** proper tuning
using the 3gram "sk" file (not dump --prefix="sk", but the actual file), generated files with varying codec and blocksize options, and measured dump time
three runs on truffle, so -j8 and local disks
| alg | bs | size (MiB) | run1 (s) | run2 | run3 | mean | sd |
|------+------+------------+----------+--------+--------+--------+--------|
| bz2 | 32 | 473 | 76.890 | 57.132 | 58.520 | 64.181 | 11.028 |
| | 64 | 463 | 42.122 | 38.923 | 38.457 | 39.834 | 1.995 |
| | 128 | 466 | 29.905 | 29.740 | 29.715 | 29.787 | 0.103 |
| | 256 | 475 | 33.126 | 32.906 | 29.835 | 31.956 | 1.840 |
| | 384 | 484 | 33.657 | 34.556 | 36.592 | 34.935 | 1.504 |
| | 512 | 490 | 34.477 | 34.826 | 33.886 | 34.396 | 0.475 |
| | 768 | 500 | 36.331 | 37.478 | 35.336 | 36.382 | 1.072 |
|------+------+------------+----------+--------+--------+--------+--------|
| lzma | 128 | 411 | 25.245 | 28.328 | 27.970 | 27.181 | 1.686 |
| 0e | 256 | 397 | 23.853 | 20.536 | 22.477 | 22.289 | 1.667 |
| | 384 | 391 | 23.147 | 22.018 | 19.267 | 21.477 | 1.996 |
| | 512 | 388 | 19.505 | 21.185 | 21.543 | 20.744 | 1.088 |
| | 768 | 385 | 21.649 | 18.444 | 18.058 | 19.384 | 1.971 |
|------+------+------------+----------+--------+--------+--------+--------|
| lzma | 128 | | | | | 0.000 | 0.000 |
| 1e | 256 | 397 | | | | 0.000 | 0.000 |
| | 384 | 391 | | | | 0.000 | 0.000 |
| | 512 | 388 | | | | 0.000 | 0.000 |
| | 768 | 385 | | | | 0.000 | 0.000 |
| | 1024 | 382 | | | | 0.000 | 0.000 |
#+TBLFM: $7=vmean($4..$6);%.3f::$8=vsdev($4..$6);%.3f
Observations:
- for bz2
- 128k is obviously the speed sweet spot -- larger block sizes make things slower (b/c the BWT complexity grows with the size of the data, I guess)
- weirdly, it is also approximately the sweet spot on file size -- larger block sizes also produce larger files (very weird! I have no explanation for this)
- for lzma
- going from 128 -> 256 makes things much faster (20%?!), with smaller but possibly reliable gains after that
- larger block sizes make files smaller: big jump 128->256, smaller from 256->384, and even smaller after that. (256 is the lzma context size at 0e, so this also makes some sense)
- 1e has basically no size advantage -- at 768k, the actual sizes are 402655885 (1e) versus 403639270 (0e) -- so a little smaller, but less than 0.5%. the gap does grow with increasing block size, but it's still negligble.
For 1e we might expect more of a win for larger blocksizes, at least on file size
for BS in 384 512 768; do for CODEC in lzma bz2; do gunzip -c google-books-eng-us-all-20120701/3gram/sorted-googlebooks-eng-us-all-3gram-20120701-sk.gz | zs make '{}' - 3gram-sk-${BS}k-${CODEC}.zs --approx-block-size=$(($BS * 1024)) --codec $CODEC; done; done
for codec in ["bz2"]:
for s in [32, 128, 64]:
times = []
for i in xrange(3):
z = ZS("3gram-sk-%sk-%s.zs" % (s, codec))
start = time.time()
z.dump(open("/dev/null", "wb"))
times.append(time.time() - start)
z.close()
print("| %s | %s | | %0.3f | %0.3f | %0.3f | | |"
% (codec, s, times[0], times[1], times[2]))
** general benchmarks
on the 'old/sorted' test vector, using branna, and python3
| alg | size | compress time | decompress time |
|----------+---------+---------------+-----------------|
| bz2 -1 | 1036222 | 774 ms | 153 ms |
| bz2 -9 | 1140258 | 799 ms | 208 ms |
| lzma -0 | 1122836 | 340 ms | 101 ms |
| lzma -1 | 1109110 | 396 ms | 95.7 ms |
| lzma -0e | 810799 | 4520 ms | 104 ms |
| lzma -1e | 805229 | 5900 ms | 103 ms |
| deflate | 1441904 | 323 ms | 20.6 ms |
using old/smalltest:
the dump times here are highly variable...
| alg | blocksize | size | make walltime | dump walltime |
|---------+-----------+----------+---------------+---------------|
| lzma 0e | 128 KiB | 27663364 | 42 s | 2.387 s |
| lzma 0e | 1024 KiB | 27197990 | 50 s | 1.993 s |
| lzma 0e | 2048 KiB | 27173018 | 52 s | |
for single zpayloads on my laptop, lzma.decompress is 16 ms for the 1M uncompressed block size, and 2.5 ms for the 128K uncompressed block size.
maybe 512K -> 10 ms -> about the same as a disk seek?
* Better parallelism
** Solving the control-C problem (and friends) by writing a better process pool
reader.py's process pool is terrible, esp. on py2
- if you are using the command-line tools, control-C tends to just blow up
- if you are in ipython, then control-C destroys your ZS object
a better pool would be:
- spawn our own processes with custom loops, a la writer.py
- for these children, set SIGINT to SIG_IGN, so we just don't have to deal with it.
- maybe have them poll the parent process occasionally to make sure that it's still alive. having the parent process unconditionally kill them on exit gets us a good part of the way, but can we do even better? heartbeats? keep a pipe around?
- some options of varying cleverness and portability: https://stackoverflow.com/questions/284325/how-to-make-child-process-die-after-parent-exits
- most reliable (but not quite mentioned there) seems to be to pass the parent's PID into the child, and then have the child poll on getppid().
easiest way to do this would be for each worker to spawn a thread that just does this and calls os.exit if the parent goes.
- windows approach: https://stackoverflow.com/a/12942797/1925449
http://msdn.microsoft.com/en-us/library/ms684161%28VS.85%29.aspx
requires win32api and some way to get a process handle for multiprocessing child
- and if a worker just dies (which should never happen), then close the parent, because who knows what got lost and this should be a very unusual event. but we do need to watch for it when blocking on get() in the main thread.
** A wackier alternative
if we spend a lot of time in IPC for a streaming reads, it's possible we should switch to threads instead of processes.
this wouldn't be too hard -- the trick would be to write a tiny circular queue in Cython, start threads using the standard Python APIs, and have all the threads execute a Cython loop that drops the GIL and then pulls char*'s off the queue and puts back decompressed char*'s.
we might even get away with using the GIL to serialize access to the queue, and just using Cython to get a GIL-dropping interface to lzma_decompress etc. Actually I guess the lzma module already has per-compressor locking, it's just zlib and bz2 that are unhelpful this way.