-
Notifications
You must be signed in to change notification settings - Fork 0
test upstream ophyd changes #3
Comments
Since the previous test (a440337) for "another set() in progress" ended with a read timeout that likely would be fixed by automonitor of the PV, let's repeat that test exactly with the combination of three branches. |
|
restart the test
|
But first, use a fresh mongo database and set the default write timeout. |
Make the new database:
|
started the test, 236 iterations so far |
11:47 pm, 1228 iterations, no problems observed |
test failed with timeout during iteration 1378:
|
Need to make sure auto_monitor is True! |
Now there's the problem.
|
This setting must happen in the upstream package (apstools) where the Component is defined. That's a lot of work. |
Re-start testing with apstools branch: 450-auto_monitor |
startup bluesky ...
|
|
Added code to set a default |
|
Things were progressing just fine:
But then this happened in the middle of a scan (ending everything):
Nothing in the logs describes what happened. |
Checking the system logs, they are read-restricted.
|
System has 16 GB of RAM, 6 of which are free at this time
|
Also note that 00:57 this morning, the run paused for about 9 minutes and resumed with no logs or other info shown to the ipython console. Anything in the system logs? Most of those are read-only by root. |
Need to add memory diagnostics before repeating that same test. No observation of the problems we were expecting to find! |
Getting the total number of bytes used by any python object is a complex task and is not generally supported. The
See https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info for what these terms mean. Here is the value at the start of a bluesky session:
rss: aka “Resident Set Size”, this is the non-swapped physical memory a process has used. On UNIX it matches “top“‘s RES column). references:
|
MB in use:
|
Track |
Definitely leaking memory somewhere:
again
|
Disable
|
Run 1000 iterations, project to use about 4GB of RAM. (~4 MB/iteration) |
BTW, this measurement will span the summer (CDT) to winter (CST) time change. |
Next, evaluate the memory leak rate using a bare RunEngine and no bec and minimal RE.md additions. |
bare example code: bp.count()
console output
code: bp.scan()
console output
versions:
|
Create a new, clean environment for testing:
Includes:
|
Same code as above: output: bp.count()
output: bp.scan()
|
Observations:
|
Use import tracemalloc
tracemalloc.start()
my_complex_analysis_method()
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop() |
Show differences across execution of a function: snapshot1 = tracemalloc.take_snapshot()
# ... call the function leaking memory ...
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ Top 10 differences ]")
for stat in top_stats[:10]:
print(stat) |
Re-run the
|
These are the files and lines of code at the top of the lists:
|
The appearance of |
As predicted above (#3 (comment)), about 4 GB RAM consumed by about 1000 iterations. Not one bit of that seems to have been garbage collected. plot of total rss memory (bytes) used by ipython after each iteration: plot of newrss memory (bytes) used by ipython since previous iteration: |
This issue is to record the testing at APS but not on a beam line computer.
USAXS & XPCS have experienced chronic problems that limit continued, uninterrupted collection with Bluesky. At the heart of the problems are random, rare, and recurrent incidents where a PV becomes temporarily unresponsive. In Bluesky, the incidents result in a timeout that terminates the Bluesky RunEngine. When a timeout has not been set for a write operation but the target PV is unresponsive, this leaves an ophyd
Status
object waiting for a response that is not coming. The next time the PV is to be written, the RunEngine terminates with Another call to set() is in progress.Several fixes have been proposed to address various aspects of these problems. The 2020-10-test branch has been created to collate these propositions to test them together. (Initial tests with just one branch proved that the different problems can occur together, interrupting the evaluation of any one particular solution.) These branches were combined here for testing the complete suite:
Recent commits cover the initial testing of some of these branches individually.
Report from latest testing (compelling this branch to combine the different propositions):
2020-10-28: a440337
After thousands of scans,
finding and centering on a randomly-placed peak while
collecting lots of additional baseline data,
one scan started
but failed (at the end) with this console trace:
The text was updated successfully, but these errors were encountered: