-
Notifications
You must be signed in to change notification settings - Fork 0
Flyscan hanging for 3 hours with no obvious reason #417
Comments
And hanged again at 12:07, found out at 12:50. D Thu-12:06:05 - make_filename: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.jpg restarting BS. This is unusable, will finish this set of samples and switch to spec, I cannot burn beamtime this way... |
Tragically, this error message:
says it failed when trying to write the status to a PV (ophyd object |
I'm assuming that, like the PV timeout problems we had last fall, this is another piece of fundamental ophyd infrastructure that needs rethinking. |
These are the last lines from the log (also shows the restart):
|
FYI, normal process (with no errors) looks like this:
|
We could add code in the except block here: https://github.com/APS-USAXS/ipython-usaxs/blob/02a1d9c0af05628a084a2a392b98c3a39b32b44a/profile_bluesky/startup/instrument/devices/usaxs_fly_scan.py#L208-L213 to clear the |
Such as (from bluesky/ophyd#757 (comment)):
|
Whatever is going to be reasonable fix. |
We have timeout handling already. |
But yes, it appears this was due to a timeout when writing the instrument state. The timeout left a |
This draft PR proposes a way out of the problem: bluesky/ophyd#901 |
We set a default timeout of 10 seconds : https://github.com/APS-USAXS/ipython-usaxs/blob/bf50e57de6849961db89e845a9f4e96e03288d9d/profile_bluesky/startup/instrument/framework/initialize.py#L124 The timeout value could be increased. BUT it is not obvious that will address the problem that brought us here. Either retry or don't care, just move on are two different implementations. Talked with the bluesky developers' team today about this in our weekly conference call. @danielballan suggested we might collaborate in a troubleshooting session. When might a couple hours of such time be available for a team debugging session? Others could be @tacaswell, @jklynch, or @mrakitin. @klauer would be a good addition but SLAC has their own hands very full at this time. If we could identify the conditions to reproduce the problem, that's a big bonus! |
May be worth installing |
Starting Monday USAXS is not running. We switch to BNP in front hutch. We can run BS (in mode when it does not control beamline) for whole October, if needed. |
Set default timeout on [`EpicsSignalBase`](bluesky/ophyd#917)
Watch bluesky/ophyd#926 for upstream fix. Interim fix for |
Also, if the write operation fails due to timeout, the signal will log the timeout at the warn level. To suppress that log message (and others at warn or below), call |
By setting a timeout in the write command, the status object is allowed to complete its job. In the case of |
Can also set the |
@jilavsky When can we test this? |
Without beam any time. I am off on vacations from Friday until end of next week. Either Thursday 10/29 or when I get back, 11/9. |
Now, there is a new method in ophyd (since the 1.5.4 release) to set default (read) timeout, write_timeout, connection_timeout, and auto_monitor values for any instances of EpicsSignal. This must be called before creating any EpicsSignal so it goes in |
To use that feature, we need to use the |
ophyd updated |
What input are we waiting here for? I will be setting up next week 17 - 18 November. I want to test this at that time. My suggestion, install and let's run this as dummy data collection over night 17-18. I should have instrument setup before lunch, assuming stuff works. Plan to run tests after that. |
Testing - can we test this by killing source of PVs while collecting data? Like take down the Linkam soft ioc on Gaga computer. That will make PVs go away. If this is fixed, the data collection should continue, may be with some delays for timeouts, but should continue. That would be proof it works as expected (= as spec does). |
I'm hoping that once we get control of timeouts, we can operate smoothly.
Mighty have to implement some try..except code to implement the _keep
going_ idea.
…On Wed, Nov 11, 2020, 7:06 PM Jan Ilavsky ***@***.***> wrote:
Testing - can we test this by killing source of PVs while collecting data?
Like take down the Linkam soft ioc on Gaga computer. That will make PVs go
away. If this is fixed, the data collection should continue, may be with
some delays for timeouts, but should continue. That would be proof it works
as expected (= as spec does).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#417 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARMUMGIHZC2TIPZRAFDGK3SPMYJFANCNFSM4RYETYMQ>
.
|
The triggering exception here was
The root cause, as determined in an ophyd issue is that the We fixed this now (as we did for #412) by setting default values for all timeouts and default of This problem is fixed by controlling the timeouts and using the CA monitors. Required upstream changes in ophyd to create the
|
Today morning I found instrument hanging for over 3 hours at the end of Flyscan state:
D Thu-02:14:46 - 45.05 8.6517096 -0.35955 11.04218 1807 39.77
D Thu-02:14:51 - 50.05 8.6033570 -0.53655 10.26343 2313 44.78
D Thu-02:14:56 - 55.05 8.5424353 -0.76758 9.28104 2851 49.78
D Thu-02:15:01 - 60.06 8.4637193 -1.06393 8.01275 3415 54.78
D Thu-02:15:06 - 65.06 8.3644328 -1.42022 6.44452 3999 59.78
D Thu-02:15:11 - 70.06 8.2453133 -1.86869 4.50669 4599 64.78
D Thu-02:15:16 - 75.06 8.1056845 -2.40073 2.22846 5212 69.78
D Thu-02:15:21 - 80.06 7.9339730 -3.01658 -0.51643 5836 74.78
D Thu-02:15:26 - 85.07 7.7379286 -3.75201 -3.69713 6470 79.80
D Thu-02:15:31 - 90.07 7.5066553 -4.61030 -7.38525 7109 84.80
D Thu-02:15:36 - 95.08 7.2468560 -5.57339 -11.56313 7755 89.80
D Thu-02:15:40 - writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
I Thu-02:15:40 - 99.11 7.0906980 -6.10326 -13.86655 7999 93.80
D Thu-02:15:40 - 99.11674690246582s - progress_reporting is done
--- no report reported or any obvious reason.
As far as I can say, all hardware seems to be OK and able to move, just nothing is happening. Not clear, what to do, actually.
Instrument was in remote operations. I did ctrl-C twice, single ctrl-C was not enough. Got regulart RE.abort, halt,stop, resume options.
tried to do RE.resume and system resumed and was back in wait mode:
In [8]: RE.resume()
D Thu-05:32:20 - Error: writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5 - Another set() call is still in progress for user_data_state
I Thu-05:32:20 - finished writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
I Thu-05:32:20 - HDF5 output complete: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
New stream: 'mca'
D Thu-05:32:20 - after return: 11898.765989303589s
W Thu-05:32:20 - Exception during execution of command usaxs.mac:
Another set() call is still in progress for user_data_state
I Thu-05:32:20 - waiting for next user command
tried preUSAXStune using remote OPS and
W Thu-05:33:46 - Exception during execution of command preUSAXStune:
Another set() call is still in progress for user_data_state
I Thu-05:33:46 - waiting for next user command
obviously, nothing was working for another set() call.
I stopped remote ops, restarted BS, tried to run, but instrument was at the end of flyscan so preUSAXStune failed.
I run resetUSAXS from spec, that returned to Q=0.
I run few preUSAXStunes from BS to realign instrument.
I restarted remote operations, which now works, back to data collection.
lost 3.5 hours of beamtime, nothing major. But kind of annoying. We will run spec most of the rest of this week due to use of heaters, so this is November issue.
The text was updated successfully, but these errors were encountered: