Flyscan hanging for 3 hours with no obvious reason #417

jilavsky · 2020-09-24T10:48:52Z

Today morning I found instrument hanging for over 3 hours at the end of Flyscan state:

D Thu-02:14:46 - 45.05 8.6517096 -0.35955 11.04218 1807 39.77
D Thu-02:14:51 - 50.05 8.6033570 -0.53655 10.26343 2313 44.78
D Thu-02:14:56 - 55.05 8.5424353 -0.76758 9.28104 2851 49.78
D Thu-02:15:01 - 60.06 8.4637193 -1.06393 8.01275 3415 54.78
D Thu-02:15:06 - 65.06 8.3644328 -1.42022 6.44452 3999 59.78
D Thu-02:15:11 - 70.06 8.2453133 -1.86869 4.50669 4599 64.78
D Thu-02:15:16 - 75.06 8.1056845 -2.40073 2.22846 5212 69.78
D Thu-02:15:21 - 80.06 7.9339730 -3.01658 -0.51643 5836 74.78
D Thu-02:15:26 - 85.07 7.7379286 -3.75201 -3.69713 6470 79.80
D Thu-02:15:31 - 90.07 7.5066553 -4.61030 -7.38525 7109 84.80
D Thu-02:15:36 - 95.08 7.2468560 -5.57339 -11.56313 7755 89.80
D Thu-02:15:40 - writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
I Thu-02:15:40 - 99.11 7.0906980 -6.10326 -13.86655 7999 93.80
D Thu-02:15:40 - 99.11674690246582s - progress_reporting is done

--- no report reported or any obvious reason.

As far as I can say, all hardware seems to be OK and able to move, just nothing is happening. Not clear, what to do, actually.

Instrument was in remote operations. I did ctrl-C twice, single ctrl-C was not enough. Got regulart RE.abort, halt,stop, resume options.

tried to do RE.resume and system resumed and was back in wait mode:

In [8]: RE.resume()
D Thu-05:32:20 - Error: writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5 - Another set() call is still in progress for user_data_state
I Thu-05:32:20 - finished writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
I Thu-05:32:20 - HDF5 output complete: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
New stream: 'mca'
D Thu-05:32:20 - after return: 11898.765989303589s
W Thu-05:32:20 - Exception during execution of command usaxs.mac:
Another set() call is still in progress for user_data_state
I Thu-05:32:20 - waiting for next user command

tried preUSAXStune using remote OPS and

W Thu-05:33:46 - Exception during execution of command preUSAXStune:
Another set() call is still in progress for user_data_state
I Thu-05:33:46 - waiting for next user command

obviously, nothing was working for another set() call.

I stopped remote ops, restarted BS, tried to run, but instrument was at the end of flyscan so preUSAXStune failed.
I run resetUSAXS from spec, that returned to Q=0.
I run few preUSAXStunes from BS to realign instrument.
I restarted remote operations, which now works, back to data collection.

lost 3.5 hours of beamtime, nothing major. But kind of annoying. We will run spec most of the rest of this week due to use of heaters, so this is November issue.

jilavsky · 2020-09-24T17:46:08Z

And hanged again at 12:07, found out at 12:50.

D Thu-12:06:05 Transient Scan ID: 325 Persistent Unique New stream: 'baseline'
New stream: 'aps_current_monitor'
I Thu-12:06:08 I Thu-12:06:08 D Thu-12:06:08 I Thu-12:06:09 - flying, s D Thu-12:06:13 - 5.01 D Thu-12:06:18 - 10.01 D Thu-12:06:23 - 15.02 D Thu-12:06:28 - 20.02 D Thu-12:06:33 - 25.03 D Thu-12:06:38 - 30.03 D Thu-12:06:43 - 35.04 D Thu-12:06:48 - 40.04 D Thu-12:06:53 - 45.05 D Thu-12:06:58 - 50.05 D Thu-12:07:03 - 55.06 D Thu-12:07:08 - 60.06 D Thu-12:07:13 - 65.06 D Thu-12:07:18 - 70.07 D Thu-12:07:23 - 75.08 D Thu-12:07:28 - 80.08 D Thu-12:07:33 - 85.09 D Thu-12:07:38 - 90.10 D Thu-12:07:43 - 95.10 D Thu-12:07:47 I Thu-12:07:47 - 98.82 D Thu-12:07:47 - make_filename: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.jpg
Time: 2020-09-24 12:06:072020-09/09_24_Myers/09_24_Myers_usaxs/
Scan ID: '42608c22-5c3e-48ac-9fd9-1a198e13ee5a'
- HDF5 config: /share1/AreaDetectorConfig/FlyScan_config/saveFlyData.xml
- HDF5 output: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
- progress_reporting has arrived
ar, deg ay, mm dy, mm channel elapsed, s
8.7493427 0.00645 12.62816 0 0.00
8.7479764 0.00119 12.60535 46 4.77
8.7471126 -0.00196 12.59177 79 9.77
8.7446726 -0.01122 12.55051 165 14.77
8.7390203 -0.03285 12.45819 336 19.77
8.7285692 -0.07247 12.28669 596 24.78
8.7118361 -0.13512 12.01442 937 29.78
8.6864256 -0.22989 11.60182 1345 34.78
8.6512936 -0.36171 11.03485 1809 39.80
8.6035025 -0.53855 10.26002 2314 44.80
8.5423541 -0.77094 9.25361 2852 49.80
8.4621219 -1.06383 7.97841 3416 54.80
8.3654398 -1.42541 6.42622 4002 59.82
8.2452665 -1.87176 4.48457 4602 64.82
8.1039427 -2.39470 2.20956 5215 69.82
7.9331157 -3.02129 -0.53868 5841 74.83
7.7371766 -3.75420 -3.69016 6473 79.83
7.5054829 -4.61611 -7.41519 7112 84.83
7.2444942 -5.58380 -11.61098 7760 89.85
- writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
7.0907560 -6.10323 -13.86644 7999 93.54
- 98.82629561424255s - progress_reporting is done

restarting BS.

This is unusable, will finish this set of samples and switch to spec, I cannot burn beamtime this way...

prjemian · 2020-09-24T19:05:30Z

Tragically, this error message:

D Thu-05:32:20 - Error: writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5 - Another set() call is still in progress for user_data_state

says it failed when trying to write the status to a PV (ophyd object user_data.state, PV 9idcLAX:state)

prjemian · 2020-09-24T19:07:01Z

I'm assuming that, like the PV timeout problems we had last fall, this is another piece of fundamental ophyd infrastructure that needs rethinking.

prjemian · 2020-09-24T19:12:02Z

Error logged in file by this code: https://github.com/APS-USAXS/ipython-usaxs/blob/02a1d9c0af05628a084a2a392b98c3a39b32b44a/profile_bluesky/startup/instrument/devices/usaxs_fly_scan.py#L211-L213

prjemian · 2020-09-24T19:13:52Z

These are the last lines from the log (also shows the restart):

|2020-09-24 12:06:05.567|DEBUG|47615|bluesky-session|for_apstools|161|bluesky-run-engine| - write_path: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/
|2020-09-24 12:06:08.478|INFO|47615|bluesky-session|usaxs_fly_scan|143|Thread-112353| - HDF5 config: /share1/AreaDetectorConfig/FlyScan_config/saveFlyData.xml
|2020-09-24 12:06:08.478|INFO|47615|bluesky-session|usaxs_fly_scan|144|Thread-112353| - HDF5 output: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
|2020-09-24 12:06:08.481|DEBUG|47615|bluesky-session|usaxs_fly_scan|99|Thread-112355| - progress_reporting has arrived
|2020-09-24 12:06:09.082|INFO|47615|bluesky-session|usaxs_fly_scan|106|Thread-112355| - flying, s    ar, deg      ay, mm       dy, mm       channel      elapsed, s 
|2020-09-24 12:06:13.488|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 5.01         8.7493427    0.00645      12.62816     0            0.00       
|2020-09-24 12:06:18.494|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 10.01        8.7479764    0.00119      12.60535     46           4.77       
|2020-09-24 12:06:23.499|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 15.02        8.7471126    -0.00196     12.59177     79           9.77       
|2020-09-24 12:06:28.502|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 20.02        8.7446726    -0.01122     12.55051     165          14.77      
|2020-09-24 12:06:33.508|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 25.03        8.7390203    -0.03285     12.45819     336          19.77      
|2020-09-24 12:06:38.512|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 30.03        8.7285692    -0.07247     12.28669     596          24.78      
|2020-09-24 12:06:43.518|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 35.04        8.7118361    -0.13512     12.01442     937          29.78      
|2020-09-24 12:06:48.521|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 40.04        8.6864256    -0.22989     11.60182     1345         34.78      
|2020-09-24 12:06:53.526|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 45.05        8.6512936    -0.36171     11.03485     1809         39.80      
|2020-09-24 12:06:58.530|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 50.05        8.6035025    -0.53855     10.26002     2314         44.80      
|2020-09-24 12:07:03.538|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 55.06        8.5423541    -0.77094     9.25361      2852         49.80      
|2020-09-24 12:07:08.542|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 60.06        8.4621219    -1.06383     7.97841      3416         54.80      
|2020-09-24 12:07:13.544|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 65.06        8.3654398    -1.42541     6.42622      4002         59.82      
|2020-09-24 12:07:18.554|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 70.07        8.2452665    -1.87176     4.48457      4602         64.82      
|2020-09-24 12:07:23.557|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 75.08        8.1039427    -2.39470     2.20956      5215         69.82      
|2020-09-24 12:07:28.562|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 80.08        7.9331157    -3.02129     -0.53868     5841         74.83      
|2020-09-24 12:07:33.567|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 85.09        7.7371766    -3.75420     -3.69016     6473         79.83      
|2020-09-24 12:07:38.577|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 90.10        7.5054829    -4.61611     -7.41519     7112         84.83      
|2020-09-24 12:07:43.582|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 95.10        7.2444942    -5.58380     -11.61098    7760         89.85      
|2020-09-24 12:07:47.300|DEBUG|47615|bluesky-session|usaxs_fly_scan|207|bluesky-run-engine| - writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
|2020-09-24 12:07:47.302|INFO|47615|bluesky-session|usaxs_fly_scan|115|Thread-112355| - 98.82        7.0907560    -6.10323     -13.86644    7999         93.54      
|2020-09-24 12:07:47.303|DEBUG|47615|bluesky-session|usaxs_fly_scan|120|Thread-112355| - 98.82629561424255s - progress_reporting is done
|2020-09-24 12:45:09.245|DEBUG|47615|bluesky-session|usaxs_fly_scan|213|bluesky-run-engine| - Error: writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5 - 
|2020-09-24 12:45:09.246|INFO|47615|bluesky-session|usaxs_fly_scan|218|bluesky-run-engine| - finished writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
|2020-09-24 12:45:09.368|INFO|47615|bluesky-session|usaxs_fly_scan|162|Thread-112382| - HDF5 output complete: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
|2020-09-24 12:45:09.539|DEBUG|47615|bluesky-session|usaxs_fly_scan|230|bluesky-run-engine| - after return: 2341.0622260570526s
|2020-09-24 12:45:09.540|WARNING|47615|bluesky-session|autocollect|75|bluesky-run-engine| - Exception during execution of command usaxs.mac:
Another set() call is still in progress for user_data_state
|2020-09-24 12:45:09.540|INFO|47615|bluesky-session|autocollect|76|bluesky-run-engine| - waiting for next user command
|2020-09-24 12:45:44.587|INFO|25301|bluesky-session|session_logs|33|MainThread| - ############################################################ startup

prjemian · 2020-09-24T19:15:27Z

FYI, normal process (with no errors) looks like this:

|2020-09-24 13:49:40.393|DEBUG|25301|bluesky-session|for_apstools|160|bluesky-run-engine| - make_filename: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.jpg
|2020-09-24 13:49:40.393|DEBUG|25301|bluesky-session|for_apstools|161|bluesky-run-engine| - write_path: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/
|2020-09-24 13:49:43.532|INFO|25301|bluesky-session|usaxs_fly_scan|143|Thread-19538| - HDF5 config: /share1/AreaDetectorConfig/FlyScan_config/saveFlyData.xml
|2020-09-24 13:49:43.533|INFO|25301|bluesky-session|usaxs_fly_scan|144|Thread-19538| - HDF5 output: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.h5
|2020-09-24 13:49:43.534|DEBUG|25301|bluesky-session|usaxs_fly_scan|99|Thread-19540| - progress_reporting has arrived
|2020-09-24 13:49:44.145|INFO|25301|bluesky-session|usaxs_fly_scan|106|Thread-19540| - flying, s    ar, deg      ay, mm       dy, mm       channel      elapsed, s 
|2020-09-24 13:49:48.540|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 5.01         8.7492720    0.00605      12.62684     0            0.00       
|2020-09-24 13:49:53.545|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 10.01        8.7479910    0.00113      12.60891     46           4.78       
|2020-09-24 13:49:58.555|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 15.02        8.7471293    -0.00192     12.59472     79           9.78       
|2020-09-24 13:50:03.563|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 20.03        8.7446602    -0.01154     12.56077     165          14.80      
|2020-09-24 13:50:08.564|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 25.03        8.7390842    -0.03290     12.47768     336          19.80      
|2020-09-24 13:50:13.567|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 30.03        8.7288099    -0.07205     12.32253     596          24.80      
|2020-09-24 13:50:18.577|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 35.04        8.7118501    -0.13555     12.06562     938          29.82      
|2020-09-24 13:50:23.581|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 40.05        8.6864396    -0.22937     11.67498     1347         34.82      
|2020-09-24 13:50:28.583|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 45.05        8.6515776    -0.36339     11.13746     1809         39.82      
|2020-09-24 13:50:33.587|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 50.05        8.6031222    -0.54090     10.39030     2314         44.82      
|2020-09-24 13:50:38.593|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 55.06        8.5419081    -0.77240     9.43939      2854         49.82      
|2020-09-24 13:50:43.600|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 60.07        8.4621402    -1.06663     8.21589      3418         54.83      
|2020-09-24 13:50:48.603|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 65.07        8.3654531    -1.42885     6.68914      4002         59.83      
|2020-09-24 13:50:53.607|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 70.07        8.2448098    -1.86779     4.80353      4604         64.83      
|2020-09-24 13:50:58.611|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 75.08        8.1039563    -2.39585     2.59782      5217         69.85      
|2020-09-24 13:51:03.616|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 80.08        7.9331287    -3.03608     -0.07875     5841         74.85      
|2020-09-24 13:51:08.621|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 85.09        7.7371900    -3.75862     -3.17921     6473         79.85      
|2020-09-24 13:51:13.624|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 90.09        7.5046098    -4.62177     -6.83601     7113         84.85      
|2020-09-24 13:51:18.629|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 95.10        7.2445072    -5.58583     -10.90616    7760         89.87      
|2020-09-24 13:51:22.935|DEBUG|25301|bluesky-session|usaxs_fly_scan|207|bluesky-run-engine| - writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.h5
|2020-09-24 13:51:22.940|INFO|25301|bluesky-session|usaxs_fly_scan|218|bluesky-run-engine| - finished writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.h5
|2020-09-24 13:51:22.956|INFO|25301|bluesky-session|usaxs_fly_scan|115|Thread-19540| - 99.41        7.0907690    -6.10323     -13.86642    7999         94.15      
|2020-09-24 13:51:22.960|DEBUG|25301|bluesky-session|usaxs_fly_scan|120|Thread-19540| - 99.4286744594574s - progress_reporting is done
|2020-09-24 13:51:23.119|INFO|25301|bluesky-session|usaxs_fly_scan|162|Thread-19549| - HDF5 output complete: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.h5
|2020-09-24 13:51:23.262|DEBUG|25301|bluesky-session|usaxs_fly_scan|230|bluesky-run-engine| - after return: 99.73098731040955s
|2020-09-24 13:51:25.574|INFO|25301|bluesky-session|autocollect|76|bluesky-run-engine| - waiting for next user command

prjemian · 2020-09-24T19:16:43Z

We could add code in the except block here: https://github.com/APS-USAXS/ipython-usaxs/blob/02a1d9c0af05628a084a2a392b98c3a39b32b44a/profile_bluesky/startup/instrument/devices/usaxs_fly_scan.py#L208-L213

to clear the set() in progress.

prjemian · 2020-09-24T19:23:56Z

Such as (from bluesky/ophyd#757 (comment)):

user_data.state._set_thread = None

jilavsky · 2020-09-24T19:26:43Z

Whatever is going to be reasonable fix.
Is ophyd really failing because it cannot write a single PV? It stops retrying? I thought there was routine which was returning back to PVs in increasing time period and retrying until user stops it.
PV timeouts are so common (at least at 9ID), that ophyd will be very fragile. Looks like something which should be fixed upstream?

prjemian · 2020-09-24T19:28:25Z

We have timeout handling already.

prjemian · 2020-09-24T19:31:14Z

But yes, it appears this was due to a timeout when writing the instrument state. The timeout left a .set() thread running. This dangling thread triggered an exception the next time state was written. Sloppy.

prjemian · 2020-09-24T19:31:53Z

This draft PR proposes a way out of the problem: bluesky/ophyd#901

prjemian · 2020-09-25T19:30:34Z

We set a default timeout of 10 seconds : https://github.com/APS-USAXS/ipython-usaxs/blob/bf50e57de6849961db89e845a9f4e96e03288d9d/profile_bluesky/startup/instrument/framework/initialize.py#L124

The timeout value could be increased. BUT it is not obvious that will address the problem that brought us here. Either retry or don't care, just move on are two different implementations.

Talked with the bluesky developers' team today about this in our weekly conference call. @danielballan suggested we might collaborate in a troubleshooting session. When might a couple hours of such time be available for a team debugging session? Others could be @tacaswell, @jklynch, or @mrakitin. @klauer would be a good addition but SLAC has their own hands very full at this time. If we could identify the conditions to reproduce the problem, that's a big bonus!

klauer · 2020-09-25T19:33:37Z

May be worth installing py-spy and seeing what it shows at that point (https://github.com/benfred/py-spy)

jilavsky · 2020-09-25T20:04:06Z

Starting Monday USAXS is not running. We switch to BNP in front hutch. We can run BS (in mode when it does not control beamline) for whole October, if needed.
I would like to remind you again, that the missing PV is - based on experience - transient and relatively rare problem. If there is way to monitor the issue somehow and run BS for long time in some loop, we have a chance. But we really do not know how to provoke this issue. That is the main reason no one (BCDA or IT) ever found source and fix.

Set default timeout on [`EpicsSignalBase`](bluesky/ophyd#917)

prjemian · 2020-10-28T15:45:01Z

Watch bluesky/ophyd#926 for upstream fix. Interim fix for Another set() call is still in progress exception is to add a timeout=10 (or better value, seconds) keyword argument when changing the affected signal, such as bps.mv(signal, value, timeout=10). The timeout kwarg is passed through to the signal's .set() method.

prjemian · 2020-10-28T15:55:02Z

Also, if the write operation fails due to timeout, the signal will log the timeout at the warn level. To suppress that log message (and others at warn or below), call signal_name.log.setLevel("ERROR"). Only error or higher logs will be printed.

prjemian · 2020-10-28T16:35:48Z

By setting a timeout in the write command, the status object is allowed to complete its job. In the case of user_data.state, we are announcing the current state of the instrument and we (the data acquisition software) really don't care if anyone is listening. Thus, the timeout should be really short so the status object is complete before we try to write the status again. That means really short, such as 0.1 s.

prjemian · 2020-10-28T18:21:51Z

Can also set the write_timeout when creating the signal: https://github.com/bluesky/ophyd/blob/3dddf111946c1eb7200361d2158445db2acfb76d/ophyd/signal.py#L740. Pass it as a kwarg to Component().

prjemian · 2020-10-28T18:25:41Z

@jilavsky When can we test this?

jilavsky · 2020-10-28T21:57:11Z

Without beam any time. I am off on vacations from Friday until end of next week. Either Thursday 10/29 or when I get back, 11/9.
With beam during USAXS setup on 11/17 and 11/18. Have users starting 11/19.

prjemian · 2020-11-04T16:56:51Z

Now, there is a new method in ophyd (since the 1.5.4 release) to set default (read) timeout, write_timeout, connection_timeout, and auto_monitor values for any instances of EpicsSignal. This must be called before creating any EpicsSignal so it goes in instrument/framework/initialize.py. It is expected these settings will resolve this problem.

prjemian · 2020-11-04T16:57:32Z

To use that feature, we need to use the ophyd master branch until the next release.

prjemian · 2020-11-04T18:10:21Z

ophyd updated

jilavsky · 2020-11-12T00:42:07Z

What input are we waiting here for? I will be setting up next week 17 - 18 November. I want to test this at that time. My suggestion, install and let's run this as dummy data collection over night 17-18. I should have instrument setup before lunch, assuming stuff works. Plan to run tests after that.

jilavsky · 2020-11-12T01:06:14Z

Testing - can we test this by killing source of PVs while collecting data? Like take down the Linkam soft ioc on Gaga computer. That will make PVs go away. If this is fixed, the data collection should continue, may be with some delays for timeouts, but should continue. That would be proof it works as expected (= as spec does).

prjemian · 2020-11-12T01:39:14Z

I'm hoping that once we get control of timeouts, we can operate smoothly. Mighty have to implement some try..except code to implement the _keep going_ idea.

…

On Wed, Nov 11, 2020, 7:06 PM Jan Ilavsky ***@***.***> wrote: Testing - can we test this by killing source of PVs while collecting data? Like take down the Linkam soft ioc on Gaga computer. That will make PVs go away. If this is fixed, the data collection should continue, may be with some delays for timeouts, but should continue. That would be proof it works as expected (= as spec does). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#417 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARMUMGIHZC2TIPZRAFDGK3SPMYJFANCNFSM4RYETYMQ> .

prjemian · 2020-11-18T16:06:37Z

The triggering exception here was

Another set() call is still in progress for user_data_state

The root cause, as determined in an ophyd issue is that the timeout (specifically, the write_timeout) needs to be set to a finite value so the status object can complete its workflow. With timeout=None (the previous default), the status object will wait until its ._finished() method is called. Since that has not happened yet, the .set() call is still in progress.

We fixed this now (as we did for #412) by setting default values for all timeouts and default of auto_monitor=True (use EPICS Channel Access monitors to cache updated PV values locally): https://github.com/APS-USAXS/ipython-usaxs/blob/e0c913c8613203dd9400f0911eda64f581949770/profile_bluesky/startup/instrument/framework/initialize.py#L120-L125

This problem is fixed by controlling the timeouts and using the CA monitors. Required upstream changes in ophyd to create the `ophyd.signal. EpicsSignalBase.set_defaults()` method.

jilavsky added bug Something isn't working high priority more urgent labels Sep 24, 2020

jilavsky added this to the 2020-11 operations milestone Sep 24, 2020

jilavsky assigned prjemian and jilavsky Sep 24, 2020

prjemian added a commit that referenced this issue Sep 24, 2020

MNT #417 temporary repair

974bc14

prjemian mentioned this issue Oct 3, 2020

Read timeout stopped measurements. aps-8id-dys/ipython-8idiuser#233

Closed

prjemian added a commit that referenced this issue Oct 6, 2020

refs #417

7b1854e

Set default timeout on [`EpicsSignalBase`](bluesky/ophyd#917)

prjemian mentioned this issue Oct 6, 2020

Set default timeout on EpicsSignalBase #419

Merged

prjemian changed the title ~~FLyscan hanging for 3 hours with no obvious reason~~ Flyscan hanging for 3 hours with no obvious reason Oct 28, 2020

prjemian added a commit that referenced this issue Oct 28, 2020

MNT #417

b8a839f

prjemian added a commit that referenced this issue Nov 4, 2020

MNT #417 timeouts

14200f8

prjemian closed this as completed Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flyscan hanging for 3 hours with no obvious reason #417

Flyscan hanging for 3 hours with no obvious reason #417

jilavsky commented Sep 24, 2020

jilavsky commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020 •

edited

Loading

prjemian commented Sep 24, 2020

jilavsky commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 25, 2020

klauer commented Sep 25, 2020

jilavsky commented Sep 25, 2020

prjemian commented Oct 28, 2020 •

edited

Loading

prjemian commented Oct 28, 2020

prjemian commented Oct 28, 2020

prjemian commented Oct 28, 2020

prjemian commented Oct 28, 2020

jilavsky commented Oct 28, 2020

prjemian commented Nov 4, 2020

prjemian commented Nov 4, 2020

prjemian commented Nov 4, 2020

jilavsky commented Nov 12, 2020

jilavsky commented Nov 12, 2020

prjemian commented Nov 12, 2020 via email

prjemian commented Nov 18, 2020

Flyscan hanging for 3 hours with no obvious reason #417

Flyscan hanging for 3 hours with no obvious reason #417

Comments

jilavsky commented Sep 24, 2020

jilavsky commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020 • edited Loading

prjemian commented Sep 24, 2020

jilavsky commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 24, 2020

prjemian commented Sep 25, 2020

klauer commented Sep 25, 2020

jilavsky commented Sep 25, 2020

prjemian commented Oct 28, 2020 • edited Loading

prjemian commented Oct 28, 2020

prjemian commented Oct 28, 2020

prjemian commented Oct 28, 2020

prjemian commented Oct 28, 2020

jilavsky commented Oct 28, 2020

prjemian commented Nov 4, 2020

prjemian commented Nov 4, 2020

prjemian commented Nov 4, 2020

jilavsky commented Nov 12, 2020

jilavsky commented Nov 12, 2020

prjemian commented Nov 12, 2020 via email

prjemian commented Nov 18, 2020

This problem is fixed by controlling the timeouts and using the CA monitors. Required upstream changes in ophyd to create the ophyd.signal. EpicsSignalBase.set_defaults() method.

prjemian commented Sep 24, 2020 •

edited

Loading

prjemian commented Oct 28, 2020 •

edited

Loading

This problem is fixed by controlling the timeouts and using the CA monitors. Required upstream changes in ophyd to create the `ophyd.signal. EpicsSignalBase.set_defaults()` method.