Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Flyscan hanging for 3 hours with no obvious reason #417

Closed
jilavsky opened this issue Sep 24, 2020 · 28 comments
Closed

Flyscan hanging for 3 hours with no obvious reason #417

jilavsky opened this issue Sep 24, 2020 · 28 comments
Assignees
Labels
bug Something isn't working high priority more urgent

Comments

@jilavsky
Copy link
Contributor

Today morning I found instrument hanging for over 3 hours at the end of Flyscan state:

D Thu-02:14:46 - 45.05 8.6517096 -0.35955 11.04218 1807 39.77
D Thu-02:14:51 - 50.05 8.6033570 -0.53655 10.26343 2313 44.78
D Thu-02:14:56 - 55.05 8.5424353 -0.76758 9.28104 2851 49.78
D Thu-02:15:01 - 60.06 8.4637193 -1.06393 8.01275 3415 54.78
D Thu-02:15:06 - 65.06 8.3644328 -1.42022 6.44452 3999 59.78
D Thu-02:15:11 - 70.06 8.2453133 -1.86869 4.50669 4599 64.78
D Thu-02:15:16 - 75.06 8.1056845 -2.40073 2.22846 5212 69.78
D Thu-02:15:21 - 80.06 7.9339730 -3.01658 -0.51643 5836 74.78
D Thu-02:15:26 - 85.07 7.7379286 -3.75201 -3.69713 6470 79.80
D Thu-02:15:31 - 90.07 7.5066553 -4.61030 -7.38525 7109 84.80
D Thu-02:15:36 - 95.08 7.2468560 -5.57339 -11.56313 7755 89.80
D Thu-02:15:40 - writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
I Thu-02:15:40 - 99.11 7.0906980 -6.10326 -13.86655 7999 93.80
D Thu-02:15:40 - 99.11674690246582s - progress_reporting is done

--- no report reported or any obvious reason.

As far as I can say, all hardware seems to be OK and able to move, just nothing is happening. Not clear, what to do, actually.

Instrument was in remote operations. I did ctrl-C twice, single ctrl-C was not enough. Got regulart RE.abort, halt,stop, resume options.

tried to do RE.resume and system resumed and was back in wait mode:

In [8]: RE.resume()
D Thu-05:32:20 - Error: writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5 - Another set() call is still in progress for user_data_state
I Thu-05:32:20 - finished writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
I Thu-05:32:20 - HDF5 output complete: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5
New stream: 'mca'
D Thu-05:32:20 - after return: 11898.765989303589s
W Thu-05:32:20 - Exception during execution of command usaxs.mac:
Another set() call is still in progress for user_data_state
I Thu-05:32:20 - waiting for next user command

tried preUSAXStune using remote OPS and

W Thu-05:33:46 - Exception during execution of command preUSAXStune:
Another set() call is still in progress for user_data_state
I Thu-05:33:46 - waiting for next user command

obviously, nothing was working for another set() call.

I stopped remote ops, restarted BS, tried to run, but instrument was at the end of flyscan so preUSAXStune failed.
I run resetUSAXS from spec, that returned to Q=0.
I run few preUSAXStunes from BS to realign instrument.
I restarted remote operations, which now works, back to data collection.

lost 3.5 hours of beamtime, nothing major. But kind of annoying. We will run spec most of the rest of this week due to use of heaters, so this is November issue.

@jilavsky jilavsky added bug Something isn't working high priority more urgent labels Sep 24, 2020
@jilavsky jilavsky added this to the 2020-11 operations milestone Sep 24, 2020
@jilavsky
Copy link
Contributor Author

And hanged again at 12:07, found out at 12:50.

D Thu-12:06:05 - make_filename: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.jpg
Transient Scan ID: 325 Time: 2020-09-24 12:06:072020-09/09_24_Myers/09_24_Myers_usaxs/
Persistent Unique Scan ID: '42608c22-5c3e-48ac-9fd9-1a198e13ee5a'
New stream: 'baseline'
New stream: 'aps_current_monitor'
I Thu-12:06:08 - HDF5 config: /share1/AreaDetectorConfig/FlyScan_config/saveFlyData.xml
I Thu-12:06:08 - HDF5 output: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
D Thu-12:06:08 - progress_reporting has arrived
I Thu-12:06:09 - flying, s ar, deg ay, mm dy, mm channel elapsed, s
D Thu-12:06:13 - 5.01 8.7493427 0.00645 12.62816 0 0.00
D Thu-12:06:18 - 10.01 8.7479764 0.00119 12.60535 46 4.77
D Thu-12:06:23 - 15.02 8.7471126 -0.00196 12.59177 79 9.77
D Thu-12:06:28 - 20.02 8.7446726 -0.01122 12.55051 165 14.77
D Thu-12:06:33 - 25.03 8.7390203 -0.03285 12.45819 336 19.77
D Thu-12:06:38 - 30.03 8.7285692 -0.07247 12.28669 596 24.78
D Thu-12:06:43 - 35.04 8.7118361 -0.13512 12.01442 937 29.78
D Thu-12:06:48 - 40.04 8.6864256 -0.22989 11.60182 1345 34.78
D Thu-12:06:53 - 45.05 8.6512936 -0.36171 11.03485 1809 39.80
D Thu-12:06:58 - 50.05 8.6035025 -0.53855 10.26002 2314 44.80
D Thu-12:07:03 - 55.06 8.5423541 -0.77094 9.25361 2852 49.80
D Thu-12:07:08 - 60.06 8.4621219 -1.06383 7.97841 3416 54.80
D Thu-12:07:13 - 65.06 8.3654398 -1.42541 6.42622 4002 59.82
D Thu-12:07:18 - 70.07 8.2452665 -1.87176 4.48457 4602 64.82
D Thu-12:07:23 - 75.08 8.1039427 -2.39470 2.20956 5215 69.82
D Thu-12:07:28 - 80.08 7.9331157 -3.02129 -0.53868 5841 74.83
D Thu-12:07:33 - 85.09 7.7371766 -3.75420 -3.69016 6473 79.83
D Thu-12:07:38 - 90.10 7.5054829 -4.61611 -7.41519 7112 84.83
D Thu-12:07:43 - 95.10 7.2444942 -5.58380 -11.61098 7760 89.85
D Thu-12:07:47 - writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
I Thu-12:07:47 - 98.82 7.0907560 -6.10323 -13.86644 7999 93.54
D Thu-12:07:47 - 98.82629561424255s - progress_reporting is done

restarting BS.

This is unusable, will finish this set of samples and switch to spec, I cannot burn beamtime this way...

@prjemian
Copy link
Contributor

Tragically, this error message:

D Thu-05:32:20 - Error: writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_23_Myers/09_23_Myers_usaxs/NREL_Yawei_1_0034.h5 - Another set() call is still in progress for user_data_state

says it failed when trying to write the status to a PV (ophyd object user_data.state, PV 9idcLAX:state)

@prjemian
Copy link
Contributor

I'm assuming that, like the PV timeout problems we had last fall, this is another piece of fundamental ophyd infrastructure that needs rethinking.

@prjemian
Copy link
Contributor

These are the last lines from the log (also shows the restart):

|2020-09-24 12:06:05.567|DEBUG|47615|bluesky-session|for_apstools|161|bluesky-run-engine| - write_path: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/
|2020-09-24 12:06:08.478|INFO|47615|bluesky-session|usaxs_fly_scan|143|Thread-112353| - HDF5 config: /share1/AreaDetectorConfig/FlyScan_config/saveFlyData.xml
|2020-09-24 12:06:08.478|INFO|47615|bluesky-session|usaxs_fly_scan|144|Thread-112353| - HDF5 output: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
|2020-09-24 12:06:08.481|DEBUG|47615|bluesky-session|usaxs_fly_scan|99|Thread-112355| - progress_reporting has arrived
|2020-09-24 12:06:09.082|INFO|47615|bluesky-session|usaxs_fly_scan|106|Thread-112355| - flying, s    ar, deg      ay, mm       dy, mm       channel      elapsed, s 
|2020-09-24 12:06:13.488|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 5.01         8.7493427    0.00645      12.62816     0            0.00       
|2020-09-24 12:06:18.494|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 10.01        8.7479764    0.00119      12.60535     46           4.77       
|2020-09-24 12:06:23.499|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 15.02        8.7471126    -0.00196     12.59177     79           9.77       
|2020-09-24 12:06:28.502|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 20.02        8.7446726    -0.01122     12.55051     165          14.77      
|2020-09-24 12:06:33.508|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 25.03        8.7390203    -0.03285     12.45819     336          19.77      
|2020-09-24 12:06:38.512|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 30.03        8.7285692    -0.07247     12.28669     596          24.78      
|2020-09-24 12:06:43.518|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 35.04        8.7118361    -0.13512     12.01442     937          29.78      
|2020-09-24 12:06:48.521|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 40.04        8.6864256    -0.22989     11.60182     1345         34.78      
|2020-09-24 12:06:53.526|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 45.05        8.6512936    -0.36171     11.03485     1809         39.80      
|2020-09-24 12:06:58.530|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 50.05        8.6035025    -0.53855     10.26002     2314         44.80      
|2020-09-24 12:07:03.538|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 55.06        8.5423541    -0.77094     9.25361      2852         49.80      
|2020-09-24 12:07:08.542|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 60.06        8.4621219    -1.06383     7.97841      3416         54.80      
|2020-09-24 12:07:13.544|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 65.06        8.3654398    -1.42541     6.42622      4002         59.82      
|2020-09-24 12:07:18.554|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 70.07        8.2452665    -1.87176     4.48457      4602         64.82      
|2020-09-24 12:07:23.557|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 75.08        8.1039427    -2.39470     2.20956      5215         69.82      
|2020-09-24 12:07:28.562|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 80.08        7.9331157    -3.02129     -0.53868     5841         74.83      
|2020-09-24 12:07:33.567|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 85.09        7.7371766    -3.75420     -3.69016     6473         79.83      
|2020-09-24 12:07:38.577|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 90.10        7.5054829    -4.61611     -7.41519     7112         84.83      
|2020-09-24 12:07:43.582|DEBUG|47615|bluesky-session|usaxs_fly_scan|111|Thread-112355| - 95.10        7.2444942    -5.58380     -11.61098    7760         89.85      
|2020-09-24 12:07:47.300|DEBUG|47615|bluesky-session|usaxs_fly_scan|207|bluesky-run-engine| - writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
|2020-09-24 12:07:47.302|INFO|47615|bluesky-session|usaxs_fly_scan|115|Thread-112355| - 98.82        7.0907560    -6.10323     -13.86644    7999         93.54      
|2020-09-24 12:07:47.303|DEBUG|47615|bluesky-session|usaxs_fly_scan|120|Thread-112355| - 98.82629561424255s - progress_reporting is done
|2020-09-24 12:45:09.245|DEBUG|47615|bluesky-session|usaxs_fly_scan|213|bluesky-run-engine| - Error: writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5 - 
|2020-09-24 12:45:09.246|INFO|47615|bluesky-session|usaxs_fly_scan|218|bluesky-run-engine| - finished writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
|2020-09-24 12:45:09.368|INFO|47615|bluesky-session|usaxs_fly_scan|162|Thread-112382| - HDF5 output complete: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/muk_31_0414.h5
|2020-09-24 12:45:09.539|DEBUG|47615|bluesky-session|usaxs_fly_scan|230|bluesky-run-engine| - after return: 2341.0622260570526s
|2020-09-24 12:45:09.540|WARNING|47615|bluesky-session|autocollect|75|bluesky-run-engine| - Exception during execution of command usaxs.mac:
Another set() call is still in progress for user_data_state
|2020-09-24 12:45:09.540|INFO|47615|bluesky-session|autocollect|76|bluesky-run-engine| - waiting for next user command
|2020-09-24 12:45:44.587|INFO|25301|bluesky-session|session_logs|33|MainThread| - ############################################################ startup

@prjemian
Copy link
Contributor

FYI, normal process (with no errors) looks like this:

|2020-09-24 13:49:40.393|DEBUG|25301|bluesky-session|for_apstools|160|bluesky-run-engine| - make_filename: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.jpg
|2020-09-24 13:49:40.393|DEBUG|25301|bluesky-session|for_apstools|161|bluesky-run-engine| - write_path: /mnt/share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/
|2020-09-24 13:49:43.532|INFO|25301|bluesky-session|usaxs_fly_scan|143|Thread-19538| - HDF5 config: /share1/AreaDetectorConfig/FlyScan_config/saveFlyData.xml
|2020-09-24 13:49:43.533|INFO|25301|bluesky-session|usaxs_fly_scan|144|Thread-19538| - HDF5 output: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.h5
|2020-09-24 13:49:43.534|DEBUG|25301|bluesky-session|usaxs_fly_scan|99|Thread-19540| - progress_reporting has arrived
|2020-09-24 13:49:44.145|INFO|25301|bluesky-session|usaxs_fly_scan|106|Thread-19540| - flying, s    ar, deg      ay, mm       dy, mm       channel      elapsed, s 
|2020-09-24 13:49:48.540|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 5.01         8.7492720    0.00605      12.62684     0            0.00       
|2020-09-24 13:49:53.545|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 10.01        8.7479910    0.00113      12.60891     46           4.78       
|2020-09-24 13:49:58.555|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 15.02        8.7471293    -0.00192     12.59472     79           9.78       
|2020-09-24 13:50:03.563|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 20.03        8.7446602    -0.01154     12.56077     165          14.80      
|2020-09-24 13:50:08.564|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 25.03        8.7390842    -0.03290     12.47768     336          19.80      
|2020-09-24 13:50:13.567|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 30.03        8.7288099    -0.07205     12.32253     596          24.80      
|2020-09-24 13:50:18.577|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 35.04        8.7118501    -0.13555     12.06562     938          29.82      
|2020-09-24 13:50:23.581|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 40.05        8.6864396    -0.22937     11.67498     1347         34.82      
|2020-09-24 13:50:28.583|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 45.05        8.6515776    -0.36339     11.13746     1809         39.82      
|2020-09-24 13:50:33.587|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 50.05        8.6031222    -0.54090     10.39030     2314         44.82      
|2020-09-24 13:50:38.593|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 55.06        8.5419081    -0.77240     9.43939      2854         49.82      
|2020-09-24 13:50:43.600|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 60.07        8.4621402    -1.06663     8.21589      3418         54.83      
|2020-09-24 13:50:48.603|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 65.07        8.3654531    -1.42885     6.68914      4002         59.83      
|2020-09-24 13:50:53.607|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 70.07        8.2448098    -1.86779     4.80353      4604         64.83      
|2020-09-24 13:50:58.611|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 75.08        8.1039563    -2.39585     2.59782      5217         69.85      
|2020-09-24 13:51:03.616|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 80.08        7.9331287    -3.03608     -0.07875     5841         74.85      
|2020-09-24 13:51:08.621|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 85.09        7.7371900    -3.75862     -3.17921     6473         79.85      
|2020-09-24 13:51:13.624|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 90.09        7.5046098    -4.62177     -6.83601     7113         84.85      
|2020-09-24 13:51:18.629|DEBUG|25301|bluesky-session|usaxs_fly_scan|111|Thread-19540| - 95.10        7.2445072    -5.58583     -10.90616    7760         89.87      
|2020-09-24 13:51:22.935|DEBUG|25301|bluesky-session|usaxs_fly_scan|207|bluesky-run-engine| - writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.h5
|2020-09-24 13:51:22.940|INFO|25301|bluesky-session|usaxs_fly_scan|218|bluesky-run-engine| - finished writing fly scan HDF5 file: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.h5
|2020-09-24 13:51:22.956|INFO|25301|bluesky-session|usaxs_fly_scan|115|Thread-19540| - 99.41        7.0907690    -6.10323     -13.86642    7999         94.15      
|2020-09-24 13:51:22.960|DEBUG|25301|bluesky-session|usaxs_fly_scan|120|Thread-19540| - 99.4286744594574s - progress_reporting is done
|2020-09-24 13:51:23.119|INFO|25301|bluesky-session|usaxs_fly_scan|162|Thread-19549| - HDF5 output complete: /share1/USAXS_data/2020-09/09_24_Myers/09_24_Myers_usaxs/LBNL_12_0442.h5
|2020-09-24 13:51:23.262|DEBUG|25301|bluesky-session|usaxs_fly_scan|230|bluesky-run-engine| - after return: 99.73098731040955s
|2020-09-24 13:51:25.574|INFO|25301|bluesky-session|autocollect|76|bluesky-run-engine| - waiting for next user command

@prjemian
Copy link
Contributor

prjemian commented Sep 24, 2020

@prjemian
Copy link
Contributor

Such as (from bluesky/ophyd#757 (comment)):

user_data.state._set_thread = None

@jilavsky
Copy link
Contributor Author

Whatever is going to be reasonable fix.
Is ophyd really failing because it cannot write a single PV? It stops retrying? I thought there was routine which was returning back to PVs in increasing time period and retrying until user stops it.
PV timeouts are so common (at least at 9ID), that ophyd will be very fragile. Looks like something which should be fixed upstream?

@prjemian
Copy link
Contributor

We have timeout handling already.

@prjemian
Copy link
Contributor

But yes, it appears this was due to a timeout when writing the instrument state. The timeout left a .set() thread running. This dangling thread triggered an exception the next time state was written. Sloppy.

@prjemian
Copy link
Contributor

This draft PR proposes a way out of the problem: bluesky/ophyd#901

prjemian added a commit that referenced this issue Sep 24, 2020
@prjemian
Copy link
Contributor

We set a default timeout of 10 seconds : https://github.com/APS-USAXS/ipython-usaxs/blob/bf50e57de6849961db89e845a9f4e96e03288d9d/profile_bluesky/startup/instrument/framework/initialize.py#L124

The timeout value could be increased. BUT it is not obvious that will address the problem that brought us here. Either retry or don't care, just move on are two different implementations.

Talked with the bluesky developers' team today about this in our weekly conference call. @danielballan suggested we might collaborate in a troubleshooting session. When might a couple hours of such time be available for a team debugging session? Others could be @tacaswell, @jklynch, or @mrakitin. @klauer would be a good addition but SLAC has their own hands very full at this time. If we could identify the conditions to reproduce the problem, that's a big bonus!

@klauer
Copy link

klauer commented Sep 25, 2020

May be worth installing py-spy and seeing what it shows at that point (https://github.com/benfred/py-spy)

@jilavsky
Copy link
Contributor Author

Starting Monday USAXS is not running. We switch to BNP in front hutch. We can run BS (in mode when it does not control beamline) for whole October, if needed.
I would like to remind you again, that the missing PV is - based on experience - transient and relatively rare problem. If there is way to monitor the issue somehow and run BS for long time in some loop, we have a chance. But we really do not know how to provoke this issue. That is the main reason no one (BCDA or IT) ever found source and fix.

prjemian added a commit that referenced this issue Oct 6, 2020
Set default timeout on [`EpicsSignalBase`](bluesky/ophyd#917)
@prjemian
Copy link
Contributor

prjemian commented Oct 28, 2020

Watch bluesky/ophyd#926 for upstream fix. Interim fix for Another set() call is still in progress exception is to add a timeout=10 (or better value, seconds) keyword argument when changing the affected signal, such as bps.mv(signal, value, timeout=10). The timeout kwarg is passed through to the signal's .set() method.

@prjemian prjemian changed the title FLyscan hanging for 3 hours with no obvious reason Flyscan hanging for 3 hours with no obvious reason Oct 28, 2020
@prjemian
Copy link
Contributor

Also, if the write operation fails due to timeout, the signal will log the timeout at the warn level. To suppress that log message (and others at warn or below), call signal_name.log.setLevel("ERROR"). Only error or higher logs will be printed.

@prjemian
Copy link
Contributor

By setting a timeout in the write command, the status object is allowed to complete its job. In the case of user_data.state, we are announcing the current state of the instrument and we (the data acquisition software) really don't care if anyone is listening. Thus, the timeout should be really short so the status object is complete before we try to write the status again. That means really short, such as 0.1 s.

@prjemian
Copy link
Contributor

Can also set the write_timeout when creating the signal: https://github.com/bluesky/ophyd/blob/3dddf111946c1eb7200361d2158445db2acfb76d/ophyd/signal.py#L740. Pass it as a kwarg to Component().

prjemian added a commit that referenced this issue Oct 28, 2020
@prjemian
Copy link
Contributor

@jilavsky When can we test this?

@jilavsky
Copy link
Contributor Author

Without beam any time. I am off on vacations from Friday until end of next week. Either Thursday 10/29 or when I get back, 11/9.
With beam during USAXS setup on 11/17 and 11/18. Have users starting 11/19.

@prjemian
Copy link
Contributor

prjemian commented Nov 4, 2020

Now, there is a new method in ophyd (since the 1.5.4 release) to set default (read) timeout, write_timeout, connection_timeout, and auto_monitor values for any instances of EpicsSignal. This must be called before creating any EpicsSignal so it goes in instrument/framework/initialize.py. It is expected these settings will resolve this problem.

@prjemian
Copy link
Contributor

prjemian commented Nov 4, 2020

To use that feature, we need to use the ophyd master branch until the next release.

@prjemian
Copy link
Contributor

prjemian commented Nov 4, 2020

ophyd updated

prjemian added a commit that referenced this issue Nov 4, 2020
@jilavsky
Copy link
Contributor Author

What input are we waiting here for? I will be setting up next week 17 - 18 November. I want to test this at that time. My suggestion, install and let's run this as dummy data collection over night 17-18. I should have instrument setup before lunch, assuming stuff works. Plan to run tests after that.

@jilavsky
Copy link
Contributor Author

Testing - can we test this by killing source of PVs while collecting data? Like take down the Linkam soft ioc on Gaga computer. That will make PVs go away. If this is fixed, the data collection should continue, may be with some delays for timeouts, but should continue. That would be proof it works as expected (= as spec does).

@prjemian
Copy link
Contributor

prjemian commented Nov 12, 2020 via email

@prjemian
Copy link
Contributor

The triggering exception here was

Another set() call is still in progress for user_data_state

The root cause, as determined in an ophyd issue is that the timeout (specifically, the write_timeout) needs to be set to a finite value so the status object can complete its workflow. With timeout=None (the previous default), the status object will wait until its ._finished() method is called. Since that has not happened yet, the .set() call is still in progress.

We fixed this now (as we did for #412) by setting default values for all timeouts and default of auto_monitor=True (use EPICS Channel Access monitors to cache updated PV values locally): https://github.com/APS-USAXS/ipython-usaxs/blob/e0c913c8613203dd9400f0911eda64f581949770/profile_bluesky/startup/instrument/framework/initialize.py#L120-L125

This problem is fixed by controlling the timeouts and using the CA monitors. Required upstream changes in ophyd to create the ophyd.signal. EpicsSignalBase.set_defaults() method.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working high priority more urgent
Projects
None yet
Development

No branches or pull requests

3 participants