-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems during parallel computing with Dask #4
Comments
Please add a link to your python script in this issue. |
Please add a link to the notebook |
Thanks for your advice, Sarah, I added it. |
Thanks for your advice, Sarah, I added it. |
I have been trying to figure out from 3 points: 1) loading data, 2) data preprocessing including temporal and spatial resampling, 3) RF model. I found the data loading and preprocessing do not have problems, but the RF model caused the memory problem. When I load the RF model outside of map_block function and then pass this model to map_blocks, the unmanaged memory is extremely high. Like what I said in the first post in this issue: when I tried to predict 1500 timesteps (the data of 1500 timesteps is 297 MB.), but always hit the worker memory limit (240 GB memory), I checked again that the memory is mostly unmanaged memory. After I changed to pass the model path to map_blocks function instead of the model, the unmanaged memory seems normal, similar as managed memory. But the RF model is 245MB, I can not understand why pass it can cause so much unmanaged memory? |
This experiment is still for 5 degree area. Now predicting 2000 timesteps and 5000 timesteps is no problem. But when I tried to predict 10000 timesteps, it failed before I increase either the memory or the CPU number. When I tried to predict 17000 timesteps, it failed before I increase both the memory and the CPU number. |
@geek-yang and @fnattino see the issue here |
@geek-yang and @fnattino, |
@geek-yang and @fnattino I prepared the two jupyter notebooks. The code is in github now: 1) 1 year in 10 degree area: For 10 degree area, although I managed to make it run, maybe it would be better if you can help me check is the script correct or not? Specifically, I have the following 4 questions:
For Europe area, I did not manage to make it run. The error is in cell 55 of https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/0921_1year_Europe.ipynb. It seems it is data size problem, but even when I tried to predict 151 * 151 pixels (10 degree is 101*101 pixels), it gave me same error. All the input data is on snellius, you can directly run my script on snellius. I am using fat node and 32 CPU, 240 GB memory. |
Based on the progress we got on June 27th, we managed to predict 100 timesteps with dask. I also managed to reduce the size of trained RF model from 15 GB to 245 MB. I can predict 200 timesteps (with 240GB memory).
However, when I tried to predict 1500 timesteps (the data of 1500 timesteps is 297 MB.), always hit the worker memory limit (240 GB memory), no matter how many workers I use (4/32/64), the threads_per_worker is always 1. When I requested 960 GB, still hit the worker memory limit. When I use my python script (https://github.com/EcoExtreML/Emulator/blob/main/1computationBlockTest/2read10kminput-halfhourly-0616.py) without Dask to predict the whole year 17000 steps, it used 100 GB memory. I do not understand why Dask need so much memory. Could you help give some advice in terms of this problem? The script is at https://github.com/EcoExtreML/Emulator/blob/main/1computationBlockTest/2read10kminput-halfhourly-0628.ipynb.
The text was updated successfully, but these errors were encountered: