-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model_read_lib implementation #68
Conversation
This implements a library for reading ML outputs and also includes point counts, because using the same approach for indexing helps with consistency.
"Fake" input that can get a start on output inspection is atResilience_data_drive/dataML_tall_unofficial.zip Forgot to mention: symlinksDurrell had mentioned that part of this whole chunk / input / output scheme was that symlinks from the input/ directory from one chunk into the output/ directories of other chunks would represent data dependencies using the filesystem. In this spirit, I've written this to take inputs from models/input/ (all paths listed at the top of the R file), but these files don't exist in Drive. The idea is to create them as symlinks to files that do. The code right now does not add the symlinks or call drive_sync. I'll add another commit to this PR with symlinks, which also requires them to be un-.gitignored. There's probably a separate discussion to be had on the best long-term approach, maybe in issue #26. |
This adds symlinks from the models/input/ directory that point into the output/ directories where those files were generated. Note that this makes a bit of a mess in .gitignore.
I will look at it ASAP, the week is already too busy! Thanks! Durrell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking my review down:
- Comments RE symlinks: @mschulist & I figured that we should 'autogenerate' the symlinks using calls to Linux/macOS (the same) from the top of the scripts and keep the .gitignore file clean. For an example see naive_occupancy_17-21.R lines 29-38:
# Creating the folder within inputs that contains the symlinks that point to the output directory from "/data_ingest/src/readPC.R"`
if (dir.exists(here("point_counts/naive_occupancy/input/")) == F) {
dir.create(here("point_counts/naive_occupancy/input/"))
}
# Creating a symlink from the output of /point_count/data_ingest/ to the input of /point_count/naive_occupancy/
PointC_filename <- "PointC_2022-03-21.csv"
link_create(here("point_counts", "data_ingest", "output", PointC_filename), here("point_counts", "naive_occupancy", "input", PointC_filename))
# Read the google data
PointC <- fread(here("point_counts", "naive_occupancy", "input", PointC_filename))
It is a pain, but we could make it a function that would save one line of code but be more opaque.
Also, I realize we still have the bug on the PointC filename. Yikes!
I read through your functions in |
The |
I used the code in this gist while developing. I figured this didn't need to be in GitHub, since it's a one-liner and the function is documented. Maybe it's worth adding an example usage section to the doc comment.
"feed into JAGS" is the intent of readCombined. It's supposed to "just work" as
I haven't actually tested it this way yet, but I think it's easy. I'll create another branch for that. (I'm keeping separate branches because I want to leave combined.R in a working state at main until reviews decide this is correct and complete enough to cut-over to.) |
If the linked-to path is going to be hard-coded into the source file that depends on / reads that input, for the purpose of creating the symlink, it's less obvious to me that the symlink is providing any benefit as the code-on-data dependencies would already be represented in the code. (Or from another view, since the code already "knows the real path.") And if it changes for some reason, we could end up with stale / confusing symlinks. I guess I had better watch the video. |
Hi @matth79:
BTW I think Patrick Ball basically did a lot of the creation of input symlinks by hand as he moved through his analyses. They left him a breadcrumb trail as he puzzled through and fixed his work. A final use case that's nice, is that the symlinks in multiple |
This is a bug that did not show itself when this was being called from a test script where species (plural) had been defined locally but revealed itself when combined.R was ported. It is indended that outerIndices specify which species and years to include and that readMl get to know those in case it can use them to make the query more efficient.
Hi @matth79 & @mkclapp, Mary, at least one of us should try to run this in its entirety (with the next PR, right Matt)? Do you agree? Can we have a brief face-to-face tomorrow to run through the workflow and all the issues and figure out where to clean those up most efficiently? Thanks! Durrell |
Hi all, |
Yes, face-to-face sounds good. I'll send an invite. I do agree that we want one or both of you to be able to run, at least the test code in the gist before merging. The changes to combined.R I kept separate so that there's the option to have this merged without breaking that. The fact is I think it is working end-to-end. |
Hi @mkclapp & @matth79, Let me know what you all decided. If nothing more is necessary, once we get your review (e.g. all good or whatever), Mary, I can merge the pull request. |
Was able to run from source() with no problem with @matth79 and @ddkapan as witness! |
This implements a library for reading ML outputs and also includes point counts, because using the same approach for indexing helps with consistency.