model_read_lib implementation #68

matt-har-vey · 2022-04-04T18:03:04Z

This implements a library for reading ML outputs and also includes point counts, because using the same approach for indexing helps with consistency.

It would be very worth checking out this branch before merging and calling runCombined on the tall CSV from Mark
- as an integration
- more importantly as a check on the correctness of this code
  - I did inspect results on a test tall CSV very carefully, and the entire matrices have sensible values. I didn't spot-check individual values, which is a way to catch certain kinds of bugs (e.g. index ordering)
I ran a bit ahead of my plan for small CLs, but since the tall CSV is ready, or close, I thought reviewing a RNBU-only implementation is starting to look like an avoidable detour

This implements a library for reading ML outputs and also includes point counts, because using the same approach for indexing helps with consistency.

matt-har-vey · 2022-04-04T21:44:53Z

"Fake" input that can get a start on output inspection is at

Resilience_data_drive/dataML_tall_unofficial.zip

Forgot to mention: symlinks

Durrell had mentioned that part of this whole chunk / input / output scheme was that symlinks from the input/ directory from one chunk into the output/ directories of other chunks would represent data dependencies using the filesystem.

In this spirit, I've written this to take inputs from models/input/ (all paths listed at the top of the R file), but these files don't exist in Drive. The idea is to create them as symlinks to files that do. The code right now does not add the symlinks or call drive_sync.

I'll add another commit to this PR with symlinks, which also requires them to be un-.gitignored. There's probably a separate discussion to be had on the best long-term approach, maybe in issue #26.

This adds symlinks from the models/input/ directory that point into the output/ directories where those files were generated. Note that this makes a bit of a mess in .gitignore.

ddkapan · 2022-04-04T23:57:33Z

I will look at it ASAP, the week is already too busy! Thanks! Durrell

ddkapan

Breaking my review down:

Comments RE symlinks: @mschulist & I figured that we should 'autogenerate' the symlinks using calls to Linux/macOS (the same) from the top of the scripts and keep the .gitignore file clean. For an example see naive_occupancy_17-21.R lines 29-38:

# Creating the folder within inputs that contains the symlinks that point to the output directory from "/data_ingest/src/readPC.R"`
if (dir.exists(here("point_counts/naive_occupancy/input/")) == F) {
  dir.create(here("point_counts/naive_occupancy/input/"))
}

# Creating a symlink from the output of /point_count/data_ingest/ to the input of /point_count/naive_occupancy/
PointC_filename <- "PointC_2022-03-21.csv"
link_create(here("point_counts", "data_ingest", "output", PointC_filename), here("point_counts", "naive_occupancy", "input", PointC_filename))
# Read the google data
PointC <- fread(here("point_counts", "naive_occupancy", "input", PointC_filename))

It is a pain, but we could make it a function that would save one line of code but be more opaque.

Also, I realize we still have the bug on the PointC filename. Yikes!

ddkapan · 2022-04-05T18:54:12Z

Review of the function definitions:

I read through your functions in model_read_lib.R. I've also downloaded the large datasets you offer for testing. However, to even get to the testing stage it looks like I will have to identify and format inputs (for instance the included functions drop in to replace bits of combined.R). Since I am working on another feature branch, it would be great to have a test script that calls the functions defined in model_read_lib.R and produces reproducible output that can feed into Jags. Maybe we can work with @mkclapp & @mschulist to create thereby integrating it fully into the workflow. Otherwise, all your changes look awesome and I'm ready to integrate them with the main code as long as we move into the next steps to use the functions as a part of building up to some model input in an example script. Thoughts?

mschulist · 2022-04-05T18:59:34Z

Comments RE symlinks: @mschulist & I figured that we should 'autogenerate' the symlinks using calls to Linux/macOS (the same) from the top of the scripts and keep the .gitignore file clean. For an example see [naive_occupancy_17-21.R]

The link_create() function uses the fs library. I can try to make a function that will simplify how we write symlinks in the code, although it is not a ton of work to write it as you showed in the example. Putting the symlink also makes it more readable and follows the "best practices" better.

matt-har-vey · 2022-04-05T20:22:05Z

it would be great to have a test script

I used the code in this gist while developing. I figured this didn't need to be in GitHub, since it's a one-liner and the function is documented. Maybe it's worth adding an example usage section to the doc comment.

reproducible output that can feed into Jags

"feed into JAGS" is the intent of readCombined. It's supposed to "just work" as

data <- readCombined(...)
jags(..., data=data, ...)

I haven't actually tested it this way yet, but I think it's easy. I'll create another branch for that. (I'm keeping separate branches because I want to leave combined.R in a working state at main until reviews decide this is correct and complete enough to cut-over to.)

matt-har-vey · 2022-04-05T20:25:43Z

'autogenerate' the symlinks using calls to Linux/macOS

If the linked-to path is going to be hard-coded into the source file that depends on / reads that input, for the purpose of creating the symlink, it's less obvious to me that the symlink is providing any benefit as the code-on-data dependencies would already be represented in the code. (Or from another view, since the code already "knows the real path.") And if it changes for some reason, we could end up with stale / confusing symlinks.

I guess I had better watch the video.

ddkapan · 2022-04-05T20:55:11Z

Hi @matth79:

thanks for the gist. I'm on a different branch now but will test with that when I can. It is just what I needed.
Regarding the symlink issue, following Patrick Ball's video, we are holding onto the idea that a human can read and understand the project by looking at the organization. We can easily reference the output from a different step as input for a given step, but then we have to look at code instead of files. RE stale symlinks, they are created each time and never synchronized so it is an entirely superfluous step unless you want to be able to read the directories by eye or somehow build a dependency graph, which is easy with a makefile/graphviz (though I haven't done it). So this is a definite area for discussion and active 'change'...

BTW I think Patrick Ball basically did a lot of the creation of input symlinks by hand as he moved through his analyses. They left him a breadcrumb trail as he puzzled through and fixed his work. A final use case that's nice, is that the symlinks in multiple .../chunkX/input/ .../chunkY/input/'s can be pointing to the same .../chunk/output/ (just as code can) but having that in the directory structure has human-readable has a small but not overriding benefit and we have to make sure if we have symlinks, we are not copying those to Google_Drive, so definitely we need to discuss and decide... Thanks!

This is a bug that did not show itself when this was being called from a test script where species (plural) had been defined locally but revealed itself when combined.R was ported. It is indended that outerIndices specify which species and years to include and that readMl get to know those in case it can use them to make the query more efficient.

ddkapan · 2022-04-06T22:30:03Z

Hi @matth79 & @mkclapp,

Mary, at least one of us should try to run this in its entirety (with the next PR, right Matt)? Do you agree?

Can we have a brief face-to-face tomorrow to run through the workflow and all the issues and figure out where to clean those up most efficiently?

Thanks!

Durrell

mkclapp · 2022-04-07T00:10:06Z

Hi all,
Apologies for my sluggishness in responding to these PRs/reviews, I've had my head down working on other things at Pat's and Angela's request. Yes, a face-to-face tomorrow would be really helpful for me. I'm flexible all day except for 2:30-3:30p.

matt-har-vey · 2022-04-07T00:37:07Z

Yes, face-to-face sounds good. I'll send an invite.

I do agree that we want one or both of you to be able to run, at least the test code in the gist before merging.

The changes to combined.R I kept separate so that there's the option to have this merged without breaking that. The fact is I think it is working end-to-end.

ddkapan · 2022-04-07T23:24:26Z

Hi @mkclapp & @matth79, Let me know what you all decided. If nothing more is necessary, once we get your review (e.g. all good or whatever), Mary, I can merge the pull request.

mkclapp · 2022-04-07T23:48:25Z

Was able to run from source() with no problem with @matth79 and @ddkapan as witness!

model_read_lib implementation

b7a520b

This implements a library for reading ML outputs and also includes point counts, because using the same approach for indexing helps with consistency.

matt-har-vey added the enhancement New feature or request label Apr 4, 2022

matt-har-vey requested review from ddkapan and mkclapp April 4, 2022 18:03

matt-har-vey assigned mkclapp Apr 4, 2022

mkclapp mentioned this pull request Apr 4, 2022

Clean up readML.R and make it tall and remove partial recordings #62

Closed

model_read_lib input symlinks

7cd9082

This adds symlinks from the models/input/ directory that point into the output/ directories where those files were generated. Note that this makes a bit of a mess in .gitignore.

Fixes a formatting error

e4dd26c

ddkapan reviewed Apr 5, 2022

View reviewed changes

mkclapp approved these changes Apr 7, 2022

View reviewed changes

ddkapan merged commit b91cbe4 into calacademy-research:main Apr 7, 2022

matt-har-vey deleted the readml-impl branch April 8, 2022 04:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model_read_lib implementation #68

model_read_lib implementation #68

matt-har-vey commented Apr 4, 2022

matt-har-vey commented Apr 4, 2022

ddkapan commented Apr 4, 2022

ddkapan left a comment •

edited

Loading

ddkapan commented Apr 5, 2022 •

edited

Loading

mschulist commented Apr 5, 2022

matt-har-vey commented Apr 5, 2022

matt-har-vey commented Apr 5, 2022

ddkapan commented Apr 5, 2022

ddkapan commented Apr 6, 2022

mkclapp commented Apr 7, 2022

matt-har-vey commented Apr 7, 2022

ddkapan commented Apr 7, 2022

mkclapp commented Apr 7, 2022

model_read_lib implementation #68

model_read_lib implementation #68

Conversation

matt-har-vey commented Apr 4, 2022

matt-har-vey commented Apr 4, 2022

"Fake" input that can get a start on output inspection is at

Forgot to mention: symlinks

ddkapan commented Apr 4, 2022

ddkapan left a comment • edited Loading

Choose a reason for hiding this comment

ddkapan commented Apr 5, 2022 • edited Loading

mschulist commented Apr 5, 2022

matt-har-vey commented Apr 5, 2022

matt-har-vey commented Apr 5, 2022

ddkapan commented Apr 5, 2022

ddkapan commented Apr 6, 2022

mkclapp commented Apr 7, 2022

matt-har-vey commented Apr 7, 2022

ddkapan commented Apr 7, 2022

mkclapp commented Apr 7, 2022

ddkapan left a comment •

edited

Loading

ddkapan commented Apr 5, 2022 •

edited

Loading