Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model_read_lib implementation #68

Merged
merged 4 commits into from
Apr 7, 2022

Conversation

matt-har-vey
Copy link
Collaborator

This implements a library for reading ML outputs and also includes point counts, because using the same approach for indexing helps with consistency.

  • It would be very worth checking out this branch before merging and calling runCombined on the tall CSV from Mark
    • as an integration
    • more importantly as a check on the correctness of this code
      • I did inspect results on a test tall CSV very carefully, and the entire matrices have sensible values. I didn't spot-check individual values, which is a way to catch certain kinds of bugs (e.g. index ordering)
  • I ran a bit ahead of my plan for small CLs, but since the tall CSV is ready, or close, I thought reviewing a RNBU-only implementation is starting to look like an avoidable detour

This implements a library for reading ML outputs and also includes point
counts, because using the same approach for indexing helps with
consistency.
@matt-har-vey
Copy link
Collaborator Author

"Fake" input that can get a start on output inspection is at

Resilience_data_drive/dataML_tall_unofficial.zip

Forgot to mention: symlinks

Durrell had mentioned that part of this whole chunk / input / output scheme was that symlinks from the input/ directory from one chunk into the output/ directories of other chunks would represent data dependencies using the filesystem.

In this spirit, I've written this to take inputs from models/input/ (all paths listed at the top of the R file), but these files don't exist in Drive. The idea is to create them as symlinks to files that do. The code right now does not add the symlinks or call drive_sync.

I'll add another commit to this PR with symlinks, which also requires them to be un-.gitignored. There's probably a separate discussion to be had on the best long-term approach, maybe in issue #26.

This adds symlinks from the models/input/ directory that point into the
output/ directories where those files were generated.

Note that this makes a bit of a mess in .gitignore.
@ddkapan
Copy link
Collaborator

ddkapan commented Apr 4, 2022

I will look at it ASAP, the week is already too busy! Thanks! Durrell

Copy link
Collaborator

@ddkapan ddkapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking my review down:

  1. Comments RE symlinks: @mschulist & I figured that we should 'autogenerate' the symlinks using calls to Linux/macOS (the same) from the top of the scripts and keep the .gitignore file clean. For an example see naive_occupancy_17-21.R lines 29-38:
# Creating the folder within inputs that contains the symlinks that point to the output directory from "/data_ingest/src/readPC.R"`
if (dir.exists(here("point_counts/naive_occupancy/input/")) == F) {
  dir.create(here("point_counts/naive_occupancy/input/"))
}

# Creating a symlink from the output of /point_count/data_ingest/ to the input of /point_count/naive_occupancy/
PointC_filename <- "PointC_2022-03-21.csv"
link_create(here("point_counts", "data_ingest", "output", PointC_filename), here("point_counts", "naive_occupancy", "input", PointC_filename))
# Read the google data
PointC <- fread(here("point_counts", "naive_occupancy", "input", PointC_filename))

It is a pain, but we could make it a function that would save one line of code but be more opaque.

Also, I realize we still have the bug on the PointC filename. Yikes!

@ddkapan
Copy link
Collaborator

ddkapan commented Apr 5, 2022

  1. Review of the function definitions:

I read through your functions in model_read_lib.R. I've also downloaded the large datasets you offer for testing. However, to even get to the testing stage it looks like I will have to identify and format inputs (for instance the included functions drop in to replace bits of combined.R). Since I am working on another feature branch, it would be great to have a test script that calls the functions defined in model_read_lib.R and produces reproducible output that can feed into Jags. Maybe we can work with @mkclapp & @mschulist to create thereby integrating it fully into the workflow. Otherwise, all your changes look awesome and I'm ready to integrate them with the main code as long as we move into the next steps to use the functions as a part of building up to some model input in an example script. Thoughts?

@mschulist
Copy link
Collaborator

Comments RE symlinks: @mschulist & I figured that we should 'autogenerate' the symlinks using calls to Linux/macOS (the same) from the top of the scripts and keep the .gitignore file clean. For an example see [naive_occupancy_17-21.R]

The link_create() function uses the fs library. I can try to make a function that will simplify how we write symlinks in the code, although it is not a ton of work to write it as you showed in the example. Putting the symlink also makes it more readable and follows the "best practices" better.

@matt-har-vey
Copy link
Collaborator Author

it would be great to have a test script

I used the code in this gist while developing. I figured this didn't need to be in GitHub, since it's a one-liner and the function is documented. Maybe it's worth adding an example usage section to the doc comment.

reproducible output that can feed into Jags

"feed into JAGS" is the intent of readCombined. It's supposed to "just work" as

data <- readCombined(...)
jags(..., data=data, ...)

I haven't actually tested it this way yet, but I think it's easy. I'll create another branch for that. (I'm keeping separate branches because I want to leave combined.R in a working state at main until reviews decide this is correct and complete enough to cut-over to.)

@matt-har-vey
Copy link
Collaborator Author

  1. 'autogenerate' the symlinks using calls to Linux/macOS

If the linked-to path is going to be hard-coded into the source file that depends on / reads that input, for the purpose of creating the symlink, it's less obvious to me that the symlink is providing any benefit as the code-on-data dependencies would already be represented in the code. (Or from another view, since the code already "knows the real path.") And if it changes for some reason, we could end up with stale / confusing symlinks.

I guess I had better watch the video.

@ddkapan
Copy link
Collaborator

ddkapan commented Apr 5, 2022

Hi @matth79:

  1. thanks for the gist. I'm on a different branch now but will test with that when I can. It is just what I needed.

  2. Regarding the symlink issue, following Patrick Ball's video, we are holding onto the idea that a human can read and understand the project by looking at the organization. We can easily reference the output from a different step as input for a given step, but then we have to look at code instead of files. RE stale symlinks, they are created each time and never synchronized so it is an entirely superfluous step unless you want to be able to read the directories by eye or somehow build a dependency graph, which is easy with a makefile/graphviz (though I haven't done it). So this is a definite area for discussion and active 'change'...

BTW I think Patrick Ball basically did a lot of the creation of input symlinks by hand as he moved through his analyses. They left him a breadcrumb trail as he puzzled through and fixed his work. A final use case that's nice, is that the symlinks in multiple .../chunkX/input/ .../chunkY/input/'s can be pointing to the same .../chunk/output/ (just as code can) but having that in the directory structure has human-readable has a small but not overriding benefit and we have to make sure if we have symlinks, we are not copying those to Google_Drive, so definitely we need to discuss and decide... Thanks!

This is a bug that did not show itself when this was being called from a
test script where species (plural) had been defined locally but revealed
itself when combined.R was ported.

It is indended that outerIndices specify which species and years to
include and that readMl get to know those in case it can use them to
make the query more efficient.
@ddkapan
Copy link
Collaborator

ddkapan commented Apr 6, 2022

Hi @matth79 & @mkclapp,

Mary, at least one of us should try to run this in its entirety (with the next PR, right Matt)? Do you agree?

Can we have a brief face-to-face tomorrow to run through the workflow and all the issues and figure out where to clean those up most efficiently?

Thanks!

Durrell

@mkclapp
Copy link
Collaborator

mkclapp commented Apr 7, 2022

Hi all,
Apologies for my sluggishness in responding to these PRs/reviews, I've had my head down working on other things at Pat's and Angela's request. Yes, a face-to-face tomorrow would be really helpful for me. I'm flexible all day except for 2:30-3:30p.

@matt-har-vey
Copy link
Collaborator Author

Yes, face-to-face sounds good. I'll send an invite.

I do agree that we want one or both of you to be able to run, at least the test code in the gist before merging.

The changes to combined.R I kept separate so that there's the option to have this merged without breaking that. The fact is I think it is working end-to-end.

@ddkapan
Copy link
Collaborator

ddkapan commented Apr 7, 2022

Hi @mkclapp & @matth79, Let me know what you all decided. If nothing more is necessary, once we get your review (e.g. all good or whatever), Mary, I can merge the pull request.

@mkclapp
Copy link
Collaborator

mkclapp commented Apr 7, 2022

Was able to run from source() with no problem with @matth79 and @ddkapan as witness!

@ddkapan ddkapan merged commit b91cbe4 into calacademy-research:main Apr 7, 2022
@matt-har-vey matt-har-vey deleted the readml-impl branch April 8, 2022 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants