Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up readML.R and make it tall and remove partial recordings #62

Closed
mschulist opened this issue Apr 1, 2022 · 16 comments
Closed

Clean up readML.R and make it tall and remove partial recordings #62

mschulist opened this issue Apr 1, 2022 · 16 comments
Assignees
Labels
enhancement New feature or request outputs Request to add or update authoritative copy of intermediate output

Comments

@mschulist
Copy link
Collaborator

mschulist commented Apr 1, 2022

readML.R ouput should be tall/sparse (report the time that it takes to rotate the matrix)
Any recordings that are <15 mins should be thrown out.

@matt-har-vey
Copy link
Collaborator

Draft of the function that would like to read it.

@matt-har-vey
Copy link
Collaborator

Lots of discussion around this on PR #67

@mschulist
Copy link
Collaborator Author

mschulist commented Apr 4, 2022

I just finished making the rough draft of the changes. The output now includes all logits and is tall, meaning that a lot of the data is repeated, making the file huge. It is 162GB and is on google drive.
https://drive.google.com/file/d/1a9C_iCJ8hFO261Qx7_j_9o8hZojOisMy/view?usp=sharing

@matt-har-vey
Copy link
Collaborator

Cool. Nice work, and yes that is huge.

@mkclapp or @ddkapan could correct me if I'm wrong, but if I understand, we only need rows where (logit > -2.0) (the rule for making it "sparse"). That could cut down on the number of lines.

Downstream readers will only need (Species, Point, Date_Time, Start_Time, Logit). Leaving out ARU, filename, etc. could cut down a lot on characters per row.

This might be a good place to do the conversion from 6-letter to 4-letter species codes (to correspond to point counts). That would save 2 characters per row.

No need to read deeply, but incidentally, what you're making here is something like a fact table.

@mschulist
Copy link
Collaborator Author

Thanks for those brilliant suggestions. I just reran it but only kept the columns that you mentioned and used the 4 letter codes. It's only 56GB now! And that's without getting rid of low logits.

@sdenton4
Copy link
Collaborator

sdenton4 commented Apr 4, 2022 via email

@matt-har-vey
Copy link
Collaborator

Thresholding, now that there's only one species per line, removes all the other-species logits from lines where only one of a few species logits had been above threshold, which is most of them. Saying this is a 90x reduction in size would be an overstatement, but it's that order of magnitude.

Going after record length, in contrast, could cut the size about in half in the best case. The biggest opportunity is the ISO-8601 date string, which could be made something weird like a count of 30-minute periods since January 1, 2018 (small, but I would definitely not recommend.)

The improvement from thresholding is nonlinear-ish. The improvement from reducing record length is linear.

Notwithstanding that, as binary formats go, Avro is nice for being columnar and including the schema with the data. I think we have a stated soft preference for text, though.

@matt-har-vey
Copy link
Collaborator

It could be a problem on my end, but when I point BigQuery at this and "SELECT *" without limit, I get errors like

Error while reading table: comb.dataML_tall, error message: CSV table references column position 4, but line starting at position:58676002278 contains only 1 columns.

I'll download this on a single machine and update this with what I find.

@mkclapp
Copy link
Collaborator

mkclapp commented Apr 4, 2022

I missed a lot of activity over the weekend and am trying to track progress chronologically. It sounds like @mschulist drafted the updated readML.R, and @matth79 did a draft of the following step (a function to intake the tall csv and create arrays appropriate for JAGS).

@mschulist , can you identify the branch you're working on so I can pull and review via my computer (connected to alice)? Then I will review @matth79 's pull request on issue #68 .

@matt-har-vey
Copy link
Collaborator

I spied through the Branches navigation and guess @mschulist changes are in this branch.

I think it might take a while to re-run, though, and maybe the output is already somewhere on alice. I've uploaded a file dataML_tall_unofficial.zip to the top-level of Resilience_data_drive folder. It's smaller and could get the code review unblocked, but since it came from me, it doesn't meet the "integration test" goal. (It's not an official "output" data product and intended to be deleted soon when it's outlived its usefulness.)

@mschulist
Copy link
Collaborator Author

mschulist commented Apr 4, 2022 via email

@matt-har-vey
Copy link
Collaborator

To my earlier comment:

It could be a problem on my end, but when I point BigQuery at this and "SELECT *" without limit, I get errors like

It was BigQuery+Drive that was in error. I downloaded to a single machine and verified that all rows have 5 fields.

@matt-har-vey
Copy link
Collaborator

We might also want to consider filtering out (UNKN, nonbird, human) at this stage.

@matt-har-vey matt-har-vey added enhancement New feature or request outputs Request to add or update authoritative copy of intermediate output labels Apr 7, 2022
@mschulist
Copy link
Collaborator Author

I just uploaded the outputs from the new readML.R that are tall. Filtering logits that are below -2.5 significantly reduces file sizes. They are in the acoustic/data_ingest/output/tall/ directory in google drive.

@ddkapan
Copy link
Collaborator

ddkapan commented Aug 17, 2022

@mschulist is this issue 'finished'? :)

@mschulist
Copy link
Collaborator Author

Yes, they are all in google drive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request outputs Request to add or update authoritative copy of intermediate output
Projects
None yet
Development

No branches or pull requests

5 participants