-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up readML.R and make it tall and remove partial recordings #62
Comments
Draft of the function that would like to read it. |
Lots of discussion around this on PR #67 |
I just finished making the rough draft of the changes. The output now includes all logits and is tall, meaning that a lot of the data is repeated, making the file huge. It is 162GB and is on google drive. |
Cool. Nice work, and yes that is huge. @mkclapp or @ddkapan could correct me if I'm wrong, but if I understand, we only need rows where (logit > -2.0) (the rule for making it "sparse"). That could cut down on the number of lines. Downstream readers will only need (Species, Point, Date_Time, Start_Time, Logit). Leaving out ARU, filename, etc. could cut down a lot on characters per row. This might be a good place to do the conversion from 6-letter to 4-letter species codes (to correspond to point counts). That would save 2 characters per row. No need to read deeply, but incidentally, what you're making here is something like a fact table. |
Thanks for those brilliant suggestions. I just reran it but only kept the columns that you mentioned and used the 4 letter codes. It's only 56GB now! And that's without getting rid of low logits. |
I'm not sure what the disk format is; there may be some good tricks to
reduce the size of the logits.
if it's a binary format: save the logits as float16 or convert to an 8-bit
int (where 0 = -2.0 and 255 = 5.0 or so), or
if it's another CSV, find a way to save in a binary format, or drop to 1
place after the decimal.
Best,
-tom denton
http://inventingsituations.net
…On Sun, Apr 3, 2022 at 9:43 PM Mark Schulist ***@***.***> wrote:
Thanks for those brilliant suggestions. I just reran it but only kept the
columns that you mentioned and used the 4 letter codes. It's *only* 56GB
now! And that's without getting rid of low logits.
—
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA6T3PNYYVJJ6LXY2QPIU7TVDJXNJANCNFSM5SHC6N7A>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thresholding, now that there's only one species per line, removes all the other-species logits from lines where only one of a few species logits had been above threshold, which is most of them. Saying this is a 90x reduction in size would be an overstatement, but it's that order of magnitude. Going after record length, in contrast, could cut the size about in half in the best case. The biggest opportunity is the ISO-8601 date string, which could be made something weird like a count of 30-minute periods since January 1, 2018 (small, but I would definitely not recommend.) The improvement from thresholding is nonlinear-ish. The improvement from reducing record length is linear. Notwithstanding that, as binary formats go, Avro is nice for being columnar and including the schema with the data. I think we have a stated soft preference for text, though. |
It could be a problem on my end, but when I point BigQuery at this and "SELECT *" without limit, I get errors like Error while reading table: comb.dataML_tall, error message: CSV table references column position 4, but line starting at position:58676002278 contains only 1 columns. I'll download this on a single machine and update this with what I find. |
I missed a lot of activity over the weekend and am trying to track progress chronologically. It sounds like @mschulist drafted the updated readML.R, and @matth79 did a draft of the following step (a function to intake the tall csv and create arrays appropriate for JAGS). @mschulist , can you identify the branch you're working on so I can pull and review via my computer (connected to alice)? Then I will review @matth79 's pull request on issue #68 . |
I spied through the Branches navigation and guess @mschulist changes are in this branch. I think it might take a while to re-run, though, and maybe the output is already somewhere on alice. I've uploaded a file dataML_tall_unofficial.zip to the top-level of Resilience_data_drive folder. It's smaller and could get the code review unblocked, but since it came from me, it doesn't meet the "integration test" goal. (It's not an official "output" data product and intended to be deleted soon when it's outlived its usefulness.) |
Yes, that is the branch. The output is on alice already under my directory
if you want to just use that without having to download the huge file.
…On Mon, Apr 4, 2022 at 2:39 PM Matt Harvey ***@***.***> wrote:
I spied through the Branches
<https://github.com/calacademy-research/COMB/branches> navigation and
guess @mschulist <https://github.com/mschulist> changes are in this branch
<https://github.com/calacademy-research/COMB/tree/readML_cleanup>.
I think it might take a while to re-run, though, and maybe the output is
already somewhere on alice. I've uploaded a file dataML_tall_unofficial.zip
to the top-level of Resilience_data_drive folder. It's smaller and could
get the code review unblocked, but since it came from me, it doesn't meet
the "integration test" goal. (It's not an official "output" data product
and intended to be deleted soon when it's outlived its usefulness.)
—
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASXKV5I75GZW6TRNGXTETM3VDNOPNANCNFSM5SHC6N7A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
To my earlier comment:
It was BigQuery+Drive that was in error. I downloaded to a single machine and verified that all rows have 5 fields. |
We might also want to consider filtering out (UNKN, nonbird, human) at this stage. |
I just uploaded the outputs from the new readML.R that are tall. Filtering logits that are below -2.5 significantly reduces file sizes. They are in the |
@mschulist is this issue 'finished'? :) |
Yes, they are all in google drive. |
readML.R ouput should be tall/sparse (report the time that it takes to rotate the matrix)
Any recordings that are <15 mins should be thrown out.
The text was updated successfully, but these errors were encountered: