Clean up readML.R and make it tall and remove partial recordings #62

mschulist · 2022-04-01T00:19:56Z

readML.R ouput should be tall/sparse (report the time that it takes to rotate the matrix)
Any recordings that are <15 mins should be thrown out.

matt-har-vey · 2022-04-02T01:15:57Z

Draft of the function that would like to read it.

matt-har-vey · 2022-04-02T19:24:30Z

Lots of discussion around this on PR #67

mschulist · 2022-04-04T00:34:01Z

I just finished making the rough draft of the changes. The output now includes all logits and is tall, meaning that a lot of the data is repeated, making the file huge. It is 162GB and is on google drive.
https://drive.google.com/file/d/1a9C_iCJ8hFO261Qx7_j_9o8hZojOisMy/view?usp=sharing

matt-har-vey · 2022-04-04T02:30:02Z

Cool. Nice work, and yes that is huge.

@mkclapp or @ddkapan could correct me if I'm wrong, but if I understand, we only need rows where (logit > -2.0) (the rule for making it "sparse"). That could cut down on the number of lines.

Downstream readers will only need (Species, Point, Date_Time, Start_Time, Logit). Leaving out ARU, filename, etc. could cut down a lot on characters per row.

This might be a good place to do the conversion from 6-letter to 4-letter species codes (to correspond to point counts). That would save 2 characters per row.

No need to read deeply, but incidentally, what you're making here is something like a fact table.

mschulist · 2022-04-04T04:42:48Z

Thanks for those brilliant suggestions. I just reran it but only kept the columns that you mentioned and used the 4 letter codes. It's only 56GB now! And that's without getting rid of low logits.

sdenton4 · 2022-04-04T15:55:26Z

I'm not sure what the disk format is; there may be some good tricks to reduce the size of the logits. if it's a binary format: save the logits as float16 or convert to an 8-bit int (where 0 = -2.0 and 255 = 5.0 or so), or if it's another CSV, find a way to save in a binary format, or drop to 1 place after the decimal. Best, -tom denton http://inventingsituations.net

…

On Sun, Apr 3, 2022 at 9:43 PM Mark Schulist ***@***.***> wrote: Thanks for those brilliant suggestions. I just reran it but only kept the columns that you mentioned and used the 4 letter codes. It's *only* 56GB now! And that's without getting rid of low logits. — Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6T3PNYYVJJ6LXY2QPIU7TVDJXNJANCNFSM5SHC6N7A> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

matt-har-vey · 2022-04-04T17:42:52Z

Thresholding, now that there's only one species per line, removes all the other-species logits from lines where only one of a few species logits had been above threshold, which is most of them. Saying this is a 90x reduction in size would be an overstatement, but it's that order of magnitude.

Going after record length, in contrast, could cut the size about in half in the best case. The biggest opportunity is the ISO-8601 date string, which could be made something weird like a count of 30-minute periods since January 1, 2018 (small, but I would definitely not recommend.)

The improvement from thresholding is nonlinear-ish. The improvement from reducing record length is linear.

Notwithstanding that, as binary formats go, Avro is nice for being columnar and including the schema with the data. I think we have a stated soft preference for text, though.

matt-har-vey · 2022-04-04T17:45:44Z

It could be a problem on my end, but when I point BigQuery at this and "SELECT *" without limit, I get errors like

Error while reading table: comb.dataML_tall, error message: CSV table references column position 4, but line starting at position:58676002278 contains only 1 columns.

I'll download this on a single machine and update this with what I find.

mkclapp · 2022-04-04T19:58:31Z

I missed a lot of activity over the weekend and am trying to track progress chronologically. It sounds like @mschulist drafted the updated readML.R, and @matth79 did a draft of the following step (a function to intake the tall csv and create arrays appropriate for JAGS).

@mschulist , can you identify the branch you're working on so I can pull and review via my computer (connected to alice)? Then I will review @matth79 's pull request on issue #68 .

matt-har-vey · 2022-04-04T21:38:51Z

I spied through the Branches navigation and guess @mschulist changes are in this branch.

I think it might take a while to re-run, though, and maybe the output is already somewhere on alice. I've uploaded a file dataML_tall_unofficial.zip to the top-level of Resilience_data_drive folder. It's smaller and could get the code review unblocked, but since it came from me, it doesn't meet the "integration test" goal. (It's not an official "output" data product and intended to be deleted soon when it's outlived its usefulness.)

mschulist · 2022-04-04T21:44:49Z

Yes, that is the branch. The output is on alice already under my directory if you want to just use that without having to download the huge file.

…

On Mon, Apr 4, 2022 at 2:39 PM Matt Harvey ***@***.***> wrote: I spied through the Branches <https://github.com/calacademy-research/COMB/branches> navigation and guess @mschulist <https://github.com/mschulist> changes are in this branch <https://github.com/calacademy-research/COMB/tree/readML_cleanup>. I think it might take a while to re-run, though, and maybe the output is already somewhere on alice. I've uploaded a file dataML_tall_unofficial.zip to the top-level of Resilience_data_drive folder. It's smaller and could get the code review unblocked, but since it came from me, it doesn't meet the "integration test" goal. (It's not an official "output" data product and intended to be deleted soon when it's outlived its usefulness.) — Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASXKV5I75GZW6TRNGXTETM3VDNOPNANCNFSM5SHC6N7A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

matt-har-vey · 2022-04-04T22:09:09Z

To my earlier comment:

It could be a problem on my end, but when I point BigQuery at this and "SELECT *" without limit, I get errors like

It was BigQuery+Drive that was in error. I downloaded to a single machine and verified that all rows have 5 fields.

matt-har-vey · 2022-04-04T22:11:07Z

We might also want to consider filtering out (UNKN, nonbird, human) at this stage.

mschulist · 2022-04-20T18:16:20Z

I just uploaded the outputs from the new readML.R that are tall. Filtering logits that are below -2.5 significantly reduces file sizes. They are in the acoustic/data_ingest/output/tall/ directory in google drive.

ddkapan · 2022-08-17T20:00:33Z

@mschulist is this issue 'finished'? :)

mschulist · 2022-08-17T20:49:12Z

Yes, they are all in google drive.

matt-har-vey added enhancement New feature or request outputs Request to add or update authoritative copy of intermediate output labels Apr 7, 2022

matt-har-vey assigned mschulist Apr 7, 2022

mschulist closed this as completed Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up readML.R and make it tall and remove partial recordings #62

Clean up readML.R and make it tall and remove partial recordings #62

mschulist commented Apr 1, 2022 •

edited

Loading

matt-har-vey commented Apr 2, 2022

matt-har-vey commented Apr 2, 2022

mschulist commented Apr 4, 2022 •

edited

Loading

matt-har-vey commented Apr 4, 2022

mschulist commented Apr 4, 2022

sdenton4 commented Apr 4, 2022 via email

matt-har-vey commented Apr 4, 2022

matt-har-vey commented Apr 4, 2022

mkclapp commented Apr 4, 2022 •

edited

Loading

matt-har-vey commented Apr 4, 2022

mschulist commented Apr 4, 2022 via email

matt-har-vey commented Apr 4, 2022

matt-har-vey commented Apr 4, 2022

mschulist commented Apr 20, 2022

ddkapan commented Aug 17, 2022

mschulist commented Aug 17, 2022

Clean up readML.R and make it tall and remove partial recordings #62

Clean up readML.R and make it tall and remove partial recordings #62

Comments

mschulist commented Apr 1, 2022 • edited Loading

matt-har-vey commented Apr 2, 2022

matt-har-vey commented Apr 2, 2022

mschulist commented Apr 4, 2022 • edited Loading

matt-har-vey commented Apr 4, 2022

mschulist commented Apr 4, 2022

sdenton4 commented Apr 4, 2022 via email

matt-har-vey commented Apr 4, 2022

matt-har-vey commented Apr 4, 2022

mkclapp commented Apr 4, 2022 • edited Loading

matt-har-vey commented Apr 4, 2022

mschulist commented Apr 4, 2022 via email

matt-har-vey commented Apr 4, 2022

matt-har-vey commented Apr 4, 2022

mschulist commented Apr 20, 2022

ddkapan commented Aug 17, 2022

mschulist commented Aug 17, 2022

mschulist commented Apr 1, 2022 •

edited

Loading

mschulist commented Apr 4, 2022 •

edited

Loading

mkclapp commented Apr 4, 2022 •

edited

Loading