audio-embedding update #32

radekosmulski · 2020-09-09T06:42:18Z

radekosmulski
Sep 9, 2020

~~Here is the google doc I we went over yesterday. (ESP accessible only)~~

~~I revised the text and made it available as the readme of the audio-embedding repository. I also linked to the repository from the project readme.~~

I edited and cleaned up the doc - the roadmap is now available as the readme for the project repository.

reedflyfarm · 2020-11-12T00:35:57Z

reedflyfarm
Nov 12, 2020

Some questions/thoughts regarding the unsupervised embedding methodology that seems to be at the heart of earth species optimism that we might be able to really make a monumental leap in animal communication understanding. https://github.com/earthspecies/unsupervised-audio-translation/blob/main/README.md

It seems to me that the sentence "Build text embeddings and demonstrate translation without rosetta stone" (https://github.com/earthspecies/unsupervised-audio-translation/blob/main/README.md) is misleading in that at a minimum we still need to use a small bilingual dictionary (25 words + optionally numbers, which some species might have at least 5 numbers and words that refer to amounts) to do embedding alignment. A small dictionary is exactly what the rosetta stone was...just not a dictionary in the modern sense. I might have totally misunderstand the research (I'm not an ML expert), but it may help the biologists to know that we need solid species to English (or other human language) dictionaries of 25 words, and it would be very good if there are numbers in the dictionary.
Also, it wasn't clear to me from Artexte et al's article (https://arxiv.org/abs/1805.06297) which semantic word classes performed the best...after all, a 40% accuracy rate is just a "40% accuracy rate" -- though in the right researchers hand it might be easy to tell if the animal translation is even remotely close to reality (based on what we know today about animal communication). Does anyone know which semantic classes had the best accuracy with their algorithm. If we do, it will help the animal researchers to know which semantic classes performed best...in case there is some biological cause behind it...or at the very least to start with animal "words" in those classes (assuming there are any).

0 replies

radekosmulski · 2020-11-12T10:34:49Z

radekosmulski
Nov 12, 2020
Author

Hi Jeff!

You are right that there are some other methods that require some minimal amount of translated word pairs (the number of required pairs kept decreasing, 5000 -> 25 as improved methods were being developed... and finally reached 0!)

I also find it quite mindboggling that translation without any word pairs is feasible, that there is enough structure in the embeddings to facilitate the alignment. Here is the paper that discusses the method: Word Translation Without Parallel Data.

This is a toy example from some time ago. I attempted training embeddings using a very small vocabulary (only 4000 words). It is astonishing that even with such a toy approach enough structure is captured to facilitate the alignment! The results can be further improved with more data and other tweaks, but the core idea is there.

0 replies

reedflyfarm · 2020-11-12T15:05:24Z

reedflyfarm
Nov 12, 2020

Thank you for the paper! For some strange reason I’m not surprised that it works with 0 word pairs in homo sapiens. And if other mammals just use their neural vector space as a mapping engine so to speak, then this should work for magpies, elk and others 😊 Back to the 40% accuracy question though: did you find any word classes (e.g. adverbs, verbs, nouns, determinatives) that had higher accuracy rates than others? And, has anyone tried this with grammatical forms (e.g. tenses, mood, gender)? From: radekosmulski <notifications@github.com> Sent: Thursday, November 12, 2020 3:35 AM To: earthspecies/project <project@noreply.github.com> Cc: Jeff Reed <Jeff@reedfly.com>; Comment <comment@noreply.github.com> Subject: Re: [earthspecies/project] audio-embedding update (#32) Hi Jeff! You are right that there are some other methods that require some minimal amount of translated word pairs (the number of required pairs kept decreasing, 5000 -> 25 as improved methods were being developed... and finally reached 0!) I also find it quite mindboggling that translation without any word pairs is feasible, that there is enough structure in the embeddings to facilitate the alignment. Here is the paper that discusses the method: Word Translation Without Parallel Data<https://arxiv.org/abs/1710.04087>. This is a toy example<https://github.com/earthspecies/decoder-head-unsupervised-translation/blob/master/05_aligning_the_embeddings_using_vecmap.ipynb> from some time ago. I attempted training embeddings using a very small vocabulary (only 4000 words). It is astonishing that even with such a toy approach enough structure is captured to facilitate the alignment! The results can be further improved with more data and other tweaks, but the core idea is there. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#32 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DHXV67FAPBWYE6RNC3SPO25TANCNFSM4RBH4CQA>.

0 replies

radekosmulski · 2020-11-12T16:03:52Z

radekosmulski
Nov 12, 2020
Author

I haven't looked specifically at how different word classes perform. But you are absolutely right - this would be very interesting to take a closer look at! I am also not aware of anyone publishing any results in this space.

0 replies

reedflyfarm · 2020-11-12T16:13:12Z

reedflyfarm
Nov 12, 2020

From a field researcher’s perspective, I’m trying to understand how a 40% “accuracy” rate helps me in the field to translate from a species vocalization to e.g. English. Is it as simple as “we’ll at least provide a translation” and then you can determine if that translation makes any sense at all in the field context? If not, then we probably got it wrong. But, if it does fit with the context, then we probably got it right. Last point: I think the biggest problem we’re facing is a definition of “word” in non-homo sapiens vocalizations. After all, the word2vec algorithm requires a “word” as an input. Have you given some thought to ways around that? I’ve seen some papers that suggest simply grabbing random time sequences of vocalizations. From: radekosmulski <notifications@github.com> Sent: Thursday, November 12, 2020 9:04 AM To: earthspecies/project <project@noreply.github.com> Cc: Jeff Reed <Jeff@reedfly.com>; Comment <comment@noreply.github.com> Subject: Re: [earthspecies/project] audio-embedding update (#32) I haven't looked specifically at how different word classes perform. But you are absolutely right - this would be very interesting to take a closer look at! I am also not aware of anyone publishing any results in this space. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#32 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DD26OO65TKVH2YJ2JTSPQBPVANCNFSM4RBH4CQA>.

1 reply

radekosmulski Nov 12, 2020
Author

I think the 40% accuracy might be a bit misleading. For instance, often measuring accuracy doesn't take into account synonyms or near-synonyms, which embedding based methods should return with some frequency. These closely related words could probably lend themselves well to translation efforts to a greater extent than just getting 40% words correct would imply.

It is very challenging to imagine today how these methods will perform in an interspecies context. Evaluation of results is also hard without ground truth, agreed, but as you mention - observation might be one of the tools at our disposal.

With regards to establishing a meaningful unit of animal vocalizations, the approach will probably have to vary from species to species. For sperm whales for instance, their communication is already organized in discreet units, the click. Still, I believe it remains an open question how much information a single click encodes, whether clicks are only meaningful in a sequence, what is the meaning of the variable length silence in collections of clicks (codas), etc. The methods that we are working on developing and applying to animal communication can help answer these questions.

For many other species identifying such a discreet unit might be much more challenging, but I am hopeful that again machine learning methods can lend a hand. Some work has already been done to arrive at word boundaries in speech in an unsupervised way. Extending these methods to animal vocalizations might be one approach. It might also be the case that answer to this question will emerge from the bioacoustic research community.

reedflyfarm · 2020-11-12T19:08:43Z

reedflyfarm
Nov 12, 2020

Precisely, there is little known about the phonemes, morphemes, lexemes (and the coda) boundaries in other species. Chickadees, like Sperm Whales have recognizable syllables and even some understanding of what they can “mean” in sequence, BUT those same syllables are used in a lot of contexts where we have NO clue what the coda of meaning is. If you have time in the future, can you post the ML methods used to arrive at lexeme boundaries in speech in an unsupervised ways. I would like to dig in to them and see where they might help with other non-homo languages. Fun, fun, fun. Thank you for taking the time to engage with my novice questions. The feedback is proving helpful with us in the Greater Yellowstone pick the species vocalizations that we think lend themselves best to the ML models you’ll be using…once we decide, we’ll get to work on collecting the data and a small dictionary of known “words” (or “sentences”). From: radekosmulski <notifications@github.com> Sent: Thursday, November 12, 2020 11:49 AM To: earthspecies/project <project@noreply.github.com> Cc: Jeff Reed <Jeff@reedfly.com>; Comment <comment@noreply.github.com> Subject: Re: [earthspecies/project] audio-embedding update (#32) I think the 40% accuracy might be a bit misleading. For instance, often measuring accuracy doesn't take into account synonyms or near-synonyms, which embedding based methods should return with some frequency. These closely related words could probably lend themselves well to translation efforts to a greater extent than just getting 40% words correct would imply. It is very challenging to imagine today how these methods will perform in an interspecies context. Evaluation of results is also hard without ground truth, agreed, but as you mention - observation might be one of the tools at our disposal. With regards to establishing a meaningful unit of animal vocalizations, the approach will probably have to vary from species to species. For sperm whales for instance, their communication is already organized in discreet units, the click. Still, I believe it remains an open question how much information a single click encodes, whether clicks are only meaningful in a sequence, what is the meaning of the variable length silence in collections of clicks (codas), etc. The methods that we are working on developing and applying to animal communication can help answer these questions. For many other species identifying such a discreet unit might be much more challenging, but I am hopeful that again machine learning methods can lend a hand. Some work has already been done to arrive at word boundaries in speech in an unsupervised way. Extending these methods to animal vocalizations might be one approach. It might also be the case that answer to this question will emerge from the bioacoustic research community. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#32 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DENFWZOHNM6PF5RBJTSPQUYXANCNFSM4RBH4CQA>.

1 reply

radekosmulski Nov 12, 2020
Author

Great to talk! 🙂

Here is a paper on the completely unsupervised front I have been looking at recently. It would be nice though to come across a paper where establishing word boundaries would be a more central theme - will stay on the lookout 🙂 I have only taken a cursory look into this space thus far.

reedflyfarm · 2020-11-13T19:04:21Z

reedflyfarm
Nov 13, 2020

The local research scientists I’m working with here in the Greater Yellowstone and I are leaning towards contributing vocalizations from Black Capped Chickadees (BCC) for a few reasons: 1. Plenty of data already captured, with regional dialects identified and tagged, and readily widespread but often found in contexts where citizen scientists can collect vocalizations with cell phones but in a non-captive environment all season long. 2. Intra-species dialog data available 3. Inter-species threat calls well studied with solid evidence for understanding the meaning of certain calls 4. Fairly well understand vocalization boundaries…although whether the boundaries are lexemes (with meaning) or phoneme (no meaning) is less understood. 5. An unsupervised term discovery (UTD) could be applied to chick-a-dee-dee-dee-etc segments to discover unknown words/syllables. In other words, we test the assumption that a common chick-a-dee-dee-dee-etc is a sentence (meaningful segment beyond the word) and look for words within it. Because we know something about the “meaning” of the number of “dees” a BCC uses to indicate the type of threat (e.g. more dees are vocalized for scarier threats like a pigmy own than a red-tailed hawk), we could then test some of the words against some ground truth. 6. Once we have some “words” to work with, then embed those (hopefully) “words” and use (or not) the (albeit) small chickadee-english dictionary we have to create an embedding alignment and test some translations against English in practical scenarios. E.g. how does a chickadee call when you bring them food outside your door translate to English versus when your dog walks outside. Then, we want to turn our attention to elk: https://academic.oup.com/jmammal/article/87/6/1072/884332 From: radekosmulski <notifications@github.com> Sent: Thursday, November 12, 2020 12:23 PM To: earthspecies/project <project@noreply.github.com> Cc: Jeff Reed <Jeff@reedfly.com>; Comment <comment@noreply.github.com> Subject: Re: [earthspecies/project] audio-embedding update (#32) Great to talk! 🙂 Here is a paper<https://arxiv.org/pdf/1811.00403.pdf> on the completely unsupervised front I have been looking at recently. It would be nice though to come across a paper where establishing word boundaries would be a more central theme - will stay on the lookout 🙂 I have only taken a cursory look into this space thus far. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#32 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DBG4YK7CBG4GMFGV5TSPQYXTANCNFSM4RBH4CQA>.

0 replies

reedflyfarm · 2020-11-13T20:02:23Z

reedflyfarm
Nov 13, 2020

Interestingly, Emily L Mackevicius<https://www.ncbi.nlm.nih.gov/pubmed/?term=Mackevicius%20EL%5BAuthor%5D&cauthor=true&cauthor_uid=30719973> is a bird song researcher but who has done a lot with identifying the seemingly tougher problem of identifying neural sequences: Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6363393/ From: Jeff Reed Sent: Friday, November 13, 2020 12:04 PM To: earthspecies/project <reply@reply.github.com>; earthspecies/project <project@noreply.github.com> Cc: Comment <comment@noreply.github.com> Subject: RE: [earthspecies/project] audio-embedding update (#32) The local research scientists I’m working with here in the Greater Yellowstone and I are leaning towards contributing vocalizations from Black Capped Chickadees (BCC) for a few reasons: 1. Plenty of data already captured, with regional dialects identified and tagged, and readily widespread but often found in contexts where citizen scientists can collect vocalizations with cell phones but in a non-captive environment all season long. 2. Intra-species dialog data available 3. Inter-species threat calls well studied with solid evidence for understanding the meaning of certain calls 4. Fairly well understand vocalization boundaries…although whether the boundaries are lexemes (with meaning) or phoneme (no meaning) is less understood. 5. An unsupervised term discovery (UTD) could be applied to chick-a-dee-dee-dee-etc segments to discover unknown words/syllables. In other words, we test the assumption that a common chick-a-dee-dee-dee-etc is a sentence (meaningful segment beyond the word) and look for words within it. Because we know something about the “meaning” of the number of “dees” a BCC uses to indicate the type of threat (e.g. more dees are vocalized for scarier threats like a pigmy own than a red-tailed hawk), we could then test some of the words against some ground truth. 6. Once we have some “words” to work with, then embed those (hopefully) “words” and use (or not) the (albeit) small chickadee-english dictionary we have to create an embedding alignment and test some translations against English in practical scenarios. E.g. how does a chickadee call when you bring them food outside your door translate to English versus when your dog walks outside. Then, we want to turn our attention to elk: https://academic.oup.com/jmammal/article/87/6/1072/884332 From: radekosmulski <notifications@github.com<mailto:notifications@github.com>> Sent: Thursday, November 12, 2020 12:23 PM To: earthspecies/project <project@noreply.github.com<mailto:project@noreply.github.com>> Cc: Jeff Reed <Jeff@reedfly.com<mailto:Jeff@reedfly.com>>; Comment <comment@noreply.github.com<mailto:comment@noreply.github.com>> Subject: Re: [earthspecies/project] audio-embedding update (#32) Great to talk! 🙂 Here is a paper<https://arxiv.org/pdf/1811.00403.pdf> on the completely unsupervised front I have been looking at recently. It would be nice though to come across a paper where establishing word boundaries would be a more central theme - will stay on the lookout 🙂 I have only taken a cursory look into this space thus far. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#32 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DBG4YK7CBG4GMFGV5TSPQYXTANCNFSM4RBH4CQA>.

0 replies

reedflyfarm · 2020-11-13T20:24:35Z

reedflyfarm
Nov 13, 2020

This short video covers her use of seqNMF and the end of the video shows its application to bird spectrograms: https://www.youtube.com/watch?v=XyWtCtZ_m-8 From: Jeff Reed Sent: Friday, November 13, 2020 1:02 PM To: 'earthspecies/project' <reply@reply.github.com>; 'earthspecies/project' <project@noreply.github.com> Cc: 'Comment' <comment@noreply.github.com> Subject: RE: [earthspecies/project] audio-embedding update (#32) Interestingly, Emily L Mackevicius<https://www.ncbi.nlm.nih.gov/pubmed/?term=Mackevicius%20EL%5BAuthor%5D&cauthor=true&cauthor_uid=30719973> is a bird song researcher but who has done a lot with identifying the seemingly tougher problem of identifying neural sequences: Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6363393/ From: Jeff Reed Sent: Friday, November 13, 2020 12:04 PM To: earthspecies/project <reply@reply.github.com<mailto:reply@reply.github.com>>; earthspecies/project <project@noreply.github.com<mailto:project@noreply.github.com>> Cc: Comment <comment@noreply.github.com<mailto:comment@noreply.github.com>> Subject: RE: [earthspecies/project] audio-embedding update (#32) The local research scientists I’m working with here in the Greater Yellowstone and I are leaning towards contributing vocalizations from Black Capped Chickadees (BCC) for a few reasons: 1. Plenty of data already captured, with regional dialects identified and tagged, and readily widespread but often found in contexts where citizen scientists can collect vocalizations with cell phones but in a non-captive environment all season long. 2. Intra-species dialog data available 3. Inter-species threat calls well studied with solid evidence for understanding the meaning of certain calls 4. Fairly well understand vocalization boundaries…although whether the boundaries are lexemes (with meaning) or phoneme (no meaning) is less understood. 5. An unsupervised term discovery (UTD) could be applied to chick-a-dee-dee-dee-etc segments to discover unknown words/syllables. In other words, we test the assumption that a common chick-a-dee-dee-dee-etc is a sentence (meaningful segment beyond the word) and look for words within it. Because we know something about the “meaning” of the number of “dees” a BCC uses to indicate the type of threat (e.g. more dees are vocalized for scarier threats like a pigmy own than a red-tailed hawk), we could then test some of the words against some ground truth. 6. Once we have some “words” to work with, then embed those (hopefully) “words” and use (or not) the (albeit) small chickadee-english dictionary we have to create an embedding alignment and test some translations against English in practical scenarios. E.g. how does a chickadee call when you bring them food outside your door translate to English versus when your dog walks outside. Then, we want to turn our attention to elk: https://academic.oup.com/jmammal/article/87/6/1072/884332 From: radekosmulski <notifications@github.com<mailto:notifications@github.com>> Sent: Thursday, November 12, 2020 12:23 PM To: earthspecies/project <project@noreply.github.com<mailto:project@noreply.github.com>> Cc: Jeff Reed <Jeff@reedfly.com<mailto:Jeff@reedfly.com>>; Comment <comment@noreply.github.com<mailto:comment@noreply.github.com>> Subject: Re: [earthspecies/project] audio-embedding update (#32) Great to talk! 🙂 Here is a paper<https://arxiv.org/pdf/1811.00403.pdf> on the completely unsupervised front I have been looking at recently. It would be nice though to come across a paper where establishing word boundaries would be a more central theme - will stay on the lookout 🙂 I have only taken a cursory look into this space thus far. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#32 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DBG4YK7CBG4GMFGV5TSPQYXTANCNFSM4RBH4CQA>.

1 reply

radekosmulski Nov 16, 2020
Author

Fascinating watch! Thank you very much for sharing

NgahaToumeni · 2023-11-09T14:12:23Z

NgahaToumeni
Nov 9, 2023

Hi!

0 replies

Tranhoa150284 · 2024-03-23T02:04:54Z

Tranhoa150284
Mar 23, 2024

Great

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio-embedding update #32

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

audio-embedding update #32

Replies: 11 comments · 3 replies

radekosmulski Nov 12, 2020 Author

radekosmulski Nov 12, 2020 Author

radekosmulski Nov 12, 2020 Author

radekosmulski Nov 12, 2020 Author

radekosmulski Nov 16, 2020 Author

Replies: 11 comments 3 replies

radekosmulski
Nov 12, 2020
Author

radekosmulski
Nov 12, 2020
Author

radekosmulski Nov 12, 2020
Author

radekosmulski Nov 12, 2020
Author

radekosmulski Nov 16, 2020
Author