audio-embedding update #32
Replies: 11 comments 3 replies
-
Some questions/thoughts regarding the unsupervised embedding methodology that seems to be at the heart of earth species optimism that we might be able to really make a monumental leap in animal communication understanding. https://github.com/earthspecies/unsupervised-audio-translation/blob/main/README.md
|
Beta Was this translation helpful? Give feedback.
-
Hi Jeff! You are right that there are some other methods that require some minimal amount of translated word pairs (the number of required pairs kept decreasing, 5000 -> 25 as improved methods were being developed... and finally reached 0!) I also find it quite mindboggling that translation without any word pairs is feasible, that there is enough structure in the embeddings to facilitate the alignment. Here is the paper that discusses the method: Word Translation Without Parallel Data. This is a toy example from some time ago. I attempted training embeddings using a very small vocabulary (only 4000 words). It is astonishing that even with such a toy approach enough structure is captured to facilitate the alignment! The results can be further improved with more data and other tweaks, but the core idea is there. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the paper! For some strange reason I’m not surprised that it works with 0 word pairs in homo sapiens. And if other mammals just use their neural vector space as a mapping engine so to speak, then this should work for magpies, elk and others 😊
Back to the 40% accuracy question though: did you find any word classes (e.g. adverbs, verbs, nouns, determinatives) that had higher accuracy rates than others?
And, has anyone tried this with grammatical forms (e.g. tenses, mood, gender)?
From: radekosmulski <notifications@github.com>
Sent: Thursday, November 12, 2020 3:35 AM
To: earthspecies/project <project@noreply.github.com>
Cc: Jeff Reed <Jeff@reedfly.com>; Comment <comment@noreply.github.com>
Subject: Re: [earthspecies/project] audio-embedding update (#32)
Hi Jeff!
You are right that there are some other methods that require some minimal amount of translated word pairs (the number of required pairs kept decreasing, 5000 -> 25 as improved methods were being developed... and finally reached 0!)
I also find it quite mindboggling that translation without any word pairs is feasible, that there is enough structure in the embeddings to facilitate the alignment. Here is the paper that discusses the method: Word Translation Without Parallel Data<https://arxiv.org/abs/1710.04087>.
This is a toy example<https://github.com/earthspecies/decoder-head-unsupervised-translation/blob/master/05_aligning_the_embeddings_using_vecmap.ipynb> from some time ago. I attempted training embeddings using a very small vocabulary (only 4000 words). It is astonishing that even with such a toy approach enough structure is captured to facilitate the alignment! The results can be further improved with more data and other tweaks, but the core idea is there.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#32 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DHXV67FAPBWYE6RNC3SPO25TANCNFSM4RBH4CQA>.
|
Beta Was this translation helpful? Give feedback.
-
I haven't looked specifically at how different word classes perform. But you are absolutely right - this would be very interesting to take a closer look at! I am also not aware of anyone publishing any results in this space. |
Beta Was this translation helpful? Give feedback.
-
From a field researcher’s perspective, I’m trying to understand how a 40% “accuracy” rate helps me in the field to translate from a species vocalization to e.g. English. Is it as simple as “we’ll at least provide a translation” and then you can determine if that translation makes any sense at all in the field context? If not, then we probably got it wrong. But, if it does fit with the context, then we probably got it right.
Last point: I think the biggest problem we’re facing is a definition of “word” in non-homo sapiens vocalizations. After all, the word2vec algorithm requires a “word” as an input. Have you given some thought to ways around that? I’ve seen some papers that suggest simply grabbing random time sequences of vocalizations.
From: radekosmulski <notifications@github.com>
Sent: Thursday, November 12, 2020 9:04 AM
To: earthspecies/project <project@noreply.github.com>
Cc: Jeff Reed <Jeff@reedfly.com>; Comment <comment@noreply.github.com>
Subject: Re: [earthspecies/project] audio-embedding update (#32)
I haven't looked specifically at how different word classes perform. But you are absolutely right - this would be very interesting to take a closer look at! I am also not aware of anyone publishing any results in this space.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#32 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DD26OO65TKVH2YJ2JTSPQBPVANCNFSM4RBH4CQA>.
|
Beta Was this translation helpful? Give feedback.
-
Precisely, there is little known about the phonemes, morphemes, lexemes (and the coda) boundaries in other species. Chickadees, like Sperm Whales have recognizable syllables and even some understanding of what they can “mean” in sequence, BUT those same syllables are used in a lot of contexts where we have NO clue what the coda of meaning is.
If you have time in the future, can you post the ML methods used to arrive at lexeme boundaries in speech in an unsupervised ways. I would like to dig in to them and see where they might help with other non-homo languages.
Fun, fun, fun. Thank you for taking the time to engage with my novice questions. The feedback is proving helpful with us in the Greater Yellowstone pick the species vocalizations that we think lend themselves best to the ML models you’ll be using…once we decide, we’ll get to work on collecting the data and a small dictionary of known “words” (or “sentences”).
From: radekosmulski <notifications@github.com>
Sent: Thursday, November 12, 2020 11:49 AM
To: earthspecies/project <project@noreply.github.com>
Cc: Jeff Reed <Jeff@reedfly.com>; Comment <comment@noreply.github.com>
Subject: Re: [earthspecies/project] audio-embedding update (#32)
I think the 40% accuracy might be a bit misleading. For instance, often measuring accuracy doesn't take into account synonyms or near-synonyms, which embedding based methods should return with some frequency. These closely related words could probably lend themselves well to translation efforts to a greater extent than just getting 40% words correct would imply.
It is very challenging to imagine today how these methods will perform in an interspecies context. Evaluation of results is also hard without ground truth, agreed, but as you mention - observation might be one of the tools at our disposal.
With regards to establishing a meaningful unit of animal vocalizations, the approach will probably have to vary from species to species. For sperm whales for instance, their communication is already organized in discreet units, the click. Still, I believe it remains an open question how much information a single click encodes, whether clicks are only meaningful in a sequence, what is the meaning of the variable length silence in collections of clicks (codas), etc. The methods that we are working on developing and applying to animal communication can help answer these questions.
For many other species identifying such a discreet unit might be much more challenging, but I am hopeful that again machine learning methods can lend a hand. Some work has already been done to arrive at word boundaries in speech in an unsupervised way. Extending these methods to animal vocalizations might be one approach. It might also be the case that answer to this question will emerge from the bioacoustic research community.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#32 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DENFWZOHNM6PF5RBJTSPQUYXANCNFSM4RBH4CQA>.
|
Beta Was this translation helpful? Give feedback.
-
The local research scientists I’m working with here in the Greater Yellowstone and I are leaning towards contributing vocalizations from Black Capped Chickadees (BCC) for a few reasons:
1. Plenty of data already captured, with regional dialects identified and tagged, and readily widespread but often found in contexts where citizen scientists can collect vocalizations with cell phones but in a non-captive environment all season long.
2. Intra-species dialog data available
3. Inter-species threat calls well studied with solid evidence for understanding the meaning of certain calls
4. Fairly well understand vocalization boundaries…although whether the boundaries are lexemes (with meaning) or phoneme (no meaning) is less understood.
5. An unsupervised term discovery (UTD) could be applied to chick-a-dee-dee-dee-etc segments to discover unknown words/syllables. In other words, we test the assumption that a common chick-a-dee-dee-dee-etc is a sentence (meaningful segment beyond the word) and look for words within it. Because we know something about the “meaning” of the number of “dees” a BCC uses to indicate the type of threat (e.g. more dees are vocalized for scarier threats like a pigmy own than a red-tailed hawk), we could then test some of the words against some ground truth.
6. Once we have some “words” to work with, then embed those (hopefully) “words” and use (or not) the (albeit) small chickadee-english dictionary we have to create an embedding alignment and test some translations against English in practical scenarios. E.g. how does a chickadee call when you bring them food outside your door translate to English versus when your dog walks outside.
Then, we want to turn our attention to elk: https://academic.oup.com/jmammal/article/87/6/1072/884332
From: radekosmulski <notifications@github.com>
Sent: Thursday, November 12, 2020 12:23 PM
To: earthspecies/project <project@noreply.github.com>
Cc: Jeff Reed <Jeff@reedfly.com>; Comment <comment@noreply.github.com>
Subject: Re: [earthspecies/project] audio-embedding update (#32)
Great to talk! 🙂
Here is a paper<https://arxiv.org/pdf/1811.00403.pdf> on the completely unsupervised front I have been looking at recently. It would be nice though to come across a paper where establishing word boundaries would be a more central theme - will stay on the lookout 🙂 I have only taken a cursory look into this space thus far.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#32 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DBG4YK7CBG4GMFGV5TSPQYXTANCNFSM4RBH4CQA>.
|
Beta Was this translation helpful? Give feedback.
-
Interestingly, Emily L Mackevicius<https://www.ncbi.nlm.nih.gov/pubmed/?term=Mackevicius%20EL%5BAuthor%5D&cauthor=true&cauthor_uid=30719973> is a bird song researcher but who has done a lot with identifying the seemingly tougher problem of identifying neural sequences: Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6363393/
From: Jeff Reed
Sent: Friday, November 13, 2020 12:04 PM
To: earthspecies/project <reply@reply.github.com>; earthspecies/project <project@noreply.github.com>
Cc: Comment <comment@noreply.github.com>
Subject: RE: [earthspecies/project] audio-embedding update (#32)
The local research scientists I’m working with here in the Greater Yellowstone and I are leaning towards contributing vocalizations from Black Capped Chickadees (BCC) for a few reasons:
1. Plenty of data already captured, with regional dialects identified and tagged, and readily widespread but often found in contexts where citizen scientists can collect vocalizations with cell phones but in a non-captive environment all season long.
2. Intra-species dialog data available
3. Inter-species threat calls well studied with solid evidence for understanding the meaning of certain calls
4. Fairly well understand vocalization boundaries…although whether the boundaries are lexemes (with meaning) or phoneme (no meaning) is less understood.
5. An unsupervised term discovery (UTD) could be applied to chick-a-dee-dee-dee-etc segments to discover unknown words/syllables. In other words, we test the assumption that a common chick-a-dee-dee-dee-etc is a sentence (meaningful segment beyond the word) and look for words within it. Because we know something about the “meaning” of the number of “dees” a BCC uses to indicate the type of threat (e.g. more dees are vocalized for scarier threats like a pigmy own than a red-tailed hawk), we could then test some of the words against some ground truth.
6. Once we have some “words” to work with, then embed those (hopefully) “words” and use (or not) the (albeit) small chickadee-english dictionary we have to create an embedding alignment and test some translations against English in practical scenarios. E.g. how does a chickadee call when you bring them food outside your door translate to English versus when your dog walks outside.
Then, we want to turn our attention to elk: https://academic.oup.com/jmammal/article/87/6/1072/884332
From: radekosmulski <notifications@github.com<mailto:notifications@github.com>>
Sent: Thursday, November 12, 2020 12:23 PM
To: earthspecies/project <project@noreply.github.com<mailto:project@noreply.github.com>>
Cc: Jeff Reed <Jeff@reedfly.com<mailto:Jeff@reedfly.com>>; Comment <comment@noreply.github.com<mailto:comment@noreply.github.com>>
Subject: Re: [earthspecies/project] audio-embedding update (#32)
Great to talk! 🙂
Here is a paper<https://arxiv.org/pdf/1811.00403.pdf> on the completely unsupervised front I have been looking at recently. It would be nice though to come across a paper where establishing word boundaries would be a more central theme - will stay on the lookout 🙂 I have only taken a cursory look into this space thus far.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#32 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DBG4YK7CBG4GMFGV5TSPQYXTANCNFSM4RBH4CQA>.
|
Beta Was this translation helpful? Give feedback.
-
This short video covers her use of seqNMF and the end of the video shows its application to bird spectrograms: https://www.youtube.com/watch?v=XyWtCtZ_m-8
From: Jeff Reed
Sent: Friday, November 13, 2020 1:02 PM
To: 'earthspecies/project' <reply@reply.github.com>; 'earthspecies/project' <project@noreply.github.com>
Cc: 'Comment' <comment@noreply.github.com>
Subject: RE: [earthspecies/project] audio-embedding update (#32)
Interestingly, Emily L Mackevicius<https://www.ncbi.nlm.nih.gov/pubmed/?term=Mackevicius%20EL%5BAuthor%5D&cauthor=true&cauthor_uid=30719973> is a bird song researcher but who has done a lot with identifying the seemingly tougher problem of identifying neural sequences: Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6363393/
From: Jeff Reed
Sent: Friday, November 13, 2020 12:04 PM
To: earthspecies/project <reply@reply.github.com<mailto:reply@reply.github.com>>; earthspecies/project <project@noreply.github.com<mailto:project@noreply.github.com>>
Cc: Comment <comment@noreply.github.com<mailto:comment@noreply.github.com>>
Subject: RE: [earthspecies/project] audio-embedding update (#32)
The local research scientists I’m working with here in the Greater Yellowstone and I are leaning towards contributing vocalizations from Black Capped Chickadees (BCC) for a few reasons:
1. Plenty of data already captured, with regional dialects identified and tagged, and readily widespread but often found in contexts where citizen scientists can collect vocalizations with cell phones but in a non-captive environment all season long.
2. Intra-species dialog data available
3. Inter-species threat calls well studied with solid evidence for understanding the meaning of certain calls
4. Fairly well understand vocalization boundaries…although whether the boundaries are lexemes (with meaning) or phoneme (no meaning) is less understood.
5. An unsupervised term discovery (UTD) could be applied to chick-a-dee-dee-dee-etc segments to discover unknown words/syllables. In other words, we test the assumption that a common chick-a-dee-dee-dee-etc is a sentence (meaningful segment beyond the word) and look for words within it. Because we know something about the “meaning” of the number of “dees” a BCC uses to indicate the type of threat (e.g. more dees are vocalized for scarier threats like a pigmy own than a red-tailed hawk), we could then test some of the words against some ground truth.
6. Once we have some “words” to work with, then embed those (hopefully) “words” and use (or not) the (albeit) small chickadee-english dictionary we have to create an embedding alignment and test some translations against English in practical scenarios. E.g. how does a chickadee call when you bring them food outside your door translate to English versus when your dog walks outside.
Then, we want to turn our attention to elk: https://academic.oup.com/jmammal/article/87/6/1072/884332
From: radekosmulski <notifications@github.com<mailto:notifications@github.com>>
Sent: Thursday, November 12, 2020 12:23 PM
To: earthspecies/project <project@noreply.github.com<mailto:project@noreply.github.com>>
Cc: Jeff Reed <Jeff@reedfly.com<mailto:Jeff@reedfly.com>>; Comment <comment@noreply.github.com<mailto:comment@noreply.github.com>>
Subject: Re: [earthspecies/project] audio-embedding update (#32)
Great to talk! 🙂
Here is a paper<https://arxiv.org/pdf/1811.00403.pdf> on the completely unsupervised front I have been looking at recently. It would be nice though to come across a paper where establishing word boundaries would be a more central theme - will stay on the lookout 🙂 I have only taken a cursory look into this space thus far.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#32 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ5A2DBG4YK7CBG4GMFGV5TSPQYXTANCNFSM4RBH4CQA>.
|
Beta Was this translation helpful? Give feedback.
-
Here is the google doc I we went over yesterday. (ESP accessible only)I revised the text and made it available as the readme of the audio-embedding repository. I also linked to the repository from the project readme.I edited and cleaned up the doc - the roadmap is now available as the readme for the project repository.
Beta Was this translation helpful? Give feedback.
All reactions