Skip to content

Latest commit

 

History

History
664 lines (541 loc) · 31.1 KB

index.org

File metadata and controls

664 lines (541 loc) · 31.1 KB

paper-reading | vernacular.ai

List of papers we cover during our weekly paper reading session. For past and missing links/notes, check out the (private) wiki.

28 August, 2020

Overlapping experiment infrastructure: More, better, faster experimentation

Optimal testing in the experiment-rich regime

21 August, 2020

Bridging Anaphora Resolution as Question Answering

A framework for understanding unintended consequences of machine learning

14 August, 2020

Reformer: The Efficient Transformer

24 July, 2020

Adversarial examples that fool both computer vision and time-limited humans

Language models are few-shot learners

10 July, 2020

DIET: Lightweight Language Understanding for Dialogue Systems

DialogueRNN: An attentive RNN for emotion detection in conversations

3 July, 2020

StarSpace: Embed All The Things!

Learning Asr-Robust Contextualized Embeddings for Spoken Language Understanding

12 June, 2020

Sentence-bert: Sentence embeddings using siamese bert-networks

PyTorch: An imperative style, high-performance deep learning library

What’s Hidden in a Randomly Weighted Neural Network?

Weakly Supervised Attention Networks for Entity Recognition

Improving BERT with Self-Supervised Attention

5 June, 2020

Hierarchical attention networks for document classification

Training classifiers with natural language explanations

ERD’14: entity recognition and disambiguation challenge

Audio adversarial examples: Targeted attacks on speech-to-text

29 May, 2020

Differentiable Reasoning over a Virtual Knowledge Base

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

15 May, 2020

NBDT: Neural-Backed Decision Trees

Faster Neural Network Training with Data Echoing

Universal Language Model Fine-tuning for Text Classification

8 May, 2020

Designing and Deploying Online Field Experiments

Intelligent Selection of Language Model Training Data

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

25 April, 2020

Gmail Smart Compose: Real-Time Assisted Writing

Supervised Learning with Quantum-Inspired Tensor Networks

The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning

Understanding deep learning requires rethinking generalization

17 April, 2020

Zoom In: An Introduction to Circuits

3 April, 2020

Sideways: Depth-Parallel Training of Video Models

Speech2face: Learning the face behind a voice

Dialog Methods for Improved Alphanumeric String Capture

Presents a way for dialog level collection of alpha numeric strings via an ASR. Two main ideas:

  1. Skip listing over n-best hypothesis across turns (attempts)
  2. Chunking and confirming pieces one by one

28 February, 2020

Self-supervised dialogue learning

The self-supervision signal here is coming from a model which tries to predict whether a provided tuple of turns is in order or not. Connecting this as the discriminator in generative-discriminative dialog systems they find better results.

7 February, 2020

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

This is an approach to collect supervision signal from deployment data. There are three tasks for the system (which is a chat bot doing ranking on candidate responses):

  1. Dialogue. The main task. Given the turns till now, the bot ranks which response to utter.
  2. Satisfaction. Given turns till now, last being user utterance, predict whether the user is satisfied.
  3. Feedback. After asking for feedback from the user, predict user’s response (feedback) based on the turns till now.

The models have shared weights, mostly among task 1 and 3.

31 January, 2020

Modeling Sequences with Quantum States: A Look Under the Hood

This paper explores a new direction in language modelling. The idea is still to learn the underlying distribution of sequences of characters, but here they do it by learning the quantum analogue of the classical probability distribution function. Unlike the classical case, marginal distributions there carry enough information to re-construct the joint distribution. This is the central idea of the paper, and is explained in the first half. The second half of the paper explains the theory and implementation of the training algorithm, with a simple example. Future work would be to apply this algorithm to a more complicated example, and even adapt it to variable length sequences.

Deep voice 2: Multi-speaker neural text-to-speech

This paper suggests improvements to DeepVoice and Tacotron, and also proposes a way to add trainable speaker embeddings. The speaker embeddings are initialized randomly and trained jointly through backpropagation. The paper lists some patterns that lead to better performance

  1. Transforming speaker embeddings to appropriate dimension and form for every place it is added to the model. The transformed speaker embeddings are called site-specific speaker embeddings
  2. Initializing recurrent layer hidden states with the site-specific speaker embeddings.
  3. Concatenating the site-specific speaker embedding to input at every timestep of the recurrent layer
  4. Multiplying layer activations element-wise to the site-specific speaker embeddings

A credit assignment compiler for joint prediction

This talks about an API for framing L2S style search problems in style of an imperative program which allows for two optimizations:

  1. memoization
  2. forced path collapse, getting losses without going to the last state

Main reduction that happens here is to a cost-sensitive classification problem.

17 January, 2020

Learning language from a large (unannotated) corpus

Introductory paper on the general approach used in learn. The idea is to learn various generalizable syntactic and semantic relations from unannotated corpus. The relations are expressed using graphs sitting on top of link grammar and meaning text theory (MTT). While the general approach is sketched out decently enough, there are details to filled in various steps and experiments to run (as of the writing in 2014).

On another note, the document is a nice read because of the many interesting ways of looking at various ideas in understanding languages and going from syntax to reasoning via semantics.

10 January, 2020

Parsing English with a link grammar

We came to here via opencog’s learn project. This is a nice perspective setup also if you are missing out on formal introduction of grammars and all. Overall a link grammar defines connectors on left and right side of a word with disjunctions and conjunctions incorporated which then link together to form a sentence, under certain constraints.

This specific paper shows the formulation and creates a parser for English, covering many (not all) linguistics phenomena.

20 December, 2019

Generalized end-to-end loss for speaker verification

This paper is development over their previous research work, Tuple-based end to end(TE2E) loss, for speaker identification. They try to generalize the concept of the cosine similarity being used in TE2E by creating similarity matrics for utterances by a user. They have suggested two losses in the paper:

  1. Softmax loss
  2. Contrast loss

Both these loss functions had two components, one which brings utterances by a user together and others, which separates the utterances of different users. Out of the two, Contrast loss is more rigorous.

13 December, 2019

Towards end-to-end spoken language understanding

This paper talks about developing an end to end model for intent recognition form speech. Currently, all the models have several components like ASR and NLU, which each have some errors of their own degrading the quality of the speech to intent pipeline. Experiments for two tasks, speech to domain and speech to intent were performed using the model. The model’s architecture is mostly inspired from end to end speech synthesis models. A unique feature of the architecture is that they perform sub-sampling after the first GRU layer to reduce the size of the vector and to tackle the problem of vanishing gradient.

Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One

They take a regular classifier, pick out logits before softmax and try to formulate an energy based model able to give $P(x, y)$ and $P(x)$. The formulation itself is pretty simple with the energy function being $E(x) = −LogSumExp_yf_Θ(x)[y]$. Final loss sums cross entropy (for discriminative part) and negative log likelhood of $P(x)$ approximated using SGLD. Check out the repo here.

Although the learning mechanism is a little fragile and needs work to be generally stable, the results are neat.

29 November, 2019

Overton: A Data System for Monitoring and Improving Machine-Learned Products

This is more about managing supervision than model. There are 3 problems that they are trying to solve:

  1. Fine grained quality monitoring,
  2. Support for multi-component pipelines, and
  3. Updating supervision

For this, they make easy to use abstractions for describing supervision and developing models. They also do a lot of multitask learning and snorkelish weak supervision, including the recent slicing abstractions for fine grained quality control.

While you have to adapt a few pieces for your own case (and scale), Overton is a nice testimony for success of things like weak supervision and higher level development abstractions in production.

Slice-based learning: A programming model for residual learning in critical data slices

This is taking the snorkel’s labelling function idea to group data instances in slices, segments which are interesting to us from an overall quality perspective. These slicing functions are important not only for identifying and narrowing down to specific kinds of data instances but also for learning slice specific representations which works out as computationally cheap way (there are other benefits too) of replicating a Mixture of Experts style model.

Like with labelling functions, we have the slice membership predicted using heuristics which are noisy. This membership value along with slice representations (and slice prediction confidences) help create the slice aware representation to be used for the final task. The appendix has few good examples of slicing functions.

21 September, 2019

3 August, 2019

27 July, 2019

20 July, 2019

13 July, 2019

6 July, 2019

1 July, 2019

25 June, 2019

15 June, 2019

8 June, 2019

1 June, 2019

  • Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., & others, , A tutorial on thompson sampling, Foundations and Trends{\textregistered} in Machine Learning, 11(1), 1–96 (2018). (cite:russo2018tutorial)

18 May, 2019

13 May, 2019