Skip to content

Commit

Permalink
describe the pairing function
Browse files Browse the repository at this point in the history
  • Loading branch information
breandan committed Oct 16, 2023
1 parent d7fe6a8 commit e57fc58
Show file tree
Hide file tree
Showing 3 changed files with 68 additions and 6 deletions.
Binary file modified latex/tacas2023/tacas.pdf
Binary file not shown.
55 changes: 53 additions & 2 deletions latex/tacas2023/tacas.tex
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,7 @@ \subsection{Bar-Hillel Construction}

%To generate edits from it, we can use the same procedure as before, but instead of interleaving $\err\sigma$ with $\varepsilon$ and introducing holes, we simply use $A\big((\_)^{|\err{\sigma}| + d}\big, G^\cap)$.


\subsection{Semiring Algebras}

There are a number of alternate semirings which can be used to solve for $A(\sigma)$. A first approach propagates the values from the bottom-up, while mapping nonterminals to lists of strings. Letting $D = V \rightarrow \mathcal{P}(\Sigma^*)$, we define $\oplus, \otimes: D \times D \rightarrow D$. Initially, we have $p(s: \Sigma) \coloneqq \{v \mid [v \rightarrow s]\in P\}$ and $p(\_) := \bigcup_{s\in \Sigma} p(s)$, then we compute the fixpoint using the algebra:
Expand Down Expand Up @@ -417,9 +418,59 @@ \subsection{Semiring Algebras}

In our experiments, we provide a comparison of the performance of the SAT algebra and these two semirings, evaluated on a dataset of Python statements.

\pagebreak\subsection{A Pairing Function for Breadth-Bounded Binary Trees}

The type all possible trees that can be generated by a CFG in Chomksy Normal Form is identified by a recurrence relation:

\begin{equation}
L(p) = 1 + p L \phantom{addspace} P(a) = 1 + a L(P(a)^2)
\end{equation}

The number of binary trees inhabited in a single instance of the $\mathbb{T}_2$ is sensititive to the number of nonterminals and rule expansions in the grammar. To compute the total number of trees with breadth $n$, we can take the intersection between a CFG and the regular language, $\mathcal{L}(G^\cap) := \mathcal{L}(\mathcal{G}) \cap \Sigma^n$, abstractly parse the string containing all holes, let $T_S := M_\infty[0, n, S]$, and compute the total number of trees $|T_S|$ using the following recurrence:

% val totalTrees: BigInteger by lazy {
% if (branches.isEmpty()) BigInteger.ONE
% else branches.map { (l, r) -> l.totalTrees * r.totalTrees }
% .reduce { acc, it -> acc + it }
% }

\begin{equation}
|T: \mathbb{T}_2| = \begin{cases}
1 & \text{if $T$ is a leaf,} \\
\sum_{\langle T_1, T_2\rangle \in \texttt{children}(T)} |T_1| \cdot |T_2| & \text{otherwise.}
\end{cases}
\end{equation}

To sample all trees uniformly without replacement, we define a pairing function $\varphi^{-1}: \mathbb{T}_2 \rightarrow \mathbb{Z}_{|T|} \rightarrow \texttt{BTree}$ as follows:

%private fun decodeTree(i: BigInteger): Pair<Tree, BigInteger> {
% if (branches.isEmpty()) return Tree(root) to i
% val (quotient1, remainder) =
% i.div(branches.size) to i.mod(branches.size.toBigInteger())
% val (lb, rb) = shuffledBranches[remainder.toString().toInt()]
% val (l, quotient2) = lb.decodeTree(quotient1)
% val (r, quotient3) = rb.decodeTree(quotient2)
% val concat = Tree(l.root, children = arrayOf(l, r))
% return concat to quotient3
%}

\begin{equation}
\varphi^{-1}(T: \mathbb{T}_2, i: \mathbb{Z}_{|T|}) := \begin{cases}
\Big\langle\texttt{BTree}\big(\pi_1(T)\big), i\Big\rangle & \text{if $T$ is a leaf,} \vspace{5pt}\\
\text{Let } b = |\texttt{children}(T)|,\\
\phantom{\text{Let }} q_1, r=\big\langle\lfloor\frac{i}{b}\rfloor, i \pmod{b}\big\rangle,\\
\phantom{\text{Let }} lb, rb = \texttt{children}[r],\\
\phantom{\text{Let }} T_1, q_2 = \varphi^{-1}(lb, q_1),\\
\phantom{\text{Let }} T_2, q_3 = \varphi^{-1}(rb, q_2) \text{ in } \\
\Big\langle\texttt{BTree}\big(\pi_1(T), T_1, T_2\big), q_3\Big\rangle & \text{otherwise.} \\
\end{cases}
\end{equation}

Then, instead of sampling trees, we can simply sample integers uniformly without replacement from $\mathbb{Z}_{|T|}$ using the construction defined in \ref{sec:dsi}, and lazily decode them into trees.

\subsection{Complexity}

Let us consider some loose bounds on the the complexity of BCFLR. To do, we first consider the complexity of computing language-edit distance, which is a lower-bound on BCFLR complexity.
Let us consider some loose bounds on the complexity of BCFLR. To do, we first consider the complexity of computing language-edit distance, which is a lower-bound on BCFLR complexity.

\begin{definition}
Language edit distance (LED) is the problem of computing the minimum number of edits required to transform an invalid string into a valid one, where validity is defined as containment in a context-free language, $\ell: \mathcal{L}$, i.e., $\Delta^*(\err{\sigma}, \ell) \coloneqq \min_{\sigma \in \ell}\Delta(\err{\sigma}, \sigma)$, and $\Delta$ is the Levenshtein distance. LED is known to have subcubic complexity~\cite{bringmann2019truly}.
Expand Down Expand Up @@ -456,7 +507,7 @@ \subsection{Complexity}

%In practice, this problem is ill-posed even when $q = \Delta^*(\err{\sigma}, \ell) \approx 1$. For example, consider the language of ursine dietary preferences. Although $\err{\sigma}\coloneqq$ ``Bears like to eat plastic'' is not a valid sentence, e.g., $\tilde{\sigma}\coloneqq$``Bears like to eat'' is $(\Delta^*=1)$, however there are many others with roughly the same edit distance, e.g., ``Bears like to eat \{\hlorange{berries}, \hlorange{honey}, \hlorange{fish}\}'', or ``\{\hlgreen{Polar}, \hlgreen{Panda}\} bears like to eat \{\hlgreen{seals}, \hlgreen{bamboo}\}''. In general, there are usually many strings nearby $\err{\sigma}$, and we seek to find those among them which are both syntactically valid and semantically plausible as quickly as possible.

\subsection{Sampling the Levenshtein ball without replacement in $\mathcal{O}(1)$}\label{sec:dsi}
\pagebreak\subsection{Sampling the Levenshtein ball without replacement in $\mathcal{O}(1)$}\label{sec:dsi}

Now that we have a reliable method to synthesize admissible completions for strings containing holes, i.e., fix \textit{localized} errors, $F: (\mathcal{G} \times \underline\Sigma^n) \rightarrow \{\Sigma^n\}\subseteq \mathcal{L}(\mathcal{G})$, how can we use $F$ to repair some unparseable string, i.e., $\err{\sigma_1\ldots\:\sigma_n}: \Sigma^n \cap\mathcal{L}(\mathcal{G})^\complement$ where the holes' locations are unknown? Three questions stand out in particular: how many holes are needed to repair the string, where should we put those holes, and how ought we fill them to obtain a parseable $\tilde{\sigma} \in \mathcal{L}(\mathcal{G})$?

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,21 +33,32 @@ class PTree(val root: String = "ε", val branches: List<Π2A<PTree>> = listOf())

fun choose(): Sequence<String> = choice.asSequence()

private fun decode(i: BigInteger): Pair<String, BigInteger> {
private fun decodeString(i: BigInteger): Pair<String, BigInteger> {
if (branches.isEmpty()) return (if ("ε" in root) "" else root) to i
val (quotient1, remainder) =
i.div(branches.size) to i.mod(branches.size.toBigInteger())
val (lb, rb) = shuffledBranches[remainder.toString().toInt()]
val (l, quotient2) = lb.decode(quotient1)
val (r, quotient3) = rb.decode(quotient2)
val (l, quotient2) = lb.decodeString(quotient1)
val (r, quotient3) = rb.decodeString(quotient2)
val concat = (if(l.isEmpty()) r else if(r.isEmpty()) l else "$l $r")
return concat to quotient3
}

private fun decodeTree(i: BigInteger): Pair<Tree, BigInteger> {
if (branches.isEmpty()) return Tree(root) to i
val (quotient1, remainder) =
i.div(branches.size) to i.mod(branches.size.toBigInteger())
val (lb, rb) = shuffledBranches[remainder.toString().toInt()]
val (l, quotient2) = lb.decodeTree(quotient1)
val (r, quotient3) = rb.decodeTree(quotient2)
val concat = Tree(l.root, children = arrayOf(l, r))
return concat to quotient3
}

fun sampleWithoutReplacement(): Sequence<String> = sequence {
println("Total trees in PTree: $totalTrees")
var i = BigInteger.ZERO
while (i < totalTrees) yield(decode(i++).first)
while (i < totalTrees) yield(decodeString(i++).first)
}

// Samples instantaneously from the parse forest, but may return duplicates
Expand Down

0 comments on commit e57fc58

Please sign in to comment.