[2019]Compound Probabilistic Context-Free Grammars for Grammar Induction
한줄 요약:
확률적 컨택스트 그래머 제안
짧은 요약(Abstract) :
The study focuses on a novel approach to grammar induction by formalizing sentences as being generated by compound probabilistic context-free grammars. Unlike traditional methods that learn a single stochastic grammar, this new method incorporates per-sentence continuous latent variables to modulate the probabilities of context-free rules. This allows for marginal dependencies beyond those typically assumed in standard context-free grammars.
Inference is performed using collapsed variational inference, where an amortized variational posterior is placed on the continuous variable and the latent trees are marginalized with dynamic programming. The effectiveness of this approach was demonstrated through experiments on English and Chinese sentences compared to recent state-of-the-art methods for grammar induction from words using neural language models.
* Useful sentences : 단어정리
Methodology
Main methods are probabilistic context-free grammars (PCFGs) and their parameterization. A PCFG consists of a grammar G with rule probabilities π = {πr}r∈R such that pπ(t) is the probability of parse tree t and pπ(x) is the probability of string x.
Standard parameterization associates each nonterminal with a categorical distribution over its rules, but this can be difficult to learn from natural language data. The article proposes a neural parameterization where rule probabilities are based on distributed representations associated with embeddings for symbols N (left side of rules).
For each rule type r:
- πS→A = exp(u>A f1(wS)) / P A0∈N exp(u>A0 f1(wS))
- πA→BC = exp(u>BC wA) / P B0C0∈M exp(u>B0C0 wA)
- πT→w = exp(u>w f2(wT)) / P w0∈Σ exp(u>w0 f2(wT))
| where M is the product space (N ∪P)×(N ∪P), and f1, f2 are MLPs with two residual layers. The set of input symbol embeddings EG for grammar G is denoted as {wN | N ∈ {S} ∪ N ∪ P}, and λ refers to the parameters of the neural network. |
The neural parameterization maintains the underlying probabilistic assumptions but allows shared, distributed representations which can be beneficial in unsupervised parsing.
Additionally this is the concept of compound probabilistic context-free grammars (PCFGs) and their application in natural language processing.
There are 10 key points:
-
Compound PCFG: This is a type of probabilistic context-free grammar where rule probabilities depend on an additional variable ( z ). The distribution over trees arises from sampling these rule probabilities first, then generating a tree/sentence using the sampled probabilities.
- Generative Process:
- Rule probabilities are obtained via ( z \sim p_{\gamma}(z) ), where ( p_{\gamma} ) is a prior with parameters γ (a spherical Gaussian in this paper).
- The rule probability function ( f_{\lambda} ) concatenates the input symbol embeddings with ( z ) and outputs sentence-level rule probabilities.
- Context-Free Assumptions:
- Conditioned on ( z ), context-free assumptions hold, meaning that grammatical structures can be generated without considering dependencies beyond direct parent-child relationships.
- Unconditioned, these assumptions do not hold due to the dependence path through ( z ).
- Marginal Distribution:
-
The marginal distribution over parse trees is given by ( p_{\theta}(t) = Z p(t z)p_{\gamma}(z) dz ), where ( p_{\theta}(t z) = Q r \in t R \pi_{z, r} ).
-
- Expressiveness:
- Compound PCFGs are more expressive than traditional PCFGs because each sentence has its own set of rule probabilities.
- Latent Tree Structures:
- Despite assuming a tree-based generative process, compound PCFGs allow learning latent tree structures.
- Motivation for Compound PCFG:
- The motivation is based on the observation that first-order context-free assumptions are often used not because they represent an adequate model of natural language but because they allow for tractable training.
- Comparison with Higher-Order PCFGs:
- Higher-order PCFGs can introduce dependencies between children and ancestors/siblings through techniques like vertical/horizontal Markovization, but these complicate training due to the rapid increase in rule numbers.
- Bayesian Interpretation:
- The compound PCFG can be interpreted as a restricted version of some higher-order PCFG where a child depends on its ancestors and siblings through a shared latent vector.
- Application in Grammar Induction:
- This dependence among siblings is hypothesized to be especially useful in grammar induction from words, enhancing the ability to predict grammatical structures based on context.
- additional information about Compound PCFGs
Expressiveness and Challenge:
- Expressiveness: Compound PCFGs are more expressive than traditional PCFGs because each sentence has its own set of rule probabilities.
- Challenge in Learning and Inference: The expressivity comes with significant challenges, particularly in learning the parameters.
Log Marginal Likelihood:
- For a neural PCFG, the log marginal likelihood can be computed using the inside algorithm, which is differentiable and suitable for gradient-based optimization.
- For compound PCFGs, the log marginal likelihood involves an integral over latent tree structures that makes it intractable. However, conditioning on these structures allows tractable inner summation.
Collapsed Amortized Variational Inference:
- To handle the intractability, collapsed amortized variational inference is used.
- Sample ( z ) from a variational posterior distribution (using an amortized inference network).
- Perform inner marginalization conditioned on this sample.
- Calculate the evidence lower bound (ELBO) using: [ ELBO(\theta, \phi; x) = Eq_{\varphi(z|x)}[\log p_\theta(x|z)] - KL[q_\varphi(z|x) | p_\gamma(z)] ]
-
Use the inside algorithm to compute ( P(p_\theta(x z)) ).
Variational Family:
- The variational family is a diagonal Gaussian, with mean and log-variance vectors given by an affine layer over max-pooled hidden states from an LSTM.
Gradient Estimation:
- Low-variance estimators for the gradient of ELBO are obtained using the reparameterization trick for expected reconstruction likelihood and analytical expression for the KL term.
Bayesian Interpretation:
- Under a Bayesian PCFG view, this approach can be seen as empirical Bayes (Robbins, 1956).
MAP Inference
After training:
- Objective: Compare learned trees against an annotated treebank by inferring the most likely tree given a sentence.
- For neural PCFGs: Use Viterbi version of the inside algorithm (CKY algorithm).
- For compound PCFGs: The argmax is intractable, so it’s estimated with: [ arg\max_t Z p_\theta(t | x, z)p_\theta(z | x) dz \approx arg\max_t p_\theta ]
Overall, compound PCFGs provide a flexible framework for modeling natural language syntax that can capture complex dependencies while maintaining computational tractability.
Also, it includes their expressiveness, challenges in inference and learning, variational methods for handling intractability, and MAP inference techniques.
- The key features of the proposed method are:
- An LSTM network is used to capture sequential dependencies in sentences.
- Hidden states from the LSTM are passed through an affine layer to generate binary tree structures representing grammatical rules.
- Xavier uniform initialization is employed for parameter initialization.
- Adam optimizer with specific hyperparameters (β1 = 0.75, β2 = 0.999) and a learning rate of 0.001 is used for training.
- A curriculum learning strategy is implemented to gradually increase the length limit of sentences during training.
- The main contributions of this paper are:
- A new method for unsupervised grammar induction that combines LSTM and binary tree generation.
- Competitive performance with existing strong baselines in unsupervised parsing tasks.
- Insights into the challenges and potential improvements in unsupervised structure learning methods.
Results
The authors compare their approach against two recent strong baselines: Parsing Predict Reading Network (PRPN) and Ordered Neurons (ON). These baselines also use LSTM-like mechanisms but are specifically designed for grammar induction.
Key points about the comparison:
- The hyperparameters of PRPN/ON were tuned using validation F1 scores to ensure fair comparisons.
- The authors discard trivial spans (width-one and sentence-level spans) when evaluating on sentence-level F1, following recent practices in unsupervised structure learning.
- Despite being labeled as “unsupervised,” the PTB results include some hyperparameter tuning based on F1 against validation trees.
The paper presents a new method for unsupervised grammar induction using LSTM and binary tree generation. The authors compare their models to various baselines on the Penn Treebank (PTB) and Chinese TreeBank (CTB). Their main findings are:
- Performance: All models outperform right branching baselines, with neural PCFGs and compound PCFGs being particularly strong.
- Compound PCFG Outperformance: The compound PCFG model significantly outperforms other methods on both English and Chinese datasets.
- Traditional PCFG Limitations: Traditional PCFGs using scalar parameterization were unable to induce meaningful grammars, indicating optimization issues.
The authors also analyze the learned tree structures, comparing their similarity against gold trees, left branching trees, right branching trees, and self-trees (averaged over all pairs). They find that PRPN is particularly consistent across multiple runs. Additionally, they observe differences in label recall among different models, suggesting potential for an ensemble approach.
The results from training RNNGs on induced trees from various models are presented in Table 3. For perplexity, the RNNGs trained on induced trees (Induced RNNG) do not improve upon an LSTM language model, while the supervised RNNG does outperform it. However, for grammaticality judgment, the RNNG trained with compound PCFG trees outperforms the LSTM LM despite obtaining worse perplexity and performs on par with the RNNG trained on gold trees.
Fine-tuning with the URNNG results in improvements in both perplexity and grammaticality judgment across all models (Induced URNNG). We also obtain large improvements on unsupervised parsing as measured by F1, with the fine-tuned URNNGs outperforming their respective original models. This may be due to an ensembling effect between the original model and the URNNG’s structured inference network.
The alignment of induced nonterminals ordered based on predicted frequency is shown in Figure 2. For each nonterminal, we visualize the proportion of correctly-predicted constituents that correspond to particular gold labels. The precision (i.e., probability of correctly predicting unlabeled constituents) is also shown in the rightmost column for reference.
Based on the information provided in Table 4 and subsequent analysis, here are some key observations about the continuous latent space:
-
Topical Information Capture: The latent space appears to effectively capture topical information from sentences.
-
Variation Due to z: To investigate variation due to (z) while holding constant the tree structure variation, pairs of the form ((\mu_{\phi}(x(n)), t_n^j)) were obtained, where (t_n^j) is the j-th subtree of the (approximate) MAP tree for the n-th sentence.
-
Subtree Analysis: For frequently occurring subtrees, PCA was performed on sets of mean vectors associated with these subtrees to obtain top principal components. The constituents contributing most positively or negatively were identified.
-
Regular Variation in Leaves: A common subtree associated with 180 unique constituents showed that leaves forming a 6-word constituent varied regularly as (z) changed.
-
Root Alignment: The root of this subtree (NT-04) aligned to prepositional phrases (PP), and the leaves identified were mostly PP.
-
Neural PCFG/HMM Performance: A neural probabilistic context-free grammar/hidden Markov model achieved 68.2% accuracy, while a simpler HMM obtained 63.4%.
-
Model Limitations: The model failed to identify certain subtrees as constituents in specific cases, highlighting potential limitations in capturing complex structures.
These observations suggest that the continuous latent space is effective at representing sentence structure and topical information, with variations due to (z) contributing meaningful insights into constituent variation.
In summary, fine-tuning RNNGs with URNNGs leads to significant improvements in both perplexity and grammaticality judgment, as well as better performance on unsupervised parsing tasks.
예제
This is the example of hypothetical scenario involving sentence parsing and latent space representation.
-
Dataset: Assume we have a dataset of sentences where each sentence is parsed into its constituent parts (e.g., using PCFG/HMM).
-
Training / Test Data:
- Sentence: “The cat sat on the mat.”
- Parsed Tree:
[ROOT] ├── [S] ├── [NP] # Noun Phrase │ ├── [DT] # Determiner │ │ └── "the" │ ├── [NN] # Noun │ │ └── "cat" │ └── [VP] # Verb Phrase │ ├── [VBZ] # Verb, 3rd person singular present tense │ │ └── "sat" │ └── [PP] # Prepositional Phrase │ ├── [IN] # Preposition or subordinating conjunction │ │ └── "on" │ └── [NP] │ ├── [DT] │ │ └── "the" │ ├── [NN] │ │ └── "mat" - Latent Space Representation: (\mu_{\phi}(x(n))) for this sentence.
- Parsed Tree:
- Sentence: “The cat sat on the mat.”
요약
-
Method: The study employs a continuous latent space representation to capture sentence structure and topical information, using neural probabilistic context-free grammar (PCFG) and hidden Markov model (HMM). Subtree analysis is conducted through PCA to identify variations due to (z), the continuous variable.
-
Result: A neural PCFG/HMM achieves 68.2% accuracy on a test set, while a simpler HMM obtains 63.4%. The latent space effectively captures topical information and constituent variation, with regular patterns observed in leaves as (z) changes.
-
Example: For the sentence “The cat sat on the mat,” the parsed tree shows that the root (NT-04) aligns to prepositional phrases (PP), and the leaves are mostly PP. The latent space representation captures this structure, demonstrating how the model effectively represents constituent variation.
기타
refer format:
@inproceedings{kim2019compound, title={Compound Probabilistic Context-Free Grammars for Grammar Induction}, author={Kim, Yoon and Dyer, Chris and Rush, Alexander M.}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)}, pages={2369–2385}, year={2019}, organization={Association for Computational Linguistics} }
Kim, Yoon, Chris Dyer, and Alexander M. Rush. “Compound Probabilistic Context-Free Grammars for Grammar Induction.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 2369-2385. Florence, Italy: Association for Computational Linguistics, July 2019.