272x Filetype PDF File size 0.67 MB Source: aclanthology.org
ImprovingMathWordProblemswithPre-trainedKnowledgeand
Hierarchical Reasoning
∗
Weijiang Yu, Yingpeng Wen, Fudan Zheng, Nong Xiao
School of Computer Science and Engineering, Sun Yat-sen University
weijiangyu8@gmail.com,{wenyp6,zhengfd3}@mail2.sysu.edu.cn,
xiaon6@mail.sysu.edu.cn
Abstract Problem: Conner has 25000 dollars in his bank
account. Every month he spends 1500 dollars. He
The recent algorithms for math word prob- does not add money to the account. How much
lems (MWP) neglect to use outside knowl- money will Conner have in his account after 8
edge not present in the problems. Most months?
of them only capture the word-level relation- Equation: x = 25000.0−(1500.0∗8.0);
ship and ignore to build hierarchical reason- Solution: 13000.0
ing like the human being for mining the
contextual structure between words and sen- Table 1: The example of the math word problem task.
tences. In this paper, we propose a Reasoning Given a natural language description for a mathemat-
with Pre-trained Knowledge and Hierarchical ical problem, it requires the model to infer a formal
Structure (RPKHS) network, which contains mathequation and final quantity solution.
a pre-trained knowledge encoder and a hier-
archical reasoning encoder. Firstly, our pre-
trained knowledge encoder aims at reasoning
the MWP by using outside knowledge from 2017b;Wangetal.,2017;Huangetal.,2018;Wang
thepre-trainedtransformer-basedmodels. Sec- et al., 2017). These Seq2Seq-based methods aim
ondly, the hierarchical reasoning encoder is to train an end-to-end model from scratch by using
presented for seamlessly integrating the word- the training dataset. Some research focuses on de-
level and sentence-level reasoning to bridge veloping structure-based approaches (Xie and Sun,
the entity and context domain on MWP. Exten- 2019a;Wangetal.,2018a,2019b;Liuetal.,2019a;
sive experimentsshowthatourRPKHSsignifi- Zhang et al., 2020b; Li et al., 2020b; Hong et al.,
cantly outperforms state-of-the-art approaches 2021;Lietal.,2020a)byincorporatingparsingtree
on two large-scale commonly-used datasets,
and boosts performance from 77.4% to 83.9% into the neural models to produce promising results
onMath23K,from75.5to82.2%onMath23K in generating solution expression for the MWP.
with5-foldcross-validationandfrom83.7%to Toanswerthis question, human beings not only
89.8% on MAWPS. More extensive ablations need to parse the question and understand the con-
areshowntodemonstratetheeffectivenessand text but also use external knowledge. However,
interpretability of our proposed method. the previous methods learn the textual description
1 Introduction purely from the short and limited narrative without
using any background knowledge that not present
Math Word Problem (MWP) is a reasoning task in the description, which restrain the ability of the
for answering a mathematical query based on the models for inferring the MWP from a global per-
problem description, which is an interdisciplinary spective. Moreover, current methods mainly fo-
research topic to bridge the mathematics and nat- cus on designing diverse entity-level structures for
ural language processing. As shown in Table 1, a word-level reasoning rather than bridging the hier-
short narrative is presented to describe a problem archical reasoning between the entity (word-level)
and poses a question about the unknown quantity. and context (sentence-level). Obviously, it is not
In recent years, research on MWP by using deep enough to use single-level reasoning for solving
learning methods has been gaining increasing at- the MWP.Inthis paper, we propose reasoning with
tention. Early research mainly focuses on Seq2Seq- pre-trained knowledge and hierarchical structure
based models (Sutskever et al., 2014; Ling et al., (RPKHS)tojointly solve the two limitations.
∗Corresponding Author Our RPKHS as shown in Figure 2 consists of
3384
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3384–3394
c
November7–11,2021.
2021Association for Computational Linguistics
Conner has …
dollars Every month … How much …?
25000 Conner has …
Every month …
8 He does …
1500
money 25000
How much …? money 8
month
account He does … dollars
account month
1500
(a) Word-level Reasoning (b) Sentence-level Reasoning (c) Hierarchical Reasoning
Figure 1: (a) Word-level reasoning is to build the relationship of each word in all textual descriptions, which can
also be considered as entity-level reasoning; (b) Sentence-level reasoning aims at mining the intra-relationship of
each sentence from the paragraph. (c) Hierarchical reasoning is to jointly excavate intra-relationship and inter-
relationship between word and sentence from the same paragraph.
two encoders, namely pre-trained knowledge en- this paper, we take advantage of the implicit knowl-
coder and hierarchical reasoning encoder, and a edge in pre-trained Roberta (Liu et al., 2019c) and
tree-structured decoder. It effectively incorporates analyze the effect of various pre-trained knowledge
the implicit linguistic knowledge into the model on the MWPtask.
via pre-trained knowledge encoder and generates CurrentmethodsmainlylearntheMWPbybuild-
structural representation by our hierarchical rea- ing word-level reasoning (as shown in Figure 1 (a))
soning encoder. The outputs of the two encoders by GNN (Zhang et al., 2020b; Li et al., 2020b)
are fed into a tree-structured decoder (Xie and Sun, and Seq2Seq model (Wang et al., 2017). They
2019b) for final prediction. seldom consider modeling hierarchical structure.
Tothebestofourknowledge,wearethefirstone Since the descriptions of MWP have a hierarchical
to study the application of pre-trained knowledge to structure (words from sentences, sentences from a
the MWPtask. We have implicit knowledge which narrative), we likewise construct hierarchical rea-
is embedded into some non-symbolic form such as soning (as shown in Figure 1 (c)) by first building
the weights of a neural network derived from an- representations of sentences from words, and then
notated data or large-scale unsupervised language aggregating those into a whole narrative represen-
training. Recently, Transformer-based (Vaswani tation.
et al., 2017) and specifically BERT-based (Devlin It is observed that different words and sentences
et al., 2019b; Liu et al., 2019c) models have been in a mathematical narrative are differentially infor-
proposed, which incorporate large-scale linguistic mative. The importance of words and sentences are
pre-training, implicitly capturing language-based highly context-dependent, i.e. the same word or
knowledge. This type of knowledge can be quite sentence may be differentially important in differ-
useful for parsing the textual description. ent contexts (e.g., 5 dollars and 5 pencils, the word
For example, there are two sentences: ‘He has of 5 has different meanings.). To include sensitivity
25000dollars in his bank account.’; ‘Paul appeared to this fact, our model includes two levels of rea-
before the faculty to account for his various misde- soning mechanisms. One at the word-level and one
meanors’. The word ‘account’ has totally different at the sentence-level. They lead the model to pay
meanings between the two sentences due to differ- moreorless attention to individual words and sen-
ent scene-awaredescriptions. Hence, wethinksuch tences when constructing the representation of the
diverse semantics of each word containing rich rep- narrative. Taking an example as shown in Table 1,
resentation in the implicit pre-trained knowledge. intuitively, the first, second and fourth sentences
Suchknowledgecanbealsoregardedasahugeim- have stronger information in assisting the predic-
plicitly vocabulary to endow each word with rich tion of the solution. Within these sentences, the
representation. It can help the model to parse the words 25000 dollars and every month contribute
correct semantics of words from complex text. In more in inferring the math-aware results. In this
3385
paper, we propose a hierarchical reasoning encoder 3 Methodology
to achieve this functionality. 3.1 Overview
Contributions. (1) As far as we know, we are the
first one to explore pre-trained knowledge on the In this section, we explain the architecture and
MWPtaskviaourpre-trained knowledge encoder. design of our proposed RPKHS network (i.e. Rea-
(2) We propose a hierarchical reasoning encoder to soning with Pre-trained Knowledge and Hierarchi-
seamlessly integrate the word-level and sentence- cal Structure) composed of pre-trained knowledge
level reasoning for bridging the entity and context encoder, hierarchical reasoning encoder and tree-
domain on MWP. It can provide insight into which structured decoder, which can appropriately incor-
words and sentences contribute to the prediction porate the outside knowledge into the model and
which can be of value in applications and analysis. bridge the hierarchical reasoning between the en-
(3) Our RPKHS outperforms previous approaches tity (word-level) and context (sentence-level). The
by a significant margin. overview of our RPKHS is illustrated in Figure 2.
Our contributions mainly focus on the design of
2 Related Work a joint-learning framework and two innovative en-
coders on the MWP task, which are unveiled and
The MWP is the task of translating a short para- discussed in detail in the following sections.
graph consisting with multiple short sentences 3.2 ProblemFormulation
into target mathematical equations. Previous ap-
proaches usually solve the MWP by using rule- The math word problems (MWP) can be formu-
based methods (Yuhui et al., 2010; Bakman, 2007), lated as (P,E), where P is the problem text and E
statistical machine learning methods (Kushman is a solution expression. Assuming a description of
et al., 2014; Mitra and Baral, 2016; Roy and Roth, MWPhasLsentencessi,andeachsentencecon-
tains T words. w with t ∈ [1,T] represents the
2018; Zou and Lu, 2019), semantic parsing meth- i it
ods (Shi et al., 2015; Roy and Roth, 2015; Huang words in the i-th sentence. Our proposed encoders
et al., 2017) and deep learning methods (Ling et al., project the raw problem descriptions into a vector
2017a; Wang et al., 2018b; Liu et al., 2019b; Wang representation, on which we build a tree-structured
et al., 2017; Zhang et al., 2020a). Recently, the decoder to predict the mathematical expression.
deep learning based methods have been paid more 3.3 Pre-trained Knowledge Encoder
attention for their significant improvement. (Wang
et al., 2017) proposed a Seq2Seq-based model to di- We want to incorporate implicit external knowl-
rectly map the linguistic text to a solution. (Wang edge as well as math-aware knowledge which can
et al., 2018b) and (Chiang and Chen, 2019) im- be learned from the training set in our model. Lan-
plicitly modeled tree-based structure for decoding guage models, and especially transformer-based
the MWPexpressions, while (Wang et al., 2019a; language models, have shown to contain com-
Liu et al., 2019b; Xie and Sun, 2019b) optimized monsense and factual knowledge (Petroni et al.,
the decoder via explicit tree structure. Some re- 2019; Jiang et al., 2019). We adopt this direc-
search focused on graph structure on word-level tion in our model and build an encoder, pre-trained
reasoning. For example, (Zhang et al., 2020a) built with Roberta (Liu et al., 2019c), which has been
two customized graphs for enriching the quantity pre-trained on the huge language corpora (e.g.,
representations in the problem. (Li et al., 2020b) BooksCorpus (Zhu et al., 2015), Wikipedia (Remy,
presented a graph-to-tree encoder-decoder frame- 2002)) to capture implicit knowledge. We tokenize
workfor grammar parsing. a description Q using WordPiece (Wu et al., 2016)
However, they ignore the sentence-level relation- as in BERT (Devlin et al., 2019a), giving us a se-
shipandthecorrelationbetweenwordandsentence. quence of |Q| tokens and embed them with the pre-
Different from the previous methods, we propose trained Roberta embeddings and append Roberta’s
to use hierarchical reasoning containing word-level positional encoding, giving us a sequence of d-
Q Q
and sentence-level reasoning. Besides, we are the dimensional token representation x ,...,x . We
1 |Q|
first ones to explore the effect of implicit knowl- feed these into the transformer-based pre-trained
edge from the pre-trained neural network weights knowledge encoder, fine-tuning the representation
on the task of math word problems. during training. We mean-pool the output of all
3386
Hierarchical Reasoning Encoder
If 6 times a number is
decreased by 5, the result is
7 more than 3 times the
sum of the number and 13. Tree-structured Decoder
Whatisthenumber?
h /
w
+ -
Word FC + 6.0 3.0
Embedding 5.0
Pre-trained Knowledge Encoder 7.0
N *
p
w
N N 13.0 3.0
At ultM orm & orwF F orm & Li
ten he-i ar eed near
tion ad Add d Add Equation: (3.0*13.0+7.0+5.0)/(6.0-3.0)
Solution: 17.0
Concatenation Word Feature Sentence Feature Tree Node Feature Voting Mechanism
Figure 2: Overview of our Reasoning with Pre-trained Knowledge and Hierarchical Structure (RPKHS). The
hierarchical reasoning encoder receives the textual embedding to construct inter-relationship between sentence
and word to aggregate semantics among entity and context. The pre-trained knowledge encoder captures a large
amountofknowledgeaboutthelinguisticworldfromthepre-trainednetworkweights,andincorporatestheimplicit
knowledge into the input embedding to enrich the input representation. Then we concatenate the results from two
encoders as the input of a tree-structured decoder for parsing the target mathematical equation and solution.
transformer steps to get our combined implicit the state of sequences without using separate mem-
knowledge representation Yp. ory cells. There are two types of gates: the reset
gate r and the update gate z . They jointly control
3.4 Hierarchical Reasoning Encoder t t
howinformation is updated to the state. At time t,
Theproposed hierarchical reasoning encoder takes the GRUcomputesthenewstateas
into account that the different parts of a math ˆ
h =(1−z)⊙h +z ⊙h . (1)
description have no similar relevant information. t t t−1 t t−1
Moreover, determining the relevant sections in- This is a linear interpolation between the previous
volves modeling the interactions among the words, ˆ
state h and the current new state h computed
not just their isolated presence in the text. There- t−1 t
with new sequence information. The gate zt de-
fore, to consider this aspect, the model includes two cides how much past information is kept and how
levels of reasoning mechanisms. One reasoning at muchnewinformationisadded. zt is updated as
the word level and the other at the sentence level,
which let the model pay more or less attention to z =σ(W x +U h +b ), (2)
individual words and sentences when constructing t z t z t−1 z
where x is the sequence vector at time t. The
the wholedescriptionrepresentation. Thehierarchi- t
ˆ
candidate state h is computed by
cal reasoning encoder is composed of 2 layers. The t
first layer is our word-level reasoning layer and ˆ
h =tanh(W x +r ⊙(U h ) +b ), (3)
the second layer is the sentence-level reasoning t h t t h t−1 h
layer. In the following sections, we first introduce wherer isthereset gate which controls how much
the GRU-based operation commonly used in our t
the previous state contributes to the candidate state.
twolayers. Then we present the details of the two If r is zero, then it forgets the past state. The reset
reasoning layers. t
gate is updated by
GRU-basedSequenceEncoding. TheGRU(Bah- r =σ(W x +U h +b ). (4)
danauetal.,2015)usesagatingmechanismtotrack t r t r t−1 r
3387
no reviews yet
Please Login to review.