Vol. 10, No. 1

Table Of Contents

*Representing the Structure of a Simple Context-Free Language in a Recurrent Neural Network: A Dynamical Systems Approach*

Paul Rodriguez, Department of Cognitive Science, UCSD

I will show that a RNN can learn to recognize the structure of a simple CFL, and that by using Dynamical Systems Theory, one can correctly identify how network states reflect that structure. An important question that arises is whether or not the RNN performs similarly or not to discrete automata.

A traditional view of NLP is based on mechanisms of discrete finite automata. Automata theory states that a regular language can be processed by a Finite State Machine (FSM). However, if a language has center-embedding, then that language is a Context-Free Language(CFL) which requires a Push-Down Automata(PDA). Importantly, a PDA is a FSM with an additional resource of a memory-like device that is a "stack" or "counter" in order to keep track of the embeddings (figure 1).

Figure 1. A push-down automaton that processes a simple string that consists of some number (n) of "a" followed by the same number of "b". These strings reflect center-embedding because each pair "ab" is embedded within another outer pair of "a..b". Each node represents a machine state and each arc represents a transition from state to state for an input with the corresponding stack control. A PDA mechanism could recognize strings from this language with a stack memory device or even something as simple as a counter to keep track of the embeddings.Parallel Distributed Processing (PDP) architectures demonstrate a potentially radical alternative to the traditional theories of language processing. However, learning complex structural relationships in temporal data presents a serious challenge to PDP systems. Here I show that a Recurrent Neural Network (RNN) can learn a simple CFL. Furthermore, I use Dynamical Systems Theory to demonstrate how the RNN reflects the functional requirements of a CFL. In particular, the net can organize resources in order to keep track of input and maintain informational states that reflect the temporal structure of the input. The analysis enables a rough comparison to PDA. This work is complementary to theoretic analysis of RNN capabilities (e.g. Siegelmann) yet original in its specific identification of dynamic properties involved. The work is distinct to other proposed modelling of CFL, such as sequential recursive auto-associative memory (RAAM, e.g. Kwasny and Kalman, 1995), in that I am not directly attempting to represent constituent structures. An important question that arises is: How does a RNN compare to a PDA?

A recurrent neural network (RNN) architecture contains feedback connections which allow the system to have node activation values at one time step affect the same node activation values at the next time step, thereby reflecting temporal processing. Figure 2 presents a typical RNN architecture with two binary-valued input nodes,(shown as input lines), two hidden nodes, two output nodes, and one bias node. The feedback connections are implemented through copy units and the weights can also be updated through appropriate error correction procedures (Elman, 1991; St. John, 1992).

Figure 2. A recurrent neural network with 2 input lines, 2 hidden nodes, 2 output nodes, a bias input to hidden nodes and output nodes. In 1 time step activation values for hidden nodes and output nodes are calculated based on the input line and copy unit values. The recurrent connections are realized through copy units which save the hidden unit values at 1 time step and then inject those values at the next time step.

Empirical investigations using natural language input have not demonstrated conclusive results on sentence processing capabilities of RNNs.1 For example, Elman reports a RNN that can learn up to three levels of center-embeddings (Elman, 1991). Stolke reports a RNN that can learn up to two levels of center-embeddings, and up to five levels for tail-recursion (Stolke, 1990). Other work has shown that a RNN can process temporal semantic cues within sentences and reflect semantic constraints and associations (St. John, 1990).

Interestingly, it has been shown that a RNN can handle more levels of embedding if there are additional semantic constraints (Weckerley and Elman, 1992). The authors suggest that the RNN is a mechanism that reflects performance. However, in all these latter studies it may be the case that the RNN is performing the functional equivalent of a FSM. In other words the RNN may not appropriately reflect syntactic structure of complex sentences as does a PDA. Perhaps RNN capabilities are inherently limited such that performance is only reflected gratuitously. The current state of affairs presents a dilemma for RNN enthusiasts - they have a mechanism that reflects some aspects of performance, but they do not yet have a theory about the "nature of the solution".

- Can a RNN learn to process a simple CFL?
- What are the states and resources employed by the RNN?

The architecture for the experiment will be a RNN with 2 input units, 2 hidden units, 2 copy units, 2 output units, 1 bias unit (see figure 2). In one time step the hidden unit activation values and then output unit activation values will be updated. The copy units will hold the hidden unit values in preparation for the next time step.

The RNN will be trained to predict the next input such that if a network makes all the right predictions possible then the network is correctly processing the input. Elman points out several advantages in the prediction task; for one it requires minimal commitment to defining the role of an external teacher, and more importantly it is psychologically motivated by the observation that humans use prediction and anticipation in language processing (Elman, 1991).

For the simple CFL input a^n b^n, when the network has "a" input, it will not be able to accurately predict the next symbol because the next symbol can be another "a" or the first "b" in the sequence. On the other hand, when the network has a "b" input, it could accurately predict the number of "b" symbols that match the number of "a" input, and then also predict the end of the string (Batali, 1994). Correct processing of a string is defined as follows:

- A) "b" input => predict only "b" symbols, except for B).
- B) last "b" input => predict end of the string (or beginning of the next string).

Figure 3. A simple physical system can be described as a dynamical system and a phase space diagram can provide a rough guide to the dynamics. The 3-D landscape will actually provide a convenient metaphor for describing many aspects of RNN behavior.The 3-D physical landscape can be characterized by a vector flow field in a phase space diagram. The vector flow field represents the set of possible trajectories and the phase space diagram coordinate axis that represent the state variables of the system.2 In this example the state variables are the {x, y} coordinate point of the ball position. For each position on the landscape, the vector flows describe the change of position. In fact, the entire trajectory of the ball can be roughly plotted out in the phase space diagram.

Since the output units are a function of the hidden unit values it will be informative to draw a contour plot on the phase space diagram for each output unit. I will draw a one-line contour plot which cuts up the phase space such that for all {hidden unit 1,hidden unit 2} values on one side of the line the output unit is > .5 or .5 (threshold value).

- a) eigenvalues > 1 or -1 => system is expanding and the fixed point is repelling,
- b) eigenvalues 1 and > -1 =>system is contracting and the fixed point is attracting,
- c) at least one eigenvalue > 1 or -1 => system is expanding in the direction of the eigenvector that corresponds to that eigenvalue and the fixed point is also a repelling point specifically referred to as a saddle point.

Figure 4 shows the vector flow field for the two input conditions. Note that the vectors are scaled from their true magnitude, but the relative vector size still reflects relative change in {HU1, HU2}. " Fa" and "Fb" represent the dynamics for the "a" input condition and "b" input condition respectively. Note that Fa has an attracting point near (0,.35), and Fb has an attracting point near (0,1). In the physical metaphor the Fa landscape has a valley that slopes down to (0,.35), and Fb has a valley that slopes down to (0,1). One can imagine placing a ball on the Fa hill, letting it roll for a while, and then holding the ball steady at some location while the landscape is instantaneously switched to Fb. Now the ball will change directions and roll around to the other attracting point. This is exactly analogous to feeding the network some strings of the form "aa...aabb...bb".

Figure 4. Case 1 vector flow fields for Fa and Fb. There is one attracting point for Fa near (0,.35), Fb has an attracting point near (0,1).The landscape analogy has at least one caveat. In real landscapes a ball will roll quickly down from a hilltop and perhaps roll past a small valley, depending on speed, momentum, etc.. However, in the RNN dynamics there is an important rule of thumb: the trajectory will always have small changes near a fixed point - for both attracting and repelling points.

Figure 5 shows the trajectory for some short strings, n=2 and n=3. Each arrow represents 1 time step, hence 1 input. The tail of the arrow is the value of {HU1, HU2} before the time step (i.e. the copy units) and the head represents the updated value of {HU1, HU2} after applying the activation functions. Note that the initial value was chosen by allowing the network to run for one or two short strings. Importantly, the last "b" input results in a change in {HU1, HU2} that crosses over to the left of the dividing line, hence the network properly predicts an "a" when the last "b" is input. Not surprisingly the last {HU1, HU2} value is near the initial starting value.

Figure 5. Case 1 trajectories for n=2 and n=3. Each arrow represents one time step, hence one input value. The tail is the previous hidden unit coordinate value; the head is the updated value. The initial starting point for the first "a" is about {.05,.99}, the final "b" input ends at about the same value. The trajectory crosses the dividing line on the last "b" input.Figure 6 shows the trajectories for longer strings: n=4 and n=5. As above, the network trajectory correctly crosses the dividing line on the last "b" input. Figure 7 shows the trajectories for n=8 and n=9. Note how the network arrows are increasingly shorter near the Fa attracting fixed point. Also, the first "b" input cause a transition to a region of phase space where the arrows can take smaller step sizes as well. However, for n=9 the network crosses the dividing line on the 8th "b", suggesting that perhaps the step sizes are not properly matched.

Figure 6. Case 1 trajectories for n=4 and n=5. Again, the trajectory crosses the dividing line on the last ``b'' input.

Figure 7. Case 1 trajectories for n=8 and n=9. For n=8, the trajectory crosses the dividing line on only the last arrow. But for n=9 the trajectory crosses the line on the 8th ``b'' input, which is 1 time step too early. The network had similar results for n>9.One important feature is that the case 1 network performed most of the processing of "a" input along the HU2 axis, and most of the processing for "b" input along the HU1 axis. In fact, Fa monotonically decreased in HU2, and Fb monotonically decreased in HU1, which intuitively suggests that the network is "counting a's" and then "counting b's", with a transition from "a" to "b" that attempts to set up a proper "b" starting point.

Figure 8. Case 2 vector flow fields for Fa and Fb. There is one attracting point for Fa near (0,.85), Fb seems to have an attracting point near (.4,.8), but see next figure.

Figure 9. Case 2 vector flow field for Fb composite with Fb. There is a periodic 2 fixed point at (0,.4) and (1,1). The Fb composite shows that there is a repelling point near (.4,.8).Figure 10 shows the trajectories for short strings, n=2 and n=3. Again the arrows represent 1 time step, hence 1 input. In the next figure, n=4 and n=5, one can see that the trajectories are actually oscillating around both the attracting point and the repelling point. Figure 12 confirms that the oscillations are somehow matched up for longer strings. In other words, if the last "a" input ends up a little higher(or lower) than the attracting fixed point, then the "b" input must transition to a coordinate point that is also higher (or lower) than the repelling fixed point. Also, the smaller arrows of Fa must be matched by small arrow of Fb. As the Fb trajectory steps expand around the repelling point, the network will cross the dividing line on the last expanding step after taking the correct number of total smaller steps. I will refer to such complementary dynamics as coordinated trajectories.

Figure 10. Case 2 trajectories for n=2 and n=3. Each arrow represents one time step, hence one input value. The trajectories oscillate around the fixed points but still manage to cross the dividing line on the last ``b'' input.

Figure 11. Case 2 trajectories for n=4 and n=5. Again, the trajectory crosses the dividing line on the last ``b'' input.

Figure 12. Case 2 trajectories for n=11 and n=16. The small trajectory steps seem to be matched near the Fa attracting point and the Fb repelling point.

The results in table 1 summarize my findings. I have added two other cases of networks with the same architecture trained on the same task, although with different combinations of initial seed, training set, and eta values. Case 3 is a network that had dynamics similar to case 1.5 Case 4 is a network with dynamics similar to my case 2, which was found by Janet Wiles in related work.

CASE max n learned Fa Fb 1/Fa ---- ------------- ------ -------- -------- 1 8 .408 .832 2.45 2 16 -.7095 -1.455* -1.409* 3 7 .242 1.29 4.136 4 25 -.637 -1.536** -1.57**Table 1. A comparison of largest eigenvalues at the attracting fixed point of Fa and repelling fixed point of Fb. The successful networks have eigenvalues that are inversely proportional to each other. The starred values indicate which values are proportional.

The table compares the largest eigenvalues of the Fa system with the largest eigenvalue of the Fb system. These eigenvalues are associated with the eigenvectors that correspond to the axis of contraction under Fa, and the axis of expansion Fb.6 The contraction and expansion rates, which are given by the eigenvalues, indicate the change in hidden units values from one time step to the next. In other words, the size of the trajectory step, as seen in the graphical analysis, is completely determined by the eigenvalues. If the eigenvalues are inversely proportional, then step sizes for "a" input will be matched by step sizes for "b" input. In the landscape metaphor, if the eigenvalues of Fa and Fb are exactly proportional to each other, then the upward slope of the valley in the Fa landscape matches the downward slope of the hilltop in the Fb landscape.

Based on the results in the table I can state the first attempt at a criterion for success as the following:

- C1) If the rates of contraction for Fa around the attracting point and the rate of expansion for Fb around the saddle point are inversely proportional to each other, then a RNN can process the a^n b^n language by matching step sizes for "a" input and "b" input.

Figure 13. Case 2 eigenvector lines on Fb and composite Fb flow fields. The stable eigenvector line crosses the HU2 axis near (0,.85). The attracting point of Fa is near this line.Figure 13 replicates the vector flow field for Fb and Fb^2. I have also drawn in the lines in the direction of the eigenvectors, which intersect at the location of the saddle point. The line that crosses the HU2 axis near (0,.85) is the stable eigenvector line; and the line that crosses the HU2 axis near (0,.65) is the unstable eigenvector line. Values of {HU1,HU2} near the stable eigenvector get contracted in towards the saddle point before moving towards the periodic 2 fixed point. Notice that the attracting point for the Fa dynamics (see figure 12) nearly falls on the stable eigenvector line of the saddle point for the Fb dynamics. The full criterion can now be stated as follows:

- C2) If "C1)" above is satisfied, and if the attracting point of Fa lies on the stable eigenvector of the saddle point for Fb then a RNN can process the a^n b^n language.

In fact one can draw functional correspondences between the behavior of the network and the behavior of the PDA as follows:

PDA: RNN: stack/counter ~ coordinated trajectories state ~ region of activation values transition ~ a change in activation valuesI am not implying that a RNN is equivalent to a PDA; the coordinated trajectory concept I introduce is neither a stack nor a counter. However, in the context of the task domain and network architecture the hidden node dynamics realize an equivalent functional and informational content as that provided by a stack or counter in the context of a PDA. Also, the dynamical systems interpretation provides the correct functional description (as well as the mathematical analysis) of the resources employed by a RNN. In particular, a coordinated trajectory is composed of the following resources in the a^n b^n task domain: 1) the contracting dynamics of the attracting point under "a" input condition matches the expanding dynamics of the repelling point under the "b" input condition and 2) the attracting point under "a" input condition lies near the axis of convergence of the saddle point.

In conclusion, I can address the two questions I presented earlier: first, a RNN can learn to process a simple CFL; secondly the states and resources employed are correctly identified by a dynamical systems interpretation. A RNN can employ its resources to behave functionally as a stack/counter. However, there are important differences which may not correspond to a PDA. In the case of a PDA the states of the system that control the stack or counter is literally disconnected from the stack or counter device. In the case of a RNN one cannot decouple the coordinated trajectory from the states of the system. In fact, the trajectory is the state of activation values and the change of activation values.7

The work presented here suggests a related question that one can now ask: How does content affect trajectories? In other words, since we have some characterization of the resources involved in trajectories, we can begin to understand how different task domains, including ones that use semantic input, affects those resources. The affects could arise from precision, location of fixed points, rates of contraction/expansion, the topographical relationships of those axis of contraction/expansion, the ability of output units to "cut up" the phase space, etc.. The dynamical systems analysis presented here provides insight on some performance issues for RNN models of language processing, but by no means does it exhaust the possibilities.

In my case 2 network the "counting up" values oscillate and converge around a fixed point, the "counting down" values oscillate and expand around a repelling fixed point. More importantly my network learned the solution, with hidden units that were fully connected to each other. The axis of expansion in case 2 contained components of both hidden units which suggests a solution that was distributed among units. The output units were not specially constructed but rather developed their own connection weights to cut up the hidden unit phase space as appropriate. In fact, since the output units were not connected to the input units the hidden layer was forced to find a solution that "counts up and down" in two different regions of phase space. Essentially, the case 2 network realizes a distributed solution for the task at hand.

Further testing of the case 2 network with strings from the balanced parenthesis language confirms that coordinated trajectories are sufficient to process other CFLs. For example, the case 2 network will correctly predict the transition at the end of strings such as "aabaabbb" and "aabaabaabbb". This result is striking because the network was never trained on any strings from the balanced parenthesis problem. The conditions to process this language are related to those for a^n b^n language. Future work will investigate whether or not a RNN can learn other CFLs. It is possible that the solution may involve a different set of dynamics or just dynamics that are more complex.

Casey, Michael Patrick (1995). Computation in Discrete-Time Dynamical Systems, Ph.D. dissertation, unpublished, University of California at San Diego.

Chalmers, David (1992). "Syntactic Transformations on Distributed Representations", in Connectionist Natural Language Processing, ed. Noel Sharkey, Kluwer Academic.

Elman, Jeffrey L. (1990). "Finding Structure in Time", in Cognitive Science 14, Ablex.

Elman, Jeffrey L. (1991). "Distributed Representations, Simple Recurrent Networks, and Grammatical Structure" in Machine Learning 7, Kluwer Academic.

Giles, C.L., Sun, G.Z., Chen, H.H., Lee, Y.C., Chen, D. (1992) "Extracting and Learning and Unknown Grammar with Recurrent Neural Networks" in Advances in Information Processing 4, J.E. Moody, S.J. Hanson and R.P. Lippmann (eds), Morgan Kaufman, San Mateo, Ca.

Giles, C. Lee, and Omlin, Christian W. (1993). "Extraction, Insertion and Refinement of Symbolic Rules in Dynamically Driven Recurrent Neural Networks", in Connection Science, Vol5, Nos. 3 and 4.

Hopcroft, John E., and Ullman, Jeffrey D. (1979). Introduction to Automata Theory, Languages, and Computation, Addison-Wesley.

Kwasny, Stan C., and Kalman, Barry L. (1995). "Tail-recursive Distributed Representations and Simple Recurrent Networks", in Connection Science, Vol7, No 1.

Lucas, S. M., Damper, R. I. (1992). "Syntactic Neural Networks", in Connectionist Natural Language Processing, ed. Noel Sharkey, Kluwer Academic.

Marcus, Mitchell P., (1980). A Theory of Syntactic Recognition for Natural Language, MIT Press, Cambridge, Mass..

Miikulainen, Risto (1992). "Script Recognition with Hierarchical Feature Maps", in Connectionist Natural Language Processing, ed. Noel Sharkey, Kluwer Academic.

Maskara, Arun and Noetzel, Andrew (1992) "Sequence Recogntion withRecurrent Neural Networks", Technical Report, New Jersey Institute of Technology.

McClelland, James L., and Kawamoto, Alan H. (1986). "Mechanisms of Sentence Processing: Assigning Roles to Constituents", in Parallel Distributed Processing Vol. 1, MIT Press.

Rumelhart, D. E., Hinton, G.E., Williams, R.J. (1987). "Learning Internal Representations by Error Propagation", in Parallel Distributed Processing Vol. 1, MIT Press.

Servan-Schreiber, David, Cleermans, Axel, and McClelland, James L. (1988) "Encoding Sequential Structure in Simple Recurrent Networks", CMU Technical Report CS-88-183.

Siegelmann, Hava Tova (1993). Foundations of Recurrent Neural Networks, Ph.D. dissertation, unpublished. New Brunswick Rutgers, The State University of New Jersey.

Smith, Anthony W. and Zipser, David (1989). "Learning Sequential Structure with the Real-Time Recurrent Learning Algorithm", International Journal of Neural Systems, Vol.1, No. 2.

St John, Mark (1992). "The Story Gestalt: A Model of Knowledge-Intensive Processes in Text Comprehension", Cognitive Science, 16.

Stolcke, Andreas (1990). "Learning Feature-Based Semantics with Simple Recurrent Networks", Technical Report, ICSI, UC Berkeley.

Stolz, W.S. (1967). "A Study of the Ability to Decode Grammatically Novel Sentences", in Journal of Learning and Verbal Behavior, Vol. 6.

Sun, G.Z., Chen, H.H., Giles, C.L., Lee, Y.C., Chen, D. (1990). "Connectionist Pushdown Automata That Learn Context-Free Grammars", in IJCNN, Washington D.C.

Tino, Peter, Horne, Bill, and Giles, Lee (1995). "Finite State Machines and RecurrentNeural Networks - Automata and Dynamical Systems Approaches", Technical Report-UMCP-CSD:CS-TR-3396, University of Maryland, College Park.

Watrous, Raymond L., and Kuhn, Gary M. (1992). "Induction of Finite-State Automata Using Second Order Recurrent Networks" in Advances in Information Processing 4, J.E. Moody, S.J. Hanson and R.P. Lippmann (eds), Morgan Kaufman, San Mateo, Ca.

Weckerley, Jill and Elman, Jeffrey L. (1992). "A PDP Approach to Processing Center-Embedded Sentences", in Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society, Indiana University, Bloomington, Lawrence Erlbaum Associates, Hillsdale, N.J.

Williams, Ronald J., and Zipser, David (1989). "Experimental Analysis of the Real-Time Recurrent Learning Algorithm", in Connection Science, Vol. 1, No. 1.

1 Several works have demonstrated that feed-forward networks are capable of representing complex structural relationships, such as the RAAM model (e.g. Chalmers, 1992). Also, there have been other works showing that feed-forward networks are capable of performing semantic processes, such as assigning thematic role information (McClelland and Kawamoto, 1986; Miikkulainen, 1992). Some researches have built networks to handle temporal structure by using special delay lines or hierarchical structure (Lucas and Damper, 1992). My emphasis here is on networks that attempt to organize itself to match the temporal structure of the language task.

2 Note that the phase space diagram does not include an axis for time.

3 An eigenvalue is a scalar (, and an eigenvector is a vector v, such that for a system of equations F, Fv=(v.

4 The case 1 network did not actually have a repelling point, however, I used the point that had the smallest change in activation values, which still allows one to make a rough comparison of dynamics.

5 Case 3 had dynamics that monotonically increase for 1 hidden unit in Fa, then monotonically decreased for the other hidden unit in Fb, but in this case the dynamics have a repelling point.

6 The other eigenvalues are only important in the sense that they are small enough so that the larger eigenvalues "dominate" the action.

7 This claim is supported by formal definitions. In the set theoretic language of automata the stack or counter is a discrete set distinct from the set of control states. In the set theoretic language of dynamical systems a trajectory (or orbit) is the real-valued set of state variables through time.

8 In fact the language a^n b^n is a subset of the balanced parenthesis language.

[CRL Newsletter Home Page] [CRL Home Page]

Center for Research in Language

CRL Newsletter October 1995 Vol. 10, No. 1