Generalization, rules, and neural networks:
A simulation of Marcus et. al, (1999)

Jeff Elman
Center for Research in Language & Department of Cognitive Science
University of California, San Diego

elman@crl.ucsd.edu

 

In their recent Science paper, Marcus, Vijayan, Bandi Rao, and Vishton (1999) report an interesting and important result. 7-month old infants were first habituated to sequences of nonsense syllables, which could be either of the form ABA or ABB (e.g., "le di le" or "de de di"). Subsequently, Marcus et al. found that infants showed a preference for novel test syllables which differed from the habituation stimuli (e.g., for the ABA group, "ba po po", and for the ABB group, "ba po ba"). Marcus et al. took these results to indicate that infants had extracted "algebra-like rules that represent relationship between placeholders (variables)" (p. 79).

The results do indicate that infants discriminated the difference between the two types of sequences (ABA vs. ABB; and in a 3rd study, between AAB and ABB sequences). The distinction actually can be formulated in somewhat simpler terms than Marcus et al. have chosen to characterize it: in the case of one stimulus type (ABB sequences), the final two syllables are identical; in the case of the other (ABA or AAB), the final two syllables are different. The second experiment in their paper provides further evidence that the infants have formed a generalization which is sufficiently abstract that it can be extended to novel stimuli, including stimuli in which novel features carry the contrast. It is this ability which prompt Marcus et al. to conclude that the infants have formed an algebraic representation in which the specific syllables are replaced by variables such as (for example), "A" and "B".

In addition, Marcus et al. report (without details) failure to replicate these results with a simple recurrent network (SRN). Given their hypothesis of the mechanism which underlies the infants’ performance, and their failure to replicate the behavior with a network, Marcus et al conclude that "such networks can simulate knowledge of grammatical rules only by being trained on all items to which they apply" (p. 79). This conclusion is consistent with other claims that have appeared in the literature, which argue that the regularities which networks are able to learn are closely tied to their training data (see, for example, Hadley, 1992, and Marcus, 1998; and Elman, 1998 for a rebuttal).

In the present case, Marcus et al. appear to make two assumptions which are questionable. The first assumption is that whatever prior experience infants might have had, before participating in the experiment, is not relevant to their experimental task. Thus, in attempting to simulate the infant data, the SRN is not provided with any training except that which occurred in the experiment. Second, Marcus et al. assume that neural networks can only (at best) form generalizations which mirror the conditional probabilities of the stimuli to which they have been exposed. Thus, generalization over more abstract patterns that are not literally encountered during learning is impossible.

In fact, infants bring a rich background knowledge to bear in the experimental task that Marcus et al. presented them with. By 7 months, infants have heard approximately 6 million words; by this age they have also already formed preferences for speech sounds in their own language are aware of the surrounding language’s phonotactic regularities. There is a large body of experimental evidence indicating that infants are quite sensitive to very subtle perceptual regularities in the stimuli they encounter. Among other things, such sensitivity almost certainly includes the ability to discriminate stimuli that are similar or different (indeed, this capacity is precisely what makes the popular habituation task possible). If infants are to be granted the capacity to distinguish "same" from "different", based on extensive prior experience with the world across multiple sensory modalities, it is unreasonable not to provide networks with a similar prior experience.

Second, there is an important different between learning statistics and statistically-driven learning. Neural networks do not simply memorize transitional probabilities. Rather, networks are induction engines in which generalizations arise over abstract classes of items. Statistical patterns provide the evidence for those classes and for the generalizations over them. Importantly, however, networks can generalize what they have learned to novel stimuli, as well as to stimuli which have not previously been encountered in similar contexts. For instance, a network that is exposed to training data in which certain restrictions apply to the class of nouns denoting humans will generalize its expectations to all human nouns, including those which it has not previously seen in specific contexts (Elman, 1998).

Simulation

The proof of the pudding is in the simulation, however. The following simulation was carried out to demonstrate that, given appropriate background knowledge, a simple recurrent network can learn the sorts of categories implicit in Marcus et al.’s study, and will generalize this to stimuli that were not used to train those categories. Thus, there are three phases to the simulation: (1) an initial period which is intended to correspond to the prior experience which is available to the infants in the experiment (but presumably not made available to the networks tested by Marcus et al.); (2) a second phase which corresponds as closely as possible to the habituation task the infants encountered; and (3) a testing phase in which the network’s response to the same novel stimuli used in the infant experiment is probed.

Stimuli. The infants in the Marcus et al. study were habituated to 12 different "sentences;" each sentence was composed of 3 syllables drawn from a set of 8 possible syllables (de, di, je, ji, le, li, we, wi). During testing, infants were presented with 4 novel sentences that were made up of 4 entirely new syllables (ba, po, ka, ko, ga, go). In their Experiment 2, habituation and testing syllables were carefully chosen so that the ABB vs. ABX pattern seen during habituation not only involved different syllables than during testing, but also to ensure that the features which carried the distinction were different in testing than during training. The testing and training syllables used in Phase 2 of the simulation was identical (as far as can be determined from the published report) to those used in the infant experiment, as shown in Table 1. (The consonants and vowels were encoded using a reduced version of the distinctive feature notation developed by Plunkett & Marchman, 1993.)

However, it is obvious that prior to their participation in the experiment, infants had had extensive experience with the full range of English phonemes and phonetic features. To simulate this previous experience, the network was therefore initially exposed in Phase 1 to a larger number of stimuli that more accurately captured the broad range of syllables which the infants undoubtedly had encountered before participating in the experiment. These 120 syllables are shown in Table 2.

Table 1. Habituation and test stimuli (Phase 2)

Habituation Featural representation

	de	-1 -1 -1 -1 –1  1 -1  1 -1  1 –1  1
	di	-1 -1 -1 -1 –1  1 –1  1 -1  1 -1 -1
	je	-1 -1  1 -1 –1  1 –1  1 -1  1 –1  1
	ji	-1 -1  1 -1 –1  1 –1  1 -1  1 -1 -1
	le	-1 –1  1  1 –1  1 –1  1 –1  1 -1  1
	li	-1 –1  1  1 –1  1 –1  1 –1  1 -1 -1
	we	 1  1  1  1 -1 -1 –1  1 -1  1 –1  1
	wi	 1  1  1  1 -1 -1 –1  1 -1  1 -1 -1

Testing

	ba	-1 -1 -1 -1  1 -1 -1  1  1 –1  1  1
	po	-1 -1 -1 -1  1 -1 -1 –1  1  1 -1  1
	ga	-1 -1 -1 -1 -1 –1  1  1  1 –1  1  1
	ko	-1 -1 -1 -1 -1 –1  1 -1  1  1 -1  1

Table 2: Pre-exposure stimuli (Phase 1)

bA	ba	be	bE	bi	bI	bo	bO	bu	bU
pA	pa	pe	pE	pi	pI	po	pO	pu	pU
dA	da	de	dE	di	dI	do	dO	du	dU
tA	ta	te	tE	ti	tI	to	tO	tu	tU
gA	ga	ge	gE	gi	gI	go	gO	gu	gU
kA	ka	ke	kE	ki	kI	ko	kO	ku	kU
jA	ja	je	jE	ji	jI	jo	jO	ju	jU
CA	Ca	Ce	CE	Ci	CI	Co	CO	Cu	CU
lA	la	le	lE	li	lI	lo	lO	lu	lU
wA	wa	we	wE	wi	wI	wo	wO	wu	wU
DA	Da	De	DE	Di	DI	Do	DO	Du	DU
TA	Ta	Te	TE	Ti	TI	To	TO	Tu	TU

 

Architecture: A simple recurrent network (Elman, 1990) with the architecture shown in Figure 1 was used. The 12 inputs represented the 12 phonetic features that were used to encode each CV syllable. One output unit was used during the pretraining phase, described below; the other output was used during the habituation and testing phases. (During pretraining, only the first unit was trained; during habituation and testing, only the second unit was trained.)

 

Figure 1. SRN architecture used in simulation

 

 

Pretraining: During pretraining, the network was presented with 50,000 syllables from the full set of 120 possible, one at a time. As each new syllable was presented, the network registered whether the current syllable was the same or different from the previous syllable (encoded by outputting a 1 or 0, respectively). This initial task reflects the assumption that infants do indeed learn to notice similarity or dissimilarity between temporally adjacent stimuli, apart from any attempt to categorize or make sense of the stimuli. After the network experienced 6 passes through this initial data set, the weights are saved and the habituation task began.

Habituation: During habituation, the same network (with weights from the pretraining phase) was shown the 32 sentences shown in Table 3.

Table 3: Habituation stimuli

ABA ABB

le di le	le di di
le je le	je le le
le li le	li le le
le we le	we le le
wi di wi	wi di di
wi je wi	je wi wi
wi we wi	li we we
wi li wi	we wi wi
ji di ji	di ji ji
ji je ji	je ji ji
ji li ji	ji li li
ji we ji	we ji ji
de di de	de di di
de je de	je de de
de we de	li de de
de li de	de we we

Each syllable in a sentence was presented, one syllable at a time. Upon hearing the final syllable, the network was trained to output (on a different output unit than used in the pretraining phase) a 0 in the case of ABA patterns, and a 1 in the case of ABB sentences (no training occurred following the first two syllables, and no training occurred on the output used during pretraining). In this way, the network was asked to make the same discriminations that the infants presumably learned to make, using the identical stimulus set. The network saw the habituation sentences 347 times, after which point the two categories were well discriminated.

Testing: During the testing phase, the network was shown the same 4 sentences that the infants were tested with the stimuli shown in Table 4. The network’s responses and target outputs (assuming generalization) are also shown.

Table 4: Testing stimuli

   ABA	       response	target	  ABB	       response	target
ba po ba	0.004	0	ba po po	0.853	1
ko ga ko	0.008	0	ko ga ga	0.622	1

These responses clearly indicate that the network learned to extend the ABA vs. ABB generalization to novel stimuli. (The responses to the ABB patterns are within the range of variability shown for the habituation stimuli; importantly, they are on the correct side of the 0.5 value which marks a chance response.)

 

Discussion

The ability to generalize a pattern beyond the specific stimuli which gave rise to that generalization is an important characteristic of human cognition. This capacity is demonstrated as well by the neural network simulation described above. Two factors play an important role in the network’s ability to perform in this way.

First, the network does not approach the task with no prior experience. As is true for the infants in the Marcus et al. study, the network sees the full range of phonetic contrasts which occur in English prior to learning the categorization task. The network is also given the opportunity to learn such highly salient patterns such as whether or not successive stimuli are the same or different. This background knowledge is important in allowing the network to generalize what it learns about subsets of stimuli to the broader class of possible stimuli to which the generalization might apply.

Second is the fact that networks do not simply record conditional probabilities. Rather, networks use conditional probabilities as the basis for learning that stimuli may belong to more abstract classes. Having formed abstract representations (in the hidden layers) which capture such class membership, networks are then able to learn that patterns which obtain over some members of the class are likely to be true of other members.

Of course, this is also true of symbolic systems. Variables operate over classes and permit generalizations that are expressed in terms of those variables to apply to all entities over which the variables range. Is thus possible that the SRN in the simulation above is actually a symbolic machine in disguise?

I would argue no. There is an important difference between the two mechanisms. The strength of the symbolic system, and of variables, is that they are blind to the characteristics of the specific entities they represent. This is also the Achilles Heel of such systems. Being blind to exemplars means that exceptions or subregularities must be noted separately, and there must be a mechanism for blocking the rule. In the case of the network, class membership is highly context- and content-dependent. The network’s representations live in a continuous high-dimensional space, and the temporal generalizations are supported by a non-linear dynamics. Taken together, these two characteristics provide for representations and processes which are simultaneously sensitive to patterns of usage which are both broad and general as well narrow and specific.

The learning of classes and regularities is also highly experience-dependent in a way which I would argue closely resembles the learning trajectory for humans. Because categories emerge over time as a result of experience, networks tend to be initially very conservative in the generalizations they make. The same is true of young children; categories such as "noun" and "verb", for example, seem not to appear until the age of roughly two years. Even then, children are often unwilling to extend to a verb (for example) all of the behaviors which may be true for verbs. Instead, children tend to confine their expectations to the usages they have actually encountered (Olguin & Tomasello, 1993; Tomasello & Olguin, 1993).

At a later stage, after having experienced more of the world, networks begin to form classes and to generalize what they know about these classes to the items in the class. If there are accidental gaps in their experience, the networks will often apply the general pattern to missing cases. Thus, a network which is trained to expect that (in an artificial language) the direct object of verbs of perception or communication will always be human, which which has never seen the word "boy" as the direct object of any verbs, will nonetheless expect "boy" as a possible direct object following the sentence fragment "Sue talks to. . ."

A symbolic system would behave similarly. The problem with a variable-based system, however, is that sometimes gaps are not accidental but systematic. The ungrammaticality of "She whispered him the news", vs. the acceptability of "She told him the news" probably reflects a subtle interaction between the discourse effects of the ditransitive construction and semantic dominance differences between "whisper" and "tell" (cf. Erteschik-Shir, 1979; Goldberg, 1995; Pinker, 1984). Then the problem is how to restrict the generalization process (cf. Goldberg, 1995, and Pinker, 1981, 1984 for discussion of this problem). In the case of networks, the generalization process is sensitive to gaps in an experience-dependent manner. Given the example above, in which "boy" happens never to appear in direct object position, a network with minimal experience will not expect "boy" as the object of any verb. With more experience—even if the additional data does not include examples of "boy" as an object—the network will learn that there are classes of words, that "boy" belongs to the class of humans, and that this class is a possible object of verbs of perception and communication. At this stage, "boy" will be expected, even though the network has never seen it in the object position. However, as the network’s experience increases, if "boy" continues never to be encountered when it is predicted as a possible object, the network will retreat from its initial overgeneralization and learn that "boy", although a member of the class of human nouns, is an exception to the pattern of human-as-object.

Thus, I would argue that there are indeed important differences between the ways in which symbolic systems and networks approach the issue of generalization. Networks are capable of abstraction and generalizing beyond the data. But they highly sensitive to subregularities and to the partial productivity of patterns in a manner that seems very difficult for symbolic systems. Ultimately, the decision about which framework the best model for human cognition will depend on which one best captures the facts, and which one provides the best account for those facts. For the moment, I’m betting on the network.

 

References

Elman, J.L. (1990). Finding structure in time. Cognition, 14, 179-211. (Adobe PDF version: http://crl.ucsd.edu/~elman/Papers/fsit.pdf; compressed postscript: ftp://crl.ucsd.edu//pub/neuralnets/fsit.ps.gz)

Elman, J.L. (1998). Generalization, simple recurrent networks, and the emergence of structure. In M.A. Gernsbacher and S.J. Derry (Eds.) Proceedings of the Twentieth Annual Conference of the Cognitive Science

Society. Mahwah, NJ: Lawrence Erlbaum Associates. (Adobe PDF version http://crl.ucsd.edu/~elman/Papers/cogsci98.pdf; compressed postscript: http://crl.ucsd.edu/~elman/Papers/cogsci98.ps.gz).

Erteschik-Shir, N. (1979). Discourse constraints on dative movement. In T. Givon (Ed.), Syntax and Semantics 12: Discourse and Syntax. New York: Academic Press.

Goldberg, A. (1995). A Construction Grammar Approach to Argument Structure. Chicago: The University of Chicago Press.

Hadley, R.F. (1992). Compositionality and systematiticity in connectionist language learning. In Proceedings of the 14th Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum Associates.

Marcus, G. (1998). Symposium on Cognitive Architecture: The algebraic mind. In M.A. Gernsbacher & S. Derry (Eds.), Proceedings of the 20th Annual Conference of the Cognitive Science Society. Mahway, NJ: Lawrence Erlbaum Associates.

Marcus, G.F., Vijayan, S., Rao, S.B., Vishton, P.M. (1999). Rule learning in seven-month-old infants. Science, 283, 77-80.

Olguin, R, & Tomasello, M. (1993). Two-year-olds do not have a grammatical category of verb. Cognitive Development, 8, 245-273.

Pinker, S. (1981). Comments on the paper by Wexler. In C.L. Baker and J.J. McCarthy (Eds.), The Logical Problem of Language Acquisition, Cambridge, MA: MIT Press.

Pinker, S. (1984). Language Learnability and Language Development. Cambridge, MA: MIT Press.

Plunkett, K., & Marchman, V. (1993). From rote learning to system building: Acquiring verb morphology in children and connectionist nets. Cognition, 48, 21-69.

Tomasello, M., & Olguin, R. (1993). Twenty-three-month-old children have a grammatical category of nount. Cognitive Development, 8, 451-464.