CRL Newsletter

February 1996
Vol. 10, No. 4

The newsletter of the Center for Research in Language, University of California, San Diego, La Jolla CA 92039. 858-534-2536; email: editor@crl.ucsd.edu.


Table Of Contents


Development in a Connectionist Framework: Rethinking the Nature-Nurture Debate

Kim Plunkett

Oxford University



A DEVELOPMENTAL PARADOX

Two findings in developmental psychology stand in apparent conflict. Piaget (1952) has shown that at a certain stage in development, children will cease in their attempts to reach for an object when it is partially or fully covered by an occluder. This finding is observed in children up to the age of about 6 months and is interpreted to indicate that the object concept is not well-established in early infancy. The object representations that are necessary to motivate reaching and grasping behavior are absent. In contrast, other studies have shown that young infants will express surprise when a stimulus array is transformed in such a way that the resulting array does not conform to reasonable expectations. For example, change in heart rate, sucking or GSR, is observed when an object, previously visible, fails to block the path of a moving drawbridge or a locomotive fails to reappear from a tunnel, or has changed colour when it reappears (Baillargeon, 1993; Spelke, 1994). These results are interpreted as indicating that important representations of object properties such as form, shape and the capacity to block the movement of other objects are already in place by 4 months of age. The conflict in these findings can be stated as follows: Why should the infant cease to reach for a partially or fully concealed object when it already controls representational characteristics of objects that confirm the stability of object properties over time, and that predict the interaction of those represented properties with objects that are visible in the perceptual array?

One answer to this conflict is that Piaget grossly underestimated young children's ability to retrieve hidden objects. However, this answer is no resolution to the conflict: Piaget's findings are robust. Alternatively, one might question Piaget's interpretation of his results. Young infants know a lot about the permanent properties of objects but recruiting object representations in the service of a reaching task requires additional sensorimotor skills which have little to do with the infant's understanding of the permanence of objects. Again, this response must be rejected. Young infants who are in full command of the skill to reach and grasp a visible object still fail to retrieve an object which is partially or fully concealed (von Hofsten, 1989). Motor skills are not the culprit here. The capacity to relate object knowledge to other domains seems to be an important part of object knowledge itself. Object knowledge has to be accessed and exercised.

A Resolution

A resolution of the conflict can be found in considering some fundamental differences in the nature of the two types of task that infants are required to perform. In experiments that measure "surprise" reactions to unusual object transformations such as failure to reappear from behind an occluder, the infant is treated as a passive observer (Baillargeon, 1993). In essence, the infant is evaluated for its expectations concerning the future state of a stimulus array. Failure of expectation elicits surprise. In the Piagetian task, the infant is required to actively transform the stimulus array. To achieve this, not only must the infant know where the object is but she must be able to coordinate that information with knowledge about the object's identity -- typically, the infant reaches for objects she wants. We suppose that this coordination is relatively easy for visible objects, because actions are supported by externally available cues. However, when the object is out of sight, the child has to rely on internal representations of the object's identity and position. We assume that the internal representations for object position and identity develop separately. This assumption is motivated by recent neurological evidence that spatial and featural information is processed in separate channels in the human brain -- the so-called 'what' and 'where' channels (Ungerlieder & Mishkin 1982). In principle, the child could demonstrate knowledge of an object's position without demonstrating knowledge about its identity, or vice versa. Surprise reactions might be triggered by failure of infant expectations within either of these domains. For example, an object may suddenly change its featural properties or fail to appear in a predicted position. Internal representations are particularly important when the object is out of sight. Hence, we might expect infants to have greater difficulty performing tasks that involve the coordination of spatial and featural representations -- such as reaching for hidden objects -- when these representations are only partially developed.

Building A Model

The resolution outlined in the previous section constitutes a theory about the origins of infants' surprise reactions to objects' properties (spatial or featural) which do not conform to expectations and attempts to explain why these surprise reactions precede the ability to reach for hidden objects even though they possess the motor skills to do so.Mareschal, Plunkett & Harris (1995) have constructed a computational model that implements the ideas outlined in this theory. The model consist of a complex neural network that processes a visual image of an object that can move across a flat plane. Different types of objects distinguished by a small number of features appear on the plane one at a time. These objects may or may not disappear behind an occluder. All objects move with a constant velocity so that if one disappears behind an occluder, it will eventually reappear on the other side. Object velocities can vary from one presentation to the next.

Figure 1: The modular neural network (Mareschal et al., 1995) used to track and initiate reaching responses for visible and hidden objects. An object recognition network and a visual tracking network process information from an input retina. The object recognition network learns spatially invariant representations of the objects that move around the retina. The visual tracking network learns to predict the next position of the object on the retina. The retrieval response network learns to integrate information from the other two modules in order to initiate a reaching response. The complete system succeeds in tracking visible objects before it can predict the reappearance of hidden objects. It also succeeds in initiating a reaching response for visible objects before it learns to reach for hidden objects.
The network is given two tasks. First, it must learn to predict the next position of the moving object, including its position when hidden behind an occluder. Second, the network must learn to initiate a motor response to reach for an object, both when visible and when hidden. The network is endowed with several information processing capacities that enable it fulfil these tasks. The image of the object moving across the plane is processed by two separate modules. One module learns to form a spatially invariant representation of the object so that it can recognise its identity irrespective of its position on the plane (Foldiak 1991). The second module learns to keep track of the object but loses all information about the object's identity (Ungerlieder & Mishkin 1982). This second module does all the work that is required to predict the position of the moving object. However, in order to reach for an object, the network needs to integrate information about the object's identity and its position. Both modules are required for this task. Therefore, the ability to reach can be impeded either because the representations of identity and position are not sufficiently developed or because the network has not yet managed to properly integrate these representations in the service of reaching.

Given the additional task demands imposed on the network for reaching it would seem relatively unsurprising to discover that the network learns to track objects before it learns to reach for them. The crucial test of the model is whether it is able to make the correct predictions about the late onset of reaching for hidden objects relative to visible objects. In fact, the model makes the right predictions for the order of mastery in tracking and reaching for visible and hidden objects. It quickly learns to track and reach for visible objects, tracking being slightly more precocious than retrieval. Next, the network learns to track occluded objects as its internal representations of position are strengthened and it is able to "keep track" of the object in the absence of perceptual input. However, the ability to track hidden objects together with the already mastered ability to reach for visible objects does not guarantee mastery of reaching for hidden objects. The internal representations that control the integration of spatial and featural information require further development before this ability is mastered.

Evaluating the Model

Notice how this modelling endeavour provides a working implementation of a set of principles that constitute a theory about how infants learn to track and reach for visible and hidden objects. It identifies a set of tasks that the model must perform and the information processing capacities required to perform those tasks. All these constitute a set of assumptions that are not explained by the model. However, given these assumptions, the model is able to make correct predictions about the order of mastery of the different tasks. The model implements a coherent and accurate (not necessarily true -- the assumptions might be wrong) theory. However, this model just like any other has a number of free parameters which the modeller may 'tweak' in order to achieve the appropriate predictions. It is necessary to derive some novel predictions which can be tested against new experimental work with infants, in order to evaluate the generality of the solution the model has found. This model makes several interesting predictions including improved tracking skills at higher velocities and imperviousness to unexpected feature changes while tracking. The first experimental prediction has been confirmed (see Mareschal, Harris & Plunkett 1995) while the second prediction is currently being tested. This instance of model building and evaluation thus seems to support the initial insight that children's object representations develop in a fragmentary fashion, and that the development of these fragments of knowledge shape infant performance on various tasks in line with their manner of involvement in the tasks concerned.

CONNECTIONIST INSIGHTS

The model described in the previous section is an example of a computer simulation that uses the learning capabilities of artificial neural networks to construct internal representations of a training environment in the service of several tasks (reaching and tracking). Neural networks are particularly good at extracting the statistical regularities of a training environment and exploiting them in a structured manner to achieve some goal. They consist of a well-specified architecture driven by a learning algorithm. The connections or weights between the simple processing units that make up the network are gradually adapted over time in response to localised messages from the learning algorithm. The final configuration of weights in the network constitutes what it knows about the environment and the tasks it is required to perform.

Connectionist modelling provides a flexible approach to evaluating alternative hypotheses concerning the start state of the organism (or what we may think of as its innate endowment), the effective learning environment that the organism occupies and the nature of the learning procedure for transforming the organism into its mature state. The start state of the organism is modelled by the choice of network architecture and computational properties of the units in the network. There are a wide range of possibilities that the developmentalist can choose between. The effective learning environment is determined by the manner in which the modeller chooses to define the task for the network. For example, the modeller must decide upon a representational format for the pattern of inputs and outputs for the network, and highlight the manner in which the network samples patterns from the environment. These decisions constitute precise hypotheses about the nature of the learning environment. Finally, the modeller must decide how the network will learn. Again, a wide variety of learning algorithms are available to drive weight adaptation in networks. Any particular connectionist model embodies a set of decisions governing all of these factors which are crucial for specifying clearly one's theory of development. Quite small changes in one of the choices can have dramatic changes for the performance of the model -- some of them quite unexpected. Connectionist modelling offers a rich space for exploring a wide range of developmental hypotheses.

In the remainder of this article I will briefly review some connectionist modelling work that has explored some important areas in the hypothesis space of developmental theories. I aim to underscore four main lessons or insights that these models have provided:

1. When constructing theories in psychology, we use behavioural data from experiments or naturalistic observation as the objects that our explanations must fit. We attempt to infer underlying mechanisms from overt behaviour. Connectionist modelling encourages us to be suspicious of the explanations we propose. Often, networks surprise us with the simplicity of the solution they discover to apparently complex tasks -- sometimes, leading us to the conclusion that learning may not be as difficult as we thought.

2. When we see new forms of behaviour emerging in development, we are tempted to conclude that some radical change has occurred in the mechanisms governing that behaviour. Connectionist modelling has shown us that small and gradual internal changes in an organism can lead to dramatic non-linearities in its overt behaviour -- new behaviour need not mean new mechanisms.

3. Theories of development are often domain specific. Behaviours that are discrete and associated with distinguishable modalities, promote explanations that do not reach beyond the specifics of those modalities or domains. These encapsulated accounts often emphasise the impoverished character of the learning environment and lead to complex specifications of the organism's start state. Connectionist models provide a framework for investigating the interaction between modalities and a formalism for entertaining distributed as well as domain specific accounts of developmental change. This approach fosters an appreciation of developing systems in which domain specific representations emerge from a complex interaction of the organism's domain-general learning capacities with a rich learning environment.

4. Complex problems seem to require complex solutions. Mastery of higher cognitive processes appears to require the application of complex learning devices from the very start of development. Connectionist modelling has shown us that placing limitations on the processing capacity of developing systems during early learning can actually enhance their long-term potential. The ignorance and apparent inadequacies of the immature organism may, in fact, be highly beneficial for learning the solutions to complex problems. Small is beautiful.

INFERRING MECHANISMS FROM BEHAVIOUR

Children make mistakes. Developmentalists use these mistakes as clues to discover the nature of the mechanisms that drive correct performance. For example, in learning the past tense forms of irregular verbs or plurals of irregular nouns, English children may sometimes overgeneralise the "-ed" or "s" suffixes to produce incorrect forms like "hitted" or "mans". These errors often occur after the child has already produced the irregular forms correctly, yielding the well-known U-shaped profile of development.

A Dual-Mechanism Account

A natural interpretation of this pattern of performance is to suggest that early in development, the child learns irregular forms by rote, simply storing in memory the forms that she hears in the adult language. At a later stage, the child recognises the regularities inherent in the inflectional system of English and re-organises her representation of the past tense or plural system to include a qualitatively new device that does the work of adding a suffix, obviating the need to memorise new forms. During this stage, some of the original irregular forms may get sucked into this new system and suffer inappropriate generalisation of the regular suffix. Finally, the child must sort out which forms cannot be generated with the new rule-based device. They do this by strengthening their memories for the irregular forms which can thereby block the application of the regular rule and eliminate overgeneralisation errors (Pinker & Prince 1988).

Figure 2: The dual-route model for the English past tense (Pinker & Prince 1988). The model involves a symbolic regular route that is insensitive to the phonological form of the stem and a route for exceptions that is capable of blocking the output from the regular route. Failure to block the regular route produces the correct output for regular verbs but results in overgeneralisation errors for irregular verbs. Children must strengthen their representation of irregular past tense forms to promote correct blocking of the regular route.
This account of the representation and development of past tense and plural inflections in English assumes that two qualitatively different types of mechanism are needed to capture the profile of development in young children -- a rote memory system to deal with the irregular forms and a symbolic rule system to deal with the rest. The behavioural dissociation between regular and irregular forms -- children make mistakes on irregular forms but not on regular forms -- make the idea of two separate mechanisms very appealing. Double dissociations between regular and irregular forms in disordered populations add to the strength of the claim that separate mechanisms are responsible for different types of errors: in some language disorders children may preserve performance on irregular verbs but not on regulars while in other disorders the opposite pattern is observed.

Although the evidence is consistent with the view that a dual-route mechanism underlies children's acquisition of English inflectional morphology, this is no proof that the theory is correct. There may be other types of mechanistic explanations for these patterns of behaviour and development. Connectionist modelling offers a tool for exploring alternative developmental hypotheses.

Single-mechanism account

One of the earliest demonstrations of the learning abilities of neural networks was for English past tense acquisition. Rumelhart & McClelland (1986) suggested that the source of children's errors in learning past tense forms was to be found in their attempts to systematise the underlying relationship that holds between the verb's stem and its past tense form. For most verbs in English, the sound of the stem does not affect the past tense form. You just add "ed" on the end. However, there is a small subset of verbs which exhibit a different relationship between stem and past tense form. For example, there is a set of no change verbs where the stem and past tense forms are identical (hit-->hit). All these verbs end in an alveolar consonant (/t/ or /d/). Other verbs undergo a particular type of vowel change (ring-->rang, sing-->sang), apparently triggered by the presence of the rhyme "-ing" in the stem. Neural networks are particularly good at picking up on these types of regularities, so Rumelhart & McClelland trained a simple network to produce the past tense forms of verbs when presented with their stems. The details of the learning procedure and network architecture are not important here (see Plunkett 1995 for a detailed review of this and related models).

Figure 3: Network overregularization errors on irregular verbs as found in the Plunkett & Marchman (1993) simulation compared to those produced by one of 83 children analysed by Marcus, Ullman, Pinker, Hollander, Rosen & Xu (1992). The thick line indicates the percentage of regular verbs in the child's/network's vocabulary at various points in learning. Note the initial period of error free performance and overall low error rate characteristic of the developmental profiles for the model and child. Plunkett and Marchman (1993) also demonstrated that the types of errors that occurred in the model closely resembled the types of errors produced by the children studied by Marcus et al. (1992).
What is important is to note that Rumelhart & McClelland were successful in training the network to perform the task and that en route to learning the correct past tense forms of English verbs, the network made mistakes that are similar to the kind of mistakes that children make during the acquisition of inflectional morphology. Furthermore, the network did not partition itself into qualitatively distinct devices during the process of learning -- one for regular verbs and one for irregular verbs. The representation of both verb types seemed to be distributed throughout the entire matrix of connections in the network. Nevertheless, a behavioural dissociation between regular and irregular verbs was observed in the network. Most of its errors occurred on irregular verbs.

More recently, Marchman (1993) has shown that damage to a network trained on the past tense problem results in further dissociations between regular and irregular forms: production of irregular forms remains intact while production of regular verbs deteriorates, mimicking patterns of performance observed in disordered populations. As with the Rumelhart & McClelland model, the representation of regular and irregular verbs was distributed throughout the network, i.e., there was no evidence of dissociable mechanisms.

As it turns out, there were a lot of fundamental design problems with the Rumelhart & McClelland model that made it untenable as a realistic model of children's acquisition of the English past tense (Pinker & Prince 1988). Some of these problems have been fixed, some haven't (MacWhinney & Leinbach 1991, Plunkett & Marchman 1991, 1993, Cottrell & Plunkett 1994). However, the basic insight that the original model offered still remains: The observation of behavioural dissociations in some domain of performance does not necessarily imply the existence of dissociable mechanisms driving those dissociations in behaviour. Behavioural dissociations can emerge as the result of subtle differences in the graded representations constructed by these networks for different types of tasks.

Of course, just because one can train a network to mimic children's performance in learning the past tense of English verbs, does not mean that children learn them the same way as the network. The relatively simple learning system that Rumelhart & McClelland and other researchers have used to model children's learning may underestimate the complexity of the resources that children bring to bare on this problem. However, the neural network model does show that, in principle, children could use a relatively simple learning system to solve this problem. The modelling work has thereby enriched our understanding of the range and types of mechanism that might drive development in this domain.

DISCONTINUITIES IN DEVELOPMENT

Developmentalists often interpret discontinuities in behaviour as manifesting the onset of a new stage or phase of development (Piaget 1955; Karmiloff-Smith 1979; Siegler 1981). The child's transition to a new stage of development is usually construed as the onset of a new mode of operation of the cognitive system, perhaps as the result of the maturation of some cognitively relevant neural sub-system. For example, the vocabulary spurt that often occurs towards the end of the child's second year has been explained as the result of an insight (McShane 1979), in which the child discovers that objects have names. Early in development, the child lacks the necessary conceptual machinery to link object names with their referents. The insight is triggered by a switch that turns on the naming machine. Similar arguments have been offered to explain the developmental stages through which children pass in mastering the object concept, understanding quantity and logical relations.

It is a reasonable supposition that new behaviours are caused by new events in the child, just as it is reasonable to hypothesise that dissociable behaviours imply dissociable mechanisms. However, connectionism teaches us that new behaviours can emerge as a result of gradual changes in a simple learning device. It is well known that the behaviour of dynamical systems unfolds in a non-linear and unpredictable fashion (van Geert 1991). Neural networks are themselves dynamical systems and they exhibit just these non-linear properties.

Plunkett, Sinha, Moller & Strandsby (1992) trained a neural network to associate object labels with distinguishable images. The images formed natural (though overlapping) categories so that images that looked similar tended to have similar labels. The network was constructed so that it was possible to interrogate it about the name of an object when only given its image (call this production) or the type of image when only given its name (call this comprehension).

Network performance during training resembled children's vocabulary development during their second year. During the early stages of training, the network was unable to produce the correct names for most objects -- it got a few right but improvement was slow. However, with no apparent warning, production of correct names suddenly increased until all the objects in the network's training environment were correctly labelled. In other words, the network went through a vocabulary spurt. The network showed a similar improvement of performance for comprehension, except that the vocabulary spurt for comprehension preceded the productive vocabulary spurt. Last but not least, the network made a series of under- and over-extension errors en route to masterful performance (such as using the word 'dog' exclusively for the family pet or calling all four-legged animals 'dog') -- a phenomenon observed in young children using new words (Barrett 1995).

Figure 4: (a) Profile of vocabulary scores typical for many children during their second year -- taken from Plunkett (1993). Each data point indicates the number of different words used by the child during a recording session. It is usually assumed that the "bumps" in the curve are due to sampling error, though temporary regressions in vocabulary growth cannot be ruled out. The vocabulary spurt that occurs around 22 months is observed in many children. It usually consists of an increased rate of acquisition of nominals -- specifically names for objects (McShane 1979). (b) Simplified version of the network architecture used in Plunkett, Sinha, Moller & Strandsby 1992. The image is filtered through a retinal pre-processor prior to presentation to the network. Labels and images are fed into the network through distinct "sensory" channels. The network is trained to reproduce the input patterns at the output -- a process known as auto-association. Production corresponds to producing a label at the output when only an image is presented at the input. Comprehension corresponds to producing an image at the output when only a label is presented at the input.
There are several important issues that this model highlights: First, the pattern of behaviour exhibited by the model is highly non-linear despite the facet that the network architecture and the training environment remain constant throughout learning. The only changes that occur in the network are small increments in the connections that strengthen the association between an image and its corresponding label. No new mechanisms are needed to explain the vocabulary spurt. Gradual changes within a single learning device are, in principle, capable of explaining this profile of development. McClelland (1989) has made a similar point in the domain of children's developing understanding of weight/distance relations for solving balance beam problems (Siegler 1981).

Second, the model predicts that comprehension precedes production. This in itself is not a particularly radical prediction to make. However, it is an emergent property of the network that was not "designed in" before the model was built. More important is the network's prediction that there should be a non-linearity in the receptive direction, i.e., a vocabulary spurt in comprehension. When the model was first built, there was no indication in the literature as to the precision of this prediction. The prediction has since been shown to be correct (Reznick & Goldfield 1992). This model provides a good example of how a computational model can be used not only to evaluate hypotheses about the nature of the mechanisms underlying some behaviour but also to generate predictions about the behaviour itself. The ability to generate novel predictions about behaviour is important in simulation work as it offers a way to evaluate the generality of the model in understanding human performance.

The behavioural characteristics of the model are a direct outcome of the interaction of the linguistic and visual representations that are used as inputs to the network. The non-linear profile of development is a direct consequence of the learning process that sets up the link between the linguistic and visual inputs and the asymmetries in production and comprehension can be traced back to the types of representation used for the two types of input. The essence of the interactive nature of the learning process is underscored by the finding that the network learns less quickly when only required to perform the production task. Learning to comprehend object labels at the same time as learning to label objects enables the model to learn the labels faster.

It is important to keep in mind that this simulation is a considerable simplification of the task that the child has to master in acquiring a lexicon. Words are not always presented with their referents and even when they are it is not always obvious (for a child who doesn't know the meaning of the word) what the word refers to. Nevertheless, within the constraints imposed upon the model, its message is clear: New behaviours don't necessarily require new mechanisms and systems integrating information across modalities can reveal surprising emergent properties that would not have been predicted on the basis of exposure to one modality alone.

SMALL IS BEAUTIFUL

The immature state of the developing infant places her at a decided disadvantage in relation to her mature, skilled caregivers. In contrast, the new born of many other species are endowed with precocious skills at birth. Why is homo sapiens not born with a set of cognitive abilities that match the adult of the species? This state of affairs may seem all the more strange given that we grow very few new neurons after birth and even synaptic growth has slowed dramatically by the first birthday. In fact, there may be important computational reasons for favouring a relatively immature brain over a cognitively precocious endowment.

A complete specification of a complex nervous systems would be expensive in genetic resources. The programming required to fully determine the precise connectivity of any adult human brain far exceeds the information capacity in the human genome. Much current research in brain development and developmental neurobiology points to a dramatic genetic underspecification of the detailed architecture of the neural pathways that characterise the mature human brain -- particularly in the neo-cortex. So how does the brain know how to develop? It appears that evolution has hit upon a solution that involves a trade-off between nature and nurture: You don't need to encode in the genes what you can extract from the environment. In other words, use the environment as a depository of information that can be relied upon to drive neural development.

The emergence of neural structures in the brain is entirely dependent upon a complex interaction of the organism's environment and the genes' capacity to express themselves in that environment. This evolutionary engineering trick allows the emergence of a complex neural system with a limited investment in genetic pre-wiring. Of course, this can have disastrous consequences when the environment fails to present itself. On the other hand, the flexibility introduced by genetic underspecification can also be advantageous when things go wrong, such as brain damage. Since information is available in the environment to guide neural development, other brain regions can take over the task of the damaged areas. Underspecification and sensitivity to environmental conditions permit a higher degree of individual specialisation and adaptation to changing living conditions. Starting off with a limited amount of built-in knowledge can therefore be an advantage if you're prepared to take the chance that you can find the missing parts elsewhere.

There are, however, other reasons for wanting to start out life with some limits on processing capacity. It turns out that some complex problems are easier to solve if you first tackle them from a over-simplistic point of view. A good example of this is Elman's (1993) simulation of grammar learning in a simple recurrent network. The network's task was to predict the next word in a sequence of words representing a large number of English-like sentences. These sentences included long distance dependencies, i.e., the sentences included embedded clauses which separated the main noun from the main verb. Since English verbs agree with their subject nouns in number, the network must remember the number of the noun all the way through the embedded clause until it reaches the main verb of the sentence. For example, in a sentence like "The boy with the football that his parents gave him on his birthday chases the dog", the network must remember that "boy" and "chases" agree with each other. This is the type of phenomenon which Chomsky (1959) used to argue against a behaviourist approach to language.

Figure 5: (a) A simple recurrent network (Elman 1993) is good at making predictions. A sequence of items is presented to the network, one at a time. The network makes a prediction about the identity of the next item in the sequence at the output. Context units provide the network with an internal memory that keeps track of its position in the sequence. If it makes a mistake, the connections in the network are adapted slightly to reduce the error. (b) When the input consists of a sequence of words that make up sentences, the network is able to represent the sequences as trajectories through a state space. Small differences in the trajectories enable the network to keep track of long-distance dependencies.
Even after a considerable amount of training, the network did rather poorly at predicting the next word in the sequence -- as do humans (cf. "The boy chased the ???"). However, it did rather well at predicting the grammatical category of the next word. For example, it seemed to know when to expect a verb and when to expect a noun, suggesting that it had learnt some fundamental facts about the grammar of the language to which it had been exposed. On the other hand, it did very badly on long distance agreement phenomena, i.e., it could not predict correctly which form of the verb should be used after an intervening embedded clause. This is a serious flaw if the simulation is taken as a model of grammar learning in English speakers, since English speakers clearly are able to master long-distance agreement.

Elman discovered two solutions to this problem: The network could learn to master long-distance dependencies if the sentences to which it was initially exposed did not contain any embedded clauses and consisted only of sequences in which the main verb and its subject were close together. Once the network had learnt the principle governing subject-verb agreement under these simplified circumstances, embedded clauses could be included in the sentences in the training environment and the network would eventually master the long-distance dependencies. Exposure to a limited sample of the language helped the network to decipher the fundamental principles of the grammar which it could then apply to the more complex problem. This demonstration shows how "motherese" might play a facilitatory role in language learning (Snow 1977).

Elman's second solution was to restrict the memory of the network at the outset of training while keeping the long distance dependencies in the training sentences. The memory constraint made if physically impossible for the network to make predictions about words more than three or four items downstream. This was achieved by resetting the context units in the recurrent network and is equivalent to restricting the system's working memory. When the network was constrained in this fashion it was only able to learn the dependencies between words that occurred close together in a sentence. However, this limitation had the advantage of preventing the network from being distracted by the difficult long-distance dependencies. So again the network was able to learn some of the fundamental principles of the grammar. The working memory of the network was then gradually expanded so that it had an opportunity to learn the long-distance dependencies. Under these conditions, the network succeeded in predicting the correct form of verbs after embedded clauses.

The initial restriction on the system's working memory turned out to have beneficial effects: Somewhat surprisingly, the network succeeded in learning the grammar underlying word sequences when working memory started off small and was gradually expanded, while it failed when a full working memory was made available to the network at the start of training.

The complementary nature of the solutions that Elman discovered to the problem of learning long-distance agreement between verbs and their subjects highlights the way that nature and nurture can be traded off against one another in the search for solutions to complex problems. In one case, exogenous environmental factors assisted the network in solving the problem. In the other case, endogenous processing factors pointed the way to an answer. In both cases, though, the solution involved an initial simplification in the service of long term gain. In development, big does not necessarily mean better.

CURRENT SHORTCOMINGS

One trial learning

Children and adults learn quickly. For example, a single reference to a novel object as a wug may be sufficient for a child to use and understand the term appropriately on all subsequent occasions. The connectionist models described in this paper use learning algorithms which adjust network connections in a gradualistic, continuous fashion. An outcome of this computational strategy is that new learning is slow. To the extent that one trial learning is an important characteristic of human development, these connectionist models fail to provide a sufficiently broad basis for characterising the mechanisms involved in development.

There are two types of solution that connectionist modellers might adopt in response to these problems. First, it should be noted that connectionist learning algorithms are not inherently incapable of one trial learning. The rate of change in the strength of the connections in a network is determined by a parameter called the learning rate. Turning up the learning rate will result in faster learning for a given input pattern. For example, it is quite easy to demonstrate one trial learning in a network that exploits a Hebbian learning algorithm. However, a side effect of using high learning rates is that individual training patterns can interfere with each other, sometimes resulting in undesirable instabilities in the network. Of course, interference is not always undesirable and may help us explain instabilities in children's performance such as in their acquisition of the English past tense. Generally, though, catastrophic interference between training patterns (when training on one pattern completely wipes out the traces of a previously trained pattern) is undesirable. One way to achieve one trial learning without catastrophic interference is to ensure that the training patterns are orthogonal (or dissimilar) to each other. Many models deliberately choose input representations which fulfil this constraint.

An alternative response to the problem of one trial learning in networks is to suggest that in some cases it is illusory, i.e., when individuals demonstrate what is apparently entirely new learning they are really exploiting old knowledge in novel ways. Vygotsky (1962) coined the term the Zone of Proximal Development to describe areas of learning where change could occur at a fast pace. Piaget (1952) used the notion of moderate novelty in a similar fashion. The performance of networks can change dramatically over just a couple of learning trials. For example, the Plunkett et al. (1992) simulation of vocabulary development exhibited rapid vocabulary growth after a prolonged period of slow lexical learning. The McClelland (1989) balance beam simulation shows similar stage-like performance. In both cases, the networks gradually move towards a state of readiness that then suddenly catapults them into higher levels of behaviour. Some one trial learning may be amenable to this kind of analysis. It seems unlikely, however, that all one trial learning is of this kind.

Defining the task and the teacher

Some network models are trained to carry out a specific task that involve a teacher. For example, the Rumelhart & McClelland model of past tense acquisition is taught to produce the past tense form of the verb when exposed to the corresponding stem. These are called supervised learning systems. In these simulations, the modeller must justify the source of the teacher signal and provide a rationale for the task the network is required to perform. Other models use an unsupervised form of learning such as auto-association (Plunkett et al., 1992) or prediction (Elman 1993, Mareschal et al., 1995). In these models, the teacher signal is the input to the network itself. In general, connectionist modellers prefer to use unsupervised learning algorithms. They involve fewer assumptions about the origins of the signal that drive learning. However, some tasks seem to be inherently supervised. For example, learning that a dog is called a dog rather than a chien involves exposure to appropriate supervision. Nevertheless, it is unclear how the brain goes about conceptualising the nature of the task to be performed and identifying the appropriate supervisory signal. Clearly, different parts of the brain end up doing different types of things. One of the challenges facing developmental connectionists is to understand how neural systems are able to define tasks for themselves in a self-supervisory fashion and to orchestrate the functioning of multiple networks in executing complex behaviour.

Biological plausibility

Throughout this paper I have tried to demonstrate how connectionist models can contribute to our understanding of the mechanisms underlying linguistic and cognitive development. Yet the learning algorithms employed in some of the models described here are assumed to be biologically implausible. For example, backpropagation (Rumelhart, Hinton & Williams 1986) involves propagating error backwards through the layers of nodes in the network. However, there is no evidence indicating that the brain propagates error across layers of neurons in this fashion and some have argued that we are unlikely to find such evidence (Crick 1989).

There is a considerable literature concerning the appropriate level of interpretation of neural network simulations (see Smolensky 1988). For example, it is often argued that connectionist models can be given an entirely functionalist interpretation and the question of their relation to biological neural networks left open for further research. In other words, the vocabulary of connectionist models can be couched at the level of software rather than hardware, much like the classical symbolic approach to cognition. Many developmental connectionists, however, are concerned to understand the nature of the relationship between cognitive development and changes in brain organisation. Connectionist models which admit the use of biologically implausible components appear to undermine this attempt to understand the biological basis of the mechanisms of change.

Given the success of connectionist approaches to modelling development, it would seem wasteful to throw these simulations onto the waste bin of the biologically implausible. Clearly, the most direct way forward is to implement these models using biologically plausible learning algorithms, such as Hebbian learning. Nevertheless, there are several reasons for tentatively accepting the understanding achieved already through existing models. First, algorithms like backpropagation may not be that implausible. The neuro-transmitters that communicate signals across the synaptic gap are still only poorly understood but it is known that they communicate information in both directions. Furthermore, information may be fed backwards through the layered system of neurons in the cortex -- perhaps exploiting the little understood back projecting neurons in the process.

A second, related proposal assumes that algorithms like backpropagation belong to a family of learning algorithms, all of which have similar computational properties and some of which have biologically plausible implementations. The study of networks trained with backpropagation could turn out to yield essentially the same results as networks trained with a biologically plausible counterpart. There is some support for this point of view. For example, Plaut & Shallice (1993) lesioned a connectionist network trained with backpropagation and compared its behaviour with a lesioned network originally trained using a contrastive Hebbian learning algorithm. The pattern of results obtained were essentially the same for both networks. This result does not obviate the need to build connectionist models that honour the rapidly expanding body of knowledge relating to brain structure and systems. However, it does suggest that given the rather large pockets of ignorance concerning brain structure and function, we should be careful about jettisoning our hard won understanding of computational systems that may yet prove to be closely related to the biological mechanisms underlying development.

SOME LESSONS

A commonly held view has been that connectionism involves a tabula rasa approach to human learning and development. It is unlikely that any developmental connectionist has ever taken this position. Indeed, it is difficult to imagine what a tabula rasa connectionist network might look like. All the models reviewed in this article assume a good deal of built-in architectural and processing constraints to get learning off the ground. In some cases, such as the Rumelhart & McClelland model of the past tense, the initial constraints are quite modest. In others, such as the Mareschal et al., model of visual tracking and reaching, the initial architectural and computational assumptions are rather complex. These modelling assumptions, together with the task definition, imply a commitment to the ingredients that are necessary to get learning off the ground.

What is needed to get learning off the ground? We have seen that there are two main sources of constraint:

1. The initial state of the organism embodies a variety of architectural and computational constraints that determine its information processing capabilities.

2. Environmental structure supports the construction of new representational capacities not initially present in the organism itself.

Modelling enables us to determine whether a theory about the initial state of the organism can make the journey to the mature state given a well-defined training environment. Modelling also enables us to investigate the minimal assumptions about the initial state that are needed to make this journey.

A minimalist strategy may not necessarily provide an accurate picture of the actual brain mechanisms that underlie human development. However, it provides an important potential contrast to theories of the initial state that are based on arguments from the poverty of the stimulus. Investigating the richness of the stimulus shifts the burden away from the need to postulate highly complex, hard-wired information processing structures. A minimalist strategy may also provide valuable insights into alternative solutions that the brain may adopt when richer resources fail.

Theories about the initial state of the organism cannot be dissociated from theories about what constitutes the organism's effective environment. Release two otherwise identical organisms in radically different environments and the representations they learn can be quite disparate. Connectionist modelling offers an invaluable tool for investigating these differences as well as examining the necessary conditions that permit the development of the emergent representations that we all share.

ACKNOWLEDGEMENTS

This manuscript was produced while the author was engaged in a collaborative book project together with Jeff Elman, Liz Bates, Mark Johnson, Annette Karmiloff-Smith and Domenico Parisi. The content of this manuscript has been influenced profoundly by discussions with my book co-authors. The reader is strongly recommended to consult Elman et al. (In press) for a more wide-ranging and detailed discussion of the issues raised here.

REFERENCES

Baillargeon, R. (1993). The object concept revisited: New directions in the investigation of infant's physical knowledge. In: C. E. Granrud (Ed.), Visual perception and cognition in infancy, 265-315. London, UK: LEA.

Barrett, M. D. (1995). Early Lexical Development. In P. Fletcher & B. MacWhinney (Eds.), The Handbook of Child Language, (pp. 362-392). Oxford: Blackwells.

Bates, E., Bretherton, I., & Snyder, L. (1988). From First Words to Grammar: Individual Differences and Dissociable Mechanisms. Cambridge, MA: Cambridge University Press.

Bliss, T. V. P., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. Journal of Physiology, 232, 331-356.

Chomsky, N. (1959). Review of Skinner's verbal behavior. Language, 35, 26-58.

Cottrell, G. W., & Plunkett, K. (1994). Acquiring the mapping from meanings to sounds. Connection Science, 6(4), 379-412.

Crick, F. H. C. (1989). The real excitement about neural networks. Nature, 337, 129-132.

Elman, J. L. (1993). Learning and development in neural networks: the importance of starting small. Cognition, 48(1), 71-99.

Elman, J., Bates, E., Karmiloff-Smith, A., Johnson, M., Parisi, D., & Plunkett, K. (In press). Rethinking Innateness: Development in a connectionist perspective. Cambridge, MA: MIT Press.

Foldiak, P. (1991). Learning invariance in transformational sequences. Neural Computation, 3, 194-200

Karmiloff-Smith, A. (1979). Micro- and macrodevelopmental changes in language acquisition and other representational systems. Cognitive Science, 3, 91-118.

MacWhinney, B. & Leinbach, A. J. (1991) Implementations are not conceptualizations: Revising the verb learning model. Cognition, 40, 121-157.

McClelland, J. L. (1989). Parallel distributed processing: implications for cognition and development. In R. G. M. Morris (Ed.), Parallel Distributed Processing: Implications for Psychology and Neurobiology. Oxford: Clarendon Press.

McShane, J. (1979). The development of naming. Linguistics, 17, 879-905.

Marchman, V. A. (1993). Constraints on Plasticity in a Connectionist Model of the English Past Tense. Journal of Cognitive Neuroscience, 5(2), 215-24.

Marcus, G. F., Ullman, M., Pinker, S., Hollander, M., Rosen, T. J. & Xu, F. (1992) Overregularization in language acquisition. Monographs of the Society for Research in Child Development, 57(4), Serial No. 228.

Mareschal, D., Plunkett, K., & Harris, P. (1995). Developing Object Permanence: A Connectionist Model. In J. D. Moore & J. F. Lehman (Eds.), Proceedings of the Seventeenth Annual Conference of the Cognitive Science Society, (pp. 170-175). Mahwah, NJ.: Lawrence Erlbaum Associates.

Piaget, J. (1952). The Origins of Intelligence in the Child. New York: International Universities Press.

Piaget, J. (1955). Les stades du developpement intellectuel de l'enfant et de l'adolescent. In P. O. e. al. (Ed.), Le probleme des stades en psychologie de l'enfant. Paris: Presses Univer. France.

Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a Parallel Distributed Processing Model of language acquisition. Cognition, 29, 73-193.

Plaut, D. C., & Shallice, T. (1993). Deep Dyslexia: A Case Study of Connectionist Neuropsychology. Cognitive Neuropsychology, 10(5), 377-500.

Plunkett, K. (1995). Connectionist Approaches to Language Acquisition. In P. Fletcher & B. MacWhinney (Eds.), Handbook of Child Language, (pp. 36-72). Oxford: Blackwells.

Plunkett, K. & Marchman, V. (1991) U-shaped learning and frequency effects in a multi-layered perceptron: Implications for child language acquisition. Cognition, 38, 43-102.

Plunkett, K. & Marchman, V. (1993) From rote learning to system building: acquiring verb morphology in children and connectionist nets. Cognition, 48, 1-49.

Plunkett, K., Sinha, C. G., Moller, M. F. & Strandsby (1992) Symbol grounding or the emergence of symbols? Vocabulary growth in children and a connectionist net. Connection Science, 4, 293-312.

Reznick, J. S. & Goldfield, B. A. (1992) Rapid change in lexical development in comprehension and production. Developmental Psychology, 28, 406-413.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing: Explorations in the Microstructure of Cognition, Vol 1: Foundations, (pp. 318-362.). Cambridge, MA: MIT Press.

Rumelhart, D. E., & McClelland, J. L. (1986). On learning the past tense of English verbs. In J. L. McClelland & D. E. Rumelhart (Eds.), Parallel distributed processing: explorations in the microstructure of cognition. Cambridge: MIT Press.

Siegler, R. (1981). Developmental sequences within and between concepts. Monographs of the Society for Research in Child Development, 46, Whole No. 2.

Snow, C. E. (1977). Mothers' speech research: From input to interaction. In C. E. Snow & C. A. Ferguson (Eds.), Talking to children: Language input and acquisition. Cambridge: Cambridge University Press.

Spelke, E. S., Katz, G., Purcell, S. E., Ehrlich, S. M. & Breinlinger, K. (1994) Early knowledge of object motion: continuity and inertia. Cognition, 51, 131-176.

von Hofsten, C (1989). Transition mechanisms in sensori-motor development. In: A. de Ribaupierre (Ed.), Transition mechanisms in child development: The longitudinal perspective, 223-259. Cambridge, UK: Cambridge University Press.

Vygotsky, L. (1962). Thought and language. Cambridge: MIT Press.

Ungerlieder, L. G. Mishkin, M. (1982). Two cortical visual systems. In: D. J. Ingle, M. A. Goodale, & Mansfield (Eds.), Analysis of visual behavior. Cambridge, MA: MIT Press.

van Geert, P. (1991). A dynamic systems model of cognitive and language growth. Psychological Review, 98, 3-53.


[CRL Newsletter Home Page] [CRL Home Page]

Center for Research in Language
CRL Newsletter February 1996 Vol. 10, No. 4