CRL Newsletter
July 1995
Vol. 9, No. 3
The newsletter of the Center for Research in Language, University of California,
San Diego, La Jolla CA 92039. 858-534-2536; email: editor@crl.ucsd.edu.
Table Of Contents
Connectionist Modeling of the Fast
Mapping Phenomenon
Jeanne Milostan
Department of Computer Science and Engineering, UCSD
1 Introduction
The average child learns some 14,000 words before the age of 6, which represents
the daunting task of acquiring 9 new words per day, or about one each waking
hour [2]. Researchers examining the process by which this is accomplished
have time and again encountered an interesting effect: often the child can
acquire a new word from only one or a small number of exposures to that
word. Susan Carey has dubbed this phenomenon "fast mapping."
In this paper we examine the research which has been done on the manifestation
of fast mapping in children and explore how this may be explained in terms
of a general cognitive model of language acquisition. We then examine a
number of basic and advanced connectionist models and systems and weigh
how each stands in relation to describing and explaining the fast mapping
behavior. We then speculate on what is missing from the constellation of
models available and propose directions for future research in this area.
2 Fast Mapping
2.1 Empirical Demonstrations
Susan Carey [2] began by asking the question "What is learned when
a word is added to a child's vocabulary? Where does the process of word
learning begin?" In her study, she examined the preschool child's limits
on word learning capacity. The study tested the acquisition of a novel word
representing a color -- chromium. After demonstrating that none of the children
in the study (age 3 to 4) had a separate name for the color olive (each
identified it as green or brown), the experimenter presented the word chromium
to each child in the context of a task request: "Please hand me the
chromium cup; not the red one, the chromium one" where the choice was
between one red cup and one otherwise identical olive (chromium) cup. Carey
found that given only one exposure to the color name, upon comprehension
testing one week later 9 of the 14 subjects successfully identified either
an olive or a green color chip when asked to point to the chromium one.
Additionally, during a production test 6 weeks later, 8 of the 14 subjects
answered differently when asked to name the color chip than they had before
the experiment began. That is, where they had originally named the chip
green or brown, they now said they didn't know the name or used another
unstable color referent from their vocabulary, thus indicating that they
had learned and retained the knowledge that olive has its own color name.
Additionally, Carey found that the children who learned the name after the
brief exposure could take two different tacks. For some, the False-Synonym
group, chromium was used as another word for green. Other children adopted
the Odd-Color-Odd-Name strategy; these children demonstrated comprehension
of the word, but for production named another color from their lexicon which
also did not have a stable referent, thus again demonstrating they knew
that olive had a separate name.
Earlier, Nelson and Bonvillian [13] had performed a study in which children
were exposed to 18 new concepts, of which 9 were made-up words and 9 were
were actual English words which the children had not yet acquired (7 control
children also did not acquire these words by the end of the study). In a
series of 10 experimental sessions, the children were presented with examples
while every third session an unnamed exemplar was used to test comprehension.
Comprehension was tested both by asking for the object by name, and by holding
up the example and asking "Bring me one of these." This study
demonstrated that the child could acquire the name from a single example,
but that learning was more likely when two or four named exemplars were
encountered.
In examining the question of what characteristics of language are dissociable,
Bates et al. [1] also performed a study examining the acquisition of a novel
concept in young children. In this study, a novel object was given both
a novel name ("fiffin") and a novel associated action ("glooping").
In an initial 5 minute exposure conducted in the home, the children were
shown several fiffins and glooping was demonstrated. In a lab session 2-3
days later, comprehension was tested through a multiple choice test and
in a play session: "Make the kitty gloop the fiffin." Of the 23
subjects, 9 performed the gesture successfully in the home, while 18 did
so in the lab. 8 subjects also made successful verbal attempts at pronunciation
in the home; 9 did so in the lab. During the multiple choice test, the average
score was 75% correct, where 33% was chance. Additionally, 18 kitties successfully
glooped. This study demonstrated again that children can obtain a concept
after an extremely brief exposure, and that it was not necessary to perform
imitation to obtain the concept, as many demonstrated lab comprehension
without acting out in the home. Additionally, Bates showed that the type
of knowledge the child demonstrated was correlated with language "style";
that is, fiffin comprehension was related to early comprehension, while
fiffin imitation was related to early production.
Mabel Rice [16] addressed word acquisition from television viewing, thus
offering evidence that neither lexical acquisition nor fast mapping in particular
are limited to interactive exchanges. In one study, Rice exposed a number
of children to short cartoon segments which were designed to introduce new
words. The test words in this case consisted of actual English which the
children did not already have in their vocabularies, and which included
a number of words which were not object names or attributes. In all, each
subject was exposed to 20 new words in a brief time; each 12-minute cartoon
presented several instances of a few new words, 114 presentations over all
words total. From this exposure, the 5-year-old subjects gained an average
of 4.87 words as compared to controls, while the 3-year-olds gained an average
of 1.56 words. This study demonstrated that new words need not be contained
in exaggerated, referent-matching contexts in order to be acquired, and
that the new word need not be surrounded exclusively by familiar words.
Additionally, this study demonstrated that words other than object names
and attributes were also subject to fast mapping, and that the new words
need not be presented in the exact same context each time in order to learn.
Rice did additional work [17] in a more naturalistic home environment, where
it was demonstrated that children learn new words rapidly from educational
programs such as "Sesame Street" even with the environmental distractions
associated with home television viewing.
In a study intended to explore what aspects of a word are developed upon
fast mapping, Chris Dollaghan [3] tracked acquisition and use of a nonsense
word, "koob". This word was introduced in a naturalistic setting;
the experimenter asked the child (age 2:1 to 5:11) to "Hide the koob
under the bowl" rather than explicitly stating "This is a koob."
The experiment was constructed so that the child could actually perform
the task requested without forming any theory of the name of the intended
object. After one exposure to the word, the subjects were later tested for
comprehension ("Hand me the koob"), production ("What's this?"),
recognition ("What is this? Is it a koob, soob, or teed?") and
association with location ("Where did you hide this?"). In most
cases, an immediate inference between the unfamiliar word and object was
made, although the extent to which that knowledge was available for use
varied considerably from child to child.
2.2 Manifestations, Modulations, Limitations
The above studies and many others demonstrate that the fast mapping phenomenon
is a real, robust occurrence which appears over a variety of situations.
The amount of fast mapping varies with age; in particular, subjects who
are too young do not show much learning. Learning occurred more readily
over a broad base of examples rather than a narrow base (only 1 example).
Fast mapping is robust across method; children successfully acquired words
from limited exposure whether the presentation was by an experimenter, the
child's mother, or the television. It is robust over distraction, as demonstrated
in the unfamiliar environment of a laboratory or in the distracting environment
of television viewing in the home, with its associated sibling, parental
and play distractions. Fast mapping is robust across linguistic method of
presentation; the effect was present for words presented in incidental naming,
explicit presentation ("This is a ... ' ), and in sentences both where
the surrounding words were all familiar and when they contained other unknown
words.
The amount and manifestation of the effect was seen to vary with gender,
age, and cognitive style -- whether the child favors one word, telegraphic
speech versus whole-phrase speech. Additionally, the effect varied with
birth order and sibling constellation. Nelson and Bonvillian [13] found
that children whose next-older sibling was less than 24 months older gained
the most words, with first-born children close behind while lagging last
were those children whose next-older sibling was more than 24 months older.
Nelson and Bonvillian hypothesize that the first-born children have the
added advantage of parents who have more time to spend, and thus are exposed
to more explicit referential sentences and more parental time overall. Short-lag
children lose the benefit of total parental attention, but are helped to
a greater extent by the presence of an older sibling whose speech is more
like their own than like the parents'. That is, the short-lag child receives
more predigested and simplified examples of speech on which to bootstrap;
some of the processing has already been done and the short-lag child can
leverage off this benefit. Conversely, the longer-lag children do not have
this benefit, and also do not receive parental attention to the extent that
first-born children do.
2.3 Theory
Rare Event Cognitive Comparison Theory
In [14], Nelson explores how current language and cognitive levels facilitate
and limit what will be learned next. The overall acquisition mechanism depends
on cognitive comparisons between old and new structures in order for the
child to determine when the current language structure is insufficient.
The mechanism is seen as a "rare event" mechanism, as the attention
to new input which leads successfully to the development of new structures
for the child's future use occurs only rarely. The development of a new
structure occurs along the following lines:
1.Assignment of old structure to new input strings. As long as new input
matches the structures already in use, the system need not change.
2.Tentatively Abstracted Foci. Something happens to draw attention to some
area of the structure, to create a "hot spot" of attention. This
may occur because a number of mismatches of new input strings have drawn
attention, or simply because the child's existing structures have developed
to a certain extent which prepares for a new structure. In this way, developments
can bootstrap, as a child may not be ready for a particular structure until
other supporting structures have been laid out first.
3.Finding input mismatches within Tentatively Abstracted Foci. Once attention
is drawn, mismatches will be more readily noticed and attended to.
4.Selective Storage. Certain strings of interest will be stored, perhaps
in episodic memory.
5.Selective Retrieval. With attention to mismatches, previously encountered
examples can be retrieved for comparison with the detected mismatch. Note
that language advances can thus be made during private thinking, as the
child retrieves example strings from memory and mulls them over alone.
6.Selective Analysis. The child considers the newly collected data.
7.Selective Hypothesis Monitoring and Consolidation. A conclusion is reached
and new structures are tentatively created. Previously encountered and new
input strings are compared against the new structure, which is eventually
consolidated into the child's language structure.
From this point of view, one can see how input exposure will affect the
child's particular path to language mastery. Different types of input will
cause individual children to call into question various structures at different
times. The particular structures the child attends to will determine the
path of acquisition the child takes. The issue of birth order mentioned
above can be cast in this light; input from a slightly-older sibling is
more like the child's own production, thus the differences are small and
more easily attended to, allowing the younger child to ride on the coat
tails of the older sibling's language efforts. Similarly, first-born children
tend to get more explicit input from parents, and thus again attention is
more readily drawn.
A Tentative Approach
The fast mapping phenomenon may then be cast in the light of the preceding
information. One may envision a protracted "hot spot" of attention
to word naming which the child encounters, perhaps driven by the root cause
of the above elaborated system. Based on the data collected from the various
studies, we may draw the following conclusion: Fast mapping in children
and the resultant characteristics of the word which the child thus obtains
are affected by area of attention and input amount and style, and the use
of episodic memory to integrate and store the information. It is reasonable
to hypothesize that a language acquisition model which incorporates these
elements may also demonstrate the fast mapping phenomenon.
3 Episodic Memory
Human memory is not performed by a single mechanism, but consists of several
different functionally and physically distinct components. To simplify,
a distinction may be drawn between what can be termed declarative memory,
that of explicit facts and events, and nondeclarative memory, which is involved
in such things as habit formation and priming. Only the information in the
declarative memory can be consciously recalled, and it is this part of memory
which is of concern when addressing one-trial lexical acquisition.
Lesion studies point out the essential role of the hippocampus and surrounding
structures in the operation of declarative memory. Again to simplify, the
hippocampus is involved in processes which bind together previously unrelated
events (represented in different parts of the brain), which then together
constitute a memory of the event in question. Additionally, the hippocampus
also participates in forming in neocortex an integrative trace of a newly
formed memory, possibly through a feedback loop between neocortex and the
hippocampus. That is, the hippocampus develops and maintains a temporary
"trace" of the formed memory while a more permanent one is formed
elsewhere in the brain.
Thus it becomes clear that in order to adequately model the process of lexical
acquisition in a more realistic manner, and to thus develop a system which
will naturally manifest the fast mapping behavior, it is necessary to develop
a system which addresses the issues of attention, episodic memory, short-term
memory to long-term memory conversion, and the ability to generalize similarities
and still handle very novel input gracefully.
4 Connectionism
In this section we look at previous work which has been performed, with
an eye to systems which may result in fast mapping. We examine standard
network architectures, followed by specific systems which have attempted
to emulate episodic memory or lexical acquisition in general, or the fast
mapping phenomenon in particular. This is followed by an examination of
a handful of larger systems which attempt to more adequately address human
performance issues.
4.1 Basic Connectionist Models
Backpropagation
The backpropagation neural network [19] is a multilayer architecture consisting
of interconnected layers of processing units. Input vectors are presented
to the elements of the input layer, and activation is propagated through
the network to the output layer. During training, the values at the output
layer are compared to the actual desired output associated with the given
input vector. Any error in the network output is used to calculate adjustments
to the network weights using a gradient descent technique. The overall effect
is that over time, the network weights adjust to form a representation of
the function described by the set of input-output vectors presented. The
backprop network is often able to form a generalization of the function,
rather than a simple mapping-and-recall of the input-output pairs. This
generalization is often desirable, in that inputs which are similar to learned
data will receive outputs which are similar to the learned patterns. Unfortunately,
often truly novel inputs will also be given a generalized output, rather
than the specific output to which it is matched. This has some utility in
modeling overgeneralization in language acquisition, but does not function
properly for the acquisition of novel concepts.
Unfortunately, as powerful as backprop is, it is not suitable for modeling
the fast mapping phenomenon. If a new input is presented to a network which
has already been trained, it is possible that the representation the network
has developed is not suitable to generate the proper output for the input.
In this case we would like to perform additional training on the network
to incorporate the new data. Unfortunately, presentation of the new input-output
pair to the network for only a few training cycles may not be sufficient
to adjust the weights to properly represent the new information. Simply
adding the new data to the existing training set and continuing training
will require many training episodes for the network to develop a representation
of the new data; it will not display the rapid learning desired. One also
may attempt to force the weights in the network to make a large adjustment
in the direction indicated by the new data; however, this technique runs
the risk of losing previously learned associations as the network may move
too far in that direction. Either way, the network will take too long to
learn or will not learn well enough to model fast mapping. (But see Section
4.5 below on some possibilities afforded by recurrent networks, i.e. networks
which allow self connections or backward connections.)
Autoassociative Memory
The autoassociative neural network is actually a large family of paradigms,
all of which have in common the association of an input vector with itself.
One very useful member of this class is the Kohonen network [7]. In this
model, the network consists of a number of processing elements each of the
same dimensionality as the input. Through training, the values in the element
vectors are adjusted so that they come to represent the space of possible
input vectors (as represented by the examples given during training). Training
this network consists of identifying the processing element which lies closest
to the input vector, and adjusting the element vector towards the input
vector by some fraction of the distance between them. With the addition
of "neighborhood" links, in which each element is connected to
additional processing elements which will form its neighbors, the network
will form a topological map of the space of input data. A common use of
the "neighbors" is to adjust all those elements in the closest
element's neighborhood toward the input vector also, by a smaller amount
than the winner adjustment. Through a very large number of training presentations
the processing elements come to reflect the spatial representation and extent
of the training input. An example representation of a network which has
been trained to represent an even distribution of points in the unit square
is shown in Figure 1. This type of network, frequently called a topological
map or feature map, is often used as a memory of examples seen, and as such
may be a candidate for representing lexical memory in humans. Since the
network is self-organizing and topological, it will develop areas of common
information which can be seen as representing a lower dimensional projection
of the main information represented by the network. However, since the mapping
is continuous, the precise boundaries of the various categories developed
are not specified.

Figure 1 (Click for enlargement)
This feature map paradigm has several properties which make it less than
adequate for representing fast mapping. The most obvious weakness is that
the network always returns as the winner the vector of the element which
is closest to the presented input. For generalization, this is a desired
trait in that one will always be presented with a representative vector
which will be identical or similar to an actual input from the training
set, or some blended combination of inputs. The problem comes when a novel
input is presented which is very much unlike those previously seen. In normal
operation, that element vector which is closest to the input vector will
be returned as a "memory" of the input, regardless of the actual
distance to the input. The network does not take into consideration the
actual distance from the nearest vector, nor the typical distance between
vectors in the trained network. One may wish to use the distance between
the input vector and the closest processing element as an indication of
whether the input is correctly categorized by the network. However, absolute
error is not an adequate measure because there is no threshold for a decision
of "don't know." The distance threshold will vary between individual
processing elements; an element in a densely populated area of input space
will encompass a much smaller area for its valid inputs than an element
in a sparsely populated area. The network has no notion of representing
"don't know" or of flagging that the input is extremely different.
Additionally, a problem still lies in modifying the network to incorporate
the new information. Addition of a truly novel input may deform the network
severely, with performance returning only gradually through continued training
over the entire training set. This is clearly not an adequate model of fast
mapping. The standard implementation of autoassociative feature maps will
not adequately model the fast mapping phenomenon.
4.2 Attempts to Address "Fast" Mapping
Fast Weights
Hinton and Plaut [6] modified a standard backpropagation network to have
two connections between each unit: one with a slow, stable weight and one
with a fast, elastic weight. The slow weights function much as they would
in a regular connectionist model: they change slowly and hold the long-term
knowledge of the network. In contrast, the fast weights change rapidly and
continually decay toward zero, and thus reflect only the recent past. The
effective connection between two units is the sum of the fast and slow connections.
At any time, the system's knowledge can be thought of as the slow weights
with a temporary overlay of the fast weights.
This system could be used for for rapid temporary learning. In other words,
when presented with a new association, the network could conceivably store
the information in one trial. Although this addresses the issue of a backprop
not being able to rapidly integrate new information, this setup does not
address the possibility of previous knowledge being obscured by the new
addition. This solution is obviously better than forcing a single set of
weights in the direction of the new information, as the previous knowledge
is not lost; however previously learned associations may still be unavailable
while the temporary weights are in place.
Additionally, although it is easy to train the fast weights for the desired
one-shot learning effect, it is not clear how to incorporate the new associations
gracefully into the slow weights for long-term storage without the traditional
drawbacks of continued training with the entire training set plus the additions.
Thus, this system does not adequately meet our needs for fast mapping as
seen during language acquisition.
CHARM
The CHARM (Composite Holographic Associative Recall Model) system developed
by Janet Metcalfe [10] [9] [11] uses a mathematical technique similar to
that used in holography to form an associative system which can be rapidly
updated through the operations of convolution and correlation. The use of
convolution for association results in the interaction of all of the parts
of one item with all of the parts of another. The system is presented input/output
pairs represented as feature vectors to be associated. Through various mathematical
transformations, the input is associated with the output, and their total
is combined with the results of other pairs into a large system representation.
For recall, the input is presented to the whole system, and further mathematical
machinations are performed, resulting in a vector intended to represent
the output of the original pair. Due to the nature of the mathematics, this
system shows one-shot learning. That is, upon one presentation of a pair,
the association is contained in the system. This shows much more promise
in the modeling of fast mapping than the models considered thus far, but
as noted above, in practice the fast mapping phenomena does not occur every
time, nor does a successful fast mapping imply that the concept has been
obtained in its entirety. If the CHARM model were an accurate representation,
more cases would be seen of concepts springing fully-formed from the little
wizards' minds, as it were.
However, CHARM does have to its credit the ability to model quite a number
of other psychological phenomena, including generalization and a number
of memory interference and failure effects. This system holds much promise
in its future applicability as a model of fast mapping.
4.3 The DISCERN Model
Description
The DISCERN model (DIstributed SCript processing and Episodic memoRy Network),
developed by Risto Miikkulainen [12], is a distributed artificial neural
network system which learns to process simple stories which follow a stereotypic
framework. As such, it combines the traditional symbolic artificial intelligence
paradigms of scripts and frames with more realistic cognitive modeling and
neurocomputation methodology. This model combines the issues traditionally
associated with script-based story understanding and adds to it the idea
of episodic memory. There are several issues which the symbolic approach
to script theory does not address. For instance, the architecture, processing
mechanisms and knowledge embedded in symbolic systems are hand-coded with
a specific domain and data in mind. Inferences are based on handcrafted
rules and representations of the scripts. Such systems cannot utilize the
statistical properties of the data to enhance processing.
One thing which the DISCERN model brings to the story understanding task
is the idea of episodic memory. Narratives are stored in the model one at
a time as they are read in, with only a single presentation. The new story
is recognized as an instance of a familiar sequence of events and attention
is paid only to the facts specific to the story, even though the system
has not gone back and explicitly reactivated all the stories previously
encountered. This parallels human episodic memory, which seems to be structured
to support classification based on similarities and storing the differences,
with the particular structures being developed by experience.
The episodic memory structure of DISCERN also supports associative retrieval.
As in humans, a question supplies only partial information about the story
to which it refers, yet the story is retrieved with only the question as
a cue. The DISCERN model has been developed to address these issues. It
is the implementation of episodic memory which is of interest for the purposes
of this paper.
Episodic Memory Implementation
The DISCERN model implements episodic memory as a collection of traces on
a hierarchical feature map system. As described above, a self-organizing
feature map (autoassociative network) is a (biologically-motivated) method
for unsupervised learning and for organizing information. The feature map
representation has many properties which make it well-suited for modeling
memory. Classification performed by a feature map is quite robust, even
in the presence of noise or incomplete inputs. Categorical perception can
be thus modeled, since inexact input often results in the recovery of the
exact representation of previously stored data. In contrast, since the feature
maps tend to be continuous with intermediate states, it is possible in some
cases to recover a blend of a number of items. However, as also mentioned
above, the feature map representation suffers from the drawback that boundaries
of related areas are not specified on the map. Additionally, feature maps
created from high-dimensional input vectors take a long time to train.
These drawbacks can be addressed to some extent with hierarchical feature
maps. In this case, the hierarchical nature of the input features are represented
by a pyramid of feature maps. This speeds the learning of the system, and
makes categorization easier. In this setup, the input features are initially
classified by the uppermost map. The vector is then passed down to subsequent
maps for more detailed classifications (Figure 2).

The episodic memory storage and retrieval is implemented in the system as
trace feature maps on the hierarchical map structure. Trace feature maps
differ from ordinary feature maps by creating a memory trace at the location
of classification on the map. The map remembers that at some point it received
an input item which was classified at that point. The traces can be stored
one at a time, and the whole of the traces over an episode constitute the
memory of events. The traces are modeled by using the "neighborhood"
links of the feature map as activity links to develop basins of activation.
The attraction bubbles created by the various memory traces are then superimposed
and blended. Upon memory recall, a partial or noisy input is presented to
the system. If it falls within an attraction bubble, the activation will
be drawn towards the center of the bubble, and the stored vector associated
with the center will be returned. In this case, the input vector could represent
a question for the system, with the unspecified features representing the
unknown roles, which would then be filled in through returning the center
vector of the specific instance activated.
In terms of representing episodic memory, this system performs well. New
stories presented to the system develop a memory trace which is robust in
a small number of presentations, and thus models the "fast" part
of fast mapping without resorting to an artificial, "guaranteed one-shot"
learning mechanism. The system demonstrates a number of memory phenomena
such as interference effects and generalization. The structure of the system
does not overly constrain how the information in the memory is to be organized,
and thus the system with use comes to reflect the statistical properties
of the data it has seen.
However, like the Kohonen Feature Maps reviewed earlier, the system suffers
from the limitation that it cannot learn truly novel information. That is,
although it can successfully represent stories on which it was not originally
trained, stories which are extremely unlike those seen during training will
not be handled correctly. Several suggestions for extensions to the system
(including those suggested by the author) addressing this limitation will
be examined in Section 5 below.
4.4 Attentional Mechanisms
The "sentence gestalt" model of St. John and McClelland [20] was
developed as an attempt to create a model which learns to convert a sentence
to a conceptual representation of the event which the sentence describes.
The model is intended to disambiguate ambiguous words, instantiate vague
words, assign thematic roles, and elaborate implied roles. In addition,
it is required to learn to perform these tasks, and perform them on-the-fly
as the sentence is presented, rather than waiting until the sentence is
finished and then performing calculations. The model is a mostly feed-forward
network with a number of hidden layers and a small amount of recurrence
(Figure 3).

The model performs rather well at its assigned tasks, and is able (through
the "probe" inputs) to answer questions about the representation
of a sentence it has developed. It also demonstrates a number of appropriate
phenomena such as generalization, interference and priming effects, and
frequency effects.
For the purposes of this paper, the most interesting property of the sentence
gestalt system is that it effectively develops an attentional mechanism.
That is, the system must learn through example which parts of the sentence
are important for providing which types of information. The system learns
to make appropriate balances between word order and semantic constraints
for determining the meaning and roles of words in a sentence, for example,
without this knowledge being otherwise coded into the system.
4.5 Generalization and Novelty
Although the linguistic processing model developed by Plaut et al. [15]
focuses mainly on learning to read (bold connections in Figure 4), the system
they have developed demonstrates some interesting behavior which may be
applicable to modeling fast mapping. Plaut and his co-authors develop a
recurrent network which learns to map orthography, the printed letters of
a word, to phonology, the phonetic representation of the word. Their effort
has produced a system which not only performs the mapping task, but successfully
demonstrates the frequency versus consistency effects shown by human subjects
and additionally shows performance following damage which parallels the
language difficulties of surface dyslexic patients.
The interesting behavior of the model in terms of the fast mapping task
is the behavior of the recurrent network in the face of novelty. It is of
some concern when using recurrent networks that due to the dynamics of the
attractor surface represented by the system weights, novel inputs will be
treated as "incomplete" or "noisy" data and subjected
to the generalization behavior of the network. However, their network develops
basins of attraction which interact like ripples in a pond to create additional
attractor basins for data which has not actually been presented to the network.
For instance, even if the network has only seen evidence for by (mapping
to the sound /bI/) and no (mapping to the sound /no/) the network may also
form an attractor basin into which bo would naturally fall (i.e. /bo/).
These extra basins can be shown to be a natural consequence of having a
highly connected, high dimensional dynamic space.
Note that in this case, the network demonstrates fast mapping. That is,
even though the network had not been trained on the mapping between the
letters bo and the sound /bo/, the network correctly made the mapping. The
network has an appropriate attractor basin for this mapping, and the system
provides just enough of a nudge to enter the basin and converge to the mapping.
At this point, any training on this specific example will serve to deepen
and expand the attractor basin, thus ensuring that the mapping will be made
more readily (in this case, fewer steps until convergence) in the future.
This rapid initial mapping followed by subsequent strengthening of the learned
associations is exactly the phenomenon which we seek. The use of this type
of network in the processing of learning to speak has been anticipated by
Plaut et al. as represented by the dotted line in Figure 4, although this
use was not addressed directly in their paper.
5 What's Missing; What's Promising
None of the models explored adequately model fast mapping (nor linguistic
acquisition in general) in a way which is satisfactory to represent a model
of human performance. However, several of the systems show promise, which
may be exploited through various changes. Using these modified systems,
an overall connectionist system can be developed which may indeed display
the desired fast-mapping phenomenon, while still producing overall behavior
which is consistent with other aspects of human language acquisition. The
proposed system combines aspects of the DISCERN model to represent episodic
memory, the sentence gestalt network to provide an attention mechanism,
and a recurrent network as described above to represent long-term memory.
The DISCERN model [12] representation of episodic memory has as its largest
drawback an inability to represent truly novel inputs due in part to its
basis in symbolic script theory but mostly due to the nature of the autoassociative
networks used. However, as suggested by the author, modifying the episodic
memory to provide dynamic recruitment of new units to the network as needed
would address this problem, with additional reorganization training conducted
between input episodes. This can be seen as an implementation of the structure
building theory examined in section 2.3, with offline restructuring paralleled
by language development which occurs during the child's private play. We
propose also that if the input feature vectors, rather than being handcoded
to represent the scripts, were learned by an additional network system,
this network would become a more accurate model of episodic memory.
The sentence gestalt model developed by St. John and McClelland [20] is
an ideal candidate for the role of just such an additional network. As described
above, this network has demonstrated the ability to develop a form of attentional
mechanism. We propose that a network similar to that of St. John and McClelland
be used to determine which input features deserve the most attention. These
feature vectors may then be used as the basis for a system similar to the
DISCERN model.
Finally, we propose a recurrent system similar to the one used by Plaut
et al. [15] to represent the long term memory. This type of system provides
the necessary generalization and ability to represent novel inputs which
is necessary for the representation of memory.
In this model, the sentence gestalt network/attention mechanism would steer
the focus of the network to features of interest, where the interest would
itself be defined by the attention mechanism and would evolve over the course
of the simulation. The gestalt of the inputs and features of focus would
then be sent to the episodic memory network, where the incoming information
would be incorporated into the episodic representation, recruiting units
as required to represent novel information. As mentioned above, the episodic
memory will be in a state of continual reorganization; once the episodic
memory "settles down" in its representation of a new concept,
that representation can then be incorporated slowly into long-term memory.
If the representation for a particular concept developed by the episodic
memory does not exist or is substantially different from that stored in
the long-term memory, the new representation will, through gradual training,
be incorporated into the long term memory. If the episodic representation
is consistent with that already in long-term memory, the representation
will be consequently strengthened in long-term memory through additional
training. Finally, feedback from both the episodic and the long-term memory
can interact with the attentional mechanism to provide the basis from which
to detect novelty and discrepancy worthy of attention.
It is hoped that the system proposed will adequately model the process of
lexical acquisition in a more realistic manner, and thus will naturally
manifest the fast mapping behavior. The proposed system is intended to address
the issues of attention, episodic memory, short-term memory to long-term
memory conversion, and the ability to generalize yet still handle novel
input gracefully.
6 Conclusions
The prodigious rate at which young children acquire language has led some
to dub them "linguistic wizards." The task of acquiring thousands
of words, along with semantics and syntax and learning to tie their shoes
all within a few short years, requires fast mapping, or the acquisition
of a word through extremely limited exposure. This effect has been studied
by a number of researchers, and has been found to be quite robust.
The field of connectionist modeling, in its quest for insight into human
language acquisition, has thus far failed to develop a feasible system which
adequately mimics human performance in language acquisition, including the
fast mapping phenomenon so prevalent in children attempting the task. However,
as this paper has shown, several current research efforts show promise in
addressing these issues. In light of this, a model has been proposed consisting
of a combination of various system components described in this paper, which
is intended to more closely model episodic memory, attention, and generalization.
It is argued that this model would then display the characteristics associated
with human performance, including the fast mapping phenomenon.
References
[1] Bates, E., Bretherton, I. & Snyder, L. (1988). Acquisition of a novel
concept at 20 months. From First Words to Grammar: Individual Differences
and Dissociable Mechanisms, 124-134. Cambridge, NY: Cambridge University
Press.
[2] Carey, S. (1978) The child as word learner. In M. Halle, G. Miller & J.
Bresnan (Eds.), Linguistic Theory and Psychological Reality, 264-293. Cambridge,
MA: MIT Press.
[3] Dollaghan, C. (1985). Child meets word: "Fast Mapping" in
preschool children. Journal of Speech and Hearing Research, 28, 449-454.
[4] Hecht-Nielsen, R. (1990). Neurocomputing. Reading, MA: Addison-Wesley.
[5] Hertz, J. A., Krogh, A. S. & Palmer, R. G. (1991). Introduction to the
Theory of Neural Computation. Reading, MA: Addison-Wesley.
[6] Hinton, G. E. & Plaut, D. C. (1987). Using fast weights to deblur old
memories. Proceedings of the Ninth Annual Conference of the Cognitive Science
Society, 177-186.
[7] Kohonen, T. (1984). Self Organization and Associative Memory, Second
Edition. Berlin: Springer-Verlag.
[8] McClelland, J. L. & Rumelhart, D. E. (1986). Parallel Distributed Processing
(Vol. 2). Cambridge, MA: MIT Press.
[9] Metcalfe, J. (1991). Recognition failure and the composite memory trace
in CHARM. Psychological Review, 98, 529-553.
[10] Metcalfe Eich, J. (1982). A composite holographic associative recall
model. Psychological Review, 89, 627-661.
[11] Metcalfe, J. & Murdock, B. B. (1981). An encoding and retrieval model
of single-trial free recall. Journal of Verbal Learning and Verbal Behavior,
20, 161-189.
[12] Miikkulainen, R. (1993). Subsymbolic Natural Language Processing: An
Integrated Model of Scripts, Lexicon and Memory. Cambridge, MA: MIT Press.
[13] Nelson, K. E. & Bonvillian, J. D. (1978). Early semantic development:
Conceptual growth and related processes between 2 and 4 1/2 years of age.
In K. E. Nelson (Ed.), Children's Language (Vol. 1), 467-556. New York:
Gardner Press.
[14] Nelson, K. E. (1987). Some observations from the perspective of the
rare event cognitive comparison theory of language acquisition. In K. E.
Nelson & A. van Kleek (Eds.), Children's Language (Vol. 6), 289-331. Hillsdale,
NJ: Lawrence Erlbaum Associates.
[15] Plaut, D. C., McClelland, J. L., Seidenberg, M. S. & Patterson, K.
E. (1994). Understanding normal and impaired word reading: Computational
principles in quasi-regular domains. Submitted to Psychological Review.
[16] Rice, M. L. & Woodsmall, L. (1988). Lessons from television: Children's
word learning when viewing. Child Development, 59, 420-429.
[17] Rice, M. L., Huston, A. C., Truglio, R. & Wright, J. (1990). Words
from "Sesame Street": Learning vocabulary while viewing. Developmental
Psychology, 26, 421-428.
[18] Rumelhart, D. E. & McClelland, J. L. (1986). Parallel Distributed Processing
(Vol. 1). Cambridge, MA: MIT Press.
[19] Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). Learning
internal representations by error propagation. In D. E. Rumelhart & J. L.
McClelland (Eds.), Parallel Distributed Processing (Vol. 1), 318-362. Cambridge,
MA: MIT Press.
[20] St. John, M. F. & McClelland, J. L. (1990). Learning and applying contextual
constraints in sentence comprehension. Artificial Intelligence, 46, 217-257.
[CRL
Newsletter Home Page] [CRL Home Page]
Center for Research in Language
CRL Newsletter July 1995 Vol. 9, No. 3