Learning to Recognize and Produce Words:
Towards a Connectionist Model
Michael Gasser
Computer Science Department, Indiana University
Abstract
This paper outlines a first cut at a connectionist approach to the
acquisition of morphological rules. The proposed model makes use of a
hierarchy of simple sequential networks, networks which take inputs or
produce outputs one at a time and which maintain a short-term memory
via recurrent connections. Three experiments are described which
demonstrate features of the model. In one, a network is trained to
learn a set of complex rules in which the stem of a verb appears
nowhere as a contiguous sequence of phones. In another, a network
trained to recognize syllables in an artificial language develops
distributed syllable representations which then provide the input
for another network that learns reduplication rules. Finally, it is
shown that the distributed syllable represenations from the second
experiment can be applied to the production as well as the recognition
of syllables.
Morphophonological Acquisition and Performance
One of the tasks that faces a language learner is that of
discovering how words in the target language are built up from
morphemes. The processes involved are tied up intimately with the
phonology of the target language because the boundaries between
morphemes are the locus of many phonological rules and because the
prosodic structure of the word often depends on its morphological
makeup.
From the learner's perspective, the problem is best viewed as
one of mapping surface phonological representations, that is,
sequences of phones, onto lexical entries and ``entries'' for
grammatical morphemes. (footnote 1) The task has a perception and
a production side. For perception, the input is a sequence of phones,
the output a set of lexical and grammatical entries. For example,
for English nouns,
# b u g z # -> bug + plural
where small capitals indicate lexical/grammatical entries with
no phonological content. Thus the task involves in part the learning
of arbitrary associations between form and function. For production,
the input is a set of entries, the output a phone sequence.
For example,
bug + plural -> # b u g z #.
Note that there is no place for underlying representations here
because the learner does not have access to them.
Morphological and phonological rules are productive, and language
learners clearly recognize and produce forms they have never heard
before. Thus a learning model should be able to generalize on the
basis of a set of training items. For example, given pairs like
the following:
# b u g z # bug + plural
# b \uh g # bug + singular
# s i: d # seed + singular,
the system might be expected to be able to respond to one or the other
of the following:
# s i: d z # -> ??
seed + plural -> ??.
The performance of a model can be evaluated on the basis of its
responses to questions like these and on the size of the training
sample required for generalization.
Connectionism and Language Acquisition
As with any problem in language acquisition, the central question
is one of the extent to which innate mechanisms are required, and the
extent to which these are linguistic, as opposed to general cognitive,
mechanisms. The approach taken here is one which starts from the
hypothesis that morphophonological learning is essentially a
statistical and not a symbolic process and that the innate aspects of
the process are features of the cognitive architecture rather than
explicit constraints on categories or rules.
Connectionist models provide a way of implementing this basic idea.
These models have several features in their favor:
They have a powerful capacity to discover regularity in
environmental patterns.
The can implement graded, as opposed to discrete, categories
and processes.
They are resistant to noise and damage.
They are constrained in ways which are grossly similar to the
constraints on nervous systems.
However, there has been only limited success applying connectionism to
domains in which the basic data are structured in complex ways.
Natural language morphology and phonology constitute such a domain.
This paper describes a connectionist model of word recognition and
production which is currently under development, with several crucial
aspects as yet unresolved. The paper discusses those aspects which
have been clarified.
Sequential Networks
If we take seriously the fact that language takes place in time,
that it is both perceived and produced in time, then our models must
deal with one segment at a time. In addition, because both perception
and production obviously require knowledge of more than just the
current segment, our models need to have a short-term memory (STM)
which is suited to the task. Sequential networks are connectionist
networks which satisfy these conditions. Sequential networks either
receive input events of some type one a time or yield output events
one a time. Recurrent connections on hidden and/or output layers
give these networks the capacity to develop a short-term memory.
Figure 1 shows two types of sequential networks, based on those
introduced originally by Jordan (1986) and Elman (1990). In a set
of simulations, I have found the architecture on the left to be
well-suited for recognition tasks and that on the right well-suited
for production tasks.
Both networks are sequential in the sense that they run in time; one
pass through each network corresponds to the input or output of a
single primitive event. In the figures, the solid arrows represent
complete connectivity between layers of processing units. The networks
are trained---that is, the weights on their connections are
adjusted---using the back-propagation learning algorithm (Rumelhart,
Hinton, & Williams, 1986). Back-propagation changes the weights in
such a way that the error between the output generated by the network
and a target provided by the environment (or a ``teacher'') is
minimized.
The fuzzy arrows in the figure represent simple copy connections.
Thus in both of the networks the pattern on the hidden layer is copied
to a context layer following each pass through the network. In the
production network, the output pattern is also added to the pattern on
a state layer. This pattern also decays on each cycle through the
network (indicated in the figure by the recurrent arrow), so the
pattern on it represents a weighted sum of past outputs. The
recurrent connections to the context and state layers give each
network the capacity to develop an STM because the network has
indirect access to previous inputs or outputs.
Input to the recognition network is a sequence of phones and
a final word boundary marker, presented one at a time in the form of
phonetic feature vectors. The network is trained to output the input
phone---this encourages it to pay attention to the input---and the
appropriate lexical and grammatical entries, each representing by a
single output unit. The lexical-grammatical target remains constant
throughout the presentation of a word, even though the network cannot
be expected to recognize a word until a sufficient number of phones
have been presented to distinguish it from all other possible words.
Input to the production network is a binary lexical-grammatical
vector which remains constant throughout the production of the word.
The network is trained to output in sequence the phones representing
the surface form of the target word and a final word boundary marker.
The effect of the state layer is to tell the network where it is
in the word.
In recent years, several researchers have shown that sequential
networks similar to these can be trained to learn simple morphological
rules (Cottrell & Plunkett, 1991; Gasser & Lee, 1991). In particular,
Gasser and Lee (1991) showed that networks of the production type in
Figure 1 were capable of learning suffixing, prefixing, and mutation
rules for limited sets of words in an artificial language.
In the next section, I describe a simulation (previously
unreported) in which a sequential network is trained on a relatively
complex morphological learning problem. For this, and the other
experiments reported in this paper, each simulation was run three,
with different randomly generated initial weights for each run;
results reported are means over the three runs.
Learning a Complex Morphological Rule
Amharic is an Ethiopian Semitic language which, like other Semitic
languages, has a rich system of verb morphology. Verbs are
characterized by a root, usually consisting of a set of three
consonants, and a conjugation template. The template specifies the
vowels in the surface form of the verb and sometimes also an additional
consonant or gemination on one of the root consonants. The two aspects
of a verb form are generally viewed as belonging to separate
phonological tiers (Goldsmith, 1990). We will consider only three
forms of an Amharic verb, all in the third person singular masculine:
the jussive, roughly `let him V'; the converb (also called the
gerundive), roughly `(him) having V-ed'; and the past. For the verb
for `steal' the three forms are as follows: y\e sr\a \k, s\a r\k o,
and s\a rr\a\k\a. (footnote 2) Note how challenging this problem is
for the language learner: there is no consistent sequence of two or
more surface segments which characterizes the verb roots.
Separate networks were trained on the recognition and production
tasks for this problem. In each case 30 different verbs were used.
For each of the three verb forms, a set of six verbs was set aside for
testing. During training, the networks saw all of the other forms,
that is, all three forms for some, two forms for other verbs.
Each network was trained and then tested repeatedly on the patterns
set aside until performace got no better on the test set. This required
101 epochs (that is, repetitions of all the training patterns) for
production, 169 for recognition. For production, the task was to
produce the phones in sequence for a form that the network had never
seen, for example,
steal + past -> s \a r r \a \k \a
For recognition, the task was to assign a novel form to the appropriate
stem and grammatical morpheme, for example,
s \a r r \a \k \a -> steal + past
Results were as follows. For the recognition network, 59% of the stems
chance: 3%) and all 96% of the grammatical morphemes (chance: 33%)
were recognized correctly. For the production network, 87% of the
phones were generated correctly (chance: 5%). For recognition, most
errors (59%) involved confusion of one verb with another which shared
two of its three consonants, though not necessarily in the same order.
All production errors involved mistaken stem consonants. For example,
bury (stem: \k b r) + converb -> \k \a b d o (for \k \a b r o).
Thus, the networks show clear evidence of having learned the rules.
We must surmise that the hidden layer in each network discovers how to
make the double association of lexical entry to consonant triple and
of grammatical morpheme to phonological template.(footnote 3)
However, if sequential networks were also capable of learning
morphological rules of a type which do not occur in human language,
this would count as evidence against them. Gasser and Lee (1991)
showed that sequential networks failed to learn a rule which reversed
the segments in a three-segment morpheme and had considerable
difficulty with morphological deletion rules. Such rules are either
very rare or non-occurring in natural language.
Handling Hierarchical Structure
However, there are processes occurring in natural language
which the simple sequential networks described above have considerable
trouble acquiring, if they can acquire them at all. These are processes
that make reference to hierarchical phonological structure, which plays
an increasingly important role in phonology (Goldsmith, 1990; Hogg &
McCully, 1987). Reduplication is an example of such a process.
Reduplication involves the repetition of whole sequences of phones,
often altered in systematic ways (Moravcsik, 1978). It is a
morphological process in many languages, e.g., Ilokano puspusa `cats'
from pusa `cat' (Hayes & Abad, 1989); an aspect of lexical
representations in even more languages, e.g., English willy-nilly; and
a simplification process in language acquisition, e.g., kiki for
kitchen. The frequency of reduplication and the ease with which
speakers and hearers seem to handle it (footnote 4) lead one to
believe that the capacity for reduplication is somehow built into
the language processing architecture.
What is it that makes reduplication hard for sequential networks?
Consider what is involved in the recognition of syllable reduplication,
for example. The perception system must have the capacity to compare
entire subsequences within the input. This would seem to necessitate
a static summary representation of the first subsequence which remains
constant while the second subsequence comes into the system. In order
not to interfere with segment-level processing, this comparison process
apparently requires a separate syllable-level module.
Another process which seems to require a hierarchical architecture
is the perception and production of stress. In many languages, the
process of determining which syllables in a word are to be stressed
proceeds from right to left. For example, in MalakMalak, an Australian
language, stress is assigned to alternating syllables beginning from
the penultimate syllable and moving to the left through the word
(Birk, 1976). In such cases, the production mechanism must clearly
have access early on to a global representation of something like the
prosodic structure of the word to be produced. There is also evidence,
especially from the tip-of-the-tongue phenomenon (Rubin, 1975), that
human speakers begin with just such a global prosodic representation.
Thus again a simple one-phone-at-a-time approach will not do.
What I propose---and many of the details remain to be worked
out---is an architecture which combines the strengths of sequential
networks described above with a built-in capacity to discover and
make use of hierarchical structure in language. Figure 2 shows the
basic architecture of the model. Each of the three modules shown
consists of a sequential network (with an STM capacity denoted by the
recurrent connections on the hidden layer) and is responsible for one
hierarchical level, say, phones, syllables, or metrical feet. (No
claims are made regarding the number of levels required.) Each of these
sequential networks runs with a different clock: one takes segments as
its primitive ``beats'', another syllables.
(The jagged arrows in the lower right-hand corner of the figure
indicate the directions in which words are processed through the
network.)
For recognition, the input to each module is a pattern constituting
a distributed representation of a sequence of events at a lower level
(except at the lowest level, where the input may be a continuous
stream).
The module then turns an input sequence into a summary distributed
representation which is then handed to the next higher level, where it
is treated as a single input event. Production begins at the highest
level with a distributed representation of the prosodic structure of
the complete word. This is then transformed into sequential patterns
at successively lower levels. The unpacking of a distributed
representation of a sequence into its component events corresponds to
the ``spelling out'' operation which is a part of some psycholinguistic
models of word production (Shattuck-Hufnagel, 1983).
This scheme requires a capacity for learning distributed
representations of subsequences, e.g., syllables, which can be used
in higher-level processing. For example, for syllable-based
reduplication the distributed syllable representations must encode
their contents in a fashion that allows it to be used in learning
reduplication rules. As noted above, such rules often involve
systematic modification of the reduplicated sequence. For instance,
Madurese has a process by which the final syllable of a stem is copied
onto the front of the stem with its vowel changed to a:
indhit -> dhatindhit (Stevens, 1968).
Within the context of the model being considered, this requires
syllable representations which encode the syllable nucleus (vowel) in
a systematic fashion. I next describe a simulation designed to
establish that this is at least possible.
A sequential network of the recognition type in Figure 1 was
trained to recognize all of the possible syllables in an artificial
language. That is, the lexical/grammatical output layer in the
figure was replaced by a layer of syllable units, one for each
syllable. This task has the effect of forcing the network to
distinguish the individual syllables and thereby to pay attention
to all of the phones in each one. The intention was to have the
pattern appearing on the hidden layer in the network following the
presentation of a syllable constitute a summary distributed
representation for the syllable.
There were 54 syllables in the language, describable as follows:
onset -> {p,m,t,n,k,0}
rime -> {a,i,u,aa,ii,uu,an,in,un}.
Training the network to recognize the syllables required 1400
epochs. At this point the network had learned distinct distributed
representations for each syllables; these were just the hidden
layer patterns left following the presentation of the syllables.
These syllables were then used as inputs to a second network, one
designed to learn a set of reduplication rules. To simplify matters,
a simple feedforward network, rather than a sequential network, was
used for this task. Input to the network consisted of a distributed
representation for a syllable (one of those learned by the syllable
recognition network) and a pattern representing one of three
reduplication rules. The network was trained to output the distributed
representation for the syllable resulting from applying the input
reduplication rule to the input syllable. The three rules were as
follows:
Make the syllable coda n, e.g., ma -> man,
tii -> tin, kun -> kun.
Remove any coda and make the syllable nucleus short, e.g.,
man -> ma, tii -> ti, ku -> ku.
Make the syllable onset k, e.g., ma -> ka,
un -> kun, kaa -> kaa.
For each rule, the network was trained on all but 11 of the 54
syllables. Training continued until all of the training item were
handled correctly; this required 1183 epochs. During the test phase,
71% of the syllables were produced correctly (chance: 2%), and, of
these, Rule 1 had the most errors; rule 3 the fewest (only 3%). The
k onset rule may have been easiest because it mapped input
syllables onto a relatively small set of possible output syllables
(9 vs. 18 for the n coda rule).
While this performance is far from perfect, there is strong evidence
that the rules have been learned. Most importantly, they have been
learned without the benefit of pre-specified representations which
provide explicit syllable structure. Rather the rules were applied
directly to distributed representations learned in the performance of
another task, syllable recognition. These representations seem to
encode syllable structure in a form which is suitable for
structure-sensitive operations such as reduplication. (footnote 5)
Recognition and Production
An important processing and acquisition issue, often overlooked, is
the relation between perception and production. While there are many
possible positions on the degree of sharing between the two processes,
some sharing must characterize the human processing system. For an
acquisition model, the problem is one of demonstrating that it is
possible to produce forms which have been encountered in recognition
but for which there has been no production training. That is, there
should be some generalization from perception to production. Obviously
there can be no generalization for arbitrary associations like those
connecting lexical items to their surface phonological realizations,
but we might expect generalization in two other places, morphological
rules and prosodic structure. In what follows, I describe an
experiment to test the latter.
A network similar to the production network in Figure 1 was
trained to produce syllables in the artificial language used in the
previous experiment. Inputs were the distributed syllable
representations developed during the recognition training phase of
that experiment. The hope was that these inputs would encode aspects
of the syllables which are relevant to their production as well as
their recognition. The production network was trained on all but 10
of the 54 syllables for 109 epochs and then tested on the remaining 10
syllables. 89% of the phones (including final syllable-boundary
markers) were correctly generated (chance: 11%). In only one case
(out of 108 test items on the three runs) did an error result in a
syllable that was not one of those in the language; in this case what
should have been {nun} appeared as {nud}.
Thus there is evidence of generalization from recognition to
production. Static summary representations of input sequences can be
used to generate the same sequences.
Discussion and Caveats
What I have shown in the three experiments described in this
paper is
that sequential networks can learn difficult
morphological processes and
that distributed syllable representations developed during
recognition embody the structure necessary for operations,
such as reduplication, that apply directly to the
representations, and for the
reproduction of the original sequence of phones.
However, there are many gaps in the implementation of the proposed
hierarchy of sequential networks. The network in the second experiment
was told precisely what the relevant syllables of the language were,
that is, how many were to be learned and what their constituent segments
were. Yet one of the most difficult taks facing the learner is
precisely this: how is the input stream to be segmented into units, and
how many units should there be? Furthermore, during the production
train ing in the last experiment, target phones were available at every
step, a luxury that no human language learner has during production.
Finally, there is the question of what controls the overall network
shown in Figure 2. As units are completed in the perception direction,
for example, they need to be passed onto the next level. But what
decides when a unit is complete? This is related to the problem of
segmentation which plays a role in the learning of the units in the
first place.
With respect to phonology, there is as yet very little that can
be said about the adequacy of the model, particularly for processes
which seem to involve complex rule interactions. Some degree of rule
sequencing is possible in the model because of the hierarchical
structure, but it is not clear whether this will prove sufficient.
Incidentally, the levels proposed here are not those proposed in other
recent work within connectionist phonology (Goldsmith, forthcoming;
Lakoff, 1988; Touretzky & Wheeler, 1990). These models posit three
levels differing in abstraction, whereas the present proposal
distinguishes levels on the basis of the size of the units that are
taken as primitive events.
Conclusions
Connectionist networks, especially those that gain access to the
past via recurrent connections, are capable of detecting and making
use of many of the regularities which characterize the morphology of
natural language. This paper adds further evidence to this claim with
a demonstration that even rules which apparently involve the
interleaving of representations from separate tiers can be induced by
such a simple system. Yet some of what goes on in word recognition and
production seems to involve the direct manipulation of higher-level
representations. This may be possible in sequential networks which
process one phone at a time, but it is difficult to imagine how it
might be done as efficiently as it seems to be done by people.
What I have described is a way in which the addition of hierarchical
structure to the model might enable the efficient processing of
prosodic phenomena.
In explaining linguistic behavior, all else being equal, those
mechanisms are to be preferred which are more directly implementable
in brains. But we should not expect to be able to squeeze all of
cognition out of the simplest connectionist networks. The goal should
be to start with the basic constraints of connectionist models (which
are roughly similar to those in brains) and add the structure that
seems to be required to process and learn complex phenomena like
reduplication, stress patterns, and phonological rule interaction.
This approach, while still connectionist, shares with traditional
models the recognition that humans learn language not simply because
they are exposed to it, but because they are equipped to learn it.
What is different about connectionist models in this regard is that
the innate equipment must be architectural, rather than symbolic.
This paper has proposed a set of architectures constraints that seem
called for for the acquisition of morphophonological rules.
Footnotes
1. For the purposes of this paper, it will be convenient to ignore
structure in the morphological representation of words.
2. Tense/aspect, which together with the root, determines the verb
stem, and person/number/gender, determining inflections, are
conflated in this experiment.
3. How the hidden layer represents these two tiers
4. I am unaware of experiments that demonstrate this facility, however.
5. For another demonstration of the use of distributed representations
for structure-sensitive operations, see Chalmers (1990).
Bibliography
Birk, D. B. W. (1976). The MalakMalak Language, Daly River (Western
Arnhem Land). Pacific Linguistics Series B, no. 45. Australia
National University, Canberra.
Chalmers, D. (1990). Syntactic transformations on distributed
representations. Connection Science, 2:53--62.
Cottrell, G. W. and Plunkett, K. (1991). Learning the past tense in
a recurrent network: Acquiring the mapping from meaning to
sounds. Annual Conference of the Cognitive Science Society,
13:328--333.
Elman, J. (1990). Finding structure in time. Cognitive Science, 1
4:179--211.
Gasser, M. and Lee, C.-D. (1991). A short term memory architecture
for the learning of morphophonemic rules. In Lippmann, R. P.,
Moody, J. E., and Touretzky, D. S., editors, Advances in Neural
Information Processing Systems 3, pages 605--611. Morgan
Kaufmann, San Mateo, CA.
Goldsmith, J. (1990). Autosegmental and Metrical Phonology. Basil
Blackwell, Cambridge, MA.
Goldsmith, J. (forthcoming). Phonology as an intelligent system.
In Napoli, D. J. and Kegl, J., editors, Bridges between
psychology and linguistics: a Swarthmore festschrift for Lila
Gleitman. Lawrence Erlbaum, Hillsdale, NJ.
Hayes, B. and Abad, M. (1989). Reduplication and syllabification in
Ilokano. Lingua, 77:331--374.
Hogg, R. and McCully, C. B. (1987). Metrical Phonology: A Coursebook.
Cambridge University Press, Cambridge.
Jordan, M. (1986). Attractor dynamics and parallelism in a
connectionist sequential machine. In Proceedings of the
Eighth Annual Conference of the Cognitive Science Society,
pages 531--546, Hillsdale, New Jersey. Lawrence Erlbaum
Associates.
Lakoff, G. (1988). A suggestion for a linguistics with connectionist
foundations. In Touretzky, D., editor, Proceedings of the 1988
Connectionist Models Summer School, pages 301--314. Morgan
Kauffmann, San Mateo, California.
Moravcsik, E. A. (1978). Reduplicative constructions. In Greenberg,
J. H., Ferguson, C. A., and Moravcsik, E. A., editors,
Universals of Human Language. Volume 3: Word Structure, pages
297--334. Stanford University Press, Stanford.
Rubin, D. C. (1975). Within-word structure in the TOT phenomenon.
Journal of Verbal Learning and Verbal Behavior, 14:392--397.
Rumelhart, D. E., Hinton, G., and Williams, R. (1986). Learning
internal representations by error propagation. In Rumelhart,
D. E. and McClelland, J. L., editors, Parallel Distributed
Processing, volume 1, pages 318--364. MIT Press, Cambridge, MA.
Shattuck-Hufnagel, S. (1983). Sublexical units and suprasegmental
structure in speech production planning. In MacNeilage, P. F.,
editor, The Production of Speech. Springer, New York.
Stevens, A. (1968). Madurese Phonology and Morphology. American
Oriental Society, New Haven, CT.
Touretzky, D. and Wheeler, D. (1990). A computational basis for
phonology. In Touretzky, D., editor, Advances in Neural
Information Processing Systems 2, San Mateo, CA. IEEE, Morgan
Kaufmann.
[CRL Newsletter Home Page]
[CRL Home Page]
Center for Research in Language
CRL Newsletter
Article 6-2