Learning to Recognize and Produce Words: Towards a Connectionist Model

Michael Gasser

Computer Science Department, Indiana University

Abstract

This paper outlines a first cut at a connectionist approach to the
acquisition of morphological rules.  The proposed model makes use of a
hierarchy of simple sequential networks, networks which take inputs or
produce outputs one at a time and which maintain a short-term memory
via recurrent connections.  Three experiments are described which 
demonstrate features of the model.  In one, a network is trained to 
learn a set of complex rules in which the stem of a verb appears 
nowhere as a contiguous sequence of phones.  In another, a network 
trained to recognize syllables in an artificial language develops 
distributed syllable representations which then provide the input
for another network that learns reduplication rules.  Finally, it is 
shown that the distributed syllable represenations from the second 
experiment can be applied to the production as well as the recognition 
of syllables.


Morphophonological Acquisition and Performance

    One of the tasks that faces a language learner is that of 
discovering how words in the target language are built up from 
morphemes.  The processes involved are tied up intimately with the 
phonology of the target language because the boundaries between
morphemes are the locus of many phonological rules and because the 
prosodic structure of the word often depends on its morphological 
makeup.
     From the learner's perspective, the problem is best viewed as 
one of mapping surface phonological representations, that is, 
sequences of phones, onto lexical entries and ``entries'' for 
grammatical morphemes. (footnote 1) The task has a perception and 
a production side.  For perception, the input is a sequence of phones, 
the output a set of lexical and grammatical entries.  For example, 
for English nouns,


# b u g z # -> bug + plural 

where small capitals indicate lexical/grammatical entries with
no phonological content.  Thus the task involves in part the learning 
of arbitrary associations between form and function.  For production, 
the input is a set of entries, the output a phone sequence.
For example, 

bug + plural -> # b u g z #.

Note that there is no place for underlying representations here
because the learner does not have access to them.

      Morphological and phonological rules are productive, and language
learners clearly recognize and produce forms they have never heard
before.  Thus a learning model should be able to generalize on the 
basis of a set of training items.  For example, given pairs like 
the following:

# b u g z #  bug + plural
# b \uh g #  bug + singular
# s i: d #  seed + singular,

the system might be expected to be able to respond to one or the other
of the following:

# s i: d z # -> ??
seed + plural -> ??.

The performance of a model can be evaluated on the basis of its
responses to questions like these and on the size of the training 
sample required for generalization.


Connectionism and Language Acquisition

     As with any problem in language acquisition, the central question 
is one of the extent to which innate mechanisms are required, and the
extent to which these are linguistic, as opposed to general cognitive,
mechanisms.  The approach taken here is one which starts from the 
hypothesis that morphophonological learning is essentially a 
statistical and not a symbolic process and that the innate aspects of 
the process are features of the cognitive architecture rather than 
explicit constraints on categories or rules.

     Connectionist models provide a way of implementing this basic idea.
These models have several features in their favor:

     They have a powerful capacity to discover regularity in
        environmental patterns.
     The can implement graded, as opposed to discrete, categories
        and processes.
     They are resistant to noise and damage.
     They are constrained in ways which are grossly similar to the
        constraints on nervous systems.

However, there has been only limited success applying connectionism to
domains in which the basic data are structured in complex ways.
Natural language morphology and phonology constitute such a domain.

      This paper describes a connectionist model of word recognition and
production which is currently under development, with several crucial
aspects as yet unresolved.  The paper discusses those aspects which 
have been clarified.


Sequential Networks

    If we take seriously the fact that language takes place in time, 
that it is both perceived and produced in time, then our models must 
deal with one segment at a time.  In addition, because both perception 
and production obviously require knowledge of more than just the 
current segment, our models need to have a short-term memory (STM) 
which is suited to the task.  Sequential networks are connectionist 
networks which satisfy these conditions.  Sequential networks either 
receive input events of some type one a time or yield output events 
one a time.  Recurrent connections on hidden and/or output layers 
give these networks the capacity to develop a short-term memory.

Figure 1 shows two types of sequential networks, based on those 
introduced originally by Jordan (1986) and Elman (1990).  In a set 
of simulations, I have found the architecture on the left to be 
well-suited for recognition tasks and that on the right well-suited
for production tasks.

    Both networks are sequential in the sense that they run in time; one
pass through each network corresponds to the input or output of a
single primitive event.  In the figures, the solid arrows represent 
complete connectivity between layers of processing units.  The networks 
are trained---that is, the weights on their connections are 
adjusted---using the back-propagation learning algorithm (Rumelhart, 
Hinton, & Williams, 1986).  Back-propagation changes the weights in 
such a way that the error between the output generated by the network 
and a target provided by the environment (or a ``teacher'') is 
minimized.

     The fuzzy arrows in the figure represent simple copy connections.
Thus in both of the networks the pattern on the hidden layer is copied 
to a context layer following each pass through the network.  In the 
production network, the output pattern is also added to the pattern on 
a state layer.  This pattern also decays on each cycle through the 
network (indicated in the figure by the recurrent arrow), so the 
pattern on it represents a weighted sum of past outputs.  The 
recurrent connections to the context and state layers give each 
network the capacity to develop an STM because the network has 
indirect access to previous inputs or outputs.

     Input to the recognition network is a sequence of phones and 
a final word boundary marker, presented one at a time in the form of 
phonetic feature vectors.  The network is trained to output the input 
phone---this encourages it to pay attention to the input---and the 
appropriate lexical and grammatical entries, each representing by a 
single output unit.  The lexical-grammatical target remains constant 
throughout the presentation of a word, even though the network cannot 
be expected to recognize a word until a sufficient number of phones 
have been presented to distinguish it from all other possible words.

     Input to the production network is a binary lexical-grammatical
vector which remains constant throughout the production of the word.
The network is trained to output in sequence the phones representing
the surface form of the target word and a final word boundary marker.
The effect of the state layer is to tell the network where it is
in the word.

     In recent years, several researchers have shown that sequential 
networks similar to these can be trained to learn simple morphological 
rules (Cottrell & Plunkett, 1991; Gasser & Lee, 1991).  In particular,
Gasser and Lee (1991) showed that networks of the production type in 
Figure 1 were capable of learning suffixing, prefixing, and mutation
rules for limited sets of words in an artificial language.

     In the next section, I describe a simulation (previously 
unreported) in which a sequential network is trained on a relatively 
complex morphological learning problem.  For this, and the other
experiments reported in this paper, each simulation was run three, 
with different randomly generated initial weights for each run; 
results reported are means over the three runs.

Learning a Complex Morphological Rule

    Amharic is an Ethiopian Semitic language which, like other Semitic
languages, has a rich system of verb morphology.  Verbs are 
characterized by a root, usually consisting of a set of three
consonants, and a conjugation template.  The template specifies the
vowels in the surface form of the verb and sometimes also an additional
consonant or gemination on one of the root consonants.  The two aspects
of a verb form are generally viewed as belonging to separate 
phonological tiers (Goldsmith, 1990).  We will consider only three 
forms of an Amharic verb, all in the third person singular masculine: 
the jussive, roughly `let him V'; the converb (also called the 
gerundive), roughly `(him) having V-ed'; and the past.  For the verb 
for `steal' the three forms are as follows: y\e sr\a \k, s\a r\k o, 
and s\a rr\a\k\a. (footnote 2) Note how challenging this problem is
for the language learner: there is no consistent sequence of two or 
more surface segments which characterizes the verb roots.

    Separate networks were trained on the recognition and production
tasks for this problem.  In each case 30 different verbs were used.
For each of the three verb forms, a set of six verbs was set aside for
testing.  During training, the networks saw all of the other forms, 
that is, all three forms for some, two forms for other verbs.

    Each network was trained and then tested repeatedly on the patterns
set aside until performace got no better on the test set.  This required
101 epochs (that is, repetitions of all the training patterns) for
production, 169 for recognition.  For production, the task was to 
produce the phones in sequence for a form that the network had never 
seen, for example, 

     steal + past ->  s \a r r \a \k \a

For recognition, the task was to assign a novel form to the appropriate 
stem and grammatical morpheme, for example,

     s \a r r \a \k \a -> steal + past

Results were as follows.  For the recognition network, 59% of the stems
chance: 3%) and all 96% of the grammatical morphemes (chance: 33%) 
were recognized correctly.  For the production network, 87% of the 
phones were generated correctly (chance: 5%).  For recognition, most 
errors (59%) involved confusion of one verb with another which shared 
two of its three consonants, though not necessarily in the same order.
All production errors involved mistaken stem consonants.  For example,

   bury (stem: \k b r) + converb ->  \k \a b d o (for \k \a b r o).

Thus, the networks show clear evidence of having learned the rules.
We must surmise that the hidden layer in each network discovers how to
make the double association of lexical entry to consonant triple and
of grammatical morpheme to phonological template.(footnote 3)

   However, if sequential networks were also capable of learning
morphological rules of a type which do not occur in human language,
this would count as evidence against them.  Gasser and Lee (1991) 
showed that sequential networks failed to learn a rule which reversed 
the segments in a three-segment morpheme and had considerable 
difficulty with morphological deletion rules.  Such rules are either
very rare or non-occurring in natural language.

Handling Hierarchical Structure

    However, there are processes occurring in natural language
which the simple sequential networks described above have considerable
trouble acquiring, if they can acquire them at all.  These are processes
that make reference to hierarchical phonological structure, which plays
an increasingly important role in phonology (Goldsmith, 1990; Hogg & 
McCully, 1987).  Reduplication is an example of such a process.
Reduplication involves the repetition of whole sequences of phones,
often altered in systematic ways (Moravcsik, 1978).  It is a 
morphological process in many languages, e.g., Ilokano puspusa `cats' 
from pusa `cat' (Hayes & Abad, 1989); an aspect of lexical 
representations in even more languages, e.g., English willy-nilly; and 
a simplification process in language acquisition, e.g., kiki for 
kitchen.  The frequency of reduplication and the ease with which 
speakers and hearers seem to handle it (footnote 4) lead one to 
believe that the capacity for reduplication is somehow built into 
the language processing architecture.

     What is it that makes reduplication hard for sequential networks? 
Consider what is involved in the recognition of syllable reduplication, 
for example.  The perception system must have the capacity to compare 
entire subsequences within the input.  This would seem to necessitate 
a static summary representation of the first subsequence which remains 
constant while the second subsequence comes into the system.  In order 
not to interfere with segment-level processing, this comparison process
apparently requires a separate syllable-level module.

    Another process which seems to require a hierarchical architecture
is the perception and production of stress.  In many languages, the 
process of determining which syllables in a word are to be stressed 
proceeds from right to left.  For example, in MalakMalak, an Australian 
language, stress is assigned to alternating syllables beginning from 
the penultimate syllable and moving to the left through the word 
(Birk, 1976).  In such cases, the production mechanism must clearly 
have access early on to a global representation of something like the
prosodic structure of the word to be produced.  There is also evidence, 
especially from the tip-of-the-tongue phenomenon (Rubin, 1975), that 
human speakers begin with just such a global prosodic representation.
Thus again a simple one-phone-at-a-time approach will not do.

    What I propose---and many of the details remain to be worked 
out---is an architecture which combines the strengths of sequential 
networks described above with a built-in capacity to discover and 
make use of hierarchical structure in language.  Figure 2 shows the
basic architecture of the model.  Each of the three modules shown 
consists of a sequential network (with an STM capacity denoted by the
recurrent connections on the hidden layer) and is responsible for one
hierarchical level, say, phones, syllables, or metrical feet.  (No 
claims are made regarding the number of levels required.) Each of these 
sequential networks runs with a different clock: one takes segments as 
its primitive ``beats'', another syllables.

(The jagged arrows in the lower right-hand corner of the figure
indicate the directions in which words are processed through the 
network.)

    For recognition, the input to each module is a pattern constituting 
a distributed representation of a sequence of events at a lower level
(except at the lowest level, where the input may be a continuous 
stream).
    The module then turns an input sequence into a summary distributed
representation which is then handed to the next higher level, where it
is treated as a single input event.  Production begins at the highest 
level with a distributed representation of the prosodic structure of 
the complete word.  This is then transformed into sequential patterns 
at successively lower levels.  The unpacking of a distributed 
representation of a sequence into its component events corresponds to 
the ``spelling out'' operation which is a part of some psycholinguistic
models of word production (Shattuck-Hufnagel, 1983).

    This scheme requires a capacity for learning distributed
representations of subsequences, e.g., syllables, which can be used 
in higher-level processing.  For example, for syllable-based 
reduplication the distributed syllable representations must encode 
their contents in a fashion that allows it to be used in learning 
reduplication rules.  As noted above, such rules often involve 
systematic modification of the reduplicated sequence.  For instance,
Madurese has a process by which the final syllable of a stem is copied 
onto the front of the stem with its vowel changed to a: 

indhit -> dhatindhit (Stevens, 1968).

     Within the context of the model being considered, this requires
syllable representations which encode the syllable nucleus (vowel) in
a systematic fashion.  I next describe a simulation designed to 
establish that this is at least possible.

     A sequential network of the recognition type in Figure 1 was 
trained to recognize all of the possible syllables in an artificial 
language.  That is, the lexical/grammatical output layer in the 
figure was replaced by a layer of syllable units, one for each 
syllable.  This task has the effect of forcing the network to 
distinguish the individual syllables and thereby to pay attention 
to all of the phones in each one.  The intention was to have the 
pattern appearing on the hidden layer in the network following the
presentation of a syllable constitute a summary distributed 
representation for the syllable.

    There were 54 syllables in the language, describable as follows:

     onset -> {p,m,t,n,k,0}
     rime -> {a,i,u,aa,ii,uu,an,in,un}.

   Training the network to recognize the syllables required 1400
epochs.  At this point the network had learned distinct distributed
representations for each syllables; these were just the hidden
layer patterns left following the presentation of the syllables.
These syllables were then used as inputs to a second network, one
designed to learn a set of reduplication rules.  To simplify matters, 
a simple feedforward network, rather than a sequential network, was 
used for this task.  Input to the network consisted of a distributed 
representation for a syllable (one of those learned by the syllable 
recognition network) and a pattern representing one of three 
reduplication rules.  The network was trained to output the distributed 
representation for the syllable resulting from applying the input 
reduplication rule to the input syllable.  The three rules were as 
follows:

     Make the syllable coda n, e.g., ma -> man, 
         tii -> tin, kun -> kun.
     Remove any coda and make the syllable nucleus short, e.g.,
         man -> ma, tii -> ti, ku -> ku.
     Make the syllable onset k, e.g., ma -> ka, 
         un -> kun, kaa -> kaa.

    For each rule, the network was trained on all but 11 of the 54
syllables.  Training continued until all of the training item were 
handled correctly; this required 1183 epochs.  During the test phase,
71% of the syllables were produced correctly (chance: 2%), and, of 
these, Rule 1 had the most errors; rule 3 the fewest (only 3%).  The 
k onset rule may have been easiest because it mapped input
syllables onto a relatively small set of possible output syllables 
(9 vs. 18 for the n coda rule).

    While this performance is far from perfect, there is strong evidence
that the rules have been learned.  Most importantly, they have been
learned without the benefit of pre-specified representations which
provide explicit syllable structure.  Rather the rules were applied 
directly to distributed representations learned in the performance of 
another task, syllable recognition.  These representations seem to 
encode syllable structure in a form which is suitable for 
structure-sensitive operations such as reduplication. (footnote 5)

Recognition and Production

    An important processing and acquisition issue, often overlooked, is
the relation between perception and production.  While there are many
possible positions on the degree of sharing between the two processes,
some sharing must characterize the human processing system.  For an 
acquisition model, the problem is one of demonstrating that it is 
possible to produce forms which have been encountered in recognition 
but for which there has been no production training.  That is, there 
should be some generalization from perception to production.  Obviously
there can be no generalization for arbitrary associations like those 
connecting lexical items to their surface phonological realizations, 
but we might expect generalization in two other places, morphological 
rules and prosodic structure.  In what follows, I describe an 
experiment to test the latter.

     A network similar to the production network in Figure 1 was 
trained to produce syllables in the artificial language used in the 
previous experiment.  Inputs were the distributed syllable 
representations developed during the recognition training phase of 
that experiment.  The hope was that these inputs would encode aspects 
of the syllables which are relevant to their production as well as 
their recognition.  The production network was trained on all but 10
of the 54 syllables for 109 epochs and then tested on the remaining 10 
syllables.  89% of the phones (including final syllable-boundary 
markers) were correctly generated (chance: 11%).  In only one case 
(out of 108 test items on the three runs) did an error result in a 
syllable that was not one of those in the language; in this case what
should have been {nun} appeared as {nud}.

     Thus there is evidence of generalization from recognition to 
production.  Static summary representations of input sequences can be 
used to generate the same sequences.


Discussion and Caveats

     What I have shown in the three experiments described in this 
paper is

     that sequential networks can learn difficult
         morphological processes and
     that distributed syllable representations developed during
         recognition embody the structure necessary for operations,
         such as reduplication, that apply directly to the
         representations, and for the
         reproduction of the original sequence of phones.

However, there are many gaps in the implementation of the proposed
hierarchy of sequential networks.  The network in the second experiment 
was told precisely what the relevant syllables of the language were, 
that is, how many were to be learned and what their constituent segments
were.  Yet one of the most difficult taks facing the learner is 
precisely this: how is the input stream to be segmented into units, and
how many units should there be? Furthermore, during the production 
train ing in the last experiment, target phones were available at every 
step, a luxury that no human language learner has during production.
Finally, there is the question of what controls the overall network
shown in Figure 2.  As units are completed in the perception direction,
for example, they need to be passed onto the next level.  But what 
decides when a unit is complete? This is related to the problem of 
segmentation which plays a role in the learning of the units in the
first place.

     With respect to phonology, there is as yet very little that can 
be said about the adequacy of the model, particularly for processes 
which seem to involve complex rule interactions.  Some degree of rule
sequencing is possible in the model because of the hierarchical 
structure, but it is not clear whether this will prove sufficient.  
Incidentally, the levels proposed here are not those proposed in other
recent work within connectionist phonology (Goldsmith, forthcoming; 
Lakoff, 1988; Touretzky & Wheeler, 1990).  These models posit three 
levels differing in abstraction, whereas the present proposal 
distinguishes levels on the basis of the size of the units that are 
taken as primitive events.

Conclusions

     Connectionist networks, especially those that gain access to the 
past via recurrent connections, are capable of detecting and making 
use of many of the regularities which characterize the morphology of 
natural language.  This paper adds further evidence to this claim with 
a demonstration that even rules which apparently involve the 
interleaving of representations from separate tiers can be induced by 
such a simple system.  Yet some of what goes on in word recognition and 
production seems to involve the direct manipulation of higher-level 
representations.  This may be possible in sequential networks which 
process one phone at a time, but it is difficult to imagine how it
might be done as efficiently as it seems to be done by people.
What I have described is a way in which the addition of hierarchical
structure to the model might enable the efficient processing of 
prosodic phenomena.

     In explaining linguistic behavior, all else being equal, those
mechanisms are to be preferred which are more directly implementable
in brains.  But we should not expect to be able to squeeze all of
cognition out of the simplest connectionist networks.  The goal should 
be to start with the basic constraints of connectionist models (which 
are roughly similar to those in brains) and add the structure that 
seems to be required to process and learn complex phenomena like 
reduplication, stress patterns, and phonological rule interaction.  
This approach, while still connectionist, shares with traditional 
models the recognition that humans learn language not simply because 
they are exposed to it, but because they are equipped to learn it. 
What is different about connectionist models in this regard is that 
the innate equipment must be architectural, rather than symbolic.
This paper has proposed a set of architectures constraints that seem 
called for for the acquisition of morphophonological rules.


Footnotes


1. For the purposes of this paper, it will be convenient to ignore 
   structure in the morphological representation of words.

2. Tense/aspect, which together with the root, determines the verb 
   stem, and person/number/gender, determining inflections, are
   conflated in this experiment.

3. How the hidden layer represents these two tiers

4. I am unaware of experiments that demonstrate this facility, however.

5. For another demonstration of the use of distributed representations 
   for structure-sensitive operations, see Chalmers (1990).


Bibliography

Birk, D. B. W. (1976).  The MalakMalak Language, Daly River (Western 
	Arnhem Land).  Pacific Linguistics Series B, no. 45. Australia 
	National University, Canberra.

Chalmers, D. (1990).  Syntactic transformations on distributed 
	representations.  Connection Science, 2:53--62.

Cottrell, G. W. and Plunkett, K. (1991).  Learning the past tense in 
	a recurrent network: Acquiring the mapping from meaning to 
	sounds.  Annual Conference of the Cognitive Science Society,
	13:328--333.

Elman, J. (1990).  Finding structure in time.  Cognitive Science, 1
	4:179--211.

Gasser, M. and Lee, C.-D. (1991).  A short term memory architecture 
	for the learning of morphophonemic rules.  In Lippmann, R. P.,
 	Moody, J. E., and Touretzky, D. S., editors, Advances in Neural
	Information Processing Systems 3, pages 605--611. Morgan
  	Kaufmann, San Mateo, CA.

Goldsmith, J. (1990).  Autosegmental and Metrical Phonology.  Basil 
	Blackwell, Cambridge, MA.

Goldsmith, J. (forthcoming).  Phonology as an intelligent system.
  	In Napoli, D. J. and Kegl, J., editors, Bridges between
 	psychology and linguistics: a Swarthmore festschrift for Lila
 	Gleitman. Lawrence Erlbaum, Hillsdale, NJ.

Hayes, B. and Abad, M. (1989).  Reduplication and syllabification in 
	Ilokano.  Lingua, 77:331--374.

Hogg, R. and McCully, C. B. (1987).  Metrical Phonology: A Coursebook.
   	Cambridge University Press, Cambridge.

Jordan, M. (1986).  Attractor dynamics and parallelism in a 
	connectionist sequential machine.  In Proceedings of the 
	Eighth Annual Conference of the Cognitive Science Society, 
	pages 531--546, Hillsdale, New Jersey. Lawrence Erlbaum
 	Associates.

Lakoff, G. (1988).  A suggestion for a linguistics with connectionist 
	foundations.  In Touretzky, D., editor, Proceedings of the 1988 
	Connectionist Models Summer School, pages 301--314. Morgan 
	Kauffmann, San Mateo, California.

Moravcsik, E. A. (1978).  Reduplicative constructions.  In Greenberg,
	J. H., Ferguson, C. A., and Moravcsik, E. A., editors, 
	Universals of Human Language. Volume 3: Word Structure, pages 
	297--334.  Stanford University Press, Stanford.

Rubin, D. C. (1975).  Within-word structure in the TOT phenomenon.
   	Journal of Verbal Learning and Verbal Behavior, 14:392--397.

Rumelhart, D. E., Hinton, G., and Williams, R. (1986).  Learning 
	internal representations by error propagation.  In Rumelhart, 
	D. E. and McClelland, J. L., editors, Parallel Distributed 
	Processing, volume 1, pages 318--364. MIT Press, Cambridge, MA.

Shattuck-Hufnagel, S. (1983).  Sublexical units and suprasegmental 
	structure in speech production planning.  In MacNeilage, P. F.,
	editor, The Production of Speech.  Springer, New York.

Stevens, A. (1968).  Madurese Phonology and Morphology.  American 
	Oriental Society, New Haven, CT.

Touretzky, D. and Wheeler, D. (1990).  A computational basis for 
	phonology.  In Touretzky, D., editor, Advances in Neural 
	Information Processing Systems 2, San Mateo, CA. IEEE, Morgan 
	Kaufmann.

[CRL Newsletter Home Page]
[CRL Home Page]
Center for Research in Language
CRL Newsletter
Article 6-2