Why It Might Pay To Assume That Languages Are Infinite

Walter J. Savitch

Computer Science and Engineering Department
University of California, San Diego
LaJolla, CA 92093-0114
wsavitch@ucsd.edu wsavitch@ucsd.bitnet

                          Abstract

We show that,  in  a  wide  variety  of  situations,  formal
languages which are finite can be given smaller descriptions
if we assume that they are but  the  observed  data  for  an
infinite  language.   This points to one possible reason for
assuming that natural language is infinite.




                          Overview


There are clear senses in which natural languages  are  fin-

ite.  It  is  generally  assumed that human brains and hence

linguistic  processing  ability  is  finite.  Moreover,  the

available  data  does  not  in  any  obvious way support the

hypothesis that natural language is  infinite,  rather  than

simply a large finite set.  At any point in time, an indivi-

dual, or the entire community of  human  beings,  will  have

experienced  only  a finite number of sentences. At no point

in history will scientists be able to point  to  any  actual

pool explicitly containing an infinite number of sentences.

Despite these apparently irrefutable facts, linguists  quite

typically treat natural languages as infinite sets, at least

when studying syntax.  Their view of  language  goes  beyond

the  finite data to a  language they perceive as both infin-

ite and real.  We will explore one possible explanation  for

why it may make sense to assume that languages are infinite.


     There is a clear and widely accepted rationale for this

technique  of  treating  finite  objects as if they extended

into infinity.  The rationale is at root the principle known

as  Occam's  Razor.   In  case  after  case it seems that an

infinite model is simpler and clearer than a  finite  model.

The  simplest and clearest finite model often consists of an

infinite model with a finiteness condition added to it.   In

this paper we are able to make this intuition precise enough

to produce  a  mathematical  theory  of  finite  sets  which

includes  a  notion  of  infinite.  We are able to show that

some finite sets have the property that, if we treat them as

infinite sets, then we obtain descriptions that are  simpler

than  the  ones we would obtain if we treated them as finite

sets. We call such sets essentially infinite.



     There are some  trivial  senses  in  which  any  finite

language  can be extended to a simpler infinite language. In

almost any grammar formalism, the set of  all  strings  over

the  basic  alphabet  (lexicon) is among the simplest of all

sets to describe, and every language can be embedded in this

simple  infinite  set.  The  notion  of essentially infinite

avoids such triviality by demanding  that,  over  a  certain

finite  domain,  both  the  finite data set and its infinite

extension agree exactly. This corresponds to demanding  that

our grammars account for negative as well as positive data.


     The results presented here provide a theoretical frame-

work  and  some  proof-of-concept examples. All the examples

are simple formal languages that are more properly viewed as

the  playthings  of  mathematicians than as fragments of any

real human language. The goal of this paper is to show  that

there  is  a  logical  basis for assuming that some finitely

witnessed languages are infinite.  Whether  human  languages

are  among  these  languages  is  held  out as a possibility

rather than as an established fact. Moreover, it is held out

as  a  methodological  imperative that may follow from other

methodological assumptions, and not  as  any  sort  of  fact

about  an  underlying  reality.   Indeed,  we  will see that

whether or not it is productive to assume that a language is

infinite  can  depend on the assumptions that are made about

the admissible class of grammars. If researchers  decide  to

model  languages  using  finite-state machines, then it will

turn out  that  certain  finite  languages  are  essentially

infinite.   If  researchers  decide  to  model  languages by

context-free grammars, then a different class  of  languages

will  turn  out  to be essentially infinite. If  yet another

class of grammars is used, then  the  class  of  essentially

infinite languages changes again.


                     Formal Definitions


Before giving the formal definition of essentially  infinite

we  will  need  to  introduce  some notation, all of it con-

sistent with standard usage.  A language is a set of  finite

length  strings  over  some  finite  alphabet.   If  G  is a

language describing device, such as a  finite-state  machine

or  context-free grammar, then L(G) will denote the language

characterized by G. For example,  if  G  is  a  finite-state

machine,  then L(G) is the set of all strings accepted by G;

if G is a context-free grammar, then L(G) is the set of  all

terminal  strings  that can be generated from the start sym-

bol. The exact definition depends on the type of  model  and

is  considered  part of the definition of the model. We will

refer to L(G) as  the  language  accepted  by  G.   For  any
               = 0, the languages DUP  are essentially finite from
                                n
the perspective of C, the class of  finite-state machines.



     The intuition for why the language ODD  is  essentially
                                           n
infinite  from  the finite-state machine perspective is sim-

ple: We know that a  two  state  machine  can  describe  the

infinite  set  of all strings of 0's and 1's that contain an

odd number of 1's. If we put an upper bound on the length of

the  strings,  then  the finite-state machine needs to count

and so needs more than 2 states. The formal details follow:



Counting Lemma for FSM's: Suppose that n > 0 and that F is a

finite language. If  the longest string in F is n symbols in

length, then any finite-state machine that accepts as inputs

(exactly) the strings in F must have at least n + 1 states.



     Proof. The proof is easily constructed using a standard

pumping  argument. It can be found in most introductory text

books on formal language theory.  



Theorem 1: The language ODD   is essentially  infinite  from
                           n
           the finite-state machine perspective, provided n => 2.


Proof: A two-state machine can easily be constructed to

accept  the  infinite  language consisting of all strings of

0's and 1's that contain an odd number of 1's.  However,  by

the  Counting Lemma, any finite-state machine for the finite

language ODD has at least n+1 states, and if n =>  2,
            n
then  n+1 => 3. Hence, when n => 2, the machine for the infin-

ite extension is  smaller  than  any  machine  that  accepts

exactly ODD .  
           n



Theorem 2: For all n > 0, the languages DUP  are essentially
                                           n
           finite from the finite-state perspective.



     Proof. It is easy to construct a  finite-state  machine

that  accepts DUP  and that has two states for each possible
                 n
w of length less than or equal to n.  Intuitively one set of

states recognizes the string before the c and the second set

checks that what follows the c matches what came before  the

c.  The other half of the proof is to show that  any finite-

state machine that accepts an  infinite  extension  of  DUP
							   n
will require at least the same number of states.           


     We will show that any finite-state  machine  that  per-

forms  correctly  on  DUP ,  even if it extends beyond DUP ,
                         n                                n
must have two states for each string w that has length n  or

less  and that consists of all 0's and 1's. To see that that

many states are required, note that any such machine must be

in  a  different state after reading each string w of length

less than or equal to n and consisting entirely of  0's  and

1's. To establish this, the proof proceeds by contradiction.

If there were fewer states, then there would be two distinct

strings,  x  and  y, such that, after reading either x or y,

the machine is in the same state. By "cutting  and  pasting"

pieces of the computations on xcx and ycy, we then obtain an

accepting computation of xcy. Now xcy is of equal or shorter

length  than  the  longest string in DUP  and is accepted by
                                        n
the finite-state machine that allegedly  extends  DUP .  But
						     n
xcy  is  not  in  DUP   and  we  have  a contradiction. This
                     n
establishes that there is one state for  each  string  w  of

length  n or less and consisting of all 0's and 1's. Further

analysis along the same lines shows that at least twice that

many  states are needed. Specifically, for each two distinct

strings x and y, of the appropriate type, the  machine  must

be  in a different state after reading each of x, y, xc, and

yc.


        Prevalence of Essentially Infinite Languages


The notion of essentially infinite is a generally applicable

concept.   Any  "reasonable" grammar model will give rise to

essentially infinite languages. To make this precise, let us

give the follow precise definition of the minimal conditions

that we need for a grammar to qualify as "reasonable."


Definition: Let C be a class of grammar models  and  size  a

mapping  from  C to the natural numbers.  The pair (C, size)

is defined to be reasonable provided  the  following  axioms

hold:



Axiom 1: For any n, there are only finitely many grammars G in C

         such that size(G) = n.


Axiom 2: There is at least one grammar G in C such that L(G)  is

         infinite.


Axiom 3: For every finite language F, there is a grammar G in C
                                                          F
         such that L(G ) = F.
                      F



     Even this set of axioms is more than we  need  for  the

results of this paper. We do not use Axiom 3 for the results

presented here.  However, Axiom 3 is true of  virtually  all

common  grammar models and it is resonable to demand it as a

guarantee that the grammar model is nontrivial for  purposes

of this study.



Theorem 3: For any grammar class C and any measure  of  size

such  that  (C,  size)  satisfies  axioms 1 and 2, there are

infinitely many finite languages that are essentially infin-

ite from the perspective of (C, size).


A trivial to prove, but important, Corollary is the  follow-

ing:



Corollary: For any grammar class C and any measure  of  size

such  that (C, size) is reasonable (i.e., satisfies axioms 1

through 3), there are  infinitely many finite languages that

are essentially infinite from the perspective of (C, size).



     Proof of Theorem 3. By Axiom 2 we know there is a gram-

mar  G   in C  such that L(G   ) is infinite.  Now, each of the
      inf                   inf
languages L , defined as follows, is finite.
           n             
				= n .
                     inf          m                o

We will show that



Claim: L  = L(G ) is essentially infinite, for all m => n .
        m      m                                        o



     Let G be any grammar in C such that

                   = n .
        m      inf              o

To show that L  is essentially infinite it will  suffice  to
              m

show that



(2)           size(G   ) < size(G).
                    inf

But, this can be seen as follows:



(3)           size(G ) =< size(G),
                    m

since they generate the same language L  and G  is a grammar
                                       m      m
of minimal size for this language.  Now (2) follows directly

from (1) and (3). Hence, the claim and the theorem are  pro-

ven.  


     Notice that this theorem is very general. The  grammars

G  need  not   possess  any  computability  properties.  The

languages L(G) need  not  even  be  recursively  enumerable.

Given   a grammar G and a candidate string w, we do not even

assume that there is an algorithm to determine if w  is in L(G).


     Treating finite sets as what they are hardly  seems  to

merit  attention  and  so  it  may seem that no comments are

needed about the existence of essentially finite sets.  How-

ever,  the  notion  of essentially finite is not the same as

the traditional notion of finite. One may well ask if essen-

tially  finite  sets  are  common or rare.  Perhaps surpris-

ingly, they are,  in  some  sense,  rarer  than  essentially

infinite sets. If, in Theorem 3, we replace the term "essen-

tially infinite" with "essentially finite," then the result-

ing claim about essentially finite sets is not provable.  It

is not difficult to  construct  definitions  for  a  grammar

class  and  a size measure that satisfies Axioms 0 through 3

and for which there are no essentially finite sets. However,

whenever we have looked at commonly used notions of grammar,

we have always been able to find languages that  are  essen-

tially finite.



           Infinity is in the Eye of the Beholder



The next result shows that whether  or  not  a  language  is

essentially infinite depends on the allowable class of gram-

mars.  A language can be essential infinite  from  the  per-

spective  of one grammar class, but not from the perspective

of another.



Theorem 4: For  all  n  greater  than  some  threshold,  the
                       
languages  PAL   = {wcR(w)  | length(w) =< n & w consists of all
              n
0's and 1's} are essentially finite  from the perspective of

finite-state  machines but are essentially infinite from the

perspective of context-free grammars. (You may use any  rea-

sonable  measure  of  size for context-free grammars such as

the number of nonterminals or the number of productions. R(w)

denotes w written backwards.)



     Proof.  Context-free grammars under any reasonable size

measure satisfy our axioms for reasonableness and so Theorem

3 applies.  If we go through the proof  of  Theorem  3  with
                                                  
L(G   )  set  equal  to the infinite language {wcR(w) | w con-
   inf
sists of all 0's and 1's}, then we see that there is  an  n
                             = n , L   is essentially infinite from
                          o                     =<2n  +  1
the context-free perspective.  But, for all n, L           =

PAL .   Hence, for all n above some fixed threshold, PAL  is
   n                                                    n
essentially infinite from the  perspective  of  context-free

grammars.



     To complete the proof of the theorem, it  will  suffice

to  show  that  for  all sufficiently large n, the languages

PAL  are essentially finite from the  finite-state  perspec-
   n
tive. But, the proof of that is just a slight variant of the

proof of Theorem 2.  


                        Related Work


A number of researchers  have  shown  that  a  more  complex

language  description  mechanism   can  yield  more succinct

descriptions of certain simple languages. For example, Meyer

and Fisher (1971) show that, by using pushdown automata, one

can give shorter descriptions of some finite-state languages

than  one  can obtain by using finite-state machine descrip-

tions. Although the finite-state model may be  adequate  for

describing  certain  languages,  one  can  obtain  a shorter

description by using a model  (pushdown  automata)  that  is

more  powerful  than is needed. This seems to be a very gen-

eral phenomenon as witnessed by numerous other model  pairs.

For  other  examples see Blum (1967),  Geller et al. (1975),

Moshier and Rounds (1987), and Valiant (1976).  However,  in

all  these cases, a finite language is always viewed as fin-

ite. The descriptive mechanism may be one typically used for

infinite languages but it describes finite languages as fin-

ite languages.  In contrast to  these  works,  the  approach

presented  here  views  certain  finite languages as if they

were infinite.


     Rounds, Manaster-Ramer,  and  Friedman  (1987)  discuss

efficiency  of  description  for  grammar  models and obtain

results related to the notion of  essentially finite.


     The notion of essentially finite  for  Turing  machines

(or  equivalently,  for  any model of a general-purpose com-

puter)  is, in its essence,  the same as  the  notion  of  a

random  sequence in the sense of Kolmogorov (1968). Although

we use no results from the literature that his work spawned,

much  of the inspiration for this study came from Kolmogorov

complexity theory.


                      Generalizations


There is nothing in  our  theory  that  depends  on  natural

language.   One  may  just  as well question why human being

would ever think that any set had an  infinite  cardinality.

We have never, nor will we ever, perceive an infinite number

of items. Indeed, upon reflection it  is  puzzling  why  the

notion  of infinity of number ever occurred to human beings.

This may be one example of what can be called a computation-

ally  derived  semantic  concept.  The  concept is used, not

because it is witnessed in the observable world, but because

it  makes  cognitive tasks simpler.  We hope to discover and

compare other examples of such computationally driven seman-

tic concepts.


                          Summary


We have seen that there is a mathematically precise sense in

which  finite languages can be viewed as infinite languages.

We call  such  languages  essentially  infinite.   A  finite

language  is  essentially  infinite if, in the precise sense

described above, it simplifies our theory to consider it  to

be  an  initial  segment  of  some  infinite language. Which

languages  are   essentially  infinite  and  which  are  not

depends  on the particular grammar model used.  Nonetheless,

for any reasonable grammar model, there will always be  some

such  finite  but  essentially infinite languages. Because a

given finite set may change  from  essentially  infinite  to

essentially  finite, or vice versa, depending on one's gram-

mar model, this does not seem  to  be  a  property  that  is

inherent  in  the underlying reality of some infinite exten-

sion language so much as it is a property of the way we view

the finite portion of language that we can perceive.


*
 This research was supported in part by NSF grant  DCR-
8604031.  I thank Alexis Manaster-Ramer for a number of
helpful discussions on this material.  Preliminary ver-
sions  of  most of these results were announced in Sav-
itch (1991).


                         References


Blum, M.: 1967, "On the Size of Machines,"  Information  and
     Control 11, 257-265.

Geller, M. M., H. B. Hunt III, T. G. Szymanski,  and  J.  D.
     Ullman:  1975,  "Economy  of  Description  by  Parsers,
     DPDA's and PDA's," IEEE 16th Symposium  on  Foundations
     of Computer Science, Berkeley, 122-127.

Hopcroft, J. E. and J.  D.  Ullman:  1979,  Introduction  to
     Automata  Theory,  Languages, and Computation, Addison-
     Wesley, Reading, Mass.

Kolmogorov, A.: 1968, "Three  Approaches  for  Defining  the
     Concept of Information Quantity," Selected Translations
     in Math. Stat. and Prob. 7, AMS publications.

Meyer, A. R.: 1972, "Program Size in Restricted  Programming
     Languages," Information and Control 21, 382-394.

Meyer, A. R. and M. J. Fischer: 1971, "Economy  of  Descrip-
     tion  of  Automata, Grammars, and Formal Systems," IEEE
     12th Symposium on Switching and Automata  Theory,  East
     Lansing, Mich. 188-191.

Moshier, M. D. and W. C. Rounds: 1987, "On the  Succinctness
     Properties   of   Unordered   Context-Free   Grammars,"
     Proceedings of the 25th Meeting of the ACL.

Rounds, William C., Alexis Manaster-Ramer, and Joyce  Fried-
     man:  1987,  "Finding Natural Language a Home in Formal
     Language  Theory,"   in   Manaster-Ramer,   A.   (ed.),
     Mathematics  of Language, John Benjamins, Philadelphia,
     87-114.

Savitch, Walter J.: 1991 "Infinity is in  the  Eyes  of  the
     Beholder," in C. Georgopoulos and  R.  Ishihara  (Eds.), 
     Interdisciplinary  Approaches to Language: Essays in Honor
     of S.- Y. Kuroda, Kluwer Academic Pub., 487-500.

Valiant, L.  G.:  1976,  "A  Note  on  the  Succinctness  of
     Description  of  Deterministic  Languages," Information
     and Control 32, 139-145.

[CRL Newsletter Home Page]
[CRL Home Page]
Center for Research in Language
CRL Newsletter
Article 6-4