University of California, San Diego

LaJolla, CA 92093-0114

wsavitch@ucsd.edu wsavitch@ucsd.bitnet

Abstract We show that, in a wide variety of situations, formal languages which are finite can be given smaller descriptions if we assume that they are but the observed data for an infinite language. This points to one possible reason for assuming that natural language is infinite. Overview There are clear senses in which natural languages are fin- ite. It is generally assumed that human brains and hence linguistic processing ability is finite. Moreover, the available data does not in any obvious way support the hypothesis that natural language is infinite, rather than simply a large finite set. At any point in time, an indivi- dual, or the entire community of human beings, will have experienced only a finite number of sentences. At no point in history will scientists be able to point to any actual pool explicitly containing an infinite number of sentences. Despite these apparently irrefutable facts, linguists quite typically treat natural languages as infinite sets, at least when studying syntax. Their view of language goes beyond the finite data to a language they perceive as both infin- ite and real. We will explore one possible explanation for why it may make sense to assume that languages are infinite. There is a clear and widely accepted rationale for this technique of treating finite objects as if they extended into infinity. The rationale is at root the principle known as Occam's Razor. In case after case it seems that an infinite model is simpler and clearer than a finite model. The simplest and clearest finite model often consists of an infinite model with a finiteness condition added to it. In this paper we are able to make this intuition precise enough to produce a mathematical theory of finite sets which includes a notion of infinite. We are able to show that some finite sets have the property that, if we treat them as infinite sets, then we obtain descriptions that are simpler than the ones we would obtain if we treated them as finite sets. We call such sets essentially infinite. There are some trivial senses in which any finite language can be extended to a simpler infinite language. In almost any grammar formalism, the set of all strings over the basic alphabet (lexicon) is among the simplest of all sets to describe, and every language can be embedded in this simple infinite set. The notion of essentially infinite avoids such triviality by demanding that, over a certain finite domain, both the finite data set and its infinite extension agree exactly. This corresponds to demanding that our grammars account for negative as well as positive data. The results presented here provide a theoretical frame- work and some proof-of-concept examples. All the examples are simple formal languages that are more properly viewed as the playthings of mathematicians than as fragments of any real human language. The goal of this paper is to show that there is a logical basis for assuming that some finitely witnessed languages are infinite. Whether human languages are among these languages is held out as a possibility rather than as an established fact. Moreover, it is held out as a methodological imperative that may follow from other methodological assumptions, and not as any sort of fact about an underlying reality. Indeed, we will see that whether or not it is productive to assume that a language is infinite can depend on the assumptions that are made about the admissible class of grammars. If researchers decide to model languages using finite-state machines, then it will turn out that certain finite languages are essentially infinite. If researchers decide to model languages by context-free grammars, then a different class of languages will turn out to be essentially infinite. If yet another class of grammars is used, then the class of essentially infinite languages changes again. Formal Definitions Before giving the formal definition of essentially infinite we will need to introduce some notation, all of it con- sistent with standard usage. A language is a set of finite length strings over some finite alphabet. If G is a language describing device, such as a finite-state machine or context-free grammar, then L(G) will denote the language characterized by G. For example, if G is a finite-state machine, then L(G) is the set of all strings accepted by G; if G is a context-free grammar, then L(G) is the set of all terminal strings that can be generated from the start sym- bol. The exact definition depends on the type of model and is considered part of the definition of the model. We will refer to L(G) as the language accepted by G. For any =0, the languages DUP are essentially finite from n the perspective of C, the class of finite-state machines. The intuition for why the language ODD is essentially n infinite from the finite-state machine perspective is sim- ple: We know that a two state machine can describe the infinite set of all strings of 0's and 1's that contain an odd number of 1's. If we put an upper bound on the length of the strings, then the finite-state machine needs to count and so needs more than 2 states. The formal details follow: Counting Lemma for FSM's: Suppose that n > 0 and that F is a finite language. If the longest string in F is n symbols in length, then any finite-state machine that accepts as inputs (exactly) the strings in F must have at least n + 1 states. Proof. The proof is easily constructed using a standard pumping argument. It can be found in most introductory text books on formal language theory. Theorem 1: The language ODD is essentially infinite from n the finite-state machine perspective, provided n => 2. Proof: A two-state machine can easily be constructed to accept the infinite language consisting of all strings of 0's and 1's that contain an odd number of 1's. However, by the Counting Lemma, any finite-state machine for the finite language ODD has at least n+1 states, and if n => 2, n then n+1 => 3. Hence, when n => 2, the machine for the infin- ite extension is smaller than any machine that accepts exactly ODD . n Theorem 2: For all n > 0, the languages DUP are essentially n finite from the finite-state perspective. Proof. It is easy to construct a finite-state machine that accepts DUP and that has two states for each possible n w of length less than or equal to n. Intuitively one set of states recognizes the string before the c and the second set checks that what follows the c matches what came before the c. The other half of the proof is to show that any finite- state machine that accepts an infinite extension of DUP n will require at least the same number of states. We will show that any finite-state machine that per- forms correctly on DUP , even if it extends beyond DUP , n n must have two states for each string w that has length n or less and that consists of all 0's and 1's. To see that that many states are required, note that any such machine must be in a different state after reading each string w of length less than or equal to n and consisting entirely of 0's and 1's. To establish this, the proof proceeds by contradiction. If there were fewer states, then there would be two distinct strings, x and y, such that, after reading either x or y, the machine is in the same state. By "cutting and pasting" pieces of the computations on xcx and ycy, we then obtain an accepting computation of xcy. Now xcy is of equal or shorter length than the longest string in DUP and is accepted by n the finite-state machine that allegedly extends DUP . But n xcy is not in DUP and we have a contradiction. This n establishes that there is one state for each string w of length n or less and consisting of all 0's and 1's. Further analysis along the same lines shows that at least twice that many states are needed. Specifically, for each two distinct strings x and y, of the appropriate type, the machine must be in a different state after reading each of x, y, xc, and yc. Prevalence of Essentially Infinite Languages The notion of essentially infinite is a generally applicable concept. Any "reasonable" grammar model will give rise to essentially infinite languages. To make this precise, let us give the follow precise definition of the minimal conditions that we need for a grammar to qualify as "reasonable." Definition: Let C be a class of grammar models and size a mapping from C to the natural numbers. The pair (C, size) is defined to be reasonable provided the following axioms hold: Axiom 1: For any n, there are only finitely many grammars G in C such that size(G) = n. Axiom 2: There is at least one grammar G in C such that L(G) is infinite. Axiom 3: For every finite language F, there is a grammar G in C F such that L(G ) = F. F Even this set of axioms is more than we need for the results of this paper. We do not use Axiom 3 for the results presented here. However, Axiom 3 is true of virtually all common grammar models and it is resonable to demand it as a guarantee that the grammar model is nontrivial for purposes of this study. Theorem 3: For any grammar class C and any measure of size such that (C, size) satisfies axioms 1 and 2, there are infinitely many finite languages that are essentially infin- ite from the perspective of (C, size). A trivial to prove, but important, Corollary is the follow- ing: Corollary: For any grammar class C and any measure of size such that (C, size) is reasonable (i.e., satisfies axioms 1 through 3), there are infinitely many finite languages that are essentially infinite from the perspective of (C, size). Proof of Theorem 3. By Axiom 2 we know there is a gram- mar G in C such that L(G ) is infinite. Now, each of the inf inf languages L , defined as follows, is finite. n = n . inf m o We will show that Claim: L = L(G ) is essentially infinite, for all m => n . m m o Let G be any grammar in C such that = n . m inf o To show that L is essentially infinite it will suffice to m show that (2) size(G ) < size(G). inf But, this can be seen as follows: (3) size(G ) =< size(G), m since they generate the same language L and G is a grammar m m of minimal size for this language. Now (2) follows directly from (1) and (3). Hence, the claim and the theorem are pro- ven. Notice that this theorem is very general. The grammars G need not possess any computability properties. The languages L(G) need not even be recursively enumerable. Given a grammar G and a candidate string w, we do not even assume that there is an algorithm to determine if w is in L(G). Treating finite sets as what they are hardly seems to merit attention and so it may seem that no comments are needed about the existence of essentially finite sets. How- ever, the notion of essentially finite is not the same as the traditional notion of finite. One may well ask if essen- tially finite sets are common or rare. Perhaps surpris- ingly, they are, in some sense, rarer than essentially infinite sets. If, in Theorem 3, we replace the term "essen- tially infinite" with "essentially finite," then the result- ing claim about essentially finite sets is not provable. It is not difficult to construct definitions for a grammar class and a size measure that satisfies Axioms 0 through 3 and for which there are no essentially finite sets. However, whenever we have looked at commonly used notions of grammar, we have always been able to find languages that are essen- tially finite. Infinity is in the Eye of the Beholder The next result shows that whether or not a language is essentially infinite depends on the allowable class of gram- mars. A language can be essential infinite from the per- spective of one grammar class, but not from the perspective of another. Theorem 4: For all n greater than some threshold, the languages PAL = {wcR(w) | length(w) =< n & w consists of all n 0's and 1's} are essentially finite from the perspective of finite-state machines but are essentially infinite from the perspective of context-free grammars. (You may use any rea- sonable measure of size for context-free grammars such as the number of nonterminals or the number of productions. R(w) denotes w written backwards.) Proof. Context-free grammars under any reasonable size measure satisfy our axioms for reasonableness and so Theorem 3 applies. If we go through the proof of Theorem 3 with L(G ) set equal to the infinite language {wcR(w) | w con- inf sists of all 0's and 1's}, then we see that there is an n = n , L is essentially infinite from o =<2n + 1 the context-free perspective. But, for all n, L = PAL . Hence, for all n above some fixed threshold, PAL is n n essentially infinite from the perspective of context-free grammars. To complete the proof of the theorem, it will suffice to show that for all sufficiently large n, the languages PAL are essentially finite from the finite-state perspec- n tive. But, the proof of that is just a slight variant of the proof of Theorem 2. Related Work A number of researchers have shown that a more complex language description mechanism can yield more succinct descriptions of certain simple languages. For example, Meyer and Fisher (1971) show that, by using pushdown automata, one can give shorter descriptions of some finite-state languages than one can obtain by using finite-state machine descrip- tions. Although the finite-state model may be adequate for describing certain languages, one can obtain a shorter description by using a model (pushdown automata) that is more powerful than is needed. This seems to be a very gen- eral phenomenon as witnessed by numerous other model pairs. For other examples see Blum (1967), Geller et al. (1975), Moshier and Rounds (1987), and Valiant (1976). However, in all these cases, a finite language is always viewed as fin- ite. The descriptive mechanism may be one typically used for infinite languages but it describes finite languages as fin- ite languages. In contrast to these works, the approach presented here views certain finite languages as if they were infinite. Rounds, Manaster-Ramer, and Friedman (1987) discuss efficiency of description for grammar models and obtain results related to the notion of essentially finite. The notion of essentially finite for Turing machines (or equivalently, for any model of a general-purpose com- puter) is, in its essence, the same as the notion of a random sequence in the sense of Kolmogorov (1968). Although we use no results from the literature that his work spawned, much of the inspiration for this study came from Kolmogorov complexity theory. Generalizations There is nothing in our theory that depends on natural language. One may just as well question why human being would ever think that any set had an infinite cardinality. We have never, nor will we ever, perceive an infinite number of items. Indeed, upon reflection it is puzzling why the notion of infinity of number ever occurred to human beings. This may be one example of what can be called a computation- ally derived semantic concept. The concept is used, not because it is witnessed in the observable world, but because it makes cognitive tasks simpler. We hope to discover and compare other examples of such computationally driven seman- tic concepts. Summary We have seen that there is a mathematically precise sense in which finite languages can be viewed as infinite languages. We call such languages essentially infinite. A finite language is essentially infinite if, in the precise sense described above, it simplifies our theory to consider it to be an initial segment of some infinite language. Which languages are essentially infinite and which are not depends on the particular grammar model used. Nonetheless, for any reasonable grammar model, there will always be some such finite but essentially infinite languages. Because a given finite set may change from essentially infinite to essentially finite, or vice versa, depending on one's gram- mar model, this does not seem to be a property that is inherent in the underlying reality of some infinite exten- sion language so much as it is a property of the way we view the finite portion of language that we can perceive. * This research was supported in part by NSF grant DCR- 8604031. I thank Alexis Manaster-Ramer for a number of helpful discussions on this material. Preliminary ver- sions of most of these results were announced in Sav- itch (1991). References Blum, M.: 1967, "On the Size of Machines," Information and Control 11, 257-265. Geller, M. M., H. B. Hunt III, T. G. Szymanski, and J. D. Ullman: 1975, "Economy of Description by Parsers, DPDA's and PDA's," IEEE 16th Symposium on Foundations of Computer Science, Berkeley, 122-127. Hopcroft, J. E. and J. D. Ullman: 1979, Introduction to Automata Theory, Languages, and Computation, Addison- Wesley, Reading, Mass. Kolmogorov, A.: 1968, "Three Approaches for Defining the Concept of Information Quantity," Selected Translations in Math. Stat. and Prob. 7, AMS publications. Meyer, A. R.: 1972, "Program Size in Restricted Programming Languages," Information and Control 21, 382-394. Meyer, A. R. and M. J. Fischer: 1971, "Economy of Descrip- tion of Automata, Grammars, and Formal Systems," IEEE 12th Symposium on Switching and Automata Theory, East Lansing, Mich. 188-191. Moshier, M. D. and W. C. Rounds: 1987, "On the Succinctness Properties of Unordered Context-Free Grammars," Proceedings of the 25th Meeting of the ACL. Rounds, William C., Alexis Manaster-Ramer, and Joyce Fried- man: 1987, "Finding Natural Language a Home in Formal Language Theory," in Manaster-Ramer, A. (ed.), Mathematics of Language, John Benjamins, Philadelphia, 87-114. Savitch, Walter J.: 1991 "Infinity is in the Eyes of the Beholder," in C. Georgopoulos and R. Ishihara (Eds.), Interdisciplinary Approaches to Language: Essays in Honor of S.- Y. Kuroda, Kluwer Academic Pub., 487-500. Valiant, L. G.: 1976, "A Note on the Succinctness of Description of Deterministic Languages," Information and Control 32, 139-145.

[CRL Newsletter Home Page]

[CRL Home Page]

Center for Research in Language

CRL Newsletter

Article 6-4