Why It Might Pay To Assume That Languages Are Infinite
Walter J. Savitch
Computer Science and Engineering Department
University of California, San Diego
LaJolla, CA 92093-0114
wsavitch@ucsd.edu wsavitch@ucsd.bitnet
Abstract
We show that, in a wide variety of situations, formal
languages which are finite can be given smaller descriptions
if we assume that they are but the observed data for an
infinite language. This points to one possible reason for
assuming that natural language is infinite.
Overview
There are clear senses in which natural languages are fin-
ite. It is generally assumed that human brains and hence
linguistic processing ability is finite. Moreover, the
available data does not in any obvious way support the
hypothesis that natural language is infinite, rather than
simply a large finite set. At any point in time, an indivi-
dual, or the entire community of human beings, will have
experienced only a finite number of sentences. At no point
in history will scientists be able to point to any actual
pool explicitly containing an infinite number of sentences.
Despite these apparently irrefutable facts, linguists quite
typically treat natural languages as infinite sets, at least
when studying syntax. Their view of language goes beyond
the finite data to a language they perceive as both infin-
ite and real. We will explore one possible explanation for
why it may make sense to assume that languages are infinite.
There is a clear and widely accepted rationale for this
technique of treating finite objects as if they extended
into infinity. The rationale is at root the principle known
as Occam's Razor. In case after case it seems that an
infinite model is simpler and clearer than a finite model.
The simplest and clearest finite model often consists of an
infinite model with a finiteness condition added to it. In
this paper we are able to make this intuition precise enough
to produce a mathematical theory of finite sets which
includes a notion of infinite. We are able to show that
some finite sets have the property that, if we treat them as
infinite sets, then we obtain descriptions that are simpler
than the ones we would obtain if we treated them as finite
sets. We call such sets essentially infinite.
There are some trivial senses in which any finite
language can be extended to a simpler infinite language. In
almost any grammar formalism, the set of all strings over
the basic alphabet (lexicon) is among the simplest of all
sets to describe, and every language can be embedded in this
simple infinite set. The notion of essentially infinite
avoids such triviality by demanding that, over a certain
finite domain, both the finite data set and its infinite
extension agree exactly. This corresponds to demanding that
our grammars account for negative as well as positive data.
The results presented here provide a theoretical frame-
work and some proof-of-concept examples. All the examples
are simple formal languages that are more properly viewed as
the playthings of mathematicians than as fragments of any
real human language. The goal of this paper is to show that
there is a logical basis for assuming that some finitely
witnessed languages are infinite. Whether human languages
are among these languages is held out as a possibility
rather than as an established fact. Moreover, it is held out
as a methodological imperative that may follow from other
methodological assumptions, and not as any sort of fact
about an underlying reality. Indeed, we will see that
whether or not it is productive to assume that a language is
infinite can depend on the assumptions that are made about
the admissible class of grammars. If researchers decide to
model languages using finite-state machines, then it will
turn out that certain finite languages are essentially
infinite. If researchers decide to model languages by
context-free grammars, then a different class of languages
will turn out to be essentially infinite. If yet another
class of grammars is used, then the class of essentially
infinite languages changes again.
Formal Definitions
Before giving the formal definition of essentially infinite
we will need to introduce some notation, all of it con-
sistent with standard usage. A language is a set of finite
length strings over some finite alphabet. If G is a
language describing device, such as a finite-state machine
or context-free grammar, then L(G) will denote the language
characterized by G. For example, if G is a finite-state
machine, then L(G) is the set of all strings accepted by G;
if G is a context-free grammar, then L(G) is the set of all
terminal strings that can be generated from the start sym-
bol. The exact definition depends on the type of model and
is considered part of the definition of the model. We will
refer to L(G) as the language accepted by G. For any
= 0, the languages DUP are essentially finite from
n
the perspective of C, the class of finite-state machines.
The intuition for why the language ODD is essentially
n
infinite from the finite-state machine perspective is sim-
ple: We know that a two state machine can describe the
infinite set of all strings of 0's and 1's that contain an
odd number of 1's. If we put an upper bound on the length of
the strings, then the finite-state machine needs to count
and so needs more than 2 states. The formal details follow:
Counting Lemma for FSM's: Suppose that n > 0 and that F is a
finite language. If the longest string in F is n symbols in
length, then any finite-state machine that accepts as inputs
(exactly) the strings in F must have at least n + 1 states.
Proof. The proof is easily constructed using a standard
pumping argument. It can be found in most introductory text
books on formal language theory.
Theorem 1: The language ODD is essentially infinite from
n
the finite-state machine perspective, provided n => 2.
Proof: A two-state machine can easily be constructed to
accept the infinite language consisting of all strings of
0's and 1's that contain an odd number of 1's. However, by
the Counting Lemma, any finite-state machine for the finite
language ODD has at least n+1 states, and if n => 2,
n
then n+1 => 3. Hence, when n => 2, the machine for the infin-
ite extension is smaller than any machine that accepts
exactly ODD .
n
Theorem 2: For all n > 0, the languages DUP are essentially
n
finite from the finite-state perspective.
Proof. It is easy to construct a finite-state machine
that accepts DUP and that has two states for each possible
n
w of length less than or equal to n. Intuitively one set of
states recognizes the string before the c and the second set
checks that what follows the c matches what came before the
c. The other half of the proof is to show that any finite-
state machine that accepts an infinite extension of DUP
n
will require at least the same number of states.
We will show that any finite-state machine that per-
forms correctly on DUP , even if it extends beyond DUP ,
n n
must have two states for each string w that has length n or
less and that consists of all 0's and 1's. To see that that
many states are required, note that any such machine must be
in a different state after reading each string w of length
less than or equal to n and consisting entirely of 0's and
1's. To establish this, the proof proceeds by contradiction.
If there were fewer states, then there would be two distinct
strings, x and y, such that, after reading either x or y,
the machine is in the same state. By "cutting and pasting"
pieces of the computations on xcx and ycy, we then obtain an
accepting computation of xcy. Now xcy is of equal or shorter
length than the longest string in DUP and is accepted by
n
the finite-state machine that allegedly extends DUP . But
n
xcy is not in DUP and we have a contradiction. This
n
establishes that there is one state for each string w of
length n or less and consisting of all 0's and 1's. Further
analysis along the same lines shows that at least twice that
many states are needed. Specifically, for each two distinct
strings x and y, of the appropriate type, the machine must
be in a different state after reading each of x, y, xc, and
yc.
Prevalence of Essentially Infinite Languages
The notion of essentially infinite is a generally applicable
concept. Any "reasonable" grammar model will give rise to
essentially infinite languages. To make this precise, let us
give the follow precise definition of the minimal conditions
that we need for a grammar to qualify as "reasonable."
Definition: Let C be a class of grammar models and size a
mapping from C to the natural numbers. The pair (C, size)
is defined to be reasonable provided the following axioms
hold:
Axiom 1: For any n, there are only finitely many grammars G in C
such that size(G) = n.
Axiom 2: There is at least one grammar G in C such that L(G) is
infinite.
Axiom 3: For every finite language F, there is a grammar G in C
F
such that L(G ) = F.
F
Even this set of axioms is more than we need for the
results of this paper. We do not use Axiom 3 for the results
presented here. However, Axiom 3 is true of virtually all
common grammar models and it is resonable to demand it as a
guarantee that the grammar model is nontrivial for purposes
of this study.
Theorem 3: For any grammar class C and any measure of size
such that (C, size) satisfies axioms 1 and 2, there are
infinitely many finite languages that are essentially infin-
ite from the perspective of (C, size).
A trivial to prove, but important, Corollary is the follow-
ing:
Corollary: For any grammar class C and any measure of size
such that (C, size) is reasonable (i.e., satisfies axioms 1
through 3), there are infinitely many finite languages that
are essentially infinite from the perspective of (C, size).
Proof of Theorem 3. By Axiom 2 we know there is a gram-
mar G in C such that L(G ) is infinite. Now, each of the
inf inf
languages L , defined as follows, is finite.
n
= n .
inf m o
We will show that
Claim: L = L(G ) is essentially infinite, for all m => n .
m m o
Let G be any grammar in C such that
= n .
m inf o
To show that L is essentially infinite it will suffice to
m
show that
(2) size(G ) < size(G).
inf
But, this can be seen as follows:
(3) size(G ) =< size(G),
m
since they generate the same language L and G is a grammar
m m
of minimal size for this language. Now (2) follows directly
from (1) and (3). Hence, the claim and the theorem are pro-
ven.
Notice that this theorem is very general. The grammars
G need not possess any computability properties. The
languages L(G) need not even be recursively enumerable.
Given a grammar G and a candidate string w, we do not even
assume that there is an algorithm to determine if w is in L(G).
Treating finite sets as what they are hardly seems to
merit attention and so it may seem that no comments are
needed about the existence of essentially finite sets. How-
ever, the notion of essentially finite is not the same as
the traditional notion of finite. One may well ask if essen-
tially finite sets are common or rare. Perhaps surpris-
ingly, they are, in some sense, rarer than essentially
infinite sets. If, in Theorem 3, we replace the term "essen-
tially infinite" with "essentially finite," then the result-
ing claim about essentially finite sets is not provable. It
is not difficult to construct definitions for a grammar
class and a size measure that satisfies Axioms 0 through 3
and for which there are no essentially finite sets. However,
whenever we have looked at commonly used notions of grammar,
we have always been able to find languages that are essen-
tially finite.
Infinity is in the Eye of the Beholder
The next result shows that whether or not a language is
essentially infinite depends on the allowable class of gram-
mars. A language can be essential infinite from the per-
spective of one grammar class, but not from the perspective
of another.
Theorem 4: For all n greater than some threshold, the
languages PAL = {wcR(w) | length(w) =< n & w consists of all
n
0's and 1's} are essentially finite from the perspective of
finite-state machines but are essentially infinite from the
perspective of context-free grammars. (You may use any rea-
sonable measure of size for context-free grammars such as
the number of nonterminals or the number of productions. R(w)
denotes w written backwards.)
Proof. Context-free grammars under any reasonable size
measure satisfy our axioms for reasonableness and so Theorem
3 applies. If we go through the proof of Theorem 3 with
L(G ) set equal to the infinite language {wcR(w) | w con-
inf
sists of all 0's and 1's}, then we see that there is an n
= n , L is essentially infinite from
o =<2n + 1
the context-free perspective. But, for all n, L =
PAL . Hence, for all n above some fixed threshold, PAL is
n n
essentially infinite from the perspective of context-free
grammars.
To complete the proof of the theorem, it will suffice
to show that for all sufficiently large n, the languages
PAL are essentially finite from the finite-state perspec-
n
tive. But, the proof of that is just a slight variant of the
proof of Theorem 2.
Related Work
A number of researchers have shown that a more complex
language description mechanism can yield more succinct
descriptions of certain simple languages. For example, Meyer
and Fisher (1971) show that, by using pushdown automata, one
can give shorter descriptions of some finite-state languages
than one can obtain by using finite-state machine descrip-
tions. Although the finite-state model may be adequate for
describing certain languages, one can obtain a shorter
description by using a model (pushdown automata) that is
more powerful than is needed. This seems to be a very gen-
eral phenomenon as witnessed by numerous other model pairs.
For other examples see Blum (1967), Geller et al. (1975),
Moshier and Rounds (1987), and Valiant (1976). However, in
all these cases, a finite language is always viewed as fin-
ite. The descriptive mechanism may be one typically used for
infinite languages but it describes finite languages as fin-
ite languages. In contrast to these works, the approach
presented here views certain finite languages as if they
were infinite.
Rounds, Manaster-Ramer, and Friedman (1987) discuss
efficiency of description for grammar models and obtain
results related to the notion of essentially finite.
The notion of essentially finite for Turing machines
(or equivalently, for any model of a general-purpose com-
puter) is, in its essence, the same as the notion of a
random sequence in the sense of Kolmogorov (1968). Although
we use no results from the literature that his work spawned,
much of the inspiration for this study came from Kolmogorov
complexity theory.
Generalizations
There is nothing in our theory that depends on natural
language. One may just as well question why human being
would ever think that any set had an infinite cardinality.
We have never, nor will we ever, perceive an infinite number
of items. Indeed, upon reflection it is puzzling why the
notion of infinity of number ever occurred to human beings.
This may be one example of what can be called a computation-
ally derived semantic concept. The concept is used, not
because it is witnessed in the observable world, but because
it makes cognitive tasks simpler. We hope to discover and
compare other examples of such computationally driven seman-
tic concepts.
Summary
We have seen that there is a mathematically precise sense in
which finite languages can be viewed as infinite languages.
We call such languages essentially infinite. A finite
language is essentially infinite if, in the precise sense
described above, it simplifies our theory to consider it to
be an initial segment of some infinite language. Which
languages are essentially infinite and which are not
depends on the particular grammar model used. Nonetheless,
for any reasonable grammar model, there will always be some
such finite but essentially infinite languages. Because a
given finite set may change from essentially infinite to
essentially finite, or vice versa, depending on one's gram-
mar model, this does not seem to be a property that is
inherent in the underlying reality of some infinite exten-
sion language so much as it is a property of the way we view
the finite portion of language that we can perceive.
*
This research was supported in part by NSF grant DCR-
8604031. I thank Alexis Manaster-Ramer for a number of
helpful discussions on this material. Preliminary ver-
sions of most of these results were announced in Sav-
itch (1991).
References
Blum, M.: 1967, "On the Size of Machines," Information and
Control 11, 257-265.
Geller, M. M., H. B. Hunt III, T. G. Szymanski, and J. D.
Ullman: 1975, "Economy of Description by Parsers,
DPDA's and PDA's," IEEE 16th Symposium on Foundations
of Computer Science, Berkeley, 122-127.
Hopcroft, J. E. and J. D. Ullman: 1979, Introduction to
Automata Theory, Languages, and Computation, Addison-
Wesley, Reading, Mass.
Kolmogorov, A.: 1968, "Three Approaches for Defining the
Concept of Information Quantity," Selected Translations
in Math. Stat. and Prob. 7, AMS publications.
Meyer, A. R.: 1972, "Program Size in Restricted Programming
Languages," Information and Control 21, 382-394.
Meyer, A. R. and M. J. Fischer: 1971, "Economy of Descrip-
tion of Automata, Grammars, and Formal Systems," IEEE
12th Symposium on Switching and Automata Theory, East
Lansing, Mich. 188-191.
Moshier, M. D. and W. C. Rounds: 1987, "On the Succinctness
Properties of Unordered Context-Free Grammars,"
Proceedings of the 25th Meeting of the ACL.
Rounds, William C., Alexis Manaster-Ramer, and Joyce Fried-
man: 1987, "Finding Natural Language a Home in Formal
Language Theory," in Manaster-Ramer, A. (ed.),
Mathematics of Language, John Benjamins, Philadelphia,
87-114.
Savitch, Walter J.: 1991 "Infinity is in the Eyes of the
Beholder," in C. Georgopoulos and R. Ishihara (Eds.),
Interdisciplinary Approaches to Language: Essays in Honor
of S.- Y. Kuroda, Kluwer Academic Pub., 487-500.
Valiant, L. G.: 1976, "A Note on the Succinctness of
Description of Deterministic Languages," Information
and Control 32, 139-145.
[CRL Newsletter Home Page]
[CRL Home Page]
Center for Research in Language
CRL Newsletter
Article 6-4