David A. Tanzer
A simple view of languages
How complex is the English language? The question has many dimensions and nuances. What does complexity mean, and how could it be measured? This a tough nut to crack, so in this post we’ll make a retrenchment to reconsider the question in a more formal setting — computer science.
In this setting, a language gets identified with its extension — the collection of all its sentences. So English gets modeled as just a large repository of sentences. Although this is an ‘impoverished’ view of language, it still has value, as simpler models can give us ideas to work with.
It is easy to see that the extension of the English language is an infinite set of sentences, as it includes for example:
- The father of Adam did not exist.
- The father of the father of Adam did not exist.
- The father of the father of the father of Adam did not exist.
- Etc., ad infinitum
One can also consider very small languages. For example, the sublanguage of English consisting of the sentences where the subject and object are either Alice or Bob, and the only verb is ‘likes’:
- Alice likes Alice.
- Alice likes Bob.
- Bob likes Alice.
- Bob likes Bob.
Let’s call this language AB:
AB = {“Alice likes Alice”, “Alice likes Bob”, “Bob likes Alice”, “Bob likes Bob”}.
Here, a ‘sentence’ means something rather simple – a string of characters belonging to the language. So the sentence “Bob likes Bob” just means [B, o, b, SPACE, l, i, k, e, s, SPACE, B, o, b], and the word “Alice” is just a shorthand for the string [A, l, i, c, e]. There are no semantics here.
Here are a couple of other examples of formal languages.
- The set of all strings involving just the characters 0 and 1, which start with 0 and alternate. That is: {“0”, “01”, “010”, “0101”, …}. This is a simple example of an infinite language.
- The set of prime numbers, written out as strings: [“2”, “3”, “5”, “7”, “11”, “13”, “17”, …]. Another infinite language.
Now that we know what these ‘languages’ are, we are ready to talk about how to measure their complexity.
Reposted from the Signal Beat, Copyright © 2021, All Rights Reserved.
(Actually Baez didn’t do it: David Tanzer did!)
I think there is something rather special about the complexities of natural languages. It is easy to observe, because Pidgins don’t have it and Creoles do. So it is something that children bring to their native language, and they synthesise it if it is absent. On the other hand synthetic formal languages generally don’t have it. It is clearly something grammatical but not purely semantic or syntactic.
It’s not just pidgins versus creoles, my understanding is that older languages (such as those found in very isolated communities like a single river valley in New Guinea) are generally more complex then newer languages (which were creoles or pidgins more recently). If I were to guess, this is the result of generations upon generations of children bringing something to a language.
Interesting comments. I’m wondering to what extent these dimensions of language complexity can still be captured in terms of computational complexity. If the formal grammar for a pidgin is a comparatively simpler set of rules, then its parsing algorithm might have lower time complexity.
Well, no need to reinvent an already existing field :) (Parts of) computational linguistics is devoted to describing the complexity classes of the different modules of grammar. Jeff Heinz at Stony Brook is probably the most well-known researcher regarding the (sub)regular complexity of phonology. Thomas Graf, another comp.linguist at Stony brook has a blog ‘outdex’ where he talks about these complexity classes sometimes.
Probably the best known result regarding the complexity of syntax is from the 80s by Shieber, establishing non-context-freeness using Swiss German. Lots of work by Aravind Joshi and students is relevant here, too. I’m not aware of much work on morphology (i.e. for example inflection) but I think that all morphological phenomena are by and large regular (in the technical sense).
Here are some ways in which pidgins are simpler, adapted from Wikipedia. Some are grammatical; some concern pronunciation:
• Typologically most closely resemble isolating languages
• Uncomplicated clausal structure (e.g., no subordinate clauses)
• Reduction or elimination of syllable codas
• Reduction of consonant clusters
• Elimination of aspiration or sound changes
• Lack of tones
• Lack of grammatical tense; use of separate words to indicate tense, usually preceding the verb
• Lack of conjugation or declension
• Lack of grammatical gender or grammatical number, supplanted by reduplication to represent plurals and superlatives, and other parts of speech that represent the concept being increased and clear indication of the gender or animated objects.
• Lack of clear parts of speech or word categorization; common use and derivation of new vocabulary through conversion, e.g. nominalization, verbification, adjectivization etc.
Interesting. It reminds me of this talk in which the author uses Shannon’s and Zipf’s laws as a guidance to categorize animal communication (including bees and flowers at some point) and to make inferences about possible extraterrestrial communication:
https://longnow.org/seminars/02020/apr/29/interspecies-communication-and-search-extraterrestrial-intelligence/
It might be difficult to distinguish simpler from different. For example, the point about: Uncomplicated clausal structure (e.g., no subordinate clauses). Perhaps speakers of a pidgin can achieve just as much complexity of semantics just using discourse connectives between main clauses, eliminating the need for embedding clauses. So complexity in the discourse-level of grammatical relations compensates for simplification at the sentence level.
Or maybe pidgins shouldn’t be compared to the codified speech of entire language communities, but are much narrower. A mother talks differently when interacting with her own child, since she knows what vocabulary and grammar the child already has used (some of this inferred from the mother’s knowledge of the child’s exposure and previous performance). Something like a pair-idiolect that evolves rapidly from day to day. Similarly, a pidgin speaker may adjust the register of their speech planning to what they know and learn about what the listener picks up on. If they simplify with one listener, that doesn’t mean more complex resources of grammar and phonology aren’t available when speaking with a more sophisticated or familiar listener.
Insofar as English, or any other natural language, only has existed (and will exist) for a finite time among a finite speaker population with a finite rate of thought, I would argue that its extension — its real, physical extent of actually spoken, actually written or even just actually thought sentences — is obviously likewise finite.
Grammarians don’t usually study the collection of “actually spoken, actually written or actually thought sentences” since this is collection is very hard to describe in detail, or study. It is easier to describe the infinite collection of all grammatical sentence. Chomsky has written about this fact and argued that it’s important.
For example, there is a set of infinitely many sentences of this form:
A cat is asleep.
A cat and another cat is asleep.
A cat and another cat and another cat is asleep.
A cat and another cat and another and another cat is asleep.
etc.
and while only finitely many of these have ever been actually spoken, actually written or actually thought, all of them are considered grammatical in most theories of grammar.
I suspect fewer than 7 of them have been actually spoken, actually written or actually thought, unless someone deliberately decided to have fun with them.
I’m aware. What this argument establishes that Chomskyite grammarians do not study natural languages like English, they study abstract mathematical objects that do not exist in nature. In real life only finitely many strings formed e.g. by repeating conjunctions or modifier can be parsed by any human even in theory, already since words take time to be spoken and humans have finite lifespans.
Studying languages as they actually exist in real life is indeed hard, but then we have many methods for actually doing this. I welcome you to sometimes engage with people who do this. Or not, if you wish… but if so, please stop pretending that whatever it is you are doing amounts to studying “the English language”.
More empirically grounded modelling that would allow you to keep around most concepts of theoretical linguistics would not even be hard! E.g. one model of language immediately superior to a set of sentences floating in a vacuum would be a set of agents running grammar-parsing algorithms on finite memory, allowing us to still maintain notions such as unrealized but grammatical sentences that could be parsed by some of these speakers. — Any ontological grounding like this, though, no matter how simple, will necessarily require abandoning at minimum the notion of infinitely large languages.