|
http://languagelog.ldc.upenn.edu/nll/?p=2004 Following up on "The order of ancestors" (12/24/2009) and "Sexual orders" (12/27/2009), I need to note one other important recent paper: Sarah Benor and Roger Levy, "The Chicken or the Egg? A Probabilistic Analysis of English Binomials", Language 82(2): 233-278, 2006. And several readers have pointed me to an older tradition of corpus linguistics that comes to a different set of conclusions about binomial ordering: Mishnah Keritot 6:9, etc.
Here's the abstract of the Benor and Levy paper:
Why is it preferable to say salt and pepper over pepper and salt? Based on an analysis of 692 binomial tokens from online corpora, we show that a number of semantic, metrical, and frequency constraints contribute significantly to ordering preferences, overshadowing the phonological factors that have traditionally been considered important. The ordering of binomials exhibits a considerable amount of variation. For example, although principal and interest is the more frequent order, interest and principal also occurs. We consider three frameworks for analysis of this variation: traditional optimality theory, stochastic optimality theory, and logistic regression. Our best models—using logistic regression—predict 79.2% of the binomial tokens and 76.7% of types, and the remainder are predicted as less frequent—but not ungrammatical—variants.
B & L take their examples from a number of tagged corpora, using a method described as follows:
The corpus search was conducted on three tagged corpora: the Switchboard (spoken), Brown (varied genres, written), and Wall Street Journal (WSJ; newspaper) sections of the Penn Treebank III, available from the Linguistic Data Consortium (Marcus et al. 1993).1 These corpora were searched for constructions of N and N, V and V, Adj and Adj, and Adv and Adv, where both X and X were part of the same XP. The search yielded 3,680 distinct binomials. Using the beginnings and ends of each corpus’s search results, we took a total of 411 input binomial TYPES—distinct sets A, B for some binomial sequence A and B—for analysis. This total consisted of 120 nouns, 103 verbs (including gerunds and participals), 118 adjectives, and 70 adverbs. We did not include binomials formed from personal names, because idiosyncratic factors frequently determine the ordering of names in a conjunction (however, we did not exclude the names of political entities such as countries or states). We discarded binomials formed with extender phrases, such as and stuff, as they are not in theory reversible (i.e. politics and everything cannot be everything and politics). For each of these binomials, we noted whether we considered each to be frozen (for example, by and large and north and south are frozen; honest and stupid and slowly and thoughtfully are not). We then searched for all occurrences of each binomial and its reverse in all three corpora, and included all such occurrences in our final corpus, yielding 692 tokens. Like Gustafsson (1976), we found that very few of the binomials occurred more than once in the three corpora. Most of those that did are frozen binomials, such as back and forth, which occurred forty-nine times.
Their technique has several important advantages. For one thing, the use of parsed corpora allows them to avoid apparent binomials like dogs and desserts from the string "…selling hamburgers, hot dogs and desserts", or dogs and columns from the string "a most unique newspaper, one that carries no headlines, photographs of cats and dogs and columns with names like 'The Downieville Dragnet.'". And this approach provides a valid sample of the binomials (common or otherwise) that happen to occur in a chosen chunk of text.
It also has an important disadvantage: the amount of text analyzed is only about three million words. 692 binomial tokens is thus a rate of about 231 per million. This is pretty common — it's about the same frequency as the word America, or the sequence "from a". But their observation that "very few of the [individual] binomials occurred more than once in the three corpora" is both expected, and telling. The nature of LNRE ("large numbers of rare events") distributions guarantees that the resulting sample will present a very noisy picture of the population frequency and the population order statistics for individual binomials. And this guarantee is honored by the facts, as can be seen in the following table, which compares a random selection of their 411 binomial types with counts from some larger corpora:
|
B&S |
COCA |
LDC News |
| English and Americans |
1 0 |
7 6 |
10 8 |
| Connecticut and Massachusetts |
1 0 |
15 23 |
140 190 |
| slowly and thoughtfully |
1 0 |
7 0 |
3 0 |
| abused and neglected |
1 0 |
86 18 |
336 57 |
| acute and correct |
1 0 |
0 0 |
0 0 |
| approved and commended |
1 0 |
0 0 |
0 0 |
| strawberries and bananas |
1 0 |
2 4 |
10 9 |
| oranges and grapefruit |
1 0 |
9 8 |
59 19 |
| warm and fuzzy |
1 0 |
154 5 |
1121 6 |
| fruits and nuts |
1 0 |
54 14 |
192 27 |
| T-ball and soccer |
2 0 |
1 2 |
2 2 |
| pinks and greens |
2 0 |
13 1 |
18 10 |
| gold and silver |
4 0 |
428 165 |
3287 548 |
| principal and interest |
5 2 |
55 33 |
980 787 |
(In each cell, the first number is the count for the cited order of the binomial, and the second number is the count for the reversed order.)
Given that their model assigns weights to 20 "semantic, pragmatic, metrical, phonological, and word-frequency factors that may affect the ordering of binomials", and that the patterning of these factors in their 411 binomial types is far from a factorial design (as expected in real-world linguistic data), this amount of noise in type-token relations will certainly degrade the predictive power of the result.
As they observe, "Because our full logistic-regression model uses a large number of constraints relative to the size of the dataset, it is not possible to draw detailed conclusions from the specific values of resulting constraint weights". This would be true even if the estimated frequencies of binomial types were reasonably accurate — it's much more of a problem given that their counts are nearly all 1, and thus almost meaningless as a basis for predicting population frequency. (This is especially true if the model is tested via cross-validation — as far as I can tell, though, they tested on their training set, making the reported 77% performance surprisingly low. )
At the start of this post, I mentioned an older corpus-linguistics tradition that also must deal with the problem of binomial order in a small corpus (about half a million words). This older tradition, without access to generalized linear models, draws a different sort of conclusion from the fact that binomial order is hard to predict and apparently variable. Thus
“This is the same Aaron and Moshe to whom G-d told, ‘Take the Jewish people, all of their hosts, out of Egypt.’” (Shemot 6:26)
The Tosefta at the end of Masekhet Keritot asks: Why does Aaron precede Moshe in this verse, whereas Moshe usually precedes Aaron? […]
[T]he Torah, one verse after another, switches the order of their names. When it speaks about the actual Exodus – “to whom G-d told, ‘Take the Jewish people, all of their hosts, out of Egypt” – where Moshe was central, it lists Aaron first – “Aaron and Moshe.” (Shemot 6:26) Then, in the next verse when it talks of speaking to Pharaoh – “They are the ones who speak to Pharaoh the king of Egypt . . .” – it lists Moshe first – “this is Moshe and Aaron.” (Shemot 6:27) This switching of the names actually teaches a lesson. By listing Aaron first concerning the area where Moshe was central and listing Moshe first in the area where Aaron was central, it makes it clear that both had an equal role in the mission.
Or again:
Dealing with the duties and the relationship of the child to its parents:
a) Honor your father and your mother, (Exodus 20:12; Deut. 5:16)
b) Ye shall fear every man his mother and his father (Levit.19:3)
[In the matter of honor due to parents, the father is mentioned first; in the matter of reverence due to them, the mother is mentioned first. From this we infer that both are to be equally honored and revered. …]
And:
4. "You shall revere every man his mother, and his father"
Rabbi Yosi says that whoever fears their mother and father observes the Shabbat. He wonders why the mother is mentioned first, and Rabbi Shimon explains that the mother does not have the power to instill fear that the father does, therefore she is mentioned first. Rabbi Yehuda says that just as heaven and earth were created simultaneously, both parents are equal in fear and honor. Rabbi Shimon tells us about the sanctification below during mating and the supernal mating above.
Some similar arguments are advanced about sheep and goats, pigeons and doves, and perhaps other binomials. But here, I think, we have an even more problematic instance of testing on a training set with small type and token counts. |