Introduction

In this corpus study, I look at modification patterns of two nouns in German: “Auto” ‘car’ and “Flugzeug” ‘airplane’. In particular, I would like to suggest to relate the frequencies of different classes of adjectival modifiers of these nouns to the psychological notions of Conceptual Distinctiveness (how many obvious subcategories there are to a concept), object frequency in typical visual input, and recognition performance in picture matching tasks I calculate the entropies for the distribution of adjectival modification of the two nouns in the DWDS core corpus (https://www.dwds.de/r). High entropy of the modification distribution is a good predictor for high distinctiveness and high object frequency in visual data for the concept related to the noun modified (many, even overlapping subcategories are available for the concept representing “Auto” ‘car’, wooden cars, blue cars, big cars etc.). Low distinctiveness might be related to what is called ‘well-establishedness of a kind’ in linguistics.

Cars might have high object frequency and are easily subcategorized with respect to many features. They are conceptually highly distinctive.

Planes, however, are rather rare. If they occur in a situation, they are often unique. If planes come in pluralities they all look more or less the same (if we are not experts). Every plane seems to be a prototype. The objects are not (or to a lesser extent) conceptually distinct.

Gregorová et al. (2021, in pre-publshed psychological work, Version 1, https://psyarxiv.com/37a9q/) found that there is a difference between “Flugzeug” ‘airplane’ and “Auto” ‘car’ in picture matching tasks where participants had to judge whether a picture matched a word and vice versa. Both nouns occur equally frequent in their data (p. 44). Interestingly, matching “Flugzeug” seems to be easier/faster than matching “Auto” with the respective picture. “Flugzeug” patterns with natural kinds like “bear” although it names an artifact. The authors relate the difference to a difference in object frequency in visual corpora. “Having fewer encounters with an object may constitute an advantage in recognizing these concepts for which we have experienced more exemplars” (Gregorová et al. 2023, 11). Moreover, object frequency is related to Conceptual Distinctiveness by Gregorova et al.. Conceptual Distinctiveness is a notion introduced by Konkle et al. (2010) in order to measure how easy it is to partition the members of a category into subcategories. High Conceptual Distinctiveness (many subcategories) interferes with the memorability of an object. Objects with low Conceptual Distinctiveness (almost no subcategories) are easier to remember (Gregorová et al. 2023, 12).

From a linguistic point of view, these findings are interesting in several respects. (a) They could mean that some differences in words might basically be iconic with respect to the level of visual perception, i.e. that there is direct relation between our perceptual experience and words that goes beyond the auditive preception. How we use a linguisitic item might be dependent on how the objects of a category look like and how many (different) objects we have seen. And (b) they make a prediction on adjectival modification: A noun with low object frequency and low Conceptual Distinctiveness should be less modifiable than one with high object frequency and high Conceptual Distinctiveness.

My aim is to track this difference in distinctiveness in linguistic data. For object frequency and the related concept of conceptual distinctness, I refer to the findings from Gregorova 2023. The two words “Flugzeug” and “Auto” come with a similar token frequencies in SUBTLEX and in DWDS core corpus, respectively.

I distinguish several levels of adjectival modification loosly relating them to the ordering of adjectives with in the nominal phrase discussed in the linguistic literature Raskin and Nirenburg (NA). And I show that the distributions of the modification frequencies found in the DWDS core corpus for the two nouns differ in entropy. Whereas the distribution of the modification frequencies for “Auto” is more uniform (high entropy) - many different subcategories may be referred to with an adjective noun combination containing “Auto” -, the distribution of the modification frequencies for “Flugzeug” is skewed.

In the following sections, I show that the adjectives that modify the two nouns “Flugzeug” and “Auto” differ indeed, i.e. object frequency and Conceptual Distinctiveness can be mirrored linguistically in modification patterns. The modification patterns for “Auto” show high entropy, many subcategories are equally frequent and the modified word names a concept that is conceptually more distinctive. The adjectives so-to-speak add distinctiveness. The modification patterns for “Flugzeug” show low (or intermediate) entropy, many modification patterns are less frequent (less subcategories) and the word names a concept that is conceptually less distinctive.

Furthermore, there is an interesting observation emerging from the data. There is a difference in meaning between combinations with adjectives denoting ORIGIN like “japanisch” ‘Japanese’. It either has the function to name the location of the PRODUCER as in “japanese car” or it may have the function to name the POSSESSOR as in “japanese airplane”. Whereas it is easy to get a PRODUCER reading for airplane, it is not so easy for “car” to get a POSSESSOR reading with adjectives denoting relations to nations. The data in German are the same. I interpret this fact as evidence for the importance of entropy.

This investigation may, in addition, throw some light on the question what a well-established kind is - a linguistic term close to conceptual distinctivess. A well established kind may be one that allows only for fewer or more specialized modifications. There are no or lesser subcategories. The point is that this difference is grammatically relevant. The famous pair of sentences by Barbara Partee illustrates the difference in English for the noun “bottle”, that is usually explained by a difference in the type of kind. Consider the example in (1). (1a) is grammatical but (1b) is odd. “Coke bottle” counts as a well established kind, no or lesser subcategories. “Green bottle” does not. An interesting fact is that at the time the pair of examples was born all Coke bottles were green, which shows that a difference in object frequency in fact may play a role - the fewer objects there are the higher the probability that a kind is well-established.

The Coke bottle has a narrow neck.
The green bottle has a narrow neck.

That there is a relation between what is called (sub)category in psychology and kind in linguistics was also investigated in my talk at DGFS 2022 in Tübingen. Whereas “Flugzeug” seems to pattern like a genuine kind, “Auto” does not. I observe the following phenomena: “Deutschland produziert zwei Autos, den BMW und den VW” sounds odd whereas “Deutschland produziert nur ein Flugzeug, den Airbus” is fine. Thinking of cars we think of car color, size, rear windows, car doors and car fronts of actual cars but not of a kind of car, not of a prototype or icon. Airplanes are not partitioned like that. We may say “Wir nehmen das Auto” ‘we take (our) car’ or “Wir nehmen das Flugzeug” ‘we take a plane’. This difference is usually discussed in connection with the notion ‘weak definites’ where well-established kinds play a role, as well Schwarz (2014). Gregorová et al. (2023), p.12, use also the notion of “more narrow categories”.

(PDF) Access to meaning from visual input: Object and word frequency effects in categorization behavior. Available from: https://www.researchgate.net/publication/370606477_Access_to_meaning_from_visual_input_Object_and_word_frequency_effects_in_categorization_behavior [accessed Aug 30 2023].

Let us start the stats: The idea is to check out whether nouns have a preference for certain types of adjectives that modify them. From this one might conclude that nouns with different preference schema have different types of meaning (well-established or not).

Setup

Loaded packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

Credits

Tidyverse is very intuitive in use: Wickham et al. (2019).

packageVersion('tidyverse')

## [1] '2.0.0'

The first “thank you” goes to ChatGPT accessed on Aug 30 2023. The chat bot almost does not get tired to answer questions about R as such and on how to solve problems with faulty R code quickly. The idea of looking at Entropy and apply it to my data is from Bodo Winter. THANK YOU. Furthermore, I thank Orin Percus, Carla Umbach and the participants of my talks for discussion on this topic.

Data loaded: The data is a table exported from DWDS core corpus 1900 - 1999 (https://www.dwds.de/r) with the prompt (KWIC): “$p=ADJA Flugzeug”. This prompt produces 3301 hits. I eliminated semicolons in the context.

airpl <- read_csv("/Users/cecile/Documents/Aktuelles/DGFS2022/DatenCorpusPilot/Auto-Flugzeug-Statistik mit R/data/2Flugzeug_dwds_export_2021-08-12.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 3206 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): No.;Date;Genre;Bibl;ContextBefore;Hit;Kind;ContextAfter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sample_n(airpl,5)

## # A tibble: 5 × 1
##   `No.;Date;Genre;Bibl;ContextBefore;Hit;Kind;ContextAfter`                     
##   <chr>                                                                         
## 1 323;26.07.76;Gebrauchsliteratur;o. A. [ms]: Kfir-C 2. In: Aktuelles Lexikon 1…
## 2 2361;31.12.40;Zeitung;Archiv der Gegenwart, Bd. 10, 31.12.1940;In einem Luftk…
## 3 483;07.01.61;Zeitung;Archiv der Gegenwart, Bd. 31, 07.01.1961;Am 3. Dezember …
## 4 1917;07.11.41;Zeitung;Archiv der Gegenwart, Bd. 11, 07.11.1941;;Englische;08_…
## 5 2160;21.05.41;Zeitung;Archiv der Gegenwart, Bd. 11, 21.05.1941;Nach dem Überf…

carl <- read_csv("/Users/cecile/Documents/Aktuelles/DGFS2022/DatenCorpusPilot/Auto-Flugzeug-Statistik mit R/data/2Auto_dwds_export_2021-08-12.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 762 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): No.;Date;Genre;Bibl;ContextBefore;Hit;Kind;ContextAfter
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sample_n(carl,5)

## # A tibble: 5 × 1
##   `No.;Date;Genre;Bibl;ContextBefore;Hit;Kind;ContextAfter`                     
##   <chr>                                                                         
## 1 "210;02.10.71;Zeitung;Neues Deutschland, 02.10.1971;In der baden-württembergi…
## 2 "151;31.12.85;Belletristik;Arjouni, Jakob: Happy birthday, Türke!, Zürich: Di…
## 3 "76;22.03.96;Zeitung;Die Zeit, 22.03.1996, Nr. 13;Die Menschen packen ihre Ha…
## 4 "574;01.03.35;Zeitung;Völkischer Beobachter (Berliner-Ausgabe), 01.03.1935;Ge…
## 5 "85;31.12.94;Belletristik;Jentzsch, Kerstin: Seit die Götter ratlos sind, Mün…

Tagging the adjectives:

The hits (column 6), I sorted alphabetically and assigned them a level/category -> column No. 7 ‘Kind’. I used categories/levels that are used in the literature on ordering of adjectives. It is often observed that ordinals and superlatives are placed highest in the spine of adjective ordering and relational adjectives are lowest. Subsective (i.e. gradable) adjectives come somewhere in the middle. More subjective adjectives precede less subjective adjectives Scontras, Degen, and Goodman (2017). I follow Laenzlinger (2000) ’s ordering. Which order is the right one is highly debated in linguistic typology. Laenzlinger’s ordering is the most specific.

Quantity denoting adjectives: ORD - Superlatives, Ordinals and Only-equivalents CARD - Cardinal numbers, indefinites like “irgendwelche” ‘any’, and “einzel” ‘single’. In addition QUANT labels adjectives like “ähnlich” engl. ‘similar’ that rely on a comparison relation.

Speaker oriented adjectives: SPOR - like “possible”, unknown” or “local” or “heutig” ‘today’s’ QUAL - like “nice”

‘Internal’ Physical Values: PHYS - size, lenght, height, speed, depth and width

Ordinal measures: MEAS - weight, temperature, wetness and age

‘External’ Physical Properties: SHAPE - like “rectangular” COLOR - like “red” ORIGIN - like “Japanese” MAT - as in “wooden”

The latter kind of adjectives is usually not gradable. Adjectives specifying the material of an object or its origin are discussed as relational adjectives in the linguistic literature McNally and Boleda (2004) I subsume adjectives having a part-hole meaning: Ex. “8-motoriges Flugzeug” ‘airplane with 8 enginges’, and also adjectives that deny a property as in “propellerlos” ‘propellerless’.

ORIGIN must be kept apart from THEM. Adjectives expressing a subkind/subcategory (i.e., a brand): Ex. “japanisches Flugzeug” ‘airplane that was produced by the Japanese in Japan’ get the category ORIGIN.

THEM - Adjectives expressing a possession relation: Ex. “japanisches Flugzeug” ‘airplane that belongs to the Japanese’. I used ChatGPT in order to find out whether an adjective denoting ORIGIN like “japanisch” may get a brand or a possessor reading in finding out whether the country of origin had aviation industry. If a country has or had aviation industry, I tried to find out whether the combination ‘adjective expressing a connection with a country’ and “Auto” or “Flugzeug” is used metonymically or not. If it says “the american airplane was delivered to Germany”, “american” is an intersective adjective that signals ORIGIN. The use ORIGIN is not metonymical. If it says “The american airplane dropped bombes on Germany”, the use of the adjective signals possession and it can be parphrased by ‘belongs to the Americans’. This use is usually linked to event nouns where an adjective may contribute a theme to the event described by the nouns, compare “The italian invasion” ‘the invasion by the italians’. The adjective occupies a thematic role signalling possession Barker (2011).

Two issues are interesting here: (a) It is possible to have THEM and ORIGIN in the same phrase as witnessed by “Ucrainian american and soviet airplanes” ‘airplanes from the US and the Soviet Union owned by the Ucraine’ — THEM (the possessor) precedes ORIGIN (the brand), and (b) the THEM interpretation disappears if further adjectives modifiy the phrase: “new american airplane” — if the modification pattern is restrictive — only has the ORIGIN reading for the adjective “american”. If we switch the order the THEM reading resurfaces (see also Seymour (1995) on possessive DPs, https://files.eric.ed.gov/fulltext/ED383174.pdf). An exception is “former” as in “former american airplane”. This phrase is ambiguous: It refers to some airplane that used to belong to the US or it refers to a collection of things that used to be an airplane of american origin. But it seems impossible that “former” targets “american” alone in the ORIGIN reading. In that sense ORIGIN is close to inalienable possession and THEM is coles to alienable possession (see Aikhenvald (2015) for the difference).

Participles PART_PERF (passive participles) and PART_PRÄS (active participles) locate the situation of a nominal referent in the past or present and add information about the situation. They are usually not included in the discussion on adjectival order. Cinque (2009) is an exception. For him, participles are reduced relative clauses that are located higher in the tree of possible adjunctions below CARD and above PHYS. I leave it at that and position the participles below QUAL.

NA is used if the hit is not an adjective Ex. “z.B.” ‘for example’ or ‘Anjas’, a possessive.

That is I use the following labels:

ORD > CARD > QUANT > SPOR > QUAL = PART = THEM > PHYS > MEAS > SHAPE > COLOR > ORIGIN > MAT and NA

I did the coding myself and all mistakes are therefore mine.

Cleaning up the files

The csv-files had only one column. I seperated those. There were warning messages that I ignored.

airpl <- separate(airpl, 'No.;Date;Genre;Bibl;ContextBefore;Hit;Kind;ContextAfter', into = c("No.","Date","Genre","Bibl","ContextBefore","Hit","Kind" ,"ContextAfter"), sep = ";")

## Warning: Expected 8 pieces. Missing pieces filled with `NA` in 3 rows [3204, 3205,
## 3206].

sample_n(airpl,5)

## # A tibble: 5 × 8
##   No.   Date     Genre   Bibl             ContextBefore Hit   Kind  ContextAfter
##   <chr> <chr>    <chr>   <chr>            <chr>         <chr> <chr> <chr>       
## 1 3081  19.07.33 Zeitung Archiv der Gege… Als Höchstle… gift… 07_P… Flugzeugen …
## 2 3043  16.05.37 Zeitung Deutsche Volksz… Die beste Ar… deut… 08_T… Flugzeugen …
## 3 1369  23.02.43 Zeitung Archiv der Gege… Am gestrigen… fein… 08_T… Flugzeuge P…
## 4 2949  11.12.39 Zeitung Archiv der Gege… Auf See kein… Fein… 08_T… Flugzeuge b…
## 5 2617  14.09.40 Zeitung Archiv der Gege… Zwei          eige… 08_T… Flugzeuge w…

carl <- separate(carl, 'No.;Date;Genre;Bibl;ContextBefore;Hit;Kind;ContextAfter', into = c("No.","Date","Genre","Bibl","ContextBefore","Hit","Kind" ,"ContextAfter"), sep = ";")

## Warning: Expected 8 pieces. Additional pieces discarded in 37 rows [2, 30, 36, 38, 40,
## 59, 64, 71, 96, 99, 134, 168, 228, 232, 236, 241, 260, 271, 278, 307, ...].

## Warning: Expected 8 pieces. Missing pieces filled with `NA` in 3 rows [760,
## 761, 762].

sample_n(airpl,5)

## # A tibble: 5 × 8
##   No.   Date     Genre   Bibl             ContextBefore Hit   Kind  ContextAfter
##   <chr> <chr>    <chr>   <chr>            <chr>         <chr> <chr> <chr>       
## 1 1028  04.04.44 Zeitung Archiv der Gege… "\"Hierbei w… fein… 08_T… Flugzeuge a…
## 2 503   15.07.60 Zeitung Archiv der Gege… "Jetzt erken… amer… 08_T… Flugzeuge ü…
## 3 2062  28.07.41 Zeitung Archiv der Gege… "Drei"        fein… 08_T… Flugzeuge w…
## 4 1709  05.04.42 Zeitung Archiv der Gege… "In Luftkämp… fein… 08_T… Flugzeuge b…
## 5 546   13.05.60 Zeitung Archiv der Gege… "\"\"\" Das … unbe… 06_P… Flugzeuges …

airpl$Kind <- factor(airpl$Kind)
levels(airpl$Kind)

##  [1] "01_ORD"       "02_CARD"      "03_QUANT"     "04_SPOR"      "05_QUAL"     
##  [6] "06_PART_PERF" "07_PART_PRÄS" "08_THEM"      "09_PHYS"      "10_MEAS"     
## [11] "12_COLOR"     "13_ORIGIN"    "14_MAT"       "15_NA"

Same for cars:

carl$Kind <- factor(carl$Kind)
levels(carl$Kind)

##  [1] "01_ORD"       "02_CARD"      "03_QUANT"     "04_SPOR"      "05_QUAL"     
##  [6] "06_PART_PERF" "07_PART_PRÄS" "08_THEM"      "09_PHYS"      "10_MEAS"     
## [11] "11_SHAPE"     "12_COLOR"     "13_ORIGIN"    "14_MAT"       "15_NA"

The result was a nice table like in Excel. And I selected the rows that interest me.

airpl <- select(airpl, No., Hit, Kind)
sample_n(airpl,5)

## # A tibble: 5 × 3
##   No.   Hit              Kind   
##   <chr> <chr>            <fct>  
## 1 203   britische        08_THEM
## 2 2483  feindliche       08_THEM
## 3 1334  Deutsche         08_THEM
## 4 671   sowjetrussisches 08_THEM
## 5 1567  britische        08_THEM

Same for cars:

carl <- select(carl, No., Hit, Kind)

I thought about threwing out the participles. They also locate the object a noun refers to locally and temporally. But maybe I reconsider this decision later once more.

#airpl <- airpl %>%
 # filter(Kind != "PART_PERF") %>%
#  filter(Kind != "PART_PRÄS")

Evaluating the data: Frequencies

Let us count the occurrences of the different levels in the first step.

Freq_airpl <- table(airpl$Kind)
print(as.data.frame(Freq_airpl))

##            Var1 Freq
## 1        01_ORD  164
## 2       02_CARD   29
## 3      03_QUANT   11
## 4       04_SPOR   31
## 5       05_QUAL   25
## 6  06_PART_PERF  179
## 7  07_PART_PRÄS  107
## 8       08_THEM 2502
## 9       09_PHYS   15
## 10      10_MEAS   36
## 11     12_COLOR    3
## 12    13_ORIGIN   27
## 13       14_MAT   73
## 14        15_NA    1

It turns out that THEM is the most frequent level of adjective uses in combination with “Flugzeug” ‘airplane’. This finding may be visualized. The distribution of adjectival modification is skewed.

plot_data <- as.data.frame(Freq_airpl)

ggplot(plot_data, aes(x = Var1, y = Freq)) +
  theme_classic() +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Factor Levels") +
  ylab("Observed Frequency") +
  ggtitle("Frequency of Factor Levels")

The mode is 2502 (maximal frequency) for THEM. This means that we wouldn’t be surprised if an adjective that signals (alienable) POSSESSION is accompanying the noun “Flugzeug”. In other words, “Flugzeug” has a preference for possession. The objects are related to countries, armies or companies.

We may calculate the mode. This is the level that is tendentially central in the distribution of adjectives modifying “Flugzeug”. In this case the mode is a good predictor for which kind of adjective is used with the noun.

mode <- max(plot_data$Freq)
print(mode)

## [1] 2502

Total_count counts the names of the columns and the column properties as extra rows. This exlains why the number is higher than the count mentionend in the DWDS count.

total_count <- sum(Freq_airpl)
print(total_count)

## [1] 3203

Finally, let us calculate the probabilities for each level. They are values from the interval [0,1]. They measure the mean probablility with which a certain kind of adjective occurs. THEM has a high probability.

airpl <- airpl %>% 
  group_by(Kind) %>% 
  summarise(Percentage = n() / nrow(airpl))
print(airpl, n = 22)

## # A tibble: 15 × 2
##    Kind         Percentage
##    <fct>             <dbl>
##  1 01_ORD         0.0512  
##  2 02_CARD        0.00905 
##  3 03_QUANT       0.00343 
##  4 04_SPOR        0.00967 
##  5 05_QUAL        0.00780 
##  6 06_PART_PERF   0.0558  
##  7 07_PART_PRÄS   0.0334  
##  8 08_THEM        0.780   
##  9 09_PHYS        0.00468 
## 10 10_MEAS        0.0112  
## 11 12_COLOR       0.000936
## 12 13_ORIGIN      0.00842 
## 13 14_MAT         0.0228  
## 14 15_NA          0.000312
## 15 <NA>           0.000936

All these calculations are preliminary for the calculation of the entropy of our distribution.

And we repeat the procedure for the data on “Auto” ‘car’. We produce a frequency table.

Freq_carl <- table(carl$Kind)
print(as.data.frame(Freq_carl))

##            Var1 Freq
## 1        01_ORD   32
## 2       02_CARD    8
## 3      03_QUANT   52
## 4       04_SPOR   16
## 5       05_QUAL  106
## 6  06_PART_PERF  135
## 7  07_PART_PRÄS  118
## 8       08_THEM   62
## 9       09_PHYS   48
## 10      10_MEAS   91
## 11     11_SHAPE    1
## 12     12_COLOR   42
## 13    13_ORIGIN   20
## 14       14_MAT   15
## 15        15_NA   13

We visualize the data. The distribution of the frequencies is more uniform.

plot_data2 <- as.data.frame(Freq_carl)

ggplot(plot_data2, aes(x = Var1, y = Freq)) +
  theme_classic() +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Factor Levels") +
  ylab("Observed Frequency") +
  ggtitle("Frequency of Factor Levels")

And we add the mode, a total count and the respective probabilites:

mode <- max(plot_data2$Freq)
print(mode)

## [1] 135

The Total_count is lower with “Auto” than with “Flugzeug”.

total_count <- sum(Freq_carl)
print(total_count)

## [1] 759

Finally, let us calculate the probabilities for each level once more for “Auto”.

carl <- carl %>% 
  group_by(Kind) %>% 
  summarise(Percentage = n() / nrow(carl))
print(carl, n = 22)

## # A tibble: 16 × 2
##    Kind         Percentage
##    <fct>             <dbl>
##  1 01_ORD          0.0420 
##  2 02_CARD         0.0105 
##  3 03_QUANT        0.0682 
##  4 04_SPOR         0.0210 
##  5 05_QUAL         0.139  
##  6 06_PART_PERF    0.177  
##  7 07_PART_PRÄS    0.155  
##  8 08_THEM         0.0814 
##  9 09_PHYS         0.0630 
## 10 10_MEAS         0.119  
## 11 11_SHAPE        0.00131
## 12 12_COLOR        0.0551 
## 13 13_ORIGIN       0.0262 
## 14 14_MAT          0.0197 
## 15 15_NA           0.0171 
## 16 <NA>            0.00394

Evaluating the data: Entropy

Entropy is a measure about the mode and also part of descriptive statistics. It measures how typical the mode is for the distribution obtained. As such, it is used as a measure for disorder (for molecules) or uncertainty (where to locate molecules) or surprise (about new information) and non-typicality. High entropy (closer to 1) means that the mode is less informative about the question what is normal and that other levels than the level with the maximum count or probability may get similar values. Low entropy (closer to 0) means that the mode is highly informative.

Let us calculate the entropy of the levels of modification, i.e., how evenly or unpredictable (i.e. uniform) the different kinds of adjectives are distributed in combination of certain nouns.

We may think that the entropy tells us how surprised we are to see a certain adjective dependent on how probable it is that the adjective occurs. And it should be clear that we would be more surprised to see an adjective that triggers an other inference than possession in combination with “Flugzeug” ‘airplane’. Therefore we expect a low entropy for the levels. On the other hand we might be less surprised if any adjective occurs with “Auto” ‘car’. And we expect a high entropy for adjectives in combination with “car” ‘car’.

The hint to calculate the entropy (for data science) and check out differences with respect to the distribution of adjectival modification came from Bodo Winter. Bodo led a workshop at Frankfurt University in connection with ViCom (https://vicom.info) and introduced us to tidyverse (and brms). For an explanation of what entropy is, I follow here the explanations by Josh Cramer (https://www.youtube.com/watch?v=YtebGVx-Fxw) which helped me a lot to grasp the concept. Entropy is also explained in Gries (2021, Stefan Gries (2021) Statistics for Linguistics with R). Section 3.1.1.2 is on Dispersion / Normalized Entropy (p.94f).

Surprise is measured as the log of the inverse of the probability with which an event occurs. Events are occurences of adjectival modifiers. The entropy is in a medium range between 0 and 1 which means: it indicates varying degrees of uncertainty (CHATGPT p.c.). The use of adjectives from some classes is not surprising, at all, the use of others is very surprising dependent on the modified noun. With “Flugzeug” there seems to be less uncertainty which adjectives to use with it first: The majority of modifications has the label THEM, i.e. adjectives expressing to whom the object(s) referred to belong(s). This intuition is represented by low size of entropy. With “Auto” there is more uncertainty which properties they have. They could have many properties. The choice of adjectives for modification is more uniformly distributed and it is more difficult to predict which category of adjectives actually could be used. (What remains to be done here is checking out whether all the modifications are in fact restrictive.)

entr_airpl <- airpl %>% 
   #group_by(Kind) %>% 
  summarize(Entropy = sum(airpl$Percentage * log2(1/airpl$Percentage)) / log2(length(levels(airpl$Kind))+1))
print(entr_airpl$Entropy[1])

## [1] 0.3627374

entr_carl <- carl %>% 
   #group_by(Kind) %>% 
  summarize(Entropy = sum(carl$Percentage * log2(1/carl$Percentage)) / log2(length(levels(carl$Kind))))
print(entr_carl$Entropy[1])

## [1] 0.8787849

Conclusion

I would like to further suggest that entropy in the sense demonstrated here is a measure for well-establishedness of a kind and conceptual distinctiveness or ‘narrow categories’ and relates to object frequency. The smaller the set of observed objects (or the less variation in how they look like) the more likely it is to deal with a well-established kind.

References

Aikhenvald, Alexandra Y. 2015. “Possession and Ownership: A Corss-Linguistic Perspective.” In Possession and Ownership, edited by Alexandra Y. Aikhenvald and R. M. W Dixon, 1–65. Bodmin; King’s Lynn: MPG Books Group.

Barker, Chris. 2011. “Possessives and Relational Nouns.” In Semantics. An International Handbook of Natural Language Meaning, edited by Klaus von Heusinger, Claudia Maienborn, and Paul Portner, 2:1109–30. Berlin: De Gruyter.

Cinque, Guglielmo. 2009. The Syntax of Adjectives. Linguistic Inquiry Monograph. MITPress.

Gregorová, Klara, Jacopo Turini, Benjamin Gagl, and Melissa Le-Hoa Võ. 2021. “Access to Meaning from Visual Input: Object and Word Frequency Effects in Categorization Behavior.” Version 1. Vol. Online Publication. Frankfurt University.

———. 2023. “Access to Meaning from Visual Input: Object and Word Frequency Effects in Categorization Behavior.” Journal of Experimental Psychology General Online Publication (May).

Gries, Stefan Th. 2021. Statistics for Linguistics with r. 3rd ed. De Gruyter Mouton.

Konkle, Talia, Timothy F. Brady, George A. Alvarez, and Aude Oliva. 2010. “Journal of Experimental Psychology: General.” Conceptual Distinctiveness Supports Detailed Visual Long-Term Memory for Real-World Objects 139 (3): 558–78.

Laenzlinger, Christopher. 2000. “French Adjective Ordering: Perspectives on DP-Internal Movement Types.” Generative Grammar in Geneva 1: 55–104.

McNally, Louise, and Gemma Boleda. 2004. “Relational Adjectives as Properties of Kinds.” In Emprical Issues in Formal Syntax and Semantics 5, 5:179–96. CNRS.

Raskin, Viktor, and Sergei Nirenburg. NA. “Lexical Semantics of Adjectives. A Microtheory of Adjectival Meaning.” State University New Mexico: Computing Research Laboratory.

Schwarz, Florian. 2014. “How Weak and How Definite Are Weak Definites?” In Weak Referentiality, edited by Ana Aguilar-Guevara, Bert Le Bruyn, and Joost Zwarts. Linguistik Aktuell/Linguistics Today 219. John Benjamins Publishing Company.

Scontras, Gregory, Judith Degen, and Noah D. Goodman. 2017. “Subjectivity Predicts Adjective Ordering Preferences.” Open Mind: Discoveries in CognitiveScience 1 (1): 53–65.

Seymour, Deborah Mandelbaum. 1995. “New Woman’s Suitcases: The Possessive Adjective Switch. LSA Annual Meeting.” US Department of Education.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686.

Evidence for well-establishedness of kinds from Corpora?

Cécile Meier / Linguistics / Frankfurt University

2023-08-29