Vagueness and Learning: A Type-Theoretic Approach

We present a formal account of the meaning of vague scalar adjectives such as ‘ tall ’ formulated in Type Theory with Records. Our approach makes precise how perceptual information can be integrated into the meaning representation of these predicates; how an agent evaluates whether an entity counts as tall; and how the proposed semantics can be learned and dynamically updated through experience.


Introduction
Traditional semantic theories such as those described in Partee (1989) and Blackburn and Bos (2005) offer precise accounts of the truthconditional content of linguistic expressions, but do not deal with the connection between meaning, perception and learning. One can argue, however, that part of getting to know the meaning of linguistic expressions consists in learning to identify the individuals or the situations that the expressions can describe. For many concrete words and phrases, this identification relies on perceptual information. In this paper, we focus on characterising the meaning of vague scalar adjectives such as 'tall', 'dark', or 'heavy'. We propose a formal account that brings together notions from traditional formal semanticswith perceptual information, which allows us to specify how a logic-based interpretation function is determined and modified dynamically by experience.
The need to integrate language and perception has been emphasised by researchers working on the generation and resolution of referring This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http: //creativecommons.org/licenses/by/4.0/ expressions (Kelleher et al., 2005;Reiter et al., 2005;Portet et al., 2009) and, perhaps even more strongly, on the field of robotics, where grounding language on perceptual information is critical to allow artificial agents to autonomously acquire and verify beliefs about the world (Siskind, 2001;Steels, 2003;Roy, 2005;Skocaj et al., 2010). Most of these approaches, however, do not build on theories of formal semantics for natural language. Here we choose to formalise our account in a theoretical framework known as Type Theory with Records (TTR), which has been shown to be suitable for formalising classic semantic aspects such as intensionality, quantification, and negation (Cooper, 2005a;Cooper, 2010;Cooper and Ginzburg, 2011) as well as less standard phenomena such as linguistic interaction (Ginzburg, 2012;Purver et al., 2014), perception and action (Dobnik et al., 2013), and semantic coordination and learning (Larsson, 2009). In this paper we use TTR to put forward an account of the semantics of vague scalar predicates like 'tall' that makes precise how perceptual information can be integrated into their meaning representation; how an agent evaluates whether an entity counts as tall; and how the proposed semantics for these expressions can be learned and dynamically updated through language use.
We start by giving a brief overview of TTR and explaining how it can be used for classifying entities as being of particular types integrating perceptual information. After that, in Section 3, we describe the main properties of vague scalar predicates. Section 4 presents a probabilistic TTR formalisation of the meaning of 'tall', which captures its context-dependence and its vague character. In Section 5, we then offer an account of how that meaning representation is acquired and updated with experience. Finally, in Section 6 we discuss related work, before concluding in Section 7.

Meaning as Classification in TTR
In this section we give a brief and hence inevitably partial introduction to Type Theory with Records. For more comprehensive introductions, we refer the reader to Cooper (2005b) and Cooper (2012).

Type Theory with Records: Main Notions
As in any type theory, the most central notion in TTR is that of a judgement that an object a is of type T , written as a : T . In TTR judgements are seen as fundamentally related to perception, in the sense that perceiving inherently involves categorising what we perceive. Some common basic types in TTR are Ind (the type of individuals) and R + (the type of positive real numbers). All basic types are members of a special type Type. Given types T 1 and T 2 , we can create the function type T 1 → T 2 whose domain are objects of type T 1 and whose range are objects of type T 2 . Types can also be constructed from predicates and objects P (a 1 , . . . , a n ). Such types are called ptypes and correspond roughly to propositions in first order logic. In TTR, propositions are types of proofs, where proofs can be a variety of things, from situations to sensor readings (more on this below).
Next, we introduce records and record types. These are structured objects made up of pairs l, v of labels and values that are displayed in a matrix: (1) a. A record type:  (1b) is of the record type in (1a) if and only if a 1 : T 1 , a 2 : T 2 (a 1 ), . . . , and a n : T n (a 1 , a 2 , . . . , a n−1 ). Note that the record may contain more fields but would still be of type (1a) if the typing condition holds. Records and record types can be nested so that the value of a label is itself a record (or record type). We can use paths within a record or record type to refer to specific bits of structure: for instance, we can use r. 2 to refer to a 2 in (1b).
As can be seen in (1a), the labels 1 , . . . n in a record type can be used elsewhere to refer to the values associated with them. This is a common way of constructing ptypes where the arguments of a predicate are entities that have been introduced before in the record type. A sample record and record type are shown in (2).
(2)   x = a c man = prf(man(a)) c run = prf(run(a))   : : Ind c man : man(x) c run : run(x)   In (2), a is an entity of type individual and prf(P ) is used as a placeholder for proofs of ptypes P . In the record type above, the ptypes man(x) and run(x) constructed from predicates are dependent on x (introduced earlier in the record type).

Perceptual Meaning
Larsson (2013) proposes a system formalised in TTR where some perceptual aspects of meaning are represented using classifiers. For example, the meaning of 'right' (as in 'to the right of ') involves a two-input perceptron classifier κ right (w, t, r), specified by a weight vector w and a threshold t, which takes as input a context r including an object x and a position-sensor reading sr pos . The sensor reading consists of a vector containing two real numbers representing the space coordinates of x. The classifier classifies x as either being to the right on a plane or not. 1 (3) if r : x : Ind sr pos : RealVector , then κ right (w, t, r) = right(r.x) if (r.sr pos · w) > t ¬ right(r.x) otherwise As output we get a record type containing either a ptype right(x) or its negation, ¬ right(x). Larsson (2013) proposes that readings from sensors may count as proofs of such ptypes. A classifier can be used for judging x as being of a particular type on the grounds of perceptual information. A perceptual proof for right(x) would thus include the output from the position sensor that is directed towards x. Here, this output would be the space coordinates of x.

Vague Scalar Predicates
Scalar predicates such as 'tall', 'long' and 'expensive', also called "relative gradable adjectives" (Kennedy, 2007), are interpreted with respect to a scale, i.e., a dimension such as height, length, or cost along which entities for which the relevant dimension is applicable can be ordered. This makes scalar predicates compatible with degree morphology, like comparative and superlative morphemes ('taller than', 'the longest') and intensifier morphemes such as 'very' or 'quite'. In this paper, our focus is on the so-called positive form of these adjectives (e.g. 'tall' as opposed to 'taller' or 'tallest').
A property that distinguishes the positive form from the comparative and the superlative forms is its context-dependance. To take a common example: If Sue's height is 180cm, she may be appropriately described as a tall woman, but probably not as a tall basketball player. Thus, what counts as tall can vary from context to context, with the most relevant contextual parameter being a comparison class relative to which the adjective is interpreted (e.g., the set of women, the set of basketball players, etc.). In addition to being contextdependent, positive-form scalar predicates are also vague, in the sense that they give rise to borderline cases, i.e., entities for which it is unclear whether the predicate holds or not.
Vagueness is certainly a property that affects most natural language expressions, not only scalar adjectives. However, scalar adjectives have a relatively simple semantics (they are often unidimensional) and thus constitute a perfect casestudy for investigating the properties and effects of vagueness on language use. Gradable adjectives have received a high amount of attention in the formal semantics literature. It is common to distinguish between two main approaches to their semantics: delineation-based and degree-based approaches. The delineation approach is associated with the work of Klein (1980), who proposes that gradable adjectives denote partial functions dependent on a comparison class. They partition the comparison class into three disjoint sets: a positive extension, a negative extension, and an extension gap (entities for which the predicate is neither true nor false). In contrast, degree-based approaches assume a measure function m mapping individuals x to degrees on a particular scale (degrees of height, degrees of darkness, etc.) and a standard of comparison or degree threshold θ (again, dependent on a comparison class) such that x belongs to the adjective's denotation if m(x) > θ (Kamp, 1975;Pinkal, 1979;Pinkal, 1995;Barker, 2002;Kennedy and McNally, 2005;Kennedy, 2007;Solt, 2011;Lassiter, 2011).
We build on degree approaches but adopt a perception-based perspective and take a step further to formalise how the meaning of these predicates can be learned and constantly updated through language use.

A Perceptual Semantics for 'Tall'
To exemplify our approach, we will use the scalar predicate 'tall' throughout.

Context-sensitivity
We first focus on capturing the contextdependence of relative scalar predicates. For this we define a type T ctxt as follows: The context (ctxt) of a scalar predicate like 'tall' is a record of the type in (4), which includes: a type c (typically a subtype of Ind) representing the comparison class; an individual x within the comparison class (the argument of tall); a perceived measure on the relevant scale(s), in this case the perceived height h of x expressed as a positive real number.
The context presupposes the acquisition of sensory input from the environment. In particular, it assumes that an agent using such a representation is able to classify the entity in focus x as being of type c and is able to use some height sensor to obtain an estimate of x's height (the value of h is the sensor reading). We thus forgo the inclusion of an abstract measure function in the representation. In an artificial agent, this may be accomplished by image processing software for detecting and measuring objects in a digital image.
Besides the ctxt, we also assume a standard threshold of tallness θ tall of the type given in (5). θ tall is a function from a type specifying a comparison class to a height value, which corresponds to a tallness threshold for that comparison class. (In Section 5 we will discuss how such a threshold may be computed.) The meaning of 'tall' involves a classifier for tallness, κ tall , of the following type: We define this classifier as a one-input perceptron that compares the perceived height h of an individual x to the relevant threshold θ determined by a comparison class c. Thus, if θ : Type → R + and r : T ctxt , then: Simplifying somewhat, we can represent the meaning of 'tall', tall, as a record specifying the type of context (T ctxt ) where an utterance of 'tall' can be made, the parameter of the tallness classifier (the threshold θ), and a function f which is applied to the context to produce the content of 'tall'.
The output of the function f is an Austinian proposition (Cooper, 2005b): a judgement that a situation (sit, represented as a record r of type T ctxt ), is of a particular type (specified in sit-type). In the case of tall, the context of utterance (which instantiates r) is judged to be of the type where there is an individual x which is either tall or not tall, according to the output of the classifier κ tall . The context of utterance in the sit field will include the height-sensor reading, which means that the sensor reading is part of the proof of the sit-type indicating that x is tall (or not, as the case may be). Thus, to decide whether to refer to some individual x as tall or to evaluate someone else's utterance describing x as tall, an agent applies the function tall.f to the current situation, represented as a record r : T ctxt . As an example, let us consider a situation that includes the context in (8) Let us assume that given the comparison class Human, θ tall (Human) = 1.87. In this case, tall.f(ctxt) will compute as shown in (9). The resulting Austinian proposition corresponds to the agent's judgement that the situation in sit is one where John Smith counts as tall.

Vagueness
According to the above account, 'tall' has a precise interpretation: given a degree of height and a comparison class, the threshold sharply determines whether tall applies or not. There are several ways in which one can account for vagueness-amongst others, by introducing perceptual uncertainty (possibly inaccurate sensor readings). Here, in line with Lassiter (2011), we opt for substituting the precise threshold with a noisy, probabilistic threshold. We consider the threshold to be a normal random variable, which can be represented by the parameters of its Gaussian distribution, the mean µ and the standard deviation σ (the noise width). 2 To incorporate this modification into our approach, we update the tallness classifier κ tall we had defined in (6) so that it now takes as parameters µ tall and σ tall , both of them dependent on the comparison class and hence of type Type → R + . The output of the classifier is now a probability rather than a ptype such as tall(x) or ¬tall(x). Before indicating how this probability is computed, we give the type of the vague version of the classifier in (10) and the vague representation of the meaning of 'tall' in (11).
(10) κ tall : (Type→ R + , Type→ R + , T ctxt ) → [0, 1] The output of the function tall.f is now a probabilistic Austinian proposition (Cooper et al., 2014). Like before, the proposition expresses a judgement that a situation sit is of a particular type. But here the judgement is probabilistic-it encodes the belief of an agent concerning the likelihood that sit is of a type where x counts as tall.
Since we take the noisy threshold to be a normal random variable, given a particular µ and σ, we can calculate the probability that the height r.h of individual r.x counts as tall as follows: Here erf is the error function, defined as 3 The error function defines a sigmoid shape (see Figure 1), in line with the upward monotonicity of 'tall'. The output of κ tall (µ, σ, r) corresponds to the probability that h will exceed the normal random threshold with mean µ and deviation σ. Let us consider an example. Assume that we have µ tall (Human) = 1.87 and σ tall (Human) = 0.05 (see Section 5.1 below for justification of the latter value). Let's also assume the same ctxt as above in (8). In this case, tall.f(ctxt) will compute as in (12) This probability can now be used in further probabilistic reasoning, to decide whether to refer to an individual x as tall, or to evaluate someone else's utterance describing x is tall. For example, an agent may map different probabilities to different adjective qualifiers of tallness to yield compositional phrases such as 'sort of tall', 'quite tall', 'very tall', 'extremely tall', etc. The meanings of these composed adjectival phrases could specify probability ranges trained independently. Compositionality for vague perceptual meanings, and the interaction between compositionality and learning, is an exciting area for future research. 4

Learning from Language Use
In this section we consider possibilities for computing the noisy threshold we have introduced in the previous section and discuss how such a threshold and the probabilistic judgements it gives rise to are updated with language use.

Computing the Noisy Threshold
We assume that agents keep track of judgements made by other agents. More concretely, for a vague scalar predicate like 'tall', we assume that an agent will have at its disposal a set of observations consisting of entities of a particular type T (a comparison class such as Human) that have been judged to be tall, together with their observed heights. Judgements of tallness may vary across individuals-indeed, such variation (both interand intra-individual) is a hallmark of vague predicates. We use Ω T tall to refer to the set of heights of those entities x : T that have been considered tall by some individual. From this agent-specific set of observations, which is constantly updated as the agent is exposed to new judgements by other individuals, we want to compute a noisy threshold, which the agent uses to make her own judgements of tallness, as specified in (11).
Different functions can be used to compute µ tall and σ tall from Ω T tall . What constitutes an appropriate function is an empirical matter and what the most suitable function is possibly varies across predicates (what may apply to 'tall' may not be suitable for 'dark' or 'expensive', for example). Hardly any work has been done on trying to identify how the threshold is computed from experience. A notable exception, however, is the work of Schmidt et al. (2009), who collect judgements of people asked to indicate which items are tall given distributions of items of different heights. Schmidt and colleagues then propose different probabilistic models to account for the data and compare their output to the human judgements. They explore two types of models: threshold-based models and category-based or cluster models. The best performing models within these two types perform equally well and the study does not identify any advantages of one type over the other one. Since we have chosen threshold models as our casestudy, we focus our attention on those here.
Each of the threshold models tested by Schmidt et al. (2009) corresponds to a possible way of computing the mean µ tall of a noisy threshold from a set of observations. The best performing threshold model in their study is the relative height by range model, where (in our notation): (13) relative height by range (RH-R): µ tall (T ) = max(Ω T tall ) − k · (max(Ω T tall ) − min(Ω T tall )) Here max(Ω T tall ) and min(Ω T tall ) stand for the maximum and the minimum height, respectively, of the items that have been judged to be tall by some individual. According to this threshold model, any item within the top k% of the range of heights that have been judged to be tall counts as tall. The model includes two parameters, k and a noise-width parameter that in our approach corresponds to σ tall . Schmidt et al. (2009) report that the best fit of their data was obtained with k = 29% and σ tall = 0.05.

Updating Vague Meanings
We now want to specify how the vague meaning of 'tall' is updated as an agent is exposed to new judgements via language use. Our setting so far offers a straightforward solution to this: If a new entity x : T with height h is referred to as tall, the agent adds h to its set of observations Ω T tall and recomputes µ tall (Human), for instance using RH-R as defined in (13). If RH-H is used, ideally the value of k and σ tall should be (re)estimated from Ω T tall . For the sake of simplicity, however, here we will assume that these two parameters take the values experimentally validated by Schmidt et al. (2009) and are kept constant. An update to µ tall will take place if it is the case that h > max(Ω T tall ) or h < min(Ω T tall ). This in turn will trigger un update to the probability outputted by κ tall .
As an example, let us assume that our initial set of observations is Ω Human If we were to re-evaluate John Smith's tallness in light of this observation, we would get a new probability 0.64 that he is tall (in contrast to the earlier probability of 0.579 given in (12)).

Possible Extensions
The set of observations Ω Human tall can be derived from a set of Austinian propositions corresponding to instances where people have been judged to be tall. To update from an Austinian proposition p we simply add p.sit.h to Ω tall Human and recompute µ tall (p.c). Note that we are here treating these Austinian propositions as non-probabilistic. This seems to make sense since an addressee does not have direct access to the probability associated with the judgement of the speaker. If we were to take these probabilities into account (for instance, the use of a hedge in 'sort of tall' may be used to make inferences about such probabilities), and if those probabilities are not always 1, we would need a different way of computing µ tall than the one specified so far.
Somewhat related to the point above, note that in our approach we treat all judgements equally, i.e., we do not distinguish between possible different levels of trustworthiness amongst speakers. An agent who is told that an entity with height h is tall adds that observation to its knowledge base without questioning the reliability of the speaker. This is clearly a simplification. For instance, there is developmental evidence showing that children are more sensitive to reliable speakers than to unreliable ones during language acquisition (Scofield and Behrend, 2008).

Other Approaches
Within the literature in formal semantics, Lassiter (2011) has put forward a proposal that extends in interesting ways earlier work by Barker (2002) and shares some aspects with the account we have presented here. Operating in a probabilistic version of classical possible-worlds semantics, Lassiter assumes a probability distribution over a set of possible worlds and a probability distribution over a set of possible languages. Each possible language represents a precise interpretation of a predicate like 'tall': tall 1 = λx.x's height ≥ 5'6"; tall 2 = λx.x's height ≥ 5'7"; and so forth. Lassiter thus treats "metalinguistic belief" (representing an agent's knowledge of the meaning of words) in terms of probability distributions over precise languages. Since each precise interpretation of 'tall' includes a given threshold, this can be seen as defining a probability distribution over possible thresholds, similarly to the noisy threshold we have used in our account. Lassiter, however, is not concerned with learning.
Within the computational semantics literature, DeVault and Stone (2004) describe an implemented system in a drawing domain that is able to interpret and execute instructions including vague scalar predicates such as 'Make a small circle'. Their approach makes use of degree-based semantics, but does not take into account comparison classes. This is possible in their drawing domain since the kind of geometric figures it includes (squares, rectangles, circles) do not have intrinsic expected properties (size, length, etc). Their focus is on modelling how the threshold for a predicate such as 'small' is updated during an interaction with the system given the local discourse context. For instance, if the initial context just contains a square, the size of that square is taken to be the standard of comparison for the predicate 'small'. The user's utterance 'Make a small circle' is then interpreted as asking for a circle of an arbitrary size that is smaller than the square.
In our characterisation of the context-sensitivity of vague gradable adjectives in Section 4.1, we have focused on their dependence on general comparison classes corresponding to types of entities (such as Human, Woman, etc) with expected properties such as height. Thus, in contrast to DeVault and Stone (2004), who focus on the local context of discourse, we have focused on what could be called the global context (an agent's experience regarding types of entities and their expected properties). How these two types of context interact remains an open question, which we plan to explore in our future work (see Kyburg and Morreau (2000), Kemp et al. (2007), and Fernández (2009) for pointers in this direction).

Conclusions and future work
Traditional formal semantics theories postulate a fixed, abstract interpretation function that mediates between natural language expressions and the world, but fall short of specifying how this function is determined or modified dynamically by experience. In this paper we have presented a characterisation of the semantics of vague scalar predicates such as 'tall' that clarifies how their context-dependent meaning and their vague character are connected with perceptual information, and we have also shown how this low-level perceptual information (here, real-valued readings from a height sensor) connects to high level logical semantics (ptypes) in a probabilistic framework. In addition, we have put forward a proposal for explaining how the meaning of vague scalar adjectives like 'tall' is dynamically updated through language use.
Tallness is a function of a single value (height), and is in this sense a uni-dimensional predicate. Indeed, most linguistic approaches to vagueness focus on uni-dimensional predicates such as 'tall'. However, many vague predicates are multi-dimensional, including nouns for positions ('above'), shapes ('hexagonal'), and colours ('green'), amongst many others. Together with compositionality (mentioned at the end of Section 4.2), generalisation of the present account to multi-dimensional vague predicates is an interesting area of future development.