Elaboration of a vector based semantic classification over the words and notions of the natural language

Бесплатный доступ

The problem of vector-based semantic classification over the words and notions of the natural language is discussed. A set of generative grammar rules is offered for generating the semantic classification vector. Examples of the classification application and a theorem of optional formal classification incompleteness are presented. The principles of assigning the meaningful phrases functions over the classification word groups are analyzed.

Natural language generation, natural language semantics

Короткий адрес: https://sciup.org/148176098

IDR: 148176098

Текст научной статьи Elaboration of a vector based semantic classification over the words and notions of the natural language

One of the most important problems of the formal languages theory, a subdivision of theoretical computer science, is the problem of syntactic and semantic analysis of a given language sentences. Respecting the study of the natural and machine language structure, the foreground is the problem of generating the natural language i. e. grammatically and semantically meaningful phrases and texts of such languages, which satisfying definite meaningfulness criteria. For example, the Turing test. The importance of the matter is determined by the significance of such applied tasks as building natural-language interfaces, developing expert systems, electronic translators, electronic summarizing systems, e-learning systems, advertisement of user dialogue software provision, etc.

The principle purpose of this research is to offer a classification of natural language words and notions, allowing the generation performance for the meaningful speech and definition of meaningful speech criteria. The basic task is to determine the classification vector for natural speech words and notions, creating a dictionary for the classification of a set of the commonest English words. This make possible the algorithms of meaningful speech generation based on the given classification, proving the theorem of the optional formal classification incompleteness for the description of the differences in natural language word meanings.

The novelty of the work is reduced to the distinguishing particularities and the application efficiency of the generative grammar, described above, for the generation of the vector coordinates for the natural language word and notion classification and the particularities of using the classification for natural language generation.

A great number of researchers currently work on the problem of generating the meaningful subset of the language:

philologists, programmers, mathematicians, semantics experts, philosophers, etc. [1; 2; 3; 4]. Especially surprising for today are results in generating natural language grammatically meaningful phrases. Text editors, electronic translators and other systems effectively carry out the generation of grammatically meaningful language structure. However, the generating process of semantically meaningful speech is a less studied topic. Although many systems based on semantic nets, speech graffiti, ontology and other methods, they still show good results in a dialogue with the natural language user. The most popular method of sustaining the dialogue with the user is reduced to the application of databases in natural language dialogues between people, participators of forums, etc. Insufficient developments are provided for the natural language phrases and texts presentation in the form of functions and functional clusters over a multidimensional semantic classification, in spite of the fact that the method shows its efficiency for the generation of meaningful speech [5; 6; 7].

Classification of Natural Language Words and Notions . Let’s look at a semantic classification of natural language words and notions, reduced to 16 classes of language semes (semantic, meaning “atoms”) and further to four gene-semes (elementary particles of meaning). Then to a notion of link (a meaning “quantum”), that can be shown based on the notional semantic nets’ apparatus. The definition based on the meaning quantum is a semantic net with arcs baring the notion semantics of some elements’ equivalence, which means a link between objects.

Using four elementary particles – gene-semes such as {system, classification, localization and perception} it is possible to determine the natural language. Localization is determined as an object, in which there is a similarity between all levels of the subsystems; for example, a triangle formed by the stars of a galaxyissimilartoanyproportionaltrianglecreatedbytheplanet houses of the star system. Perception is defined as an object, where all the subsystems (perceived) are similar to the supersystems (perceiving). For example, a real image of a vase in the light specter will form an information similarity pyramid, at first in the pupil of the eye, and then in the brain. The structure is defined as an object with heterogeneous and super-systems. For example, the structure of the automobile body and wheels are heterogeneous. Classification is defined as an object with the similarity of all subsystems to the super-systems. For example, crab apples posses all the properties of apples, while apples posses all the properties of fruit.

Using four gene-semes it is possible to determine 16 classes of semes. We shall give some examples of such a definition for the semes class: “Basic semes”:

  • –    creature – perceiving and localized in space;

  • –    thing – not perceiving and localized in space;

  • –    mind – perceiving and not localized in space;

  • –    abstraction – not perceiving and not localized in space;

  • –    idea – perceived and not localizing in space;

  • –    place – not obligatory perceived and localized in space; – information – perceived and localizing;

  • –    abstraction – not obligatorily perceived and not localizing.

The following basic classes of meaning atoms are determined as semes of the natural language:

  • 1.    Basic semes: creature, place, information and others.

  • 2.    Semes of probability: existing, non-existing, necessary, possible and the derived ones.

  • 3.    Semes-predicates: relation-x, relation-x-x, relation-creature-x and others.

  • 4.    Semes-arguments: subject, object, recipient, instrument and others.

  • 5.    Semes of localization: of, in, on, at and others.

  • 6.    Semes-relations: includes, is included in, includes and is included in, partially includes, is more than, is less than and others.

  • 7.    Semes-numbers: digits from 0 to 15.

  • 8.    Semes of indefinite number: all, many, some, few, no and others.

  • 9– 12. Semes of the language stylistics: positive – negative, low – high and others.

  • 13– 16. Semes, characterizing the description of images and forms: wide – narrow, stable – unstable and others.

Based on the natural language semes classification a natural language notions classification vector of five coordinates is offered. The values of the G vector coordinates are assigned by means of a generative grammar of the following form:

  • 1.    The first level of the notions classification corresponds to the coordinate G 1 of the vector G . Let G 1 = {something, relation, mind, idea, information, place, thing, creature}.

  • 2.    The second level of the notions classification is presented by the coordinate G 2. Aset G 2 of the coordinates value for the classification is assigned by a set of generative grammar rules: { S ^ Fd , S ^ Fx , d ^ alive, d ^ not alive, x ^ which alive, x ^ which not alive, f ^ of, f ^ in, f ^ on, f ^ at}, where notion At means any not zero distance between objects.

  • 3.    The third level of the notions classification is determined by the coordinate G3, G3= {X–y(essence), X–X– y (essence of essence),            –X–y (property),

    –X–X–y (connection),            -–

(action),            -          –X–X–y (joining),

  • -           -          –X–y (presenting),

  • -         –X–X–y (exchange)}, where X is any

  • 4 .Asetof G 4 values of the coordinate G is assigned by a set of generative grammar rules: { S ^ P 1 - P 2 - P 3 - P 4 - P 5 - P 6 - P 7 - P 8 , P 1 ^ g - quantity, P 1 ^ X, P 2 ^ g - stability, P 2 ^ X,

  • 5 . Aset G 5 of the coordinate values G isassignedbyaset of generative grammar rules: { S ^ x , x ^ ( xFx ), x ^ xFx , x ^ 1 (existing), x ^ 0 (non-existing), x ^ ◊ (possible), x ^ U (necessary), F ^ includes, F ^is included in, F ^ includes and is included in, F ^ partially includes, F ^ more than, F ^ less than, F ^ equal to, F ^ similar to, F ^ becomes, F ^ derives from, F ^ is simultaneous to,

of the basic semes, determined on the first level of the classification, while y is any sequence of such semes. X is determined as the seme, main by its meaning. Sign “–” is used in the given case for concatenation notation. Essential explanations are shown in the round brackets.

P 3 ^ g -positivity, P 3 ^ X, P 4 ^ g -spectrum, P 4 ^ X,

P 5 ^ g -information content, P 5 ^ X, P 6 ^ g -location, P 6 ^ X, P 7 ^ g -size, P 7 ^ X, P 8 ^ g -being artificial, P 8 ^ X}, where g is a linguistic scale value like: {minimal,…,little,…, medium, ^, big, ^, maximal, X}. Here X is an empty symbol.

F ^ is not simultaneous to, F ^ implies, F ^ is determined by, F ^ corresponds to, F ^ is connected to}.

All further levels of the classification are formed by means of the recursive repeating the offered five levels of classification. The level index can be calculated by the formula Gi = G mod( i ,5), where i belongs to the set of natural numbers. Any notion or class of notions for the natural language corresponds to a definite classification vector.

For example, the group of words {take, give, buy, sell, accept, present, etc.} correspond to the such a vector as [thing\\relation-creature-creature- X ].

The group of words {shop, kiosk, supermarket, etc.} correspond to such a vector as [thing\in which alive\ X ]+[thing\\relation-creature-creature- X ].

The word “transport” corresponds to a vector: [thing\in which alive\ X ] + [place\\relation-creature- X ].

Each word corresponds to a set of semantic notions – points of the notions’ space. However, using the five coordinates of the multidimensional classification vector is definite simplification. In the most complete form the classification can be based on 16 coordinates of a recursively repeating vector of values.

The principle of meaningful speech generation based on the offered classification has been tested by such software as: “Electronic Dictionary”.

The Incompleteness of a Formal Classification Theorem is the basis for the given classification; let’s introduce a definition of a conditionally complete classification and prove the theorem of semantic classification incompleteness.

Definition 1 . Let’s consider a system of words semantics representation as points of a vector space to be a conditionally complete, as for an optional element a e { a ’, a ’’, a ’’’...}, b e{ b ’, b ”, b ”’,...}, ..., c e{ c ’, c”, c ”’...} and vector v [ a , b ,…, c ] itistrue, asthatforanynotion A , A ~ a ’ v a v a ”’Ъ .., for any notion B , B ~ b' v b ”Ъ b ”’ v... , ... for any notion C , C ~ c v c ’’ v c ’’’ v .., where “~” is a sign of correspondence.

Theorem . Any system of words semantics representation as points of a vector space is characterized by incompleteness. In other words, for any classification there exist words, with meaning elements being classified by the classification not completely. For any classification A of the words set { ai }, where any ak ~ v [ ax , by ,…, cz ], the meaning of the word S ( ak ) includes the meaning shade S ( ak ) Ln , such that — ( S ( a k ). L n ~ S ( v [ a x , b y ,.., c ])), that is — ( S ( a k ) e S ( v [ a x , by ,…, cz ])), where a = { a 1 , a 2 , a 3 ,…}, b = { b 1 , b 2 , b 3 ,…},…, c = { c 1 , c 2 , c 3 ,…}.

Let’s show an example of transfer of the meaning out of the meaning, determined by a classification. In this way, the word light ~ v [action,…, from the surface,…, intensive,…], while — ( S (light).shining ~ S ( v [action,.., from the surface,.., intensive,…])), so, the emotional and associative rows, determined by a person cannot be completely manifested by a formal classification. Consequently, a row of the meaning elements cannot be manifested by any formalism, for example it is impossible to explain a blind person what is the feeling of a color such as red, and therefore it is impossible for him to imagine it. This way the words correspond to positions in a classification according to the law of the excluded third, but meanwhile their meanings are not reduced to the division.

Lemma 1 . A word meaning can have optionally large power. Proof. Let’s understand the power of a word meaning as the power of set { S ( ak ). Lj } for a definite word ak . Let the word ak meaning is assigned by definition in the form of a semantic net { Lj ( Lj , Lj ’’)}. The word ak is correlated with an object of reality, being in a system of relations with outer objects, parts of the system and the perception of the system; because of the fact that relations with outer objects of reality (distance, concatenation, simultaneousness) determine the meaning { Lj ( Lj , Lj ’’)}, where Lj ’’ isanouterobject, andthe reality (for example, space points set, a quantity of literary worlds, time, subsets of sets of objects and points) is principally endless, so the set of word meaning is principally limited by nothing. { L j „} = да |- { L j ( L j , , L j ,, )} = да.

For example, the meaning of any word can be always increased: a reading student, a student reading a book, a student sitting and reading a book, etc. – without limitation.

Lemma 2 . The quantity of possible words with different semantics is endless. Proof. { L j ( L j , , L j ,, )} = да |- { L } = да |-{ S ( a k ). L } = да, because of the fact that word a k can be optional.

Proof. Let classification A be assigned by a vector of coordinates v [ a , b ,…, c ], where ak ~ v [ ax , by ,…, cz ] and S ( ak ). Lj is an element of ak word meaning and where a = { a 1 , a 2 , a 3 ,…}, b = { b 1 , b 2 , b 3 ,…},…, c = { c 1 , c 2 , c 3 ,…}. For any ak , let it be true that v [ ax , by ,…, cz ] ~ S ( ak ). For any coordinate d of vector v [ a , b ,.., c ]: d = { d v — d } и { d "v d ’’ } и ... Let’s assign value g = { g’ v g’ } и { g” v g” } и . ., where g * a , g * b , .., g * c . As a result of the union of the classification vectors v [ a , b ,…, c ] and v ’[ g ], vector v ’’[ a , b , …, c , g ] is obtained. It is evident, that according to lemma 2, such a S ( a r ). Lh can be found that S ( a r ) e v ’’[ a , b , .., c , g ]. Let’s consider a set of such meaning elements E = { S ( ar ). Lh }, I = { S ( a k ). L j }. I e E . The sets are different, if g is not empty, because g * a , g * b , ., g * c . Let Y = E - 1 .

If such a classification v [ a , b ,…, c ], can exist, that set Y is always empty, than it will be always that either S ( g ) =0, or S ( g ) e S ( a ) и S ( b ) и . S ( c ). Let’s assign a word a t such that S ( at ) = S ( v [ a , b , c ]). Let’s show that it is always possible to select such g , that Y will be not empty; because of lemma 2, a word af , can be found, such that S ( af ) = S ( v[a , b ,…, c , g ]) and such that its semantics will be always different from the a t word semantics. In other case { S ( a t , )} * », where S ( a t ) is any meaning of the word-classification. Correspondingly, in this respect the set Y will not be empty and the complete classification v [ a , b ,…, c ] cannot exist, the theorem has been proved.

In this way, a meta-notion always exists that adds an extra meaning into the classification of words. It means that, no word classification can generate all words meaning. The given theorem is correlated with the G del’s theorem about the incompleteness of formal systems.

The Principles of Meaningful Natural Language Generation . Let’s consider the principles of meaningful speech generation based on the offered vector based classification (see the figure).

The structures of different levels are formed over the given semantic classification of words and notions of the natural language. On the first level there are word groups of the language, on the second level they are united into word combinations – pairs of words linked semantically and grammatically, on that level the combinations of words more or less useful as word combinations are assigned. On the second level the words are united into patterns for example: “Determiner + Attribute + Subject + Modality + Predicate + Determiner + Attribute + Object + Link + Determiner + Attribute + Nominal Group (Modifier of Time) + Link + Determiner + Attribute + Nominal Group (Modifier of Place) + Link + Determiner + Attribute + Nominal Group (Modifier of Purpose) +…”. Semantic chains of the type are presented in the following way: “this/that/…+ hungry/ full/…+ vegetarian/gourmand/…+ can/wants to/…+ eat/ cook/…+ the/a/…+ tasty/aromatic/…+ pie/salad/…+ after/ before/…+ five/six/…+ hours + in/for/…five/six/… + minutes +…+ in + a + big/beautiful/…+ restaurant/ canteen/…+ on + a + big/beautiful/…+ street/square/… named after Smith/Brown/…+ in + a + big/beautiful/… + city/village/…+ Ababa/Acaca/…+ in order to/to/…+ taste/ know/…+ a + pungent/spicy/…+ taste/aftertaste/…+…”. On the forth level words are separated into subsets of these patterns: “I/he/…+ have eaten/tasted/…+ on a street/ square/…+ named after + Smith/Brown/…”. On the fifth level the fragments of the patterns are united into semantic patterns of the second rank: “the taste of a pie surprised me in the morning” (pattern type: Relation-Attribute_of_Object-Time), “The restaurant gladdened me with a crunching crust”. (pattern class: Relation-Place-Part_of_Object). Generation and ordering the semantic patterns of the second rank is an important task determining the success of the system for natural speech generation by software means. Example of semantic patterns of natural speech generation are shown in the table:

the…этот...

of the…    …

is... является...

…- /у-йте(сь)

the...этот...

stuff предмет

taste вкус

after-taste привкус smack привкус

flavor вкус

berry ягода

strawberry клубника

raspberry малина gooseberry крыжовник

currant смородина

bilberry черника

blackberry черная смородина

cranberry клюква

sweet       cherry

черешня

cherry вишня

grape виноград

raisin изюм

sweet сладкий

sour кислый salty соленый bitter горький

pungent острый

weak слабый

strong сильный

enjoy

наслаждаться

feel чувствовать

savor смаковать

discuss наслаждаться

identify узнать

notice заметить

learn

узнавать

experience испытать

good хороший

great великолепный

excellent отличный

wonderful чудесный

superior превосходный

splendid

великолепный

magnificent сказочный

surprising

уди вительный

lovely красивый

worthy стоящий

useful полезный

funny забавный

thing вещь

object объект

gem прелесть must важная вещь

trifle мелочь

process

процесс

time

время

moment

момент

Статья научная