History of Computational Linguistics. What is Computational Linguistics? Cognitive toolkit of computational linguistics

Novoselova Irina

Why aren't all machine translations perfect? What determines the quality of a translation? Does the author have enough knowledge to use and supplement existing computer dictionaries? The author tried to provide answers to these questions in her work. Report on the topic - in the attached file, the product of project activities - on the school portal

Download:

Preview:

open

International

research

conference

high school students and students

"Education. The science. Profession"

Section "Foreign Linguistics"

"Computer Linguistics"

Made by Irina Novosyolova

MOU gymnasium No. 39 "Classic"

10 "B" class

Scientific supervisors:

Chigrineva Tatyana Dmitrievna,

English teacher of the highest category

Osipova Svetlana Leonidovna,

computer science teacher of the highest category

city of Otradny

2011

English words in ICT

Look at the site

My experiment

One of the tasks is to conduct an experiment, which consists in comparing the capabilities of various computer linguistic dictionaries, for a more accurate and approximate translation from English into Russian.

The following sites have been tested:

http://translate.eu/
http://translate.google.ru/#ru
http://www.langinfo.ru/index.php?div=6
http://www2.worldlingo.com/ru/products_services/worldlingo_translator.html

For the purity of the experiment, I chose sentences with varying degrees of complexity of stylistic translation. The input phrases are as follows:

1. A new report says today's teenagers are more selfish than they were 20 years ago

(New report says today's teens are more selfish than they were 20 years ago)

2. She believes video games and the Internet are the biggest reasons for this increased selfishness.

(She believes that video games and the Internet are the biggest reasons for this growing selfishness)

3. They want to be better than others

(They want to be better than the rest)

4. She found the big increase started from the year 2000, which is when violent video games became really popular.

(She found a lot of growth starting in 2000 when violent video games became really popular)

After translating these sentences on online translator sites, I got the following results:

http://translate.eu/

Computational Linguistics: Methods, Resources, Applications

Introduction

Term computational linguistics(CL) in recent years is increasingly common in connection with the development of various applied software systems, including commercial software products. This is due to the rapid growth in the society of textual information, including on the Internet, and the need for automatic processing of texts in natural language (NL). This circumstance stimulates the development of computational linguistics as a field of science and the development of new information and linguistic technologies.

Within the framework of computational linguistics, which has existed for more than 50 years (and is also known under the names machine linguistics, automatic word processing in NL) many promising methods and ideas have been proposed, but not all of them have yet found their expression in software products used in practice. Our goal is to characterize the specifics of this area of research, formulate its main tasks, indicate its connections with other sciences, give a brief overview of the main approaches and resources used, and briefly characterize the existing applications of CL. For a more detailed acquaintance with these issues, books can be recommended.

1. Tasks of computational linguistics

Computational linguistics arose at the intersection of such sciences as linguistics, mathematics, computer science (Computer Science) and artificial intelligence. The origins of CL go back to the research of the famous American scientist N. Chomsky in the field of formalization of the structure of natural language; its development is based on results in the field of general linguistics (linguistics). Linguistics studies the general laws of natural language - its structure and functioning, and includes the following areas:

Ø Phonology- studies the sounds of speech and the rules for their combination in the formation of speech;

Ø Morphology- deals with the internal structure and external form of words of speech, including parts of speech and their categories;

Ø Syntax- studies the structure of sentences, the rules of compatibility and the order of words in a sentence, as well as its general properties as a unit of language.

Ø Semanticsand pragmatics- closely related areas: semantics deals with the meaning of words, sentences and other units of speech, and pragmatics deals with the features of expressing this meaning in connection with the specific goals of communication;

Ø Lexicography describes the lexicon of a particular SL - its individual words and their grammatical properties, as well as methods for creating dictionaries.

The results of N. Chomsky, obtained at the intersection of linguistics and mathematics, laid the foundation for the theory of formal languages and grammars (often called generative, or generative grammarians). This theory is now mathematical linguistics and is used to process not so much NL, but artificial languages, primarily programming languages. By its nature, it is quite a mathematical discipline.

Mathematical linguistics also includes quantitative linguistics, studying the frequency characteristics of the language - words, their combinations, syntactic constructions, etc., while using mathematical methods of statistics, so you can call this branch of science statistical linguistics.

CL is also closely related to such an interdisciplinary scientific field as artificial intelligence (AI), within which computer models of individual intellectual functions are developed. One of the first working programs in the field of AI and CL is the well-known program of T. Winograd, which understood the simplest orders of a person to change the world of cubes, formulated on a limited subset of NL. It should be noted that despite the obvious intersection of research in the field of CL and AI (since language proficiency is related to intellectual functions), AI does not absorb all CL, since it has its own theoretical basis and methodology. Common to these sciences is computer modeling as the main method and final goal of research.

Thus, the task of CL can be formulated as the development of computer programs for automatic processing of texts in NL. And although processing is understood quite broadly, far from all types of processing can be called linguistic, and the corresponding processors can be called linguistic. Linguistic Processor must use one or another formal model of the language (even if it is very simple), which means that it must be language-dependent in one way or another (that is, depend on a specific NL). So, for example, the text editor Mycrosoft Word can be called linguistic (if only because it uses dictionaries), but the NotePad editor is not.

The complexity of the tasks of CL is due to the fact that NL is a complex multi-level system of signs that arose for the exchange of information between people, developed in the process of human practical activity, and constantly changing in connection with this activity. Another difficulty in the development of CL methods (and the difficulty of studying SL within the framework of linguistics) is associated with the diversity of natural languages, significant differences in their vocabulary, morphology, syntax, different languages provide different ways of expressing the same meaning.

2. Features of the NL system: levels and connections

The objects of linguistic processors are the texts of NL. Texts are understood as any samples of speech - oral and written, of any genre, but basically CL considers written texts. The text has a one-dimensional, linear structure, and also carries a certain meaning, while the language acts as a means of converting the transmitted meaning into texts (speech synthesis) and vice versa (speech analysis). The text is composed of smaller units, and there are several ways of splitting (dividing) the text into units belonging to different levels.

The existence of the following levels is generally recognized:

The level of sentences (statements) - syntactic level;

· Lexico-morphological homonymy (the most common type) occurs when the word forms of two different lexemes coincide, for example, verse- a verb in the singular masculine gender and a noun in the singular, nominative case),

· Syntactic homonymy signifies an ambiguity in the syntactic structure, leading to several interpretations: Students from Lvov went to Kyiv,flying planes can be dangerous(famous example of Chomsky), etc.

3. Modeling in computational linguistics

The development of a linguistic processor (LP) involves a description of the linguistic properties of the processed text of the NL, and this description is organized as model language. As in modeling in mathematics and programming, a model is understood as some system that reflects a number of essential properties of the phenomenon being modeled (i.e., NL) and therefore has a structural or functional similarity.

Models of language used in CL are usually built on the basis of theories created by linguists by studying various texts and based on their linguistic intuition (introspection). What is the specificity of the KL models? The following features can be distinguished:

Formality and, ultimately, algorithmizability;

Functionality (the purpose of modeling is to reproduce the functions of the language as a “black box”, without building an accurate model for the synthesis and analysis of human speech);

Generality of the model, i.e., it takes into account a rather large set of texts;

· Experimental validity, which involves testing the model on different texts;

· Reliance on dictionaries as a mandatory component of the model.

The complexity of the SL, its description and processing leads to the division of this process into separate stages corresponding to the levels of the language. Most modern LPs are of a modular type, in which each level of linguistic analysis or synthesis corresponds to a separate processor module. In particular, in the case of text analysis, individual LP modules perform:

Ø Graphematic analysis, i.e. highlighting word forms in the text (transition from characters to words);

Ø Morphological analysis - the transition from word forms to their lemmas(dictionary forms of lexemes) or basics(nuclear parts of the word, minus inflectional morphemes);

Ø Syntactic analysis, i.e., identifying the grammatical structure of text sentences;

Ø Semantic and pragmatic analysis, which determines the meaning of phrases and the corresponding reaction of the system within which the LP works.

Different schemes of interaction of these modules are possible (sequential work or parallel interleaved analysis), however, individual levels - morphology, syntax and semantics are still processed by different mechanisms.

Thus, the LP can be considered as a multi-stage converter that, in the case of text analysis, translates each of its sentences into an internal representation of its meaning, and vice versa in the case of synthesis. The corresponding language model can be called structural.

Although complete CL models require taking into account all the main levels of the language and the availability of appropriate modules, when solving some applied problems, it is possible to do without the representation of individual levels in the LP. For example, in early experimental CL programs, the processed texts belonged to very narrow problem areas (with a limited set of words and a strict word order), so that word recognition could use their initial letters, omitting the stages of morphological and syntactic analysis.

Another example of a reduced model, which is now quite often used, is the language model of the frequency of symbols and their combinations (bigrams, trigrams, etc.) in the texts of a particular NL. Such statistical model displays linguistic information at the level of characters (letters) of the text, and it is sufficient, for example, to detect typos in the text or to recognize its linguistic affiliation. A similar model based on the statistics of individual words and their joint occurrence in texts (bigrams, trigrams of words) is used, for example, to resolve lexical ambiguity or determine the part of speech of a word (in languages like English).

Note that it is possible structural-statistical models, in which certain statistics are taken into account when presenting individual levels of NL - words, syntactic constructions, etc.

In a modular type LP, at each stage of text analysis or synthesis, an appropriate model (morphology, syntax, etc.) is used.

The morphological models of the analysis of word forms existing in CL differ mainly in the following parameters:

The result of the work is a lemma or stem with a set of morphological characteristics (gender, number, case, type, person, etc.) of a given word form;

the method of analysis - based on the dictionary of word forms of the language or on the dictionary of basics, or the non-dictionary method;

· the possibility of processing the word form of a lexeme not included in the dictionary.

In morphological synthesis, the initial data are the lexeme and specific morphological characteristics of the requested word form of the given lexeme; it is also possible to request the synthesis of all forms of the given lexeme. The result of both morphological analysis and synthesis is generally ambiguous.

To model syntax within the framework of CL, a large number of different ideas and methods have been proposed that differ in the way the syntax of the language is described, the way this information is used in the analysis or synthesis of the SL sentence, and the way the syntactic structure of the sentence is presented. It is quite conditionally possible to single out three main approaches to creating models: a generative approach that goes back to the ideas of Chomsky, an approach that goes back to the ideas of I. Melchuk and is represented by the Meaning Text model, as well as an approach in which certain attempts are made to overcome the limitations of the first two approaches, in particular, the theory of syntactic groups.

Within the framework of the generative approach, syntactic analysis is usually performed on the basis of a formal context-free grammar that describes the phrase structure of a sentence, or on the basis of some extension of the context-free grammar. These grammars proceed from a sequential linear division of a sentence into phrases (syntactic constructions, for example, noun phrases) and therefore reflect simultaneously both its syntactic and linear structures. The hierarchical syntactic structure of the NL sentence obtained as a result of the analysis is described component tree, whose leaves contain the words of the sentence, the subtrees correspond to the syntactic constructions (phrases) included in the sentence, and the arcs express the nesting relations of the constructions.

The approach under consideration can include network grammars, which are both a device for describing a language system and for setting a procedure for analyzing sentences based on the concept of a finite automaton, for example, an extended transition network ATN .

As part of the second approach, a more visual and common way is used to represent the syntactic structure of a sentence - dependency trees. The nodes of the tree contain the words of the sentence (usually a verb-predicate at the root), and each arc of the tree that connects a pair of nodes is interpreted as a syntactic one. subordinating connection between them, and the direction of connection corresponds to the direction of this arc. Since, in this case, the syntactic links of words and the order of words in the sentence are separated, then, on the basis of subordination trees, broken and non-projective constructions that occur quite often in languages with free word order.

Component trees are more suitable for describing languages in a rigid word order; their representation of broken and non-projective constructions requires an extension of the grammatical formalism used. But within the framework of this approach, constructions with non-subordinate relations are more naturally described. At the same time, a common difficulty for both approaches is the presentation of homogeneous members of the sentence.

Syntactic models in all approaches try to take into account the restrictions imposed on the connection of language units in speech, while in one way or another the concept of valence is used. Valence- this is the ability of a word or other unit of a language to attach other units in a certain syntactic way; actant is a word or a syntactic construction that fills this valence. For example, the Russian verb hand over has three main valences, which can be expressed by the following interrogative words: who? to whom? what? Within the framework of the generative approach, the valences of words (first of all, verbs) are described mainly in the form of special frames ( subcategorization frames) , and in the framework of the dependency tree approach, as management models.

Models of the semantics of the language are the least developed within the framework of CL. For the semantic analysis of sentences, the so-called case grammars and semantic cases(valency), on the basis of which the semantics of the sentence is described as through the connection of the main word (verb) with its semantic actants, i.e. through semantic cases. For example, the verb hand over described by semantic cases giving(agent), addressee and transfer object.

To represent the semantics of the entire text, two logically equivalent formalisms are commonly used (both of which are detailed in the framework of AI):

· Predicate calculus formulas expressing properties, states, processes, actions and relationships;

· Semantic networks are labeled graphs in which vertices correspond to concepts, and vertices correspond to relationships between them.

As for the models of pragmatics and discourse, which allow processing not only individual sentences, but also the text as a whole, the ideas of Van Dyck are mainly used to build them. One of the rare and successful models is the model of discursive synthesis of connected texts. Such models should take into account anaphoric references and other discourse-level phenomena.

Concluding the characterization of language models within the framework of CL, let us dwell a little more on the theory of linguistic models "Meaning Text", and within which many fruitful ideas appeared that were ahead of their time and are still relevant.

In accordance with this theory, the NL is considered as a special kind of converter that performs the processing of given meanings into corresponding texts and given texts into their corresponding meanings. The meaning is understood as the invariant of all synonymous transformations of the text. The content of a connected fragment of speech without division into phrases and word forms is displayed as a special semantic representation consisting of two components: semantic graph and information about communicative organization of meaning.

As distinctive features of the theory should be indicated:

o orientation towards the synthesis of texts (the ability to generate correct texts is considered as the main criterion for language competence);

o multi-level, modular nature of the model, and the main levels of the language are divided into superficial and deep levels: they differ, for example, deep(semantized) and surface("pure") syntax, as well as surface-morphological and deep-morphological levels;

o the integral nature of the language model; saving the information presented at each level by the corresponding module performing the transition from this level to the next;

o special means of describing syntactics (rules for connecting units) at each level; to describe lexical compatibility, a set was proposed lexical functions, with the help of which the rules of syntactic paraphrasing are formulated;

o emphasis on vocabulary rather than grammar; the dictionary stores information related to different levels of the language; in particular, for syntactic analysis, word management models are used that describe their syntactic and semantic valencies.

This theory and language model has found its embodiment in the ETAP machine translation system.

4. Linguistic resources

The development of linguistic processors requires an appropriate presentation of linguistic information about the processed NL. This information is displayed in a variety of computer dictionaries and grammars.

Dictionaries are the most traditional form of representation of lexical information; they differ in their units (usually words or phrases), structure, scope of vocabulary (dictionaries of terms of a specific problem area, dictionaries of general vocabulary, etc.). The dictionary unit is called dictionary entry, it provides information about the token. Lexical homonyms are usually presented in different dictionary entries.

Morphological dictionaries used for morphological analysis are the most common in CL, their dictionary entry contains morphological information about the corresponding word - part of speech, inflectional class (for inflectional languages), a list of word meanings, etc. Depending on the organization of the linguistic processor in the dictionary grammatical information can also be added, such as word control patterns.

There are dictionaries that provide more information about words. For example, the linguistic model "Meaning-Text" essentially relies on explanatory-combinatorial dictionary, in the dictionary entry of which, in addition to morphological, syntactic and semantic information (syntactic and semantic valences), information about the lexical compatibility of this word is presented.

A number of linguistic processors use synonym dictionaries. A relatively new type of dictionaries - paronym dictionaries, i.e. outwardly similar words that differ in meaning, for example, stranger and alien, editing and reference .

Another type of lexical resources - phrase bases, in which the most typical phrases of a particular language are selected. Such a base of phrases in the Russian language (about a million units) is the core of the CrossLexic system.

More complex types of lexical resources are thesauri and ontologies. Thesaurus is a semantic dictionary, i.e. a dictionary in which the semantic relationships of words are presented - synonymous, gender-species relations (sometimes called the above-below relation), part-whole, associations. The spread of thesauri is associated with the solution of information retrieval problems.

The concept of ontology is closely related to the concept of thesaurus. Ontology is a set of concepts, entities of a certain field of knowledge, focused on multiple use for various tasks. Ontologies can be created on the basis of the vocabulary existing in the language - in this case they are called linguistic and.

Such a linguistic ontology is considered to be the WordNet system - a large lexical resource in which the words of the English language are collected: nouns, adjectives, verbs and adverbs, and their semantic connections of several types are presented. For each of the indicated parts of speech, words are grouped into groups of synonyms ( synsets), between which the relations of antonymy, hyponymy (genus-species relation), meronymy (part-whole relation) are established. The resource contains about 25 thousand words, the number of hierarchy levels for the genus-species relation is on average 6-7, sometimes reaching 15. The upper level of the hierarchy forms a common ontology - a system of basic concepts about the world.

According to the English WordNet scheme, similar lexical resources for other European languages were built, united under the common name EuroWordNet.

A completely different kind of linguistic resources is Grammar, whose type depends on the syntax model used in the processor. In the first approximation, grammar is a set of rules that express the general syntactic properties of words and groups of words. The total number of grammar rules also depends on the syntax model, varying from several tens to several hundreds. In essence, such a problem manifests itself here as the correlation in the language model of grammar and vocabulary: the more information is presented in the dictionary, the shorter the grammar can be and vice versa.

It should be noted that the construction of computer dictionaries, thesauri and grammars is a voluminous and time-consuming work, sometimes even more time-consuming than the development of a linguistic model and the corresponding processor. Therefore, one of the subordinate tasks of CL is the automation of the construction of linguistic resources.

Computer dictionaries are often formed by converting ordinary text dictionaries, but often much more complex and painstaking work is required to build them. This usually happens when building dictionaries and thesauri for rapidly developing scientific fields - molecular biology, computer science, etc. The source material for extracting the necessary linguistic information can be collections and corpora of texts.

A corpus of texts is a collection of texts collected according to a certain principle of representativeness (by genre, authorship, etc.), in which all texts are marked up, that is, they are provided with some linguistic markup (annotations) - morphological, accent, syntactic, etc. At present, there are at least a hundred different corpora - for different NL and with different markings, in Russia the most famous is the National Corpus of the Russian Language.

Labeled corpora are created by linguists and used both for linguistic research and for tuning (training) models and processors used in CL using well-known mathematical methods of machine learning. So, machine learning is used to set up methods for resolving lexical ambiguity, recognizing parts of speech, and resolving anaphoric references.

Since corpora and collections of texts are always limited in terms of the linguistic phenomena represented in them (and corpora, in addition, are created for a rather long time), recently Internet texts are increasingly considered as a more complete linguistic resource. Undoubtedly, the Internet is the most representative source of modern speech samples, but its use as a corpus requires the development of special technologies.

5. Computational linguistics applications

The field of applications of computational linguistics is constantly expanding, so we will characterize here the most well-known applied problems solved by its tools.

Machine translate- the earliest application of CL, with which this area itself arose and developed. The first translation programs were built over 50 years ago and were based on the simplest word-by-word translation strategy. However, it was quickly realized that machine translation requires a complete linguistic model that takes into account all levels of the language, up to semantics and pragmatics, which repeatedly hampered the development of this direction. A fairly complete model is used in the domestic system ETAP, which translates scientific texts from French into Russian.

Note, however, that in the case of translation into a related language, for example, when translating from Spanish to Portuguese or from Russian to Ukrainian (which have much in common in syntax and morphology), the processor can be implemented based on a simplified model, for example, based on all the same strategy of word-for-word translation.

Currently, there is a whole range of computer translation systems (of varying quality), from large international research projects to commercial automatic translators. Of significant interest are projects of multilingual translation, using an intermediate language in which the meaning of translated phrases is encoded. Another modern direction is statistical translation, based on the statistics of the translation of words and phrases (these ideas, for example, are implemented in the Google search engine translator).

But despite many decades of development of this whole area, in general, the task of machine translation is still very far from being completely solved.

Another fairly old application of computational linguistics is information retrieval and related tasks of indexing, summarizing, classifying and categorizing documents.

Full-text search of documents in large databases of documents (primarily scientific, technical, business), is usually carried out on the basis of their search images, which is understood as a set keywords- words that reflect the main topic of the document. At first, only individual words of the SL were considered as keywords, and the search was carried out without taking into account their inflection, which is uncritical for weakly inflectional languages such as English. For inflectional languages, for example, for Russian, it was necessary to use a morphological model that takes into account inflection.

The search request was also presented as a set of words, suitable (relevant) documents were determined based on the similarity of the request and the search image of the document. Creating a search image of a document involves indexing its text, i.e. highlighting key words in it. Since very often the topic and content of the document are much more accurately displayed not by individual words, but by phrases, phrases began to be considered as keywords. This significantly complicated the procedure for indexing documents, since it was necessary to use various combinations of statistical and linguistic criteria to select meaningful phrases in the text.

In fact, information retrieval mainly uses text vector pattern(sometimes called bag of words- a bag of words), in which the document is represented by a vector (set) of its keywords. Modern Internet search engines also use this model, indexing texts by the words used in them (at the same time, they use very sophisticated ranking procedures to return relevant documents).

The specified text model (with some complications) is also used in the related problems of information retrieval considered below.

Abstracting text- reducing its volume and obtaining its summary - abstract (contracted content), which makes it faster to search in collections of documents. A general abstract can also be drawn up for several documents related to the topic.

The main method of automatic summarization is still the selection of the most significant sentences of the abstracted text, for which the keywords of the text are usually calculated first and the coefficient of significance of the sentences of the text is calculated. The choice of meaningful sentences is complicated by anaphoric links of sentences, the break of which is undesirable - to solve this problem, certain strategies for selecting sentences are being developed.

A task close to referencing - annotation the text of the document, i.e., compiling its annotation. In its simplest form, an abstract is a list of the main topics of the text, for which indexing procedures can be used to highlight.

When creating large collections of documents, the tasks are relevant classification and clustering texts in order to create classes of documents related to the topic . Classification means assigning each document to a certain class with known parameters in advance, and clustering means dividing a set of documents into clusters, i.e., subsets of thematically related documents. To solve these problems, machine learning methods are used, and therefore these applied tasks are called Text Mining and belong to the scientific direction known as Data Mining, or data mining.

Very close to classification problem rubricating text - its assignment to one of the previously known thematic headings (usually headings form a hierarchical tree of topics).

The task of classification is becoming more widespread, it is solved, for example, when recognizing spam, and a relatively new application is the classification of SMS messages in mobile devices. A new and relevant direction of research for the general task of information retrieval is multilingual document search.

Another relatively new task related to information retrieval is formation of answers to questions(Question Answering) . This task is solved by determining the type of question, searching for texts that potentially contain the answer to this question, and extracting the answer from these texts.

A completely different applied direction, which is developing, albeit slowly, but steadily, is automation of preparation and editing texts on EY. One of the first applications in this direction were programs for automatically detecting word hyphenation and programs for spelling text checks (spellers, or auto-correctors). Despite the apparent simplicity of the hyphenation problem, its correct solution for many NLs (for example, English) requires knowledge of the morphemic structure of the words of the corresponding language, and hence the corresponding dictionary.

Spell checking has long been implemented in commercial systems and relies on an appropriate vocabulary and morphology model. An incomplete syntax model is also used, on the basis of which rather frequent all syntactic errors (for example, word agreement errors) are revealed. At the same time, the detection of more complex errors, for example, the misuse of prepositions, has not yet been implemented in auto-correctors. Many lexical errors are also not detected, in particular, errors resulting from typos or misuse of similar words (for example, weight instead of weighty). In modern studies of CL, methods are proposed for the automated detection and correction of such errors, as well as some other types of stylistic errors. These methods use statistics on the occurrence of words and phrases.

An applied task close to supporting the preparation of texts is natural language teaching, within the framework of this direction, computer systems for teaching languages - English, Russian, etc. are often developed (similar systems can be found on the Internet). Typically, these systems support the study of certain aspects of the language (morphology, vocabulary, syntax) and are based on appropriate models, for example, a morphology model.

As for the study of vocabulary, electronic analogues of text dictionaries are also used for this (in which, in fact, there are no language models). However, multifunctional computer dictionaries are also being developed that have no text analogues and are aimed at a wide range of users - for example, a dictionary of Russian phrases Crosslexic. This system covers a wide range of vocabulary - words and their acceptable word combinations, and also provides information on word management models, synonyms, antonyms and other semantic correlates of words, which is clearly useful not only for those who study Russian, but also for native speakers.

The next application area worth mentioning is automatic generation texts on EY. In principle, this task can be considered a subtask of the machine translation task already considered above, however, within the framework of the direction, there are a number of specific tasks. Such a task is multilingual generation, i.e. automatic construction in several languages of special documents - patent formulas, operating instructions for technical products or software systems, based on their specification in a formal language. Quite detailed language models are used to solve this problem.

An increasingly relevant applied task, often referred to as Text Mining, is extracting information from texts, or Information Extraction, which is required when solving problems of economic and industrial analytics. To do this, certain objects are identified in the NL test - named entities (names, personalities, geographical names), their relationships and events associated with them. As a rule, this is implemented on the basis of partial parsing of the text, allowing processing of news feeds from news agencies. Since the task is quite complex not only theoretically, but also technologically, the creation of meaningful systems for extracting information from texts is feasible within the framework of commercial companies.

The direction of Text Mining also includes two other related tasks - the selection of opinions (Opinion Mining) and the assessment of the tonality of texts (Sentiment Analysis), attracting the attention of an increasing number of researchers. The first task searches (in blogs, forums, online stores, etc.) for user opinions about products and other objects, and analyzes these opinions. The second task is close to the classical task of content analysis of texts of mass communication; it evaluates the general tone of statements.

Another application worth mentioning is − dialogue support with the user on the NL within the framework of any information software system. Most often, this problem was solved for specialized databases - in this case, the query language is quite limited (lexically and grammatically), which allows using simplified language models. Requests to the base, formulated in NL, are translated into a formal language, after which the search for the necessary information is performed and the corresponding response phrase is built.

As the last in our list of CL applications (but not in importance) we indicate speech recognition and synthesis. Recognition errors that inevitably arise in these tasks are corrected by automatic methods based on dictionaries and linguistic knowledge about morphology. Machine learning will also be applied in this area.

Conclusion

Computational linguistics demonstrates quite tangible results in various applications for automatic processing of texts in NL. Its further development depends both on the emergence of new applications and the independent development of various language models, in which many problems have not yet been solved. The most developed are the models of morphological analysis and synthesis. Syntax models have not yet been brought to the level of stable and efficient modules, despite the large number of proposed formalisms and methods. Even less studied and formalized are models of the level of semantics and pragmatics, although automatic processing of discourse is already required in a number of applications. Note that the already existing tools of computational linguistics itself, the use of machine learning and text corpora, can significantly advance the solution of these problems.

Literature

1. Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval, Adison Wesley, 1999.

2. Bateman, J., Zock M. Natural Language Generation. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p.304.

3. Biber, D., Conrad S., and Reppen D. Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press, Cambridge, 1998.

4. Bolshakov, I. A., Gelbukh putational Linguistics. Models, Resources, Applications. Mexico, IPN, 2004.

5. Brown P., Pietra S., Mercer R., Pietra V. The Mathematics of Statistical Machine Translation. // Computational Linguistics, Vol. 19(2): 263-3

6. Carroll J R. Parsing. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 233-248.

7. Chomsky, N. Syntactic Structures. The Hague: Mouton, 1957.

8. Grishman R. Information extraction. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 545-559.

9. Harabagiu, S., Moldovan D. Question Answering. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 560-582.

10. Hearst, M. A. Automated Discovery of WordNet Relations. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, 1998, p.131-151.

11. Hirst, G. Ontology and the Lexicon. In.: Handbook on Ontologies in Niformation Systems. Berlin, Springer, 2003.

12. Jacquemin C., Bourigault D. Term extraction and automatic indexing // Mitkov R. (ed.): Handbook of Computational Linguistics. Oxford University Press, 2003. p. 599-615.

13. Kilgarriff, A., G. Grefenstette. Introduction to the Special Issue on the Web as putational linguistics, V. 29, No. 3, 2003, p. 333-347.

14. Manning, Ch. D., H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.

15. Matsumoto Y. Lexical Knowledge Acquisition. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 395-413.

16. The Oxford Handbook on Computational Linguistics. R. Mitkov (Ed.). Oxford University Press, 2005.

17. Oakes, M., Paice C. D. Term extraction for automatic abstracting. Recent Advances in Computational Terminology. D. Bourigault, C. Jacquemin and M. L "Homme (Eds), John Benjamins Publishing Company, Amsterdam, 2001, p.353-370.

18. Pedersen, T. A decision tree of bigrams is an accurate predictor of word senses. Proc. 2nd Annual Meeting of NAC ACL, Pittsburgh, PA, 2001, p. 79-86.

19. Samuelsson C. Statistical Methods. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 358-375.

20. Salton, G. Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley, 1988.

21. Somers, H. Machine Translation: Latest Developments. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 512-528.

22. Strzalkowski, T. (ed.) Natural Language Information Retrieval. Kluwer, 19p.

23. Woods W. A. Transition Network Grammers for Natural language Analysis/ Communications of the ACM, V. 13, 1970, No. 10, p. 591-606.

24. Word Net: an Electronic Lexical Database. /Christian Fellbaum. Cambridge, MIT Press, 1998.

25. Wu J., Yu-Chia Chang Y., Teruko Mitamura T., Chang J. Automatic Collocation Suggestion in Academic Writing // Proceedings of the ACL 2010 Conference Short Papers, 2010.

26. and others. Linguistic support of the ETAP-2 system. Moscow: Nauka, 1989.

27. etc. Data analysis technologies: Data Mining, Visual Mining, Text Mining, OLAP - 2nd ed. - St. Petersburg: BHV-Petersburg, 2008.

28. Bolshakov, Vocabulary - a large electronic dictionary of combinations and semantic connections of Russian words. // Comp. linguistics and intelligence. technologies: Proceedings of int. Conf. "Dialogue 2009". Issue: RGGU, 2009, pp. 45-50.

29. Bolshakova E. I., Bolshakov detection and automated correction of Russian malapropisms // NTI. Ser. 2, No. 5, 2007, pp. 27-40.

30. Wang, Kinch V. A strategy for understanding a coherent text.// New in foreign linguistics. Issue. XXIII– M., Progress, 1988, p. 153-211.

31. Vasiliev V. G., Krivenko M. P. Methods of automated text processing. – M.: IPI RAN, 2008.

32. Vinograd T. A program that understands natural language - M., world, 1976.

33. Smooth structure of natural language in automated communication systems. - M., Nauka, 1985.

34. Gusev, V.D., Salomatina dictionary of paronyms: version 2. // NTI, Ser. 2, No. 7, 2001, p. 26-33.

35. Zakharov - space as a language corpus // Computational Linguistics and Intelligent Technologies: Proceedings of Int. Conference Dialogue ‘2005 / Ed. , - M .: Nauka, 2005, p. 166-171.

36. Kasevich of general linguistics. - M., Nauka, 1977.

37. Leontief understanding of texts: Systems, models, resources: Textbook - M.: Academy, 2006.

38. Linguistic Encyclopedic Dictionary / Ed. V. N. Yartseva, Moscow: Soviet Encyclopedia, 1990, 685 p.

39., Saliy for automatic indexing and categorization: development, structure, maintenance. // NTI, Ser. 2, No. 1, 1996.

40. Luger J. Artificial intelligence: strategies and methods for solving complex problems. M., 2005.

41. McQueen K. Discursive strategies for text synthesis in natural language // New in foreign linguistics. Issue. XXIV. M.: Progress, 1989, pp. 311-356.

42. Melchuk theory of linguistic models "MEANING "TEXT". - M., Nauka, 1974.

43. National Corpus of the Russian Language. http://*****

44. Khoroshevsky VF OntosMiner: a family of systems for extracting information from multilingual document collections // Ninth National Conference on Artificial Intelligence with International Participation KII-2004. T. 2. - M .: Fizmatlit, 2004, pp. 573-581.

linguistics statistical linguistics software

History of the development of computational linguistics

The process of formation and formation of modern linguistics as a science of natural language is a long historical development of linguistic knowledge. Linguistic knowledge is based on elements, the formation of which took place in the process of activity, inextricably linked with the development of the structure of oral speech, the emergence, further development and improvement of writing, learning to write, as well as the interpretation and decoding of texts.

Natural language as an object of linguistics occupies a central place in this science. In the process of language development, ideas about it also changed. If earlier no special importance was attached to the internal organization of the language, and it was considered, first of all, in the context of its relationship with the outside world, then, starting from the end of the 19th - beginning of the 20th centuries, a special role is assigned to the internal formal structure of the language. It was during this period that the famous Swiss linguist Ferdinand de Saussure developed the foundations of such sciences as semiology and structural linguistics, and were detailed in his book A Course in General Linguistics (1916).

The scientist owns the idea of considering the language as a single mechanism, an integral system of signs, which in turn makes it possible to describe the language mathematically. Saussure was the first to propose a structural approach to language, namely, the description of a language by studying the relationships between its units. By units, or "signs", he understood a word that combines both meaning and sound. The concept proposed by the Swiss scientist is based on the theory of language as a system of signs, consisting of three parts: language (from French langue), speech (from French parole) and speech activity (from French langage).

The scientist himself defined the science he created, semiology, as "a science that studies the life of signs within the framework of the life of society." Since language is a sign system, in search of an answer to the question of what place linguistics occupies among other sciences, Saussure argued that linguistics is part of semiology. It is generally accepted that it was the Swiss philologist who laid the theoretical foundation of a new direction in linguistics, becoming the founder, the "father" of modern linguistics.

The concept put forward by F. de Saussure was further developed in the works of many outstanding scientists: in Denmark - L. Elmslev, in the Czech Republic - N. Trubetskoy, in the USA - L. Bloomfield, Z. Harris, N. Chomsky. As for our country, here structural linguistics began its development at about the same period of time as in the West - at the turn of the 19th-20th centuries. - in the works of F. Fortunatov and I. Baudouin de Courtenay. It should be noted that I. Baudouin de Courtenay worked closely with F. de Saussure. If Saussure laid the theoretical foundation of structural linguistics, then Baudouin de Courtenay can be considered the person who laid the foundations for the practical application of the methods proposed by the Swiss scientist. It was he who defined linguistics as a science that uses statistical methods and functional dependencies, and separated it from philology. The first experience of applying mathematical methods in linguistics was phonology - the science of the structure of the sounds of a language.

It should be noted that the postulates put forward by F. de Saussure could be reflected in the problems of linguistics that were relevant in the middle of the 20th century. It is during this period that a clear trend towards the mathematization of the science of language is outlined. Practically in all large countries, the rapid development of science and computer technology begins, which in turn required more and more new linguistic foundations. The result of all this was the rapid convergence of the exact and humanities, as well as the active interaction of mathematics and linguistics, which found practical application in solving urgent scientific problems.

In the 1950s, at the intersection of such sciences as mathematics, linguistics, computer science and artificial intelligence, a new direction of science arose - computational linguistics (also known as machine linguistics or automatic processing of texts in natural language). The main stages in the development of this direction took place against the backdrop of the evolution of artificial intelligence methods. A powerful impetus to the development of computational linguistics was the creation of the first computers. However, with the advent of a new generation of computers and programming languages in the 60s, a fundamentally new stage in the development of this science begins. It should also be noted that the origins of computational linguistics go back to the works of the famous American linguist N. Chomsky in the field of formalizing the structure of the language. The results of his research, obtained at the intersection of linguistics and mathematics, formed the basis for the development of the theory of formal languages and grammars (generative or generative grammars), which is widely used to describe both natural and artificial languages, in particular programming languages. To be more precise, this theory is quite a mathematical discipline. It can be considered one of the first in such a direction of applied linguistics as mathematical linguistics.

The first experiments and first developments in computational linguistics relate to the creation of machine translation systems, as well as systems that simulate human language abilities. In the late 80s, with the advent and active development of the Internet, there was a rapid growth in the volume of text information available in electronic form. This has led to the fact that information retrieval technologies have moved to a qualitatively new stage of their development. There was a need for automatic processing of texts in natural language, completely new tasks and technologies appeared. Scientists are faced with such a problem as the rapid processing of a huge stream of unstructured data. In order to find a solution to this problem, great importance has been given to the development and application of statistical methods in the field of automatic word processing. It was with their help that it became possible to solve such problems as dividing texts into clusters united by a common theme, highlighting certain fragments in the text, etc. In addition, the use of methods of mathematical statistics and machine learning made it possible to solve the problems of speech recognition and the creation of search engines.

Scientists did not stop at the achieved results: they continued to set themselves new goals and objectives, to develop new techniques and methods of research. All this led to the fact that linguistics began to act as an applied science, combining a number of other sciences, the leading role among which belonged to mathematics with its variety of quantitative methods and the ability to use them for a deeper understanding of the phenomena being studied. Thus began its formation and development of mathematical linguistics. At the moment, this is a rather “young” science (it has existed for about fifty years), however, despite its very “young age”, it is an already established field of scientific knowledge with many successful achievements.

The term "computational linguistics" usually refers to a wide area of using computer tools - programs, computer technologies for organizing and processing data - to model the functioning of a language in certain conditions, situations, problem areas, as well as the scope of computer language models. only in linguistics, but also in related disciplines. Actually, only in the latter case we are talking about applied linguistics in the strict sense, since computer language modeling can also be considered as a field of application of programming theory (computer science) in the field of linguistics. Nevertheless, the general practice is such that the field of computational linguistics covers almost everything related to the use of computers in linguistics: "The term" computational linguistics "sets a general orientation towards the use of computers to solve a variety of scientific and practical problems related to language, without limiting in any way ways of solving these problems.

Institutional aspect of computational linguistics. As a special scientific direction, computational linguistics took shape in the 60s. The flow of publications in this area is very high. In addition to thematic collections, the journal Computational Linguistics is published quarterly in the USA. A large organizational and scientific work is carried out by the Association for Computational Linguistics, which has regional structures around the world (in particular, the European branch). Every two years there are international conferences on computational linguistics - KOLING. Relevant issues are also widely represented at international conferences on artificial intelligence at various levels.

Cognitive toolkit of computational linguistics

Computational linguistics as a special applied discipline is distinguished primarily by its tool - that is, by the use of computer tools for processing language data. Since computer programs that model certain aspects of the functioning of a language can use a variety of programming tools, it seems that there is no need to talk about a common metalanguage. However, it is not. There are general principles of computer modeling of thinking, which are somehow implemented in any computer model. This language is based on the theory of knowledge developed in artificial intelligence and forming an important branch of cognitive science.

The main thesis of the theory of knowledge states that thinking is a process of processing and generating knowledge. "Knowledge" or "knowledge" is considered an undefined category. The human cognitive system acts as a "processor" that processes knowledge. In epistemology and cognitive science, two main types of knowledge are distinguished - declarative ("knowing what") and procedural ("knowing how"2)). Declarative knowledge is usually presented as a set of propositions, statements about something. A typical example of declarative knowledge is the interpretation of words in ordinary explanatory dictionaries. For example, a cup] - "a small rounded drinking vessel, usually with a handle, made of porcelain, faience, etc.". Declarative knowledge lends itself to the verification procedure in terms of "true-false". Procedural knowledge is presented as a sequence (list) of operations, actions to be performed. This is some general instruction about actions in a certain situation. A typical example of procedural knowledge is instructions for using household appliances.

Unlike declarative knowledge, procedural knowledge cannot be verified as true or false. They can be evaluated only by the success or failure of the algorithm.

Most of the concepts of the cognitive toolkit of computational linguistics are homonymous: they simultaneously designate some real entities of the human cognitive system and ways of representing these entities in some metalanguages. In other words, the elements of metalanguage have an ontological and instrumental aspect. Ontologically, the division of declarative and procedural knowledge corresponds to different types of knowledge of the human cognitive system. So, knowledge about specific objects, objects of reality is mainly declarative, and the functional abilities of a person to walk, run, drive a car are realized in the cognitive system as procedural knowledge. Instrumentally, knowledge (both ontologically procedural and declarative) can be represented as a set of descriptions, descriptions and as an algorithm, an instruction. In other words, ontologically declarative knowledge about the object of reality "table" can be represented procedurally as a set of instructions, algorithms for its creation, assembly (= creative aspect of procedural knowledge) or as an algorithm for its typical use (= functional aspect of procedural knowledge). In the first case, this may be a guide for a novice carpenter, and in the second, a description of the possibilities of an office desk. The converse is also true: ontologically procedural knowledge can be represented declaratively.

It requires a separate discussion whether any ontologically declarative knowledge can be represented as procedural, and any ontologically procedural - as declarative. Researchers agree that any declarative knowledge can, in principle, be represented procedurally, although this may turn out to be very uneconomical for a cognitive system. The reverse is hardly true. The fact is that declarative knowledge is much more explicit, it is easier for a person to understand than procedural knowledge. In contrast to declarative knowledge, procedural knowledge is predominantly implicit. So, the language ability, being procedural knowledge, is hidden from a person, is not realized by him. An attempt to explicate the mechanisms of language functioning leads to dysfunction. Specialists in the field of lexical semantics know, for example, that the long-term semantic introspection necessary to study the word content plan leads to the fact that the researcher partially loses the ability to distinguish between correct and incorrect uses of the analyzed word. Other examples can be cited. It is known that from the point of view of mechanics, the human body is a complex system of two interacting pendulums.

In knowledge theory, various knowledge structures are used to study and represent knowledge - frames, scenarios, plans. According to M. Minsky, "a frame is a data structure designed to represent a stereotyped situation" [Minsky 1978, p.254]. In more detail, we can say that the frame is a conceptual structure for the declarative representation of knowledge about a typified thematically unified situation containing slots interconnected by certain semantic relationships. For purposes of illustration, a frame is often represented as a table, the rows of which form slots. Each slot has its own name and content (see Table 1).

Table 1

Fragment of the "table" frame in a table view

Depending on the specific task, frame structuring can be much more complex; a frame can include nested subframes and references to other frames.

Instead of a table, a predicate form of presentation is often used. In this case, the frame is in the form of a predicate or a function with arguments. There are other ways to represent a frame. For example, it can be represented as a tuple of the following form: ( (frame name) (slot name)) (slot value,), ..., (slot name n) (slot value n) ).

Typically, frames in knowledge representation languages have this form.

Like other cognitive categories of computational linguistics, the concept of a frame is homonymous. Ontologically, it is a part of the human cognitive system, and in this sense, the frame can be compared with such concepts as gestalt, prototype, stereotype, scheme. In cognitive psychology, these categories are considered precisely from an ontological point of view. Thus, D. Norman distinguishes two main ways of existence and organization of knowledge in the human cognitive system - semantic networks and schemes. "Schemas," he writes, "are organized packets of knowledge assembled to represent distinct, self-contained units of knowledge. My schema for Sam may contain information describing his physical features, his activities, and personality traits. This schema correlates with other schemas that describe its other aspects" [Norman 1998, p. 359]. If we take the instrumental side of the frame category, then this is a structure for the declarative representation of knowledge. In current AI systems, frames can form complex knowledge structures; frame systems allow for hierarchy - one frame can be part of another frame.

In terms of content, the concept of a frame is very close to the category of interpretation. Indeed, a slot is an analogue of valence, the filling of a slot is an analogue of an actant. The main difference between them is that the interpretation contains only linguistically relevant information about the plan of the content of the word, and the frame, firstly, is not necessarily tied to the word, and, secondly, includes all information relevant to the given problem situation, including including extralinguistic (knowledge of the world) 3).

A scenario is a conceptual framework for the procedural representation of knowledge about a stereotyped situation or behavior. Script elements are the steps of an algorithm or instruction. People usually talk about "restaurant scenario", "buying scenario" and so on.

The frame was also originally used for procedural presentation (cf. the term "procedural frame"), but the term "scenario" is now more commonly used in this sense. A scenario can be represented not only as an algorithm, but also as a network, the vertices of which correspond to certain situations, and the arcs correspond to connections between situations. Along with the concept of a script, some researchers use the category of a script for computer modeling of intelligence. According to R. Schenk, a script is some generally accepted, well-known sequence of causal relationships. For example, understanding the dialogue

On the street it pours like a bucket.

You still have to go to the store: there is nothing in the house - yesterday the guests swept everything.

is based on non-explicit semantic connections such as "if it rains, it is undesirable to go outside, because you can get sick." These connections form a script, which is used by native speakers to understand each other's verbal and non-verbal behavior.

As a result of applying the scenario to a specific problem situation, a plan). A plan is used to procedurally represent knowledge about possible actions leading to a specific goal. A plan relates a goal to a sequence of actions.

In the general case, the plan includes a sequence of procedures that transfer the initial state of the system to the final one and lead to the achievement of a certain subgoal and goal. In AI systems, the plan arises as a result of the planning or planning activity of the corresponding module - the planning module. The planning process may be based on the adaptation of data from one or more scenarios, activated by testing procedures, to solve a problem situation. The execution of the plan is carried out by an executive module that controls the cognitive procedures and physical actions of the system. In the elementary case, a plan in an intelligent system is a simple sequence of operations; in more complex versions, the plan is associated with a specific subject, its resources, capabilities, goals, with detailed information about the problem situation, etc. The emergence of the plan occurs in the process of communication between the model of the world, part of which is formed by scenarios, the planning module and the executive module.

Unlike a scenario, a plan is associated with a specific situation, a specific performer, and pursues a specific goal. The choice of plan is governed by the contractor's resources. The feasibility of a plan is an obligatory condition for its generation in a cognitive system, and the feasibility characteristic is inapplicable to a scenario.

Another important concept is the model of the world. The model of the world is usually understood as a set of knowledge about the world organized in a certain way, inherent in a cognitive system or its computer model. In a somewhat more general sense, the model of the world is spoken of as part of a cognitive system that stores knowledge about the structure of the world, its patterns, etc. In another sense, the model of the world is associated with the results of understanding the text or, more broadly, discourse. In the process of understanding the discourse, its mental model is built, which is the result of the interaction between the plan of the content of the text and the knowledge about the world inherent in this subject [Johnson-Laird 1988, p. 237 et seq.]. The first and second understandings are often combined. This is typical of linguistic researchers working within cognitive linguistics and cognitive science.

Closely related to the category of frame is the concept of a scene. The scene category is mainly used in the literature as a designation of a conceptual structure for the declarative representation of situations and their parts actualized in a speech act and highlighted by linguistic means (lexemes, syntactic constructions, grammatical categories, etc.). Being associated with linguistic forms, the scene is often updated by a certain word or expression. In plot grammars (see below), a scene appears as part of an episode or narrative. Characteristic examples of scenes are a set of cubes that the AI system works with, the scene of action in the story and the participants in the action, etc. In artificial intelligence, scenes are used in image recognition systems, as well as in programs focused on the study (analysis, description) of problem situations. The concept of a scene has become widespread in theoretical linguistics, as well as in logic, in particular in situational semantics, in which the meaning of a lexical unit is directly associated with the scene.

Computational linguistics has practically exhausted itself today. This is directly indicated by the unsuccessful experience of researchers and developers of "intellectual" information products, who have been working for more than half a century on the creation of such ambitious programs as, for example, adequate machine translation or semantic search for information in arrays of natural language documents.

The future of machine processing of natural language texts, of course, is seen in the creation and development of supralinguistic technologies capable of analyzing the content of information at the level of semantic understanding of the context, just as a person can do. However, the creation of "thinking machines" (Thinking Machine) for a long time was hampered by two main factors - the lack of the necessary methodology and proper tools for solving two fundamental problems - this is finding a "formula of meaning" and building a "model of knowledge about the universe" in some formalized computer-accessible form, without which, in fact, it is impossible to repeat the nature of human thinking at the program level.

Linguists, together with cybernetics, have not been able to overcome these problems, since the latter already lies outside the boundaries of their subject specialization, which, in fact, significantly slowed down the development of such long-requested applied areas of text processing, such as the creation of "smart" dialogue systems. or "semantic Internet search engines". And the same machine translation still leaves much to be desired.

The experience of the development of scientific and technological progress suggests that the breakthrough desired result is ultimately obtained, as a rule, at the junction of different technological fields and subject disciplines. Apparently, the problem of “machine thinking” will be solved exactly when we understand exactly how our natural consciousness works in the procedural plan, and when we can reliably find out whether these thinking procedures, shown to us in the necessary and sufficient quantity, will final computer algorithmization.

It should be noted that in recent years, a new ("smartbuter") scientific discipline has begun to develop, which deals exactly with the fact that it studies the procedural nature of human mental activity. We can say that at the moment we have a significant breakthrough in this direction and we already quite clearly imagine how the algorithm of human thinking works. If we talk about this in general, then, first of all, it should be noted that a person does not think in images, as is usually thought, but in “patterns of image behavior” (IGO). Secondly, we think “ontolologically”, that is, we constantly ask questions, even without noticing it ourselves, and permanently look for answers to them (also automatically). Finally, a meaningful understanding of everything that happens around the individual or in his mind during any contemplation is carried out precisely with the help of a certain “model representation” of the surrounding universe. This happens by comparing those MPOs that he receives on an operational basis with the ideas about the Universe stored in human long-term memory. Exactly, these three main whales make up the whole technology of natural thinking, which now remains only to be simply transferred to a language understandable for programmers and get the long-awaited result.

When people comprehend any natural language message, they practically never establish an instant correspondence of the stated judgment with the concepts and behavior patterns of the images stored in their memory. Each time, they give to the received (perceived) MPOs the first associative-heuristic correspondence that arises in their minds, based on the specifics of their experience and knowledge, and only then, in the course of further rethinking of the text, they begin to clarify and concretize the information received. Computational linguistics, on the other hand, seeks to establish exact correspondences between the meanings of words, as well as their mutual relationships, trying to overcome the problem of the ambiguity of verbal tools inherent in any language, which, in fact, is very different from how our thinking works. After all, a person achieves understanding of speech or text not at all due to the knowledge of the morphological loads of words or the establishment of syntactic links between words, and not even because he recognized the specific meanings (semantis) of words, but exactly due to the initial associative assumptions and the subsequent “iterative scrolling”. of the entire context” in order to draw the final picture of the correspondence of the perceived information to its internal content.

Portal for the student. Self-training

Download:

Preview:

English words in ICT

My experiment

History of the development of computational linguistics

Cognitive toolkit of computational linguistics

RELATED ARTICLES