Machine linguistics. History, development and formation of computational linguistics as a scientific direction

Philology high school Economics, a new master's program is being launched dedicated to computational linguistics: it welcomes applicants with a humanitarian and mathematical basic education and anyone who is interested in solving problems in one of the most promising branches of science. Its leader, Anastasia Bonch-Osmolovskaya, told Theory and Practice what computational linguistics is, why robots will not replace humans, and what they will teach in HSE master's degree in Computational Linguistics.

This program is almost the only one of its kind in Russia. Where did you study yourself?

I studied at Moscow State University at the Department of Theoretical and Applied Linguistics Faculty of Philology. I did not get there right away, at first I entered Russian branch, but then I became seriously interested in linguistics, and I was attracted by the atmosphere that remains at the department to this day. The most important thing there is good contact between teachers and students and their mutual interest.

When I had children and had to earn a living, I went into the field of commercial linguistics. In 2005, it was not very clear what this area of ​​activity as such was. I worked in various linguistic companies: I started with a small company at the Public.ru website - this is such a library of media, where I began to deal with linguistic technologies. Then I worked for a year at Rosnanotech, where I had an idea to make an analytical portal so that the data on it would be automatically structured. Then I headed the linguistic department at the Avicomp company - this is already a serious production in the field computational linguistics and semantic technologies. At the same time, I taught a course in computational linguistics at Moscow State University and tried to make it more modern.

Two resources for a linguist: - a site created by linguists for scientific and applied research related to the Russian language. This is a model of the Russian language, presented with the help of a huge array of texts from different genres and periods. Texts are provided with linguistic markup, which can be used to obtain information about the frequency of certain linguistic phenomena. Wordnet - a huge lexical database in English, main idea Wordnet - to connect not words, but their meanings into one big network. Wordnet can be downloaded and used for your own projects.

What does computational linguistics do?

This is the most interdisciplinary field. The most important thing here is to understand what is happening in the electronic world and who will help you do specific things.

We are surrounded by a large number of digital information, there are many business projects whose success depends on the processing of information, these projects can be related to marketing, politics, economics, and whatever. And it is very important to be able to handle this information effectively - the main thing is not only the speed of information processing, but also the ease with which you can, after filtering out the noise, get the data that you need and create a whole picture from them.

Previously, some global ideas were associated with computational linguistics, for example: people thought that machine translation would replace human translation, robots would work instead of people. But now it seems like a utopia, and machine translation is used in search engines to quickly search in an unfamiliar language. That is, now linguistics rarely deals with abstract tasks - mostly some small things that can be inserted into a large product and make money on it.

One of big tasks modern linguistics - the semantic web, when the search occurs not just by the coincidence of words, but by meaning, and all sites are somehow marked by semantics. This can be useful, for example, for police or medical reports that are written every day. The analysis of internal connections gives a lot of necessary information, and it is incredibly long to read and calculate it manually.

In a nutshell, we have a thousand texts, we need to sort them into piles, present each text as a structure and get a table that we can already work with. This is called unstructured information processing. On the other hand, computational linguistics deals, for example, with the creation of artificial texts. There is a company that came up with a mechanism for generating texts on topics that are boring for a person to write about: changes in real estate prices, weather forecast, report on football matches. It is much more expensive to order these texts for a person, moreover, computer texts on such topics are written in a coherent human language.

Developments in the field of search for unstructured information in Russia are actively engaged in "Yandex", "Kaspersky Lab" hires research groups who study machine learning. Is someone in the market trying to come up with something new in the field of computational linguistics?

**Books on Computational Linguistics:**

Daniel Jurafsky, Speech and Language Processing

Christopher Manning, Prabhakar Raghavan, Heinrich Schütze, Introduction to Information Retrieval

Jacob Testelec, "Introduction to General Syntax"

Most linguistic developments are the property of large companies, almost nothing can be found in open access. This hinders the development of the industry, we do not have a free linguistic market, boxed solutions.

Moreover, there is a lack of complete information resources. There is such a project as the National Corpus of the Russian Language. This is one of the best national corpuses in the world, which is rapidly developing and opens up incredible opportunities for scientific and applied research. The difference is about the same as in biology - before and after DNA research.

But many resources do not exist in Russian. So, there is no analogue to such a wonderful English-language resource as Framenet - this is such a conceptual network, where all possible connections of a particular word with other words are formally presented. For example, there is the word "fly" - who can fly, where, with what pretext this word is used, what words it is combined with, and so on. This resource helps connect the language with real life, that is, to trace how a particular word behaves at the level of morphology and syntax. It is very useful.

Avicomp is currently developing a plug-in to search for related articles. That is, if you are interested in some article, you can quickly see the history of the plot: when the topic arose, what was written, and when was the peak of interest in this problem. For example, with the help of this plugin, it will be possible, starting from an article on events in Syria, to very quickly see how last year events unfolded there.

How will the learning process in the master's program be structured?

Education at HSE is organized into separate modules - as in Western universities. Students will be divided into small teams, mini-startups - that is, at the end we should get several finished projects. We want to get real products, which we will then open to people and leave in the public domain.

In addition to direct supervisors of student projects, we want to find curators for them from among their potential employers- from the same "Yandex", for example, who will also play this game and give students some advice.

I hope that people from the most different areas: programmers, linguists, sociologists, marketers. We will have several adaptation courses in linguistics, mathematics and programming. Then we will have two serious courses in linguistics, and they will be connected with the most relevant linguistic theories, we want our graduates to be able to read and understand contemporary linguistic articles. It's the same with mathematics. We will have a course called "Mathematical Foundations of Computational Linguistics", which will present those sections of mathematics on which modern computational linguistics is based.

In order to enroll in a master's program, you need to pass entrance examination in language and pass a portfolio competition.

In addition to the main courses, there will be a line of elective subjects. We have planned several cycles - two of them are focused on a deeper study of individual topics, which include, for example, machine translation and corpus linguistics, and, on the contrary, one is related to related areas: such as , social networks, machine learning or Digital Humanities - a course that we hope will be delivered in English.

Computer linguists are engaged in the development of text and speech recognition algorithms, the synthesis of artificial speech, the creation of semantic translation systems and the very development of artificial intelligence (in the classical sense of the word, as a replacement for human intelligence, it is unlikely to ever appear, but various expert systems based on on data analysis).

Speech recognition algorithms will be increasingly used in everyday life - smart homes and electronic devices will not have remotes and buttons, but a voice interface will be used instead. This technology is being perfected, but there are still many challenges: it is difficult for a computer to recognize human speech, because different people speak very differently. Therefore, as a rule, recognition systems work well either when they are trained for one speaker and already adjusted to his pronunciation features, or when the number of phrases that the system can recognize is limited (as, for example, in voice commands for TV).

Specialists in the creation of semantic translation programs still have a lot of work ahead of them: this moment good algorithms are developed only for translation into and from English. There are many problems here - different languages ​​are arranged differently in a semantic plan, this differs even at the level of phrase construction, and not all meanings of one language can be conveyed using the semantic apparatus of another. In addition, the program must distinguish between homonyms, correctly recognize parts of speech, select correct value polysemantic word appropriate to the context.

Synthesizing artificial speech (for example, for home robots) is also painstaking work. It is difficult to make artificially created speech sound natural to human ear, because there are millions of nuances that we do not pay attention to, but without which everything is no longer “that” - false starts, pauses, hitches, etc. The speech stream is continuous and at the same time discrete: we speak without pausing between words, but it is not difficult for us to understand where one word ends and another begins, and for a machine this will be a big problem.

The biggest direction in computational linguistics is connected with Big Data. After all, there are huge corpora of texts such as news feeds, from which you need to isolate certain information - for example, highlight newsworthy events or sharpen RSS to the tastes of a particular user. Such technologies already exist and will continue to develop, because computing power is growing rapidly. Linguistic analysis of texts is also used to ensure security on the Internet, search necessary information for special services.

Where to study as a computational linguist? We, unfortunately, have quite a strong division between specialties related to classical linguistics and programming, statistics, and data analysis. And in order to become a digital linguist, you need to understand both. AT foreign universities there are higher education programs in computational linguistics, but we still have best option- get a basic linguistic education, and then master the basics of IT. It's good that now there are many different online courses, unfortunately, in my student days, this was not the case. I studied at the Faculty of Applied Linguistics at Moscow State Linguistic University, where we had courses in artificial intelligence and speech recognition - but still not enough. Now IT companies are actively trying to interact with institutions. My colleagues from Kaspersky Lab and I also try to participate in educational process: we give lectures, hold student conferences, give grants to graduate students. But for now, the initiative comes more from employers than from universities.

COURSE WORK

in the discipline "Informatics"

on the topic: "Computer Linguistics"


INTRODUCTION

2. Modern interfaces of computational linguistics

CONCLUSION

LITERATURE


Introduction

Automated information technologies play an important role in the life of modern society. With the passage of time, their value continuously increases. But the development of information technology is very uneven: if modern level computer technology and means of communication strikes the imagination, then in the field of semantic information processing, the successes are much more modest. These successes depend, first of all, on achievements in the study of the processes of human thinking, the processes of speech communication between people, and on the ability to simulate these processes on a computer.

When it comes to creating promising information technologies, the problems of automatic processing text information presented in natural languages ​​come to the fore. This is determined by the fact that a person's thinking is closely connected with his language. Moreover, natural language is a tool of thinking. He is also universal remedy communication between people - a means of perception, accumulation, storage, processing and transmission of information. The problems of using natural language in automatic information processing systems are dealt with by the science of computational linguistics. This science arose relatively recently - at the turn of the fifties and sixties of the last century. Over the past half century, significant scientific and practical results have been obtained in the field of computational linguistics: systems machine translation texts from one natural language to another, systems for automated search for information in texts, systems for automatic analysis and synthesis of oral speech, and many others. this work is devoted to the construction of an optimal computer interface using computational linguistics when conducting linguistic research.


1. Place and role of computational linguistics in linguistic research

AT modern world Computational linguistics is increasingly being used in various linguistic studies.

Computational linguistics is a field of knowledge related to solving problems of automatic processing of information presented in natural language. Central scientific problems computational linguistics are the problem of modeling the process of understanding the meaning of texts (transition from text to a formalized representation of its meaning) and the problem of speech synthesis (transition from a formalized representation of meaning to texts in natural language). These problems arise when solving a number of applied problems and, in particular, problems of automatic detection and correction of errors when entering texts into a computer, automatic analysis and synthesis of oral speech, automatic translation of texts from one language to another, communication with a computer in natural language, automatic classification and indexing of text documents, their automatic referencing, searching for documents in full-text databases.

Linguistic tools created and used in computational linguistics can be conditionally divided into two parts: declarative and procedural. The declarative part includes dictionaries of language and speech units, texts and various kinds of grammar tables, while the procedural part includes means of manipulating language and speech units, texts and grammar tables. Computer interface refers to the procedural part of computational linguistics.

Success in solving applied problems of computational linguistics depends, first of all, on the completeness and accuracy of representation of declarative means in computer memory and on the quality of procedural means. To date, the required level of solving these problems has not yet been achieved, although work in the field of computational linguistics is being carried out in all developed countries world (Russia, USA, England, France, Germany, Japan, etc.).

Nevertheless, serious scientific and practical achievements in the field of computational linguistics can be noted. So in a number of countries (Russia, the USA, Japan, etc.) experimental and industrial systems for machine translation of texts from one language to another have been built, a number of experimental systems for communicating with computers in natural language have been built, work is underway to create terminological data banks, thesauri, bilingual and multilingual machine dictionaries (Russia, USA, Germany, France, etc.), systems for automatic analysis and synthesis of oral speech are being built (Russia, USA, Japan, etc.), research is underway in the field of building models of natural languages.

An important methodological problem of applied computational linguistics is the correct assessment of the necessary correlation between the declarative and procedural components of automatic text information processing systems. What should be preferred: powerful computational procedures based on relatively small vocabulary systems with rich grammatical and semantic information, or a powerful declarative component with relatively simple computer interfaces? Most scientists believe that the second way is preferable. It will lead to the achievement of practical goals more quickly, since in this case there will be fewer dead ends and obstacles that are difficult to overcome, and here it will be possible to use computers on a larger scale to automate research and development.

The need to mobilize efforts, primarily on the development of the declarative component of automatic text processing systems, is confirmed by half a century of experience in the development of computational linguistics. After all, here, despite the indisputable successes of this science, the enthusiasm for algorithmic procedures did not bring the expected success. There was even some disappointment in the possibilities of procedural means.

In the light of the foregoing, such a path of development of computational linguistics seems promising, when the main efforts will be directed to the creation of powerful dictionaries of units of language and speech, the study of their semantic-syntactic structure and the creation of basic procedures for morphological, semantic-syntactic and conceptual analysis and synthesis of texts. This will make it possible to solve a wide range of applied problems in the future.

Computational linguistics faces, first of all, the tasks of linguistic support for the processes of collecting, accumulating, processing and searching for information. The most important of them are:

1. Automation of compilation and linguistic processing of machine dictionaries;

2. Automation of the processes of detecting and correcting errors when entering texts into a computer;

3. Automatic indexing of documents and information requests;

4. Automatic classification and referencing of documents;

5. Linguistic support of information search processes in monolingual and multilingual databases;

6. Machine translation of texts from one natural language to another;

7. Construction of linguistic processors that provide users with communication with automated intelligent information systems (in particular, with expert systems) in natural language, or in a language close to natural;

8. Extraction of factual information from non-formalized texts.

Let us dwell in detail on the problems most relevant to the research topic.

AT practical activities information centers, there is a need to solve the problem of automated detection and correction of errors in texts when they are entered into a computer. This complex task can be conditionally divided into three tasks - the tasks of spelling, syntactic and semantic control of texts. The first of them can be solved using a morphological analysis procedure using a fairly powerful reference machine dictionary of word stems. In the process of spelling control, the words of the text are subjected to morphological analysis, and if their bases are identified with the bases of the reference dictionary, then they are considered correct; if they are not identified, then they, accompanied by a micro-context, are given out for viewing by a person. A person detects and corrects distorted words, and the corresponding software system makes these corrections to the corrected text.

The task of syntactic control of texts in order to detect errors in them is much more difficult than the task of their spelling control. Firstly, because it includes in its composition the task of spelling control as its mandatory component, and, secondly, because the problem of syntactic analysis of non-formalized texts has not yet been fully resolved. Nevertheless, partial syntactic control of texts is quite possible. There are two ways to go here: either compose sufficiently representative machine dictionaries of reference syntactic structures and compare the syntactic structures of the analyzed text with them; or develop a complex system of rules for checking the grammatical consistency of text elements. The first way seems to us more promising, although, of course, it does not exclude the possibility of using elements of the second way. The syntactic structure of texts should be described in terms of grammatical classes of words (more precisely, in the form of sequences of sets of grammatical information for words).

The task of semantic control of texts in order to detect semantic errors in them should be attributed to the class of artificial intelligence tasks. In full, it can be solved only on the basis of modeling the processes of human thinking. At the same time, apparently, it will be necessary to create powerful encyclopedic knowledge bases and software tools for manipulating knowledge. Nevertheless, for limited subject areas and for formalized information, this problem is quite solvable. It should be posed and solved as a task of semantic-syntactic control of texts.

The problem of automating the indexing of documents and queries is traditional for automated text search systems. At first, indexing was understood as the process of assigning classification indices to documents and queries, reflecting their thematic content. In the future, this concept was transformed and the term "indexing" began to refer to the process of translating descriptions of documents and queries from a natural language into a formalized one, in particular, into the language of "search images". Search images of documents began, as a rule, to be made out in the form of lists of keywords and phrases reflecting their thematic content, and search images of queries - in the form of logical structures in which keywords and phrases were connected to each other by logical and syntactic operators.

Automatic indexing of documents is convenient to carry out according to the texts of their abstracts (if any), since the main content of the documents is reflected in the abstracts in a concentrated form. Indexing can be done with or without thesaurus control. In the first case, keywords and phrases of the reference machine dictionary are searched for in the title text of the document and its abstract, and only those that are found in the dictionary are included in the DOD. In the second case, keywords and phrases are extracted from the text and included in the POD, regardless of whether they belong to any reference dictionary. A third option was also implemented, where, along with the terms from the machine thesaurus, the AML also included terms extracted from the title and the first sentence of the abstract of the document. Experiments have shown that PODs compiled automatically based on the titles and abstracts of documents provide a greater completeness of the search than manually compiled PODs. This is explained by the fact that the automatic indexing system more fully reflects various aspects of the content of documents than the manual indexing system.

With automatic indexing of queries, approximately the same problems arise as with automatic indexing of documents. Here you also have to extract keywords and phrases from the text and normalize the words included in the query text. Logical links between keywords and phrases and contextual operators can be entered manually or using an automated procedure. An important element The process of automatic indexing of a query is the addition of its keywords and phrases with their synonyms and hyponyms (sometimes also hypernyms and other terms associated with the original terms of the query). This can be done automatically or interactively using a machine thesaurus.

We have already partially considered the problem of automating the search for documentary information in connection with the task of automatic indexing. The most promising here is the search for documents by their full texts, since the use of all kinds of substitutes for this purpose (bibliographic descriptions, search images of documents and texts of their abstracts) leads to loss of information during the search. The greatest losses occur when their bibliographic descriptions are used as substitutes for primary documents, the smallest - when abstracts are used.

Important Features The qualities of information retrieval are its completeness and accuracy. The completeness of the search can be ensured by taking into account as much as possible the paradigmatic connections between the units of language and speech (words and phrases), and the accuracy - by taking into account their syntagmatic connections. There is an opinion that the completeness and accuracy of the search are inversely related: measures to improve one of these characteristics lead to a deterioration in the other. But this is only true for fixed search logic. If this logic is improved, then both characteristics can be improved simultaneously.

The process of searching for information in full-text databases should be built as a process of interactive communication between a user and an information retrieval system (IPS), in which he sequentially looks through text fragments (paragraphs, paragraphs) that satisfy logical conditions request, and selects those that are of interest to him. As the final search results can be given as full texts documents, as well as any of their fragments.

As can be seen from the previous considerations, in the automatic search for information, one has to overcome the language barrier that arises between the user and the IPS due to the variety of forms of representation of the same meaning that takes place in texts. This barrier becomes even more significant if you have to search in multilingual databases. The cardinal solution of the problem here can be machine translation of texts of documents from one language into another. This can be done either in advance, before uploading documents to a search engine, or in the process of searching for information. AT last case the user's request must be translated into the language of the array of documents in which the search is conducted, and the search results - into the language of the query. Of such kind search engines already working on the Internet. The Cyrillic Browser system was also built at VINITI RAS, which makes it possible to search for information in Russian-language texts on queries in English with the search results also displayed in the user's language.

An important and promising task of computational linguistics is the construction of linguistic processors that provide users with communication with intelligent automated information systems (in particular, with expert systems) in natural language or in a language close to natural. Since information is stored in a formalized form in modern intelligent systems, linguistic processors, acting as intermediaries between a person and a computer, must solve the following main tasks: 1) the task of moving from texts of input information requests and messages in natural language to representing their meaning in a formalized language (when entering information into a computer); 2) the task of transition from a formalized representation of the meaning of output messages to its representation in natural language (when information is given to a person). The first task should be solved by morphological, syntactic and conceptual analysis of input requests and messages, the second - by conceptual, syntactic and morphological synthesis of output messages.

Conceptual analysis of information requests and messages consists in identifying their conceptual structure (the boundaries of the names of concepts and relationships between concepts in the text) and translating this structure into a formalized language. It is carried out after the morphological and syntactic analysis of requests and messages. The conceptual synthesis of messages consists in the transition from the representation of the elements of their structure in a formalized language to a verbal (verbal) representation. After that, the messages are given the necessary syntactic and morphological design.

For machine translation of texts from one natural language to another, it is necessary to have dictionaries of translation correspondences between the names of concepts. Knowledge about such translational correspondences was accumulated by many generations of people and issued in the form of special editions - bilingual or multilingual dictionaries. For specialists who to some extent know foreign languages, these dictionaries served as valuable aids in translating texts.

In traditional bilingual and multilingual dictionaries general purpose transfer equivalents were indicated mainly for individual words, for phrases - much less frequently. The indication of translation equivalents for phrases was more typical for special terminological dictionaries. Therefore, when translating segments of texts containing polysemantic words, students often had difficulties.

Below are translation correspondences between several pairs of English and Russian phrases on "school" topics.

1) The bat looks like a mouse with wings - The bat looks like a mouse with wings.

2) Children like to play in the sand on the beach - Children love to play in the sand on the beach.

3) A drop of rain fell on my hand - A drop of rain fell on my hand.

4) Dry wood burns easily - dry wood burns well.

5) He pretended not to hear me - He pretended not to hear me.

Here the English phrases are not idiomatic expressions. Nevertheless, their translation into Russian can only be considered, with some stretch, as a simple word-for-word translation, since almost all the words included in them are polysemous. Therefore, only the achievements of computational linguistics can help students here.

The content of the article

COMPUTER LINGUISTICS, direction in applied linguistics, focused on the use of computer tools - programs, computer technologies for organizing and processing data - for modeling the functioning of a language in certain conditions, situations, problem areas, etc., as well as the entire scope of computer language models in linguistics and related disciplines. Actually, only in the latter case we are talking about applied linguistics in the strict sense, since computer modeling of a language can also be considered as a sphere of application of computer science and programming theory to solving problems of the science of language. In practice, however, almost everything related to the use of computers in linguistics is referred to as computational linguistics.

As a special scientific direction, computational linguistics took shape in the 1960s. The Russian term "computational linguistics" is a tracing-paper from English computational linguistics. Since the adjective computational in Russian can also be translated as “computational”, the term “computational linguistics” is also found in the literature, but in Russian science it acquires a narrower meaning, approaching the concept of “quantitative linguistics”. The flow of publications in this area is very high. Except thematic collections, in the United States, the journal Computational Linguistics is published quarterly. Great organizational and scientific work is carried out by the Association for Computational Linguistics, which has regional structures (in particular, the European branch). Every two years there are international conferences on computational linguistics - COLING. Relevant issues are usually widely presented also at various conferences on artificial intelligence.

Toolkit of Computational Linguistics.

Computational linguistics, as a special applied discipline, is distinguished primarily by its tool - i.e. on the use of computer tools for processing language data. Insofar as computer programs, modeling certain aspects of the functioning of the language, can use the most various means programming, then there seems to be no need to talk about the general conceptual apparatus of computational linguistics. However, it is not. There are general principles computer simulation thinking, which are somehow implemented in any computer model. They are based on the theory of knowledge, which was originally developed in the field of artificial intelligence, and later became one of the sections of cognitive science. The most important conceptual categories computational linguistics are such knowledge structures as "frames" (conceptual, or, as they say, conceptual structures for the declarative representation of knowledge about a typified thematically unified situation), "scenarios" (conceptual structures for the procedural representation of knowledge about a stereotypical situation or stereotypical behavior) , "plans" (knowledge structures that fix ideas about possible actions leading to the achievement specific purpose). The concept of "scene" is closely related to the category of frame. The scene category is mainly used in the literature on computational linguistics as a designation of a conceptual structure for the declarative representation of situations and their parts that are actualized in a speech act and highlighted by linguistic means (lexemes, syntactic constructions, grammatical categories, etc.).

A certain organized set of knowledge structures forms the "model of the world" of the cognitive system and its computer model. In artificial intelligence systems, the model of the world forms a special block, which, depending on the chosen architecture, may include general knowledge about the world (in the form of simple propositions like "it's cold in winter" or in the form of production rules "if it's raining outside, you need to put on a raincoat or take an umbrella"), some specific facts ("The highest peak in the world is Everest"), and also values ​​and their hierarchies, sometimes singled out in a special "axiological block".

Most elements of the concepts of computational linguistics tools are homonymous: they simultaneously designate some real entities of the human cognitive system and ways of representing these entities used in their theoretical description and modeling. In other words, the elements conceptual apparatus computational linguistics have ontological and instrumental aspects. For example, in the ontological aspect, the separation of declarative and procedural knowledge corresponds to different types knowledge that a person has - the so-called knowledge of WHAT (declarative; such, for example, knowledge postal address of some NN), on the one hand, and knowledge of HOW (procedural; such, for example, knowledge that allows you to find the apartment of this NN, even without knowing its formal address) - on the other. In the instrumental aspect, knowledge can be embodied in a set of descriptions (descriptions), in a data set, on the one hand, and in an algorithm, an instruction that a computer or some other model of a cognitive system executes, on the other.

Directions of Computational Linguistics.

The sphere of CL is very diverse and includes such areas as computer modeling of communication, modeling of the plot structure, hypertext technologies for text presentation, machine translation, computer lexicography. AT narrow sense CL issues are often associated with an interdisciplinary applied area with a somewhat unfortunate name "natural language processing" (translation of the English term Natural Language Processing). It arose in the late 1960s and developed within the framework of the scientific and technological discipline "artificial intelligence". In its own way inner form the phrase "natural language processing" covers all areas in which computers are used to process language data. Meanwhile, a narrower understanding of this term has become fixed in practice - the development of methods, technologies and specific systems that ensure communication between a person and a computer in natural or limited natural language.

The rapid development of the direction of "natural language processing" falls on the 1970s, which was associated with an unexpected exponential growth in the number of end users of computers. Since it is impossible to teach languages ​​and programming technologies to all users, the problem of organizing interaction with computer programs has arisen. The solution to this problem of communication followed two main paths. In the first case, attempts were made to adapt programming languages ​​and operating systems to the end user. As a result, high-level languages ​​such as Visual Basic appeared, as well as convenient operating systems built in the conceptual space of metaphors familiar to humans - DESK, LIBRARY. The second way is the development of systems that would allow interacting with a computer in a specific problem area in a natural language or some limited version of it.

The architecture of natural language processing systems generally includes an analysis unit voice message user, a message interpretation block, a block for generating the meaning of an answer, and a block for synthesizing the surface structure of an utterance. A special part of the system is the dialogue component, which contains dialogue strategies, the conditions for applying these strategies, ways to overcome possible communication failures (failures in the communication process).

Among natural language processing computer systems, question-answer systems are usually distinguished, dialogue systems problem solving and connected text processing systems. Initially, question-answer systems began to be developed as a response to poor quality encodings of queries when searching for information in information retrieval systems. Since the problem area of ​​such systems was very limited, this somewhat simplified the algorithms for translating queries into a formal language representation and the reverse procedure for transforming a formal representation into natural language statements. From domestic developments, the POET system, created by a team of researchers led by E.V. Popov, belongs to programs of this type. The system processes requests in Russian (with minor restrictions) and synthesizes a response. The block diagram of the program assumes the passage of all stages of analysis (morphological, syntactic and semantic) and the corresponding stages of synthesis.

Dialogue systems for solving problems, unlike systems of the previous type, play in communication active role, since their task is to obtain a solution to the problem based on the knowledge that is presented in it itself, and on the information that can be obtained from the user. The system contains knowledge structures that record typical sequences of actions for solving problems in a given problem area, as well as information about necessary resources. When the user asks a question or sets a certain task, the corresponding script is activated. If some script components are missing or some resources are missing, the system initiates the communication. This is how, for example, the SNUKA system works, which solves the problems of planning military operations.

Connected text processing systems are quite diverse in structure. Them common feature can be considered the widespread use of knowledge representation technologies. The functions of systems of this kind are to understand the text and answer questions about its content. Understanding is considered not as a universal category, but as a process of extracting information from a text, determined by a specific communicative intention. In other words, the text is "read" only with the assumption that it is the potential user who wants to know about it. Thus, connected text processing systems turn out to be by no means universal, but problem-oriented. Typical examples of systems of the type under discussion are the RESEARCHER and TAILOR systems, which form a single software package, which allows the user to obtain information from abstracts of patents describing complex physical objects.

The most important area of ​​computational linguistics is the development of information retrieval systems (IPS). The latter arose in the late 1950s and early 1960s as a response to a sharp increase in the volume of scientific and technical information. By the type of stored and processed information, as well as by the features of the search, IPS are divided into two large groups - documentary and factual. Documentary information systems store the texts of documents or their descriptions (abstracts, bibliographic cards, etc.). Factographic IPS deal with the description of specific facts, and not necessarily in textual form. It can be tables, formulas and other types of data presentation. There are also mixed IPSs that include both documents and factual information. At present, factographic information systems are built on the basis of database (DB) technologies. To provide information retrieval in IPS, special information retrieval languages ​​are created, which are based on information retrieval thesauri. Information retrieval language is a formal language designed to describe certain aspects of the content plan of documents stored in the IPS and the request. The procedure for describing a document in an information retrieval language is called indexing. As a result of indexing, each document is assigned its formal description in the information retrieval language - the search image of the document. Similarly, the query is indexed, to which the search image of the query and the search prescription are assigned. Information retrieval algorithms are based on the comparison of the search prescription with the search image of the query. The criterion for issuing a document for a request may consist of a full or partial match between the search image of the document and the search prescription. In some cases, the user has the opportunity to formulate the issuance criteria himself. This is determined by his information need. Descriptive information retrieval languages ​​are more often used in automated ISs. The subject of the document is described by a set of descriptors. Words and terms denoting simple, fairly elementary categories and concepts of the problem area act as descriptors. As many descriptors are entered into the search image of the document as various topics affected by the document. The number of descriptors is not limited, which makes it possible to describe the document in a multidimensional feature matrix. Often, in a descriptor information retrieval language, restrictions are imposed on the combinability of descriptors. In this case, we can say that the information retrieval language has a syntax.

One of the first systems to work with a descriptor language was American system UNITERM created by M. Taube. In this system, the keywords of the document, the uniterms, functioned as descriptors. The peculiarity of this IPS is that initially the dictionary of the information language was not set, but arose in the process of indexing the document and the query. The development of modern information retrieval systems is associated with the development of non-thesaurus-type IPS. Such IPS work with the user in a limited natural language, and the search is carried out in the texts of abstracts of documents, in their bibliographic descriptions, and often in the documents themselves. For indexing in the non-thesaurus type IPS, words and phrases of natural language are used.

To a certain extent, the field of computational linguistics can include works in the field of creating hypertext systems, considered as a special way of organizing text and even as fundamentally the new kind text, opposed in many of its properties to the usual text formed in the Gutenberg tradition of typography. The idea of ​​hypertext is associated with the name of Vannevar Bush, President F. Roosevelt's science adviser. W. Bush theoretically substantiated the project of the technical system "Memex", which allowed the user to link texts and their fragments by various types of links, mainly by associative relations. Absence computer technology made the project difficult to implement, as the mechanical system proved to be too complex for practical implementation.

Bush's idea in the 1960s received a second birth in the "Xanadu" system of T. Nelson, which already assumed the use of computer technology. "Xanadu" allowed the user to read the totality of texts entered into the system different ways, in various sequences, the software made it possible to both memorize the sequence of texts viewed, and choose almost any of them at an arbitrary point in time. A set of texts with relations connecting them (a system of transitions) was called hypertext by T. Nelson. Many researchers consider the creation of hypertext as the beginning of a new information age, opposed to the era of printing. The linearity of writing, outwardly reflecting the linearity of speech, turns out to be a fundamental category that limits human thinking and understanding of the text. The world of meaning is non-linear, therefore, the compression of semantic information in a linear speech segment requires the use of special "communicative packages" - division into topic and rheme, division of the utterance content plan into explicit (statement, proposition, focus) and implicit (presupposition, consequence, implicature of discourse) layers . Rejection of the linearity of the text both in the process of its presentation to the reader (i.e., in reading and understanding) and in the process of synthesis, according to theorists, would contribute to the "liberation" of thinking and even the emergence of its new forms.

In a computer system, hypertext is represented as a graph, the nodes of which contain traditional texts or their fragments, images, tables, videos, etc. The nodes are connected by a variety of relationships, the types of which are specified by the developers of the hypertext software or by the reader himself. The relations define the potential possibilities of movement, or navigation through the hypertext. Relationships can be unidirectional or bidirectional. Accordingly, bidirectional arrows allow the user to move in both directions, while unidirectional arrows allow the user to move only in one direction. The chain of nodes through which the reader passes while viewing the components of the text forms a path, or route.

Computer implementations of hypertext are hierarchical or network. The hierarchical – tree-like – structure of the hypertext significantly limits the possibilities of transition between its components. In such a hypertext, the relationships between components resemble the structure of a thesaurus based on genus-species relationships. Network hypertext allows you to use various types of relationships between components, not limited to genus-species relationships. According to the mode of existence of hypertext, static and dynamic hypertexts are distinguished. Static hypertext does not change during operation; in it, the user can record his comments, but they do not change the essence of the matter. For dynamic hypertext, change is a normal form of existence. Typically, dynamic hypertexts function where it is necessary to constantly analyze the flow of information, i.e. in information services of various kinds. Hypertext is, for example, the Arizona Information System (AAIS), which is updated monthly with 300–500 abstracts per month.

Relationships between hypertext elements can be initially fixed by the creators, or they can be generated whenever the user accesses the hypertext. In the first case, we are talking about hypertexts of a rigid structure, and in the second case, about hypertexts of a soft structure. The rigid structure is technologically quite clear. The technology for organizing a soft structure should be based on a semantic analysis of the proximity of documents (or other sources of information) to each other. This is a non-trivial task of computational linguistics. Currently, the use of soft structure technologies on keywords is widespread. The transition from one node to another in the hypertext network is carried out as a result of searching for keywords. Since the set of keywords may differ each time, the structure of the hypertext also changes each time.

The technology of building hypertext systems does not distinguish between textual and non-textual information. Meanwhile, the inclusion of visual and audio information (videos, paintings, photographs, sound recordings, etc.) requires significant change user interface and more powerful software and computer support. Such systems are called hypermedia, or multimedia. The visibility of multimedia systems predetermined their widespread use in education, in the creation of computer versions of encyclopedias. There are, for example, beautifully executed CD-roms with multimedia systems for children's encyclopedias published by Dorlin Kindersley.

Within the framework of computer lexicography, computer technologies for the compilation and operation of dictionaries are being developed. Special programs - databases, computer filing cabinets, text processing programs - allow you to automatically generate dictionary entries, store dictionary information and process it. Many different computer lexicographic programs are divided into two large groups: programs for supporting lexicographic works and automatic dictionaries of various types, including lexicographic databases. An automatic dictionary is a dictionary in a special machine format designed for use on a computer by a user or a computer word processing program. In other words, there is a difference between automatic human end-user dictionaries and automatic dictionaries for word processing programs. Automatic dictionaries intended for the end user, in terms of interface and structure of a dictionary entry, differ significantly from automatic dictionaries included in machine translation systems, automatic referencing systems, information retrieval systems, etc. Most often they are computer versions of well-known conventional dictionaries. There are computer analogues of explanatory dictionaries of the English language on the software market (automatic Webster, automatic explanatory English dictionary of the Collins publishing house, automatic version of the New Large English-Russian dictionary ed. Yu.D. Apresyan and E.M. Mednikova), there is also computer version Ozhegov's dictionary. Automatic dictionaries for word processing programs can be called automatic dictionaries in the exact sense. They are generally not intended for the average user. Features of their structure, the scope of vocabulary material are set by the programs that interact with them.

Computer modeling of the plot structure is another promising direction computational linguistics. The study of the structure of the plot refers to the problems of structural literary criticism (in the broad sense), semiotics and cultural studies. The available computer programs for plot modeling are based on three basic plot presentation formalisms - morphological and syntactic directions for plot presentation, as well as on a cognitive approach. Ideas about the morphological structure of the plot structure go back to the famous works of V.Ya. Propp ( cm.) about a Russian fairy tale. Propp noticed that with the abundance of characters and events in a fairy tale, the number of character functions is limited, and he proposed an apparatus for describing these functions. Propp's ideas formed the basis of the TALE computer program, which simulates the generation of the plot of a fairy tale. The algorithm of the TALE program is based on the sequence of functions of the characters in the fairy tale. In fact, the Propp functions set a set of typified situations, ordered on the basis of the analysis of empirical material. Coupling capabilities various situations in the rules of generation were determined by a typical sequence of functions - in the form in which it can be established from the texts of fairy tales. In the program, typical sequences of functions were described as typical scenarios for meeting characters.

The theoretical basis of the syntactic approach to the plot of the text was “plot grammars”, or “narrative grammars” (story grammars). They appeared in the mid-1970s as a result of the transfer of the ideas of N. Chomsky's generative grammar to the description of the macrostructure of the text. If the most important components of the syntactic structure in the generative grammar were verbal and nominal groups, then in most plot grammars, exposition (setting), event and episode were singled out as basic ones. In the theory of plot grammars, minimality conditions, that is, restrictions that determined the status of a sequence of plot elements as a normal plot, were widely discussed. It turned out, however, that purely linguistic methods it is impossible to do so. Many restrictions are sociocultural in nature. Plot grammars, differing significantly in the set of categories in the generation tree, allowed a very limited set of rules for modifying the narrative (narrative) structure.

In the early 1980s, one of R. Schenk's students, V. Lenert, as part of the work on creating a computer plot generator, proposed an original formalism of emotional plot units (Affective Plot Units), which turned out to be a powerful tool for representing the plot structure. While it was originally developed for an artificial intelligence system, this formalism has been used in purely theoretical studies. The essence of Lehnert's approach was that the plot was described as a successive change in the cognitive-emotional states of the characters. Thus, the focus of Lehnert's formalism is not on the external components of the plot - exposition, event, episode, morality - but on its substantive characteristics. In this respect, Lehnert's formalism is partly a return to Propp's ideas.

Computational linguistics also includes machine translation, which is currently experiencing a rebirth.

Literature:

Popov E.V. Communication with computers in natural language. M., 1982
Sadur V.G. Voice communication with electronic computers and problems of their development. - In the book: Speech communication: problems and prospects. M., 1983
Baranov A.N. Categories of artificial intelligence in linguistic semantics. Frames and scripts. M., 1987
Kobozeva I.M., Laufer N.I., Saburova I.G. Modeling communication in human-machine systems. – Linguistic support information systems. M., 1987
Olker H.R. Fairy tales, tragedies and ways of presenting world history. - In the book: Language and Modeling social interaction. M., 1987
Gorodetsky B.Yu. Computational Linguistics: Modeling Language Communication
McQueen K. Discursive Strategies for Natural Language Text Synthesis. – New in foreign linguistics. Issue. XXIV, Computational Linguistics. M., 1989
Popov E.V., Preobrazhensky A.B. . Features of the implementation of NL-systems
Preobrazhensky A.B. The state of development of modern NL-systems. - Artificial intelligence. Book. 1, Communication systems and expert systems. M., 1990
Subbotin M.M. Hypertext. New form written communication. — VINITI, Ser. Informatics, 1994, v. 18
Baranov A.N. Introduction to Applied Linguistics. M., 2000



The term "computational linguistics" usually refers to a wide area of ​​using computer tools - programs, computer technologies for organizing and processing data - to model the functioning of a language in certain conditions, situations, problem areas, as well as the scope of computer language models. only in linguistics, but also in related disciplines. Actually, only in the latter case we are talking about applied linguistics in the strict sense, since computer language modeling can also be considered as a field of application of programming theory (computer science) in the field of linguistics. Nevertheless, the general practice is such that the field of computational linguistics covers almost everything related to the use of computers in linguistics: "The term" computational linguistics "sets a general orientation towards the use of computers to solve a variety of scientific and practical problems related to language, without limiting in any way ways of solving these problems.

Institutional aspect of computational linguistics. As a special scientific direction, computational linguistics took shape in the 60s. The flow of publications in this area is very high. In addition to thematic collections, the journal Computational Linguistics is published quarterly in the USA. A large organizational and scientific work is carried out by the Association for Computational Linguistics, which has regional structures around the world (in particular, the European branch). Every two years there are international conferences on computational linguistics - KOLING. Relevant issues are also widely represented at international conferences on artificial intelligence at various levels.

Cognitive toolkit of computational linguistics

Computational linguistics as a special applied discipline is distinguished primarily by its tool - that is, by the use of computer tools for processing language data. Since computer programs that model certain aspects of the functioning of a language can use a variety of programming tools, it seems that there is no need to talk about a common metalanguage. However, it is not. There are general principles of computer modeling of thinking, which are somehow implemented in any computer model. This language is based on the theory of knowledge developed in artificial intelligence and forming an important branch of cognitive science.

The main thesis of the theory of knowledge states that thinking is a process of processing and generating knowledge. "Knowledge" or "knowledge" is considered an undefined category. The human cognitive system acts as a "processor" that processes knowledge. In epistemology and cognitive science, two main types of knowledge are distinguished - declarative ("knowing what") and procedural ("knowing how"2)). Declarative knowledge is usually presented as a set of propositions, statements about something. A typical example of declarative knowledge is the interpretation of words in ordinary explanatory dictionaries. For example, a cup] - "a small rounded drinking vessel, usually with a handle, made of porcelain, faience, etc.". Declarative knowledge lends itself to the verification procedure in terms of "true-false". Procedural knowledge is presented as a sequence (list) of operations, actions to be performed. This is some general instruction about actions in a certain situation. A typical example of procedural knowledge is instructions for using household appliances.

Unlike declarative knowledge, procedural knowledge cannot be verified as true or false. They can be evaluated only by the success or failure of the algorithm.

Most of the concepts of the cognitive toolkit of computational linguistics are homonymous: they simultaneously designate some real entities of the human cognitive system and ways of representing these entities in some metalanguages. In other words, the elements of metalanguage have an ontological and instrumental aspect. Ontologically, the division of declarative and procedural knowledge corresponds to different types of knowledge of the human cognitive system. So, knowledge about specific objects, objects of reality is mainly declarative, and the functional abilities of a person to walk, run, drive a car are realized in the cognitive system as procedural knowledge. Instrumentally, knowledge (both ontologically procedural and declarative) can be represented as a set of descriptions, descriptions and as an algorithm, an instruction. In other words, ontologically declarative knowledge about the object of reality "table" can be represented procedurally as a set of instructions, algorithms for its creation, assembly (= creative aspect of procedural knowledge) or as an algorithm for its typical use (= functional aspect procedural knowledge). In the first case, this may be a guide for a novice carpenter, and in the second, a description of the possibilities of an office desk. The converse is also true: ontologically procedural knowledge can be represented declaratively.

It requires a separate discussion whether any ontologically declarative knowledge can be represented as procedural, and any ontologically procedural - as declarative. Researchers agree that any declarative knowledge can, in principle, be represented procedurally, although this may turn out to be very uneconomical for a cognitive system. The reverse is hardly true. The fact is that declarative knowledge is much more explicit, it is easier for a person to understand than procedural knowledge. In contrast to declarative knowledge, procedural knowledge is predominantly implicit. So, the language ability, being procedural knowledge, is hidden from a person, is not realized by him. An attempt to explicate the mechanisms of language functioning leads to dysfunction. Specialists in the field of lexical semantics know, for example, that the long-term semantic introspection necessary to study the word content plan leads to the fact that the researcher partially loses the ability to distinguish between correct and misuses analyzed word. Other examples can be cited. It is known that from the point of view of mechanics, the human body is complex system two interacting pendulums.

In knowledge theory, knowledge is studied and represented using various structures knowledge - frames, scenarios, plans. According to M. Minsky, "a frame is a data structure designed to represent a stereotyped situation" [Minsky 1978, p.254]. In more detail, we can say that the frame is a conceptual structure for the declarative representation of knowledge about a typified thematically unified situation containing slots interconnected by certain semantic relationships. For purposes of illustration, a frame is often represented as a table, the rows of which form slots. Each slot has its own name and content (see Table 1).

Table 1

Fragment of the "table" frame in a table view

Depending on the specific task frame structuring can be significantly more complex; a frame can include nested subframes and references to other frames.

Instead of a table, a predicate form of presentation is often used. In this case, the frame is in the form of a predicate or a function with arguments. There are other ways to represent a frame. For example, it can be represented as a tuple the following kind: ( (frame name) (slot name)) (slot value,),..., (slot name n) (slot value n) ).

Typically, frames in knowledge representation languages ​​have this form.

Like other cognitive categories of computational linguistics, the concept of a frame is homonymous. Ontologically, it is a part of the human cognitive system, and in this sense, the frame can be compared with such concepts as gestalt, prototype, stereotype, scheme. In cognitive psychology, these categories are considered precisely from an ontological point of view. Thus, D. Norman distinguishes two main ways of existence and organization of knowledge in the human cognitive system - semantic networks and schemes. "Schemas," he writes, "are organized packets of knowledge assembled to represent distinct, self-contained units of knowledge. My schema for Sam may contain information describing his physical features, his activities, and personality traits. This schema correlates with other schemas that describe its other aspects" [Norman 1998, p. 359]. If we take the instrumental side of the frame category, then this is a structure for the declarative representation of knowledge. In existing AI systems, frames can form complex structures knowledge; frame systems allow for hierarchy - one frame can be part of another frame.

In terms of content, the concept of a frame is very close to the category of interpretation. Indeed, a slot is an analogue of valence, the filling of a slot is an analogue of an actant. The main difference between them is that the interpretation contains only linguistically relevant information about the plan of the content of the word, and the frame, firstly, is not necessarily tied to the word, and, secondly, includes all the information relevant to a given problem situation, including including extralinguistic (knowledge of the world) 3).

A scenario is a conceptual framework for the procedural representation of knowledge about a stereotyped situation or behavior. Script elements are the steps of an algorithm or instruction. People usually talk about "restaurant scenario", "buying scenario" and so on.

The frame was also originally used for procedural presentation (cf. the term "procedural frame"), but the term "scenario" is now more commonly used in this sense. A scenario can be represented not only as an algorithm, but also as a network, the vertices of which correspond to certain situations, and the arcs correspond to connections between situations. Along with the concept of a script, some researchers use the category of a script for computer modeling of intelligence. According to R. Schenk, a script is some generally accepted, well-known sequence causation. For example, understanding the dialogue

On the street it pours like a bucket.

You still have to go to the store: there is nothing in the house - yesterday the guests swept everything.

is based on non-explicit semantic connections such as "if it rains, it is undesirable to go outside, because you can get sick." These connections form a script, which is used by native speakers to understand each other's verbal and non-verbal behavior.

As a result of applying the scenario to a specific problem situation, a plan). A plan is used to procedurally represent knowledge about possible actions leading to a specific goal. A plan relates a goal to a sequence of actions.

In the general case, the plan includes a sequence of procedures that transfer the initial state of the system to the final one and lead to the achievement of a certain subgoal and goal. In AI systems, the plan arises as a result of the planning or planning activity of the corresponding module - the planning module. The planning process may be based on the adaptation of data from one or more scenarios, activated by testing procedures, to solve a problem situation. The execution of the plan is carried out by an executive module that controls cognitive procedures and physical actions systems. In the elementary case, a plan in an intelligent system is a simple sequence of operations; in more complex versions, the plan is associated with a specific subject, its resources, capabilities, goals, detailed information about a problematic situation, etc. The emergence of the plan occurs in the process of communication between the model of the world, part of which is formed by scenarios, the planning module and the executive module.

Unlike a scenario, a plan is associated with a specific situation, a specific performer, and pursues a specific goal. The choice of plan is governed by the contractor's resources. The feasibility of a plan is an obligatory condition for its generation in a cognitive system, and the feasibility characteristic is inapplicable to a scenario.

One more important concept- model of the world. The model of the world is usually understood as a set of knowledge about the world organized in a certain way, inherent in a cognitive system or its computer model. In a somewhat more general sense, the model of the world is spoken of as part of a cognitive system that stores knowledge about the structure of the world, its patterns, etc. In another sense, the model of the world is associated with the results of understanding the text or, more broadly, discourse. In the process of understanding the discourse, its mental model is built, which is the result of the interaction between the plan of the content of the text and the knowledge about the world inherent in this subject [Johnson-Laird 1988, p. 237 et seq.]. The first and second understandings are often combined. This is typical of linguistic researchers working within cognitive linguistics and cognitive science.

Closely related to the category of frame is the concept of a scene. The scene category is mainly used in the literature as a designation of a conceptual structure for the declarative representation of situations and their parts actualized in a speech act and highlighted by linguistic means (lexemes, syntactic constructions, grammatical categories, etc.). Being associated with linguistic forms, the scene is often updated certain word or expression. In plot grammars (see below), a scene appears as part of an episode or narrative. Typical examples scenes - a set of cubes that the AI ​​system works with, the scene in the story and the participants in the action, etc. In artificial intelligence, scenes are used in image recognition systems, as well as in research-oriented programs (analysis, description) problem situations. The concept of a scene has become widespread in theoretical linguistics, as well as in logic, in particular in situational semantics, in which the meaning of a lexical unit is directly associated with the scene.