What is computational linguistics? Rapid growth in the volume of text information in electronic form

21.09.2019

COURSE WORK

in the discipline "Informatics"

on the topic: “Computational linguistics”

INTRODUCTION

2. Modern interfaces for computational linguistics

CONCLUSION

LITERATURE

Introduction

In life modern society important role automated information technologies play a role. Over time, their importance continuously increases. But the development of information technology is very uneven: if the modern level of computer technology and communications is amazing, then in the field of semantic processing of information, successes are much more modest. These successes depend, first of all, on achievements in the study of the processes of human thinking, the processes of verbal communication between people and the ability to simulate these processes on a computer.

When it comes to the creation of promising information technologies, the problems of automatic processing of text information presented in natural languages come to the fore. foreground. This is determined by the fact that a person’s thinking is closely connected with his language. Moreover, natural language is a tool for thinking. It is also a universal means of communication between people - a means of perception, accumulation, storage, processing and transmission of information. Science deals with the problems of using natural language in automatic information processing systems computational linguistics. This science arose relatively recently - at the turn of the fifties and sixties of the last century. Over the past half century, significant scientific and practical results have been obtained in the field of computer linguistics: systems for machine translation of texts from one natural language to another, systems for automated retrieval of information in texts, systems for automatic analysis and synthesis have been created oral speech and many others. this work is devoted to the construction of an optimal computer interface using computer linguistics when conducting linguistic research.

1. The place and role of computational linguistics in linguistic research

IN modern world Computational linguistics is increasingly being used in various linguistic studies.

Computational linguistics is a field of knowledge associated with solving problems of automatic processing of information presented in natural language. Central scientific problems Computer linguistics is the problem of modeling the process of understanding the meaning of texts (transition from text to a formalized representation of its meaning) and the problem of speech synthesis (transition from a formalized representation of meaning to texts in natural language). These problems arise when solving a number of applied problems and, in particular, problems of automatic detection and correction of errors when entering texts into a computer, automatic analysis and synthesis of oral speech, automatic translation of texts from one language to another, communication with a computer in natural language, automatic classification and indexing text documents, their automatic abstracting, searching for documents in full-text databases.

Linguistic tools created and used in computational linguistics can be divided into two parts: declarative and procedural. The declarative part includes dictionaries of units of language and speech, texts and various kinds of grammar tables, the procedural part includes means of manipulating units of language and speech, texts and grammar tables. Computer interface refers to the procedural part of computational linguistics.

Success in solving applied problems of computer linguistics depends, first of all, on the completeness and accuracy of the representation of declarative means in computer memory and on the quality of procedural means. To date, the required level of solving these problems has not yet been achieved, although work in the field of computational linguistics is being carried out in all developed countries of the world (Russia, USA, England, France, Germany, Japan, etc.).

Nevertheless, serious scientific and practical achievements in the field of computational linguistics can be noted. Thus, in a number of countries (Russia, USA, Japan, etc.) experimental and industrial systems for machine translation of texts from one language to another have been built, a number of experimental systems for communicating with computers in natural language have been built, work is underway to create terminological data banks, thesauruses, bilingual and multilingual machine dictionaries (Russia, USA, Germany, France, etc.), systems for automatic analysis and synthesis of oral speech are being built (Russia, USA, Japan, etc.), research is being conducted in the field of constructing natural language models.

An important methodological problem of applied computational linguistics is the correct assessment of the necessary relationship between the declarative and procedural components of automatic text information processing systems. What should be preferred: powerful computational procedures based on relatively small vocabulary systems with rich grammatical and semantic information, or a powerful declarative component with relatively simple computer interfaces? Most scientists believe that the second way is preferable. It will lead to the achievement of practical goals faster, since there will be fewer dead ends and difficult obstacles to overcome, and here it will be possible to use computers on a larger scale to automate research and development.

The need to mobilize efforts, first of all, on the development of the declarative component of automatic text information processing systems is confirmed by half a century of experience in the development of computer linguistics. After all, here, despite the undeniable successes of this science, the passion for algorithmic procedures has not brought the expected success. There was even some disappointment in the capabilities of procedural means.

In light of the above, it seems promising to develop such a path of development of computer linguistics, when the main efforts will be aimed at creating powerful dictionaries of language and speech units, studying their semantic-syntactic structure and creating basic procedures for morphological, semantic-syntactic and conceptual analysis and synthesis of texts. This will allow us to solve a wide range of applied problems in the future.

Computer linguistics faces, first of all, the tasks of linguistic support for the processes of collecting, accumulating, processing and retrieving information. The most important of them are:

1. Automation of the compilation and linguistic processing of machine dictionaries;

2. Automation of the processes of detecting and correcting errors when entering texts into a computer;

3. Automatic indexing of documents and information requests;

4. Automatic classification and abstracting of documents;

5. Linguistic support for information retrieval processes in monolingual and multilingual databases;

6. Machine translation of texts from one natural language to another;

7. Construction of linguistic processors that ensure communication between users and automated intelligent information systems(in particular, with expert systems) in natural language, or in a language close to natural;

8. Extracting factual information from informal texts.

Let us dwell in detail on the problems most relevant to the topic of research.

In the practical activities of information centers, there is a need to solve the problem of automated detection and correction of errors in texts when they are entered into a computer. This complex task can be conditionally divided into three tasks - tasks of orthographic, syntactic and semantic control of texts. The first of them can be solved using a morphological analysis procedure that uses a fairly powerful reference machine dictionary of word stems. In the process of spelling control, the words of the text are subject to morphological analysis, and if their bases are identified with the bases of the reference dictionary, then they are considered correct; if they are not identified, then they, accompanied by a microcontext, are presented to a person for viewing. A person detects and corrects distorted words, and the corresponding software system makes these corrections into the corrected text.

The task of syntactic control of texts in order to detect errors in them is essentially more difficult tasks their spelling control. Firstly, because it includes in its composition the task of spelling control as its obligatory component, and, secondly, because the problem of syntactic analysis of informal texts has not yet been fully resolved. However, partial syntactic control of texts is quite possible. Here you can go in two ways: either compile fairly representative machine dictionaries of reference syntactic structures and compare the syntactic structures of the analyzed text with them; or develop complex system rules for checking the grammatical consistency of text elements. The first path seems to us more promising, although it, of course, does not exclude the possibility of using elements of the second path. The syntactic structure of texts should be described in terms of grammatical classes of words (more precisely, in the form of sequences of sets of grammatical information for words).

The task of semantic control of texts in order to detect semantic errors in them should be classified as a task artificial intelligence. It can be solved in full only on the basis of modeling the processes of human thinking. In this case, it will apparently be necessary to create powerful encyclopedic knowledge bases and software tools for knowledge manipulation. Nevertheless, for limited subject areas and for formalized information, this task is completely solvable. It should be posed and solved as a problem of semantic-syntactic control of texts.

The problem of automating the indexing of documents and queries is traditional for automated text information retrieval systems. At first, indexing was understood as the process of assigning classification indices to documents and queries that reflected their thematic content. Subsequently, this concept was transformed and the term “indexing” began to refer to the process of translating descriptions of documents and queries from natural language into formalized language, in particular, into the language of “search images”. Search images of documents began, as a rule, to be drawn up in the form of lists of keywords and phrases reflecting their thematic content, and search images of queries - in the form of logical structures in which keywords and phrases were connected to each other by logical and syntactic operators.

It is convenient to automatically index documents based on the texts of their abstracts (if any), since abstracts reflect the main content of documents in a concentrated form. Indexing can be carried out with or without thesaurus control. In the first case, in the text of the title of the document and its abstract, key words and phrases of the reference machine dictionary are searched and only those that are found in the dictionary are included in the AML. In the second case, key words and phrases are isolated from the text and included in the POD, regardless of their belonging to any reference dictionary. A third option was also implemented, where, along with terms from the machine thesaurus, the AML also included terms extracted from the title and first sentence of the document abstract. Experiments have shown that PODs compiled automatically using titles and abstracts of documents provide greater search completeness than PODs compiled manually. This is explained by the fact that the automatic indexing system more fully reflects various aspects document content than a manual indexing system.

Automatic indexing of queries poses approximately the same problems as automatic indexing of documents. Here you also have to extract keywords and phrases from the text and normalize the words included in the query text. Logical connections between keywords and phrases and contextual operators can be entered manually or using an automated procedure. An important element of the process of automatic indexing of a query is the addition of its constituent keywords and phrases with their synonyms and hyponyms (sometimes also hyperonyms and other terms associated with the original query terms). This can be done automatically or interactively using a machine thesaurus.

We have already partially considered the problem of automating the search for documentary information in connection with the task of automatic indexing. The most promising here is to search for documents using their full texts, since the use of all kinds of substitutes for this purpose (bibliographic descriptions, search images of documents and the texts of their abstracts) leads to loss of information during the search. The greatest losses occur when substitutes primary documents their bibliographic descriptions are used, the least - when using abstracts.

Important characteristics of the quality of information retrieval are its completeness and accuracy. The completeness of the search can be ensured by taking maximum account of the paradigmatic connections between units of language and speech (words and phrases), and accuracy - by taking into account their syntagmatic connections. There is an opinion that the completeness and accuracy of a search are inversely related: measures to improve one of these characteristics lead to a deterioration in the other. But this is only true for fixed search logic. If this logic is improved, then both characteristics can be improved simultaneously.

It is advisable to build the process of searching for information in full-text databases as a process of interactive communication between the user and the information retrieval system (IRS), in which he sequentially views text fragments (paragraphs) that satisfy the logical conditions of the request, and selects those that are relevant to him. are of interest. As final results search results may appear as full texts documents and any fragments thereof.

As can be seen from the previous discussions, when automatically searching for information, it is necessary to overcome the language barrier that arises between the user and the information system due to the variety of forms of representation of the same meaning that occurs in texts. This barrier becomes even more significant if the search has to be carried out in multilingual databases. A radical solution to the problem here could be machine translation of document texts from one language to another. This can be done either in advance, before loading documents into a search engine, or during the process of searching for information. In the latter case, the user's request must be translated into the language of the document array in which the search is being conducted, and the search results must be translated into the language of the request. This kind of search engines already operate on the Internet. VINITI RAS also built a Cyrillic Browser system, which allows you to search for information in Russian-language texts using queries in English with search results also in the user’s language.

An important and promising task of computer linguistics is the construction of linguistic processors that ensure user communication with intelligent automated information systems (in particular, expert systems) in natural language or in a language close to natural. Since in modern intelligent systems information is stored in a formalized form, linguistic processors, acting as intermediaries between a person and a computer, must solve the following main tasks: 1) the task of transitioning from the texts of input information requests and messages in natural language to representing their meaning in a formalized language (when entering information into a computer); 2) the task of transition from a formalized representation of the meaning of output messages to its representation in natural language (when issuing information to a person). The first task must be solved by morphological, syntactic and conceptual analysis of input queries and messages, the second - by conceptual, syntactic and morphological synthesis of output messages.

Conceptual analysis information requests and messages consists in identifying their conceptual structure (the boundaries of the names of concepts and relationships between concepts in the text) and translating this structure into a formalized language. It is carried out after morphological and syntactic analysis of requests and messages. The conceptual synthesis of messages consists of the transition from the representation of the elements of their structure in a formalized language to a verbal (verbal) representation. After this, the messages are given the necessary syntactic and morphological format.

For machine translation of texts from one natural language to another, it is necessary to have dictionaries of translation correspondence between the names of concepts. Knowledge about such translation correspondences was accumulated by many generations of people and was compiled in the form of special publications - bilingual or multilingual dictionaries. For specialists who have some knowledge of foreign languages, these dictionaries served as valuable aids in translating texts.

In traditional bilingual and multilingual general-purpose dictionaries, translation equivalents were indicated primarily for individual words, and for phrases - much less often. Indication of translation equivalents for phrases was more typical for special terminological dictionaries. Therefore, when translating sections of texts containing polysemantic words, students often encountered difficulties.

Below are translation correspondences between several pairs of English and Russian phrases on “school” topics.

1) The bat looks like a mouse with wings – Bat looks like a mouse with wings.

2) Children like to play in the sand on the beach - Children love to play in the sand on the seashore.

3) A drop of rain fell on my hand - A drop of rain fell on my hand.

4) Dry wood burns easily - dry wood burns well.

5) He pretended not to hear me - He pretended not to hear me.

Here the English phrases are not idiomatic expressions. However, their translation into Russian can only with some stretch be considered as a simple word-by-word translation, since almost all the words included in them are ambiguous. Therefore, only the achievements of computer linguistics can help students here.

COMPUTER LINGUISTICS (tracing paper from English computational linguistics), one of the areas applied linguistics, in which computer programs, computer technologies for organizing and processing data are developed and used to study language and model the functioning of language in certain conditions, situations and problem areas. On the other hand, this is the area of application of computer language models in linguistics and related disciplines. As a special scientific direction, computational linguistics took shape in European research in the 1960s. Because the English adjective computational can also be translated as “computational”; the term “computational linguistics” is also found in the literature, but in Russian science it takes on a narrower meaning, approaching the concept of “quantitative linguistics”.

The term “quantitative linguistics” is often referred to as computational linguistics, which characterizes an interdisciplinary direction in applied research, where quantitative or statistical methods of analysis are used as the main tool for studying language and speech. Sometimes quantitative (or quantitative) linguistics is contrasted with combinatorial linguistics. In the latter, the dominant role is occupied by the “non-quantitative” mathematical apparatus - set theory, mathematical logic, theory of algorithms, etc. From a theoretical point of view, the use of statistical methods in linguistics makes it possible to supplement the structural model of language with a probabilistic component, i.e. to create a theoretical structural-probabilistic a model with significant explanatory potential. In the applied field, quantitative linguistics is represented, first of all, by the use of fragments of this model, used for linguistic monitoring of the functioning of the language, decryption of coded text, authorization/attribution of text, etc.

The term “computer linguistics” and the problems of this area are often associated with the modeling of communication, and above all with the provision of human interaction with a computer in natural or limited natural language (special natural language processing systems are created for this purpose), as well as with the theory and practice of information technology. search engines (IRS). Ensuring human-computer communication in natural language is sometimes referred to as “natural language processing” (translation from in English term Natural Language Processing). This area of computer linguistics arose in the late 1960s abroad and developed within the framework of the scientific and technological discipline called artificial intelligence (works by R. Schenk, M. Lebowitz, T. Winograd, etc.). In its meaning, the phrase “natural language processing” should cover all areas in which computers are used to process language data. In practice, however, a narrower understanding of the term has taken hold - the development of methods, technologies and specific systems that ensure human communication with a computer in natural or limited natural language.

To a certain extent, computer linguistics can include work in the field of creating hypertext systems, considered as special way organization of the text and even as a fundamentally new type of text, opposed in many of its properties plain text, formed in the Gutenberg tradition of printing (see Gutenberg).

The competence of computational linguistics also includes automatic translation.

Within the framework of computational linguistics, a relatively new direction has emerged, actively developing since the 1980-90s - corpus linguistics, where general principles construction of linguistic data corpora (in particular, text corpora) using modern computer technologies. Text corpora are collections of specially selected texts from books, magazines, newspapers, etc., transferred to computer media and intended for automatic processing. One of the first text corpora was created for American English at Brown University (the so-called Brown Corpus) in 1962-63 under the leadership of W. Francis. In Russia, since the early 2000s, the Vinogradov Russian Language Institute of the Russian Academy of Sciences has been developing the National Corpus of the Russian Language, consisting of a representative sample of Russian-language texts with a volume of about 100 million word usages. In addition to the actual construction of data corpora, corpus linguistics is engaged in the creation of computer tools ( computer programs), designed to extract a variety of information from text corpora. From the user's point of view, text corpora are subject to requirements for representativeness (representativeness), completeness and economy.

Computer linguistics is actively developing both in Russia and abroad. The flow of publications in this area is very large. In addition to thematic collections, the journal Computational Linguistics has been published quarterly in the United States since 1984. Greater organizational and scientific work is conducted by The Association for Computational Linguistics, which has regional structures around the world (in particular, a European branch). Every two years, international COLINT conferences are held (in 2008 the conference was held in Manchester). The main directions of computational linguistics are also discussed at the annual international conference “Dialogue”, organized by the Russian Research Institute of Artificial Intelligence, the Faculty of Philology of Moscow State University, Yandex and a number of other organizations. Relevant issues are also widely represented at international conferences on artificial intelligence at various levels.

Lit.: Zvegintsev V. A. Theoretical and applied linguistics. M., 1968; Piotrovsky R. G., Bektaev K. B., Piotrovskaya A. A. Mathematical linguistics. M., 1977; Gorodetsky B. Yu. Actual problems applied linguistics // New in foreign linguistics. M., 1983. Issue. 12; Kibrik A. E. Applied linguistics // Kibrik A. E. Essays on general and applied issues of linguistics. M., 1992; Kennedy G. An introduction to corpus linguistics. L., 1998; Bolshakov I.A., Gelbukh A. Computational linguistics: models, resources, applications. Mekh., 2004; National Corpus of the Russian Language: 2003-2005. M., 2005; Baranov A. N. Introduction to applied linguistics. 3rd ed. M., 2007; Computer linguistics and intellectual technologies. M., 2008. Issue. 7.

linguistics statistical linguistics software

History of the development of computational linguistics

The process of formation and formation modern linguistics as a science of natural language represents a long historical development of linguistic knowledge. Linguistic knowledge is based on elements that were formed in the process of activities inextricably linked with the development of the structure of oral speech, the emergence, further development and improvement of writing, learning to write, as well as the interpretation and decoding of texts.

Natural language as an object of linguistics occupies a central place in this science. In the process of language development, ideas about it also changed. If previously no special importance was attached to the internal organization of language, and it was considered primarily in the context of its relationship with the outside world, then, starting with late XIX- the beginning of the 20th century, a special role is given to the internal formal structure of the language. It was during this period that the famous Swiss linguist Ferdinand de Saussure developed the foundations of such sciences as semiology and structural linguistics, and set out in detail in his book “A Course in General Linguistics” (1916).

The scientist came up with the idea of considering language as a single mechanism, an integral system of signs, which in turn makes it possible to describe language mathematically. Saussure was the first to propose a structural approach to language, namely: a description of language by studying the relationships between its units. By units, or “signs,” he understood a word that combines both meaning and sound. The concept proposed by the Swiss scientist is based on the theory of language as a system of signs consisting of three parts: language (from the French langue), speech (from the French parole) and speech activity(from French langage).

The scientist himself defined the science he created, semiology, as “a science that studies the life of signs within the framework of the life of society.” Since language is a sign system, in search of an answer to the question of what place linguistics occupies among other sciences, Saussure argued that linguistics is part of semiology. It is generally accepted that it was the Swiss philologist who laid down theoretical foundation a new direction in linguistics, becoming the founder, “father” of modern linguistics.

The concept put forward by F. de Saussure was further developed in the works of many outstanding scientists: in Denmark - L. Hjelmslev, in the Czech Republic - N. Trubetskoy, in the USA - L. Bloomfield, Z. Harris, N. Chomsky. As for our country, here structural linguistics began its development at approximately the same period of time as in the West - in turn of XIX-XX centuries - in the works of F. Fortunatov and I. Baudouin de Courtenay. It should be noted that I. Baudouin de Courtenay worked closely with F. de Saussure. If Saussure laid the theoretical foundation of structural linguistics, then Baudouin de Courtenay can be considered the person who laid the foundations for the practical application of the methods proposed by the Swiss scientist. It was he who defined linguistics as a science that uses statistical methods and functional dependencies, and separated it from philology. The first experience in the application of mathematical methods in linguistics was phonology - the science of the structure of the sounds of a language.

It should be noted that the postulates put forward by F. de Saussure were able to be reflected in the problems of linguistics that were relevant in the middle of the 20th century. It was during this period that there was a clear tendency towards mathematization of the science of language. In almost all large countries The rapid development of science and computer technology begins, which in turn required more and more new linguistic foundations. The result of all this was the rapid convergence of precise and humanities, and active interaction mathematics and linguistics found practical use when solving current scientific problems.

In the 50s of the 20th century, at the junction of such sciences as mathematics, linguistics, computer science and artificial intelligence, a new branch of science arose - computer linguistics (also known as machine linguistics or automatic text processing in natural language). The main stages of development of this direction took place against the backdrop of the evolution of artificial intelligence methods. A powerful impetus for the development of computer linguistics was the creation of the first computers. However, with the advent of a new generation of computers and programming languages in the 60s, a fundamentally new stage in the development of this science begins. It should also be noted that the origins of computational linguistics go back to the works of the famous American linguist N. Chomsky in the field of formalizing the structure of language. The results of his research, obtained at the intersection of linguistics and mathematics, formed the basis for the development of the theory of formal languages and grammars (generative, or generative, grammars), which is widely used to describe both natural and artificial languages, in particular programming languages. To be more precise, this theory is a completely mathematical discipline. It can be considered one of the first in such a direction of applied linguistics as mathematical linguistics.

The first experiments and first developments in computational linguistics relate to the creation of machine translation systems, as well as systems that model human language abilities. In the late 80s, with the advent and active development of the Internet, there was a rapid growth in the volume of text information available in electronic form. This has led to the fact that information retrieval technologies have moved to a qualitatively new stage of their development. The need arose for automatic processing of texts in natural language, and completely new tasks and technologies appeared. Scientists are faced with the problem of quickly processing a huge stream of unstructured data. In order to find a solution to this problem great importance began to focus on the development and application of statistical methods in the field of automatic text processing. It was with their help that it became possible to solve such problems as dividing texts into clusters united by a common theme, highlighting certain fragments in the text, etc. In addition, the use of methods mathematical statistics and machine learning made it possible to solve the problems of speech recognition and the creation of search engines.

Scientists did not stop at the results achieved: they continued to set themselves new goals and objectives, develop new techniques and research methods. All this led to the fact that linguistics began to act as an applied science, combining a number of other sciences, the leading role among which belonged to mathematics with its variety of quantitative methods and the ability to use them for a deeper understanding of the phenomena being studied. This is how mathematical linguistics began its formation and development. On this moment This is a fairly “young” science (it has existed for about fifty years), however, despite its very “young age”, it represents an already established field of scientific knowledge with many successful achievements.

Linguistics (from Latin lingua -
language), linguistics, linguistics - science,
studying languages.
This is the science of natural human language in general
and about all the languages of the world like his
individual representatives.
In the broad sense of the word, linguistics
divided into scientific and practical. More often
In total, linguistics means precisely
scientific linguistics. Is part of semiotics
science of signs.
Linguistics is professionally studied by linguistic scientists.

Linguistics and Computer Science.
Automated systems play an important role in the life of modern society.
information Technology. But the development of information technology is happening
very unevenly: if the modern level of computer technology and
means of communication amazes the imagination, then in the field of semantic processing
information, the successes are much more modest. These successes depend, first of all, on
achievements in the study of human thinking processes, speech processes
communication between people and the ability to simulate these processes on a computer. And this is a task of extreme complexity. When it comes to creating promising
information technology, then the problems of automatic text processing
information presented in natural languages come to the fore.
This is determined by the fact that a person’s thinking is closely connected with his language. More
Moreover, natural language is a thinking tool. He is also
a universal means of communication between people - a means of perception,
accumulation, storage, processing and transmission of information.
Problems of using natural language in automatic systems
The science of computer linguistics deals with information processing. This science
arose relatively recently - at the turn of the fifties and sixties
last century. At first, during its formation, it had various
titles: mathematical linguistics, computational linguistics, engineering
linguistics. But in the early eighties the name was assigned to it
computational linguistics.

Computational linguistics is a field of knowledge related to problem solving
automatic processing of information presented in natural language.
The central scientific problems of computational linguistics are the problem
modeling the process of understanding the meaning of texts (transition from text to
formalized representation of its meaning) and the problem of speech synthesis (transition from
formalized representation of meaning to texts in natural language). These problems
arise when solving a number of applied problems:
1) automatic detection and correction of errors when entering texts into a computer,
2) automatic analysis and synthesis of oral speech,
3) automatic translation of texts from one language to another,
4) communication with a computer in natural language,
5) automatic classification and indexing of text documents, their
automatic abstracting, searching for documents in full-text databases.
Over the past half century in the field of computational linguistics there have been obtained
significant scientific and practical results: machine systems were created
translation of texts from one natural language to another, automated systems
searching for information in texts, systems for automatic analysis and synthesis of oral speech and
a lot others. But there were also disappointments. For example, the problem of machine translation
texts from one language to another turned out to be much more difficult than imagined
pioneers of machine translation and their followers. The same can be said about
automated search for information in texts and the task of analysis and synthesis of oral
speech. Scientists and engineers will apparently have to work a lot more to
achieve the desired results.

Natural language processing; syntactic,
morphological, semantic text analysis). This also includes:
Corpus linguistics, creation and use of electronic text corpora
Creation electronic dictionaries, thesauruses, ontologies. For example, Lingvo. Dictionaries
used, for example, for automatic translation and spell checking.
Automatic translation of texts. Popular among Russian translators
is Promt. Among the free ones is Google Translate.
Automatic extraction of facts from text (information extraction)
extraction, text mining)
Automatic text summarization. This feature is enabled
for example, in Microsoft Word.
Building knowledge management systems. See Expert Systems
Creation of question-answering systems.
Optical character recognition (OCR). For example, the FineReader program
Automatic Speech Recognition (ASR). There are paid and free software
Automatic speech synthesis

The Faculty of Philology of the Higher School of Economics is launching a new Master Program, dedicated to computational linguistics: applicants with a basic education in the humanities and mathematics and anyone interested in solving problems in one of the most promising branches of science are welcome here. Its director, Anastasia Bonch-Osmolovskaya, told Theories and Practitioners what computational linguistics is, why robots will not replace humans, and what they will teach in the HSE master’s program in computational linguistics.

This program is almost the only one of its kind in Russia. Where did you study?

I studied at Moscow State University in the Department of Theoretical and Applied Linguistics, Faculty of Philology. I didn’t get there right away, first I entered the Russian department, but then I became seriously interested in linguistics, and I was attracted by the atmosphere that remains in the department to this day. The most important thing there is good contact between teachers and students and their mutual interest.

When I had children and needed to earn a living, I went into the field of commercial linguistics. In 2005, it was not very clear what this area of activity as such was. I worked in different linguistic companies: I started with a small company at the site Public.ru - this is a kind of media library, where I started working on linguistic technologies. Then I worked for a year at Rosnanotech, where there was an idea to make analytical portal so that the data on it is automatically structured. Then I headed the linguistic department at the Avicomp company - this is already a serious production in the field of computer linguistics and semantic technologies. At the same time, I taught a course on computational linguistics at Moscow State University and tried to make it more modern.

Two resources for a linguist: - a site created by linguists for scientific and applied research related to the Russian language. This is a model of the Russian language, presented using a huge array of texts from different genres and periods. The texts are equipped with linguistic markup, with the help of which you can obtain information about the frequency of certain linguistic phenomena. Wordnet is a huge lexical database of the English language, main idea Wordnet - to connect not words, but their meanings into one large network. Wordnet can be downloaded and used for your own projects.

What does computational linguistics do?

This is the most interdisciplinary field. The most important thing here is to understand what is going on in electronic world and who will help you do specific things.

We are surrounded by very a large number of digital information, there are many business projects, the success of which depends on the processing of information, these projects can relate to the field of marketing, politics, economics and anything else. And it is very important to be able to handle this information effectively - the main thing is not only the speed of processing information, but also the ease with which you can, after filtering out the noise, get the data you need and create a complete picture from it.

Previously, some global ideas were associated with computer linguistics, for example: people thought that machine translation would replace human translation, that robots would work instead of people. But now it seems like a utopia, and machine translation is used in search engines to quickly search in an unknown language. That is, now linguistics rarely deals with abstract problems - mostly with some small things that can be inserted into a large product and make money on it.

One of the big tasks of modern linguistics is the semantic web, when the search occurs not just by matching words, but by meaning, and all sites are in one way or another marked by semantics. This can be useful, for example, for police or medical reports that are written every day. Analysis of internal connections provides a lot of necessary information, but reading and calculating it manually is incredibly time-consuming.

In a nutshell, we have a thousand texts, we need to sort them into groups, present each text in the form of a structure and get a table with which we can already work. This is called unstructured information processing. On the other hand, computational linguistics deals, for example, with the creation of artificial texts. There is a company that has come up with a mechanism for generating texts on topics that are boring for a person to write about: changes in real estate prices, weather forecast, report on football matches. It is much more expensive to order these texts for a person, and computer texts on such topics are written in coherent human language.

Yandex is actively involved in developments in the field of searching for unstructured information in Russia; Kaspersky Lab hires research groups that study machine learning. Is someone in the market trying to come up with something new in the field of computational linguistics?

**Books on computational linguistics:**

Daniel Jurafsky, Speech and Language Processing

Christopher Manning, Prabhakar Raghavan, Heinrich Schuetze, "Introduction to Information Retrieval"

Yakov Testelets, “Introduction to General Syntax”

Most linguistic developments are the property of large companies; almost nothing can be found in the public domain. This slows down the development of the industry; we do not have a free linguistic market or packaged solutions.

In addition, there is a lack of comprehensive information resources. There is such a project as the National Corpus of the Russian Language. This is one of the best national buildings in the world, which is rapidly developing and opens up incredible opportunities for scientific and applied research. The difference is about the same as in biology - before DNA research and after.

But many resources do not exist in Russian. Thus, there is no analogue to such a wonderful English-language resource as Framenet - this is a conceptual network where all possible connections of a particular word with other words are formally presented. For example, there is the word “fly” - who can fly, where, with what preposition this word is used, what words it is combined with, and so on. This resource helps connect language to real life, that is, to monitor how it behaves specific word at the level of morphology and syntax. It is very useful.

The Avicomp company is currently developing a plugin for searching articles with similar content. That is, if you are interested in an article, you can quickly look at the history of the plot: when the topic arose, what was written and when was the peak of interest in this problem. For example, with the help of this plugin it will be possible, starting from an article devoted to events in Syria, to very quickly see how during last year events developed there.

How will the learning process in the master's program be structured?

Education at HSE is organized in separate modules, just like in Western universities. Students will be divided into small teams, mini-startups - that is, at the end we should receive several finished projects. We want to get real products, which we will then open to people and leave in the public domain.

In addition to the students' immediate project managers, we want to find them curators from among their potential employers - from the same Yandex, for example, who will also play this game and give the students some advice.

I hope that people from the most different areas: programmers, linguists, sociologists, marketers. We will have several adaptation courses in linguistics, mathematics and programming. Then we will have two serious courses in linguistics, and they will be related to the most relevant linguistic theories, we want our graduates to be able to read and understand modern linguistic articles. It's the same with mathematics. We will have a course called “Mathematical Foundations of Computational Linguistics,” which will outline those branches of mathematics on which modern computational linguistics is based.

In order to enroll in a master's program, you need to pass an entrance exam in language and pass a portfolio competition.

In addition to the main courses, there will be a line of elective subjects. We have planned several cycles - two of them are focused on a more in-depth study of individual topics, which include, for example, machine translation and corpus linguistics, and one, on the contrary, is related to related areas: such as , social networks, machine learning or Digital Humanities - a course that we hope will be taught in English.