Computational linguistics as an applied linguistic discipline. Automatically obtaining various dictionaries from text. Based on the corpus, you can obtain data

21.09.2019

Plan:

1. What is computational linguistics?

2. Object and subject of computational linguistics

4. Problems of computational linguistics

5. Research methods for computational linguistics

6. History and reasons for the emergence of computational linguistics

7. Basic terms of computational linguistics

8. Scientists working on the problem of computational linguistics

9. Associations and conferences on computational linguistics

10. Literature used.

Computational linguistics – an independent direction in applied linguistics, focused on the use of computers to solve problems related to the use of natural language. (Shchilikhina K.M.)

Computational linguistics– being one of the areas of applied linguistics, it studies the linguistic foundations of computer science and all aspects of the connection between language and thinking, modeling language and thinking in a computer environment using computer programs, and her interests lie in the areas of: 1) optimization of communication based on linguistic knowledge 2) creation of a natural language interface and typologies of language understanding for human-machine communication 3) creation and modeling of computer information systems (Sosnina E.P.)

Object of Computational Linguistics– analysis of language in its natural state as it is used by people in various communication situations, and how the features of language can be formulated.

Tasks of computational linguistics:

Computational linguistics research methods:

1. modeling method- a special object of study that is not available through direct observation. According to the definition of mathematician K. Shannon, a model is a representation of an object in some form that is different from the form of its real existence.

2. knowledge representation theory method implies methods of representing knowledge that are oriented towards automatic processing by modern computers.

3. programming language theory method(programming language theory) is a field of computer science associated with the design, analysis, determination of characteristics and classification of programming languages and the study of their individual characteristics.

Reasons for the emergence of computational linguistics

1. The emergence of computers

2. The problem of communicating with computers of untrained users

1.Dictionary search system developed at Birkbeck College in London in 1948.

2. Warren Weaver Memorandum

3. The beginning of the introduction of the first computers in the field of machine translation

4. Georgetown Project in 1954

1. ALPAC (Automatic Language Processing Advisory Committee) / Advisory Committee for Automatic Language Processing 2. new stage in the development of computer technologies and their active use in linguistic tasks 3. the creation of a new generation of computers and programming languages 4. increasing interest in machine translation 60

-70s of the twentieth century

Late 80s – early 90s of the twentieth century

The emergence and active development of the Internet

Rapid growth in the volume of text information in electronic form
The need for automatic processing of texts in natural language

1. Products of PROMT and ABBY (Lingvo) 2. Machine translation technologies 3. Translation Memory technologies

Modern commercial systems

Reviving texts

Communication models

Computer lexicography

Machine translate

Corpus of texts

Natural language text analysis

3 levels of text structure:
Surface syntactic structure

Deep syntactic structure

Semantic level

The problem of synthesis is the reverse of that in analysis

Bringing text to life

1. Exchange texts via visual images on the display screen

2. 2 modalities of human thinking: symbolic and visual.

1. Imitation of the communication process 2. Creation of an effective dialogue model Communication models

Hypertext- a special way of organizing and presenting text, in which several texts or fragments of text can be interconnected using various types of connections.

Differences between hypertext and traditional text

Hypertext

1. processing of spoken language

2. processing of written text

Spoken speech processing

1. automatic speech synthesis

A) the development of text-to-speech synthesizers. Includes 2 blocks: linguistic text processing block And acoustic synthesis block.

2. automatic speech recognition

1) text recognition

2) text analysis

3) text synthesis

IRS (information retrieval system)– these are software systems for storing, searching and issuing information of interest.

Zakharov V.P. believes that IPS is an ordered set of documents and information technologies designed for storing and retrieving information - texts or data.

3 types of IPS

Manual- This is a search in the library.

Mechanized IPS are technical means that ensure the selection of the necessary documents
Automatic- searching for information using computers

Computer lexicography

Computer lexicography– one of the important areas of applied linguistics, deals with the theory and practice of compiling dictionaries.

There are 2 directions in lexicography:
Traditional lexicography compiles traditional dictionaries

Machine lexicography deals with automation of dictionary preparation and solves problems of developing electronic dictionaries

Tasks of computer lexicography

Automatically obtaining various dictionaries from text
Creation of dictionaries that are electronic versions of traditional dictionaries or complex electronic linguistic dictionaries for traditional dictionary work, for example LINGVO
Development of theoretical and practical aspects of compiling special computer dictionaries, for example for information retrieval, machine translation

Machine translate

Machine translate– converting text in one natural language into another natural language using a computer.

Types of machine translation
FAMT(Fully Automated Machine Translation) – fully automatic translation

HAMT(Human Aided machine Translation) - machine translation with human participation

MAHT(Machine Aided Human Translation) – translation carried out by a person using auxiliary software and linguistic tools.

2) professional MP– higher quality translation followed by human editing

3) interactive MP– is considered a translation in special support systems; it takes place in dialogue mode with a computer system. The quality of MP depends on customization options, resources, and type of texts.

Corpus of texts

Corpus of texts- this is a certain collection of texts, which is based on a logical concept, a logical idea that unites these texts.

Language corpus- a large, electronically presented, unified, structured, labeled, philologically competent array of language data designed to solve specific linguistic problems.

Representativeness is the most important property of a corpus

Purpose language corpus– show the functioning of linguistic units in their natural contextual environment

Based on the corpus, you can obtain the following data:

1. about the frequency of grammatical categories

2. about frequency changes

3. about changes in contexts in different periods of time

5. about the co-occurrence of lexical units

6. about the features of their compatibility

Brown Corps

Corpus of texts - this is a certain collection of texts, which is based on a logical concept, a logical idea that unites these texts. The embodiment of this logical idea: rules for organizing texts into a corpus; algorithms and programs for analyzing a corpus of texts; associated ideology and methodology. National Corps represents a given language at a certain stage (or stages) of its existence and in all the diversity of genres, styles, territorial and social variants, etc. Basic terms of computational linguistics

Programming languages (PL) is a class artificial languages, intended for processing information using a computer. Any programming language is a strict (formal) sign system, with which computer programs are written. According to various estimates, there are currently between a thousand and ten thousand different programming languages.

Computer science(Computer Science) - the science of the laws of recording, storing, processing, transmitting and using information using technical means.

Search for information (Information Retrieval) is the process of finding such documents (texts, records and

etc.) that correspond to the received request.

« Information retrieval system (IPS) is an ordered set of documents (arrays of documents) and information technologies designed for storing and retrieving information - texts (documents) or data (facts).

Machine lexicography(Computational Lexicography) deals with the automation of the preparation of dictionaries and solves the problems of developing electronic

dictionaries.

Machine translate is the computer's transformation of text on one

natural language into content-equivalent text in another

natural language.

Hypertext is a technology for organizing information and specially structured text, divided into separate blocks, having a non-linear presentation, for the effective presentation of information in computer environments.

Frame is a structure for representing declarative knowledge about a typified thematically unified situation, i.e. structure of data about a stereotypical situation.

Scenario - this is a sequence of several episodes in time, this is also a representation of a stereotypical situation or stereotypical behavior, only the elements of the scenario are steps of an algorithm or instructions.
Plan – representation of knowledge about possible actions that are necessary to achieve a certain goal.

Scientists in the field of computational linguistics:

Soviet and Russian scientists: Alexey Lyapunov, Igor Melchuk, Olga Kulagina, Yu.D. Apresyan, N.N. Leontyeva, Yu.S. Martemyanov, Z.M. Shalyapina, Igor Boguslavsky, A.S. Narignani, A.E. Kibrik, Baranov A.N.
Western scientists Stars: Yorick Wilkes, Gregory Grefenstette, Gravil Corbett, John Carroll, Diana McCarthy, Luis Marquez, Dan Moldovan, Joakim Nivre, Victor Raskin, Eduard Hovey.

Associations and conferences in computational linguistics:
"Dialogue"- the main Russian conference on computational linguistics with international participation.

The priority of the Dialogue is computer modelling Russian language. The working languages of the conference are Russian and English. To attract foreign reviewers, the bulk of applied work is submitted in English.

Main directions of the conference:
Linguistic semantics and semantic analysis

Formal language models and their applications

Theoretical and computer lexicography

Methods for evaluation of text analysis and machine translation systems

Corpus linguistics. Creation, application, evaluation of corpora

Internet as a linguistic resource. Linguistic technologies on the Internet

Ontologies. Knowledge extraction from texts

Computer analysis of documents: abstracting, classification, search

Automatic sentiment analysis of texts

Machine translate

Communication models. Communication, dialogue and speech act

Analysis and speech synthesis

2. Association for Computational Linguistics (ACL) is an international scientific and professional society of people working on problems involving natural language and computing. The annual meeting is held every summer in locations where significant computational linguistics research is being carried out. Founded in 1962, originally named Association for Machine Translation and Computational Linguistics (AMTCL). In 1968 it became ACL.
UACL has a European one (EACL) and North American (NAACL) branches.

ACL Magazine, Computational linguistics, is the premier forum for research in computational linguistics and natural language processing. Since 1988 the magazine has been published for ACL MIT Press.

ACL Book Series, Research in Natural Language Processing, published Cambridge University Press.

Every year ACL and its chapters organize international conferences in different countries.

ACL 2014 was held in Baltimore, USA.

References:

1. Marchuk Yu.N. Computer linguistics: textbook/Yu.N. Marchuk.- M.:AST: East-West, 2007-317 p.

2. Shilikhina K.M. Fundamentals of applied linguistics: textbook for specialty 021800 (031301) - Theoretical and applied linguistics, Voronezh, 2006.

3. Boyarsky K.K. Introduction to computational linguistics. Textbook. - St. Petersburg: NRU ITMO, 2013. - 72 p.

4. Shchipitsina L.Yu. Information technologies in linguistics: textbook / L.Yu. Shchipitsina.- M.: FLINT: science, 2013.- 128 p.

5. Sosnina E.P. Introduction to applied linguistics: textbook / E.P. Sosnina. - 2nd ed., revised. and additional – Ulyanovsk: Ulyanovsk State Technical University, 2012. -110 p.

6. Baranov A.N. Introduction to applied linguistics: Textbook. - M.: Editorial URSS, 2001. - 360 p.

7. Applied linguistics: Textbook / L.V. Bondarko, L.A. Verbitskaya, G.Ya. Martynenko and others; Rep. Editor A.S. Gerd. St. Petersburg: publishing house St. Petersburg. Univ., 1996.- 528 p.

8. Shemyakin Yu.I. Beginnings of computer linguistics: Textbook. M.: Publishing house MGOU, JSC "Rosvuznauka", 1992.

linguistics statistical linguistics software

History of the development of computational linguistics

The process of formation and formation of modern linguistics as a science of natural language represents a long historical development of linguistic knowledge. Linguistic knowledge is based on elements that were formed in the process of activities inextricably linked with the development of the structure of oral speech, the emergence, further development and improvement of writing, learning to write, as well as the interpretation and decoding of texts.

Natural language as an object of linguistics occupies a central place in this science. In the process of language development, ideas about it also changed. If previously no special importance was attached to the internal organization of language, and it was considered primarily in the context of its relationship with the outside world, then, starting from the late 19th - early 20th centuries, a special role was assigned to the internal formal structure of the language. It was during this period that the famous Swiss linguist Ferdinand de Saussure developed the foundations of such sciences as semiology and structural linguistics, and set out in detail in his book “A Course in General Linguistics” (1916).

The scientist came up with the idea of considering language as a single mechanism, an integral system of signs, which in turn makes it possible to describe language mathematically. Saussure was the first to propose a structural approach to language, namely: a description of language by studying the relationships between its units. By units, or “signs,” he understood a word that combines both meaning and sound. The concept proposed by the Swiss scientist is based on the theory of language as a system of signs consisting of three parts: language (from the French langue), speech (from the French parole) and speech activity (from the French langage).

The scientist himself defined the science he created, semiology, as “a science that studies the life of signs within the framework of the life of society.” Since language is a sign system, in search of an answer to the question of what place linguistics occupies among other sciences, Saussure argued that linguistics is part of semiology. It is generally accepted that it was the Swiss philologist who laid down theoretical foundation a new direction in linguistics, becoming the founder, “father” of modern linguistics.

The concept put forward by F. de Saussure was further developed in the works of many outstanding scientists: in Denmark - L. Hjelmslev, in the Czech Republic - N. Trubetskoy, in the USA - L. Bloomfield, Z. Harris, N. Chomsky. As for our country, here structural linguistics began its development at approximately the same period of time as in the West - at the turn of the 19th-20th centuries. - in the works of F. Fortunatov and I. Baudouin de Courtenay. It should be noted that I. Baudouin de Courtenay worked closely with F. de Saussure. If Saussure laid the theoretical foundation of structural linguistics, then Baudouin de Courtenay can be considered the person who laid the foundations for the practical application of the methods proposed by the Swiss scientist. It was he who defined linguistics as a science that uses statistical methods and functional dependencies, and separated it from philology. The first experience in the application of mathematical methods in linguistics was phonology - the science of the structure of the sounds of a language.

It should be noted that the postulates put forward by F. de Saussure were able to be reflected in the problems of linguistics that were relevant in the middle of the 20th century. It was during this period that there was a clear tendency towards mathematization of the science of language. In almost all large countries The rapid development of science and computer technology begins, which in turn required more and more new linguistic foundations. The result of all this was the rapid convergence of precise and humanities, as well as active interaction between mathematics and linguistics found practical use when solving current scientific problems.

In the 50s of the 20th century, at the junction of such sciences as mathematics, linguistics, computer science and artificial intelligence, a new branch of science arose - computer linguistics (also known as machine linguistics or automatic text processing in natural language). The main stages of development of this direction took place against the backdrop of the evolution of artificial intelligence methods. A powerful impetus for the development of computer linguistics was the creation of the first computers. However, with the advent of a new generation of computers and programming languages in the 60s, a fundamentally new stage in the development of this science begins. It should also be noted that the origins of computational linguistics go back to the works of the famous American linguist N. Chomsky in the field of formalizing the structure of language. The results of his research, obtained at the intersection of linguistics and mathematics, formed the basis for the development of the theory of formal languages and grammars (generative grammars), which is widely used to describe both natural and artificial languages, in particular programming languages. To be more precise, this theory is a completely mathematical discipline. It can be considered one of the first in such a direction of applied linguistics as mathematical linguistics.

The first experiments and first developments in computational linguistics relate to the creation of machine translation systems, as well as systems that model human language abilities. At the end of the 80s, with the advent and active development of the Internet, rapid growth volumes of text information available in electronic form. This has led to the fact that information retrieval technologies have moved to a qualitatively new stage of their development. The need arose for automatic processing of texts in natural language, and completely new tasks and technologies appeared. Scientists are faced with the problem of quickly processing a huge stream of unstructured data. In order to find a solution to this problem, great importance has been given to the development and application of statistical methods in the field of automatic text processing. It was with their help that it turned out possible solution such tasks as dividing texts into clusters united by a common theme, highlighting certain fragments in the text, etc. In addition, the use of methods mathematical statistics and machine learning made it possible to solve the problems of speech recognition and the creation of search engines.

Scientists did not stop at the results achieved: they continued to set themselves new goals and objectives, develop new techniques and research methods. All this led to the fact that linguistics began to act as an applied science, combining a number of other sciences, the leading role among which belonged to mathematics with its variety of quantitative methods and the ability to use them for a deeper understanding of the phenomena being studied. This is how mathematical linguistics began its formation and development. On this moment This is a fairly “young” science (it has existed for about fifty years), however, despite its very “young age”, it represents an already established field of scientific knowledge with many successful achievements.

The Faculty of Philology of the Higher School of Economics is launching a new Master Program, dedicated to computational linguistics: applicants with a basic education in the humanities and mathematics and anyone interested in solving problems in one of the most promising branches of science are welcome here. Its director, Anastasia Bonch-Osmolovskaya, told Theories and Practitioners what computational linguistics is, why robots will not replace humans, and what they will teach in the HSE master’s program in computational linguistics.

This program is almost the only one of its kind in Russia. Where did you study?

I studied at Moscow State University in the Department of Theoretical and Applied Linguistics, Faculty of Philology. I didn’t get there right away, first I entered the Russian department, but then I became seriously interested in linguistics, and I was attracted by the atmosphere that remains in the department to this day. The most important thing there is good contact between teachers and students and their mutual interest.

When I had children and needed to earn a living, I went into the field of commercial linguistics. In 2005, it was not very clear what this area of activity as such was. I worked in different linguistic companies: I started with a small company at the site Public.ru - this is a kind of media library, where I started working on linguistic technologies. Then I worked for a year at Rosnanotech, where there was an idea to create an analytical portal so that the data on it would be automatically structured. Then I headed the linguistic department at the Avicomp company - this is already a serious production in the field of computer linguistics and semantic technologies. At the same time, I taught a course on computational linguistics at Moscow State University and tried to make it more modern.
Two resources for a linguist: - a site created by linguists for scientific and applied research related to the Russian language. This is a model of the Russian language, presented using a huge array of texts from different genres and periods. The texts are equipped with linguistic markup, with the help of which you can obtain information about the frequency of certain linguistic phenomena. Wordnet is a huge lexical database of the English language, main idea Wordnet - to connect not words, but their meanings into one large network. Wordnet can be downloaded and used for your own projects.
What does computational linguistics do?

This is the most interdisciplinary field. The most important thing here is to understand what is going on in the electronic world and who will help you do specific things.

We are surrounded by a very large amount of digital information, there are many business projects, the success of which depends on the processing of information, these projects can relate to the field of marketing, politics, economics and anything else. And it is very important to be able to handle this information effectively - the main thing is not only the speed of processing information, but also the ease with which you can, after filtering out the noise, get the data you need and create a complete picture from it.

Previously, some global ideas were associated with computer linguistics, for example: people thought that machine translation would replace human translation, that robots would work instead of people. But now it seems like a utopia, and machine translation is used in search engines to quickly search in an unknown language. That is, now linguistics rarely deals with abstract problems - mostly with some small things that can be inserted into a large product and make money on it.

One of the big tasks of modern linguistics is the semantic web, when the search occurs not just by matching words, but by meaning, and all sites are in one way or another marked by semantics. This can be useful, for example, for police or medical reports that are written every day. Analysis of internal connections provides a lot of necessary information, but reading and calculating it manually is incredibly time-consuming.

In a nutshell, we have a thousand texts, we need to sort them into groups, present each text in the form of a structure and get a table with which we can already work. This is called unstructured information processing. On the other hand, computational linguistics deals, for example, with the creation of artificial texts. There is a company that has come up with a mechanism for generating texts on topics that are boring for a person to write about: changes in real estate prices, weather forecast, report on football matches. It is much more expensive to order these texts for a person, and computer texts on such topics are written in coherent human language.

Yandex is actively involved in developments in the field of searching for unstructured information in Russia; Kaspersky Lab hires research groups that study machine learning. Is someone in the market trying to come up with something new in the field of computational linguistics?
Books on computational linguistics:
Daniel Jurafsky, Speech and Language Processing
Christopher Manning, Prabhakar Raghavan, Heinrich Schuetze, "Introduction to Information Retrieval"
Yakov Testelets, “Introduction to General Syntax”
Most linguistic developments are the property of large companies; almost nothing can be found in the public domain. This slows down the development of the industry; we do not have a free linguistic market or packaged solutions.

In addition, there is a lack of full-fledged information resources. There is such a project as the National Corpus of the Russian Language. This is one of the best national buildings in the world, which is rapidly developing and opens up incredible opportunities for scientific and applied research. The difference is about the same as in biology - before DNA research and after.

But many resources do not exist in Russian. Thus, there is no analogue to such a wonderful English-language resource as Framenet - this is a conceptual network where all possible connections of a particular word with other words are formally presented. For example, there is the word “fly” - who can fly, where, with what preposition this word is used, what words it is combined with, and so on. This resource helps to connect language with real life, that is, to track how people behave specific word at the level of morphology and syntax. It is very useful.

The Avicomp company is currently developing a plugin for searching articles with similar content. That is, if you are interested in an article, you can quickly look at the history of the plot: when the topic arose, what was written and when was the peak of interest in this problem. For example, with the help of this plugin it will be possible, starting from an article devoted to events in Syria, to very quickly see how during last year events developed there.

How will the learning process in the master's program be structured?

Education at HSE is organized in separate modules, just like in Western universities. Students will be divided into small teams, mini-startups - that is, at the end we should receive several finished projects. We want to get real products, which we will then open to people and leave in the public domain.

In addition to the students' immediate project managers, we want to find them curators from among their potential employers - from the same Yandex, for example, who will also play this game and give the students some advice.

I hope that people from the most different areas: programmers, linguists, sociologists, marketers. We will have several adaptation courses in linguistics, mathematics and programming. Then we will have two serious courses in linguistics, and they will be related to the most relevant linguistic theories, we want our graduates to be able to read and understand modern linguistic articles. It's the same with mathematics. We will have a course called “Mathematical Foundations of Computational Linguistics,” which will outline those branches of mathematics on which modern computational linguistics is based.
In order to enroll in a master's program, you need to pass an entrance exam in language and pass a portfolio competition.
In addition to the main courses, there will be a line of elective subjects. We have planned several cycles - two of them are focused on a more in-depth study of individual topics, which include, for example, machine translation and corpus linguistics, and one, on the contrary, is related to related areas: such as , social media, machine learning or Digital Humanities - a course that we hope will be taught in English.

The content of the article
COMPUTER LINGUISTICS, direction in applied linguistics, focused on the use of computer tools - programs, computer technologies for organizing and processing data - to model the functioning of language in certain conditions, situations, problem areas, etc., as well as the entire scope of application of computer language models in linguistics and related disciplines. Actually, only in the latter case are we talking about applied linguistics in the strict sense, since computer modeling of language can also be considered as a field of application of computer science and programming theory to solving problems in the science of language. In practice, however, computational linguistics includes almost everything related to the use of computers in linguistics.

Computational linguistics took shape as a special scientific field in the 1960s. The Russian term “computer linguistics” is a translation from the English computational linguistics. Since the adjective computational in Russian can also be translated as “computational,” the term “computational linguistics” is also found in the literature, but in Russian science it takes on a narrower meaning, approaching the concept of “quantitative linguistics.” The flow of publications in this area is very large. In addition to thematic collections, the journal Computer Linguistics is published quarterly in the United States. Greater organizational and scientific work is conducted by the Association for Computational Linguistics, which has regional structures (in particular, a European branch). Every two years, international conferences on computational linguistics – COLING – are held. Relevant issues are usually widely represented at various conferences on artificial intelligence.

Toolkit for Computational Linguistics.

Computational linguistics as a special applied discipline stands out primarily by instrument – i.e. on the use of computer tools for processing language data. Since computer programs that model certain aspects of the functioning of a language can use a variety of programming tools, there seems to be no need to talk about the general conceptual apparatus of computer linguistics. However, it is not. There are general principles of computer modeling of thinking, which are somehow implemented in any computer model. They are based on the theory of knowledge, which was originally developed in the field of artificial intelligence, and later became one of the branches of cognitive science. The most important conceptual categories of computer linguistics are such knowledge structures as “frames” (conceptual, or, as they say, conceptual structures for the declarative representation of knowledge about a typified thematically unified situation), “scenarios” (conceptual structures for the procedural representation of knowledge about a stereotypical situation or stereotypical behavior), “plans” (knowledge structures that capture ideas about possible actions leading to achieving a certain goal). Closely related to the category of frame is the concept of “scene”. The scene category is predominantly used in the literature on computational linguistics as a designation of a conceptual structure for the declarative representation of actualized and highlighted language means(lexemes, syntactic constructions, grammatical categories etc.) situations and their parts.

An organized set of knowledge structures in a certain way forms the “world model” of the cognitive system and its computer model. In artificial intelligence systems, the world model forms a special block, which, depending on the chosen architecture, may include general knowledge about the world (in the form of simple propositions like “it’s cold in winter” or in the form of production rules “if it’s raining outside, you should put on a raincoat or take an umbrella”), some specific facts (“The highest peak in the world is Everest”), and also values and their hierarchies, sometimes separated into a special “axiological block”.

Most elements of the concepts of the tools of computational linguistics are homonymous: they simultaneously designate some real entities of the human cognitive system and ways of representing these entities used in their theoretical description and modeling. In other words, elements of the conceptual apparatus of computer linguistics have ontological and instrumental aspects. For example, in the ontological aspect, the division of declarative and procedural knowledge corresponds to different types of knowledge available to a person - the so-called knowledge WHAT (declarative; such, for example, knowledge of the postal address of some NN), on the one hand, and knowledge HOW (procedural; such , for example, knowledge that allows you to find the apartment of this NN, even without knowing its formal address) - on the other. In the instrumental aspect, knowledge can be embodied in a set of descriptions (descriptions), in a set of data, on the one hand, and in an algorithm, an instruction carried out by a computer or some other model of a cognitive system, on the other.

Directions of computational linguistics.

The field of CL is very diverse and includes such areas as computer modeling of communication, plot structure modeling, hypertext technologies for text presentation, machine translation, and computer lexicography. In a narrow sense, the problems of CL are often associated with an interdisciplinary applied area with the somewhat unfortunate name “natural language processing” (translation of the English term Natural Language Processing). It arose in the late 1960s and developed within the scientific and technological discipline of “artificial intelligence”. In its internal form, the phrase “natural language processing” covers all areas in which computers are used to process language data. Meanwhile, a narrower understanding of this term has taken hold in practice - the development of methods, technologies and specific systems that ensure human communication with a computer in natural or limited natural language.

The rapid development of the field of “natural language processing” occurred in the 1970s, which was associated with an unexpected exponential growth in the number of computer end users. Since teaching languages and programming technology to all users is impossible, the problem of organizing interaction with computer programs has arisen. The solution to this communication problem followed two main paths. In the first case, attempts were made to adapt programming languages and operating systems to the end user. As a result, high-level languages such as Visual Basic appeared, as well as convenient operating systems built in the conceptual space of metaphors familiar to humans - DESK, LIBRARY. The second way is to develop systems that would allow interaction with a computer in a specific problem area in natural language or some limited version of it.

The architecture of natural language processing systems in the general case includes a block for analyzing the user's speech message, a block for interpreting the message, a block for generating the meaning of the response, and a block for synthesizing the surface structure of the statement. A special part of the system is the dialogue component, which records strategies for conducting dialogue, conditions for using these strategies, and ways to overcome possible communication failures (failures in the communication process).

Among computer natural language processing systems, question-answer systems, interactive problem-solving systems, and connected text processing systems are usually distinguished. Initially, question-answer systems began to be developed as a reaction to the poor quality of query encoding when searching for information in information retrieval systems. Since the problem area of such systems was very limited, this somewhat simplified the algorithms for translating queries into a representation in a formal language and the reverse procedure for converting a formal representation into statements in a natural language. Among the domestic developments, programs of this type include the POET system, created by a team of researchers under the leadership of E.V. Popov. The system processes requests in Russian (with minor restrictions) and synthesizes the answer. The program flow diagram involves going through all stages of analysis (morphological, syntactic and semantic) and the corresponding stages of synthesis.

Conversational problem solving systems, unlike systems of the previous type, play an active role in communication, since their task is to obtain a solution to the problem based on the knowledge that is presented in it and the information that can be obtained from the user. The system contains knowledge structures that record typical sequences of actions for solving problems in a given problem area, as well as information about the necessary resources. When a user asks a question or sets a specific task, the corresponding script is activated. If some script components are missing or some resources are missing, the system initiates communication. This is how, for example, the SNUKA system works, solving the problems of planning military operations.

Systems for processing connected texts are quite diverse in structure. Their common feature can be considered the widespread use of knowledge representation technologies. The functions of systems of this kind are to understand the text and answer questions about its content. Understanding is not considered as a universal category, but as a process of extracting information from a text, determined by a specific communicative intention. In other words, the text is “read” only with the assumption that exactly what the potential user wants to know about it. Thus, systems for processing connected texts turn out to be by no means universal, but problem-oriented. Typical examples of systems of the type discussed are the RESEARCHER and TAILOR systems, which form a single software package that allows the user to obtain information from patent abstracts describing complex physical objects.

The most important area of computer linguistics is the development of information retrieval systems (IRS). The latter arose in the late 1950s and early 1960s as a response to the sharp increase in the volume of scientific and technical information. Based on the type of information stored and processed, as well as the search features, information retrieval systems are divided into two large groups - documentary and factual. Documentary information retrieval systems store the texts of documents or their descriptions (abstracts, bibliographic cards, etc.). Factual IRS deal with the description of specific facts, and not necessarily in text form. These can be tables, formulas and other types of data presentation. There are also mixed information systems, including both documents and factual information. Currently, factual information systems are built on the basis of database technologies (DB). To ensure information retrieval in the information retrieval system, special information retrieval languages are created, which are based on information retrieval thesauruses. Information retrieval language is formal language, intended to describe individual aspects of the content plan for documents stored in the IRS and the request. The procedure for describing a document in an information retrieval language is called indexing. As a result of indexing, each document is assigned its formal description in an information retrieval language - a search image of the document. The query is indexed in a similar way, to which a search query image and a search prescription are assigned. Information retrieval algorithms are based on comparing the search prescription with the search query image. The criterion for issuing a document to a request may be a complete or partial match of the search image of the document and the search instruction. In some cases, the user has the opportunity to formulate the issuance criteria himself. This is determined by his information need. Automated information retrieval systems often use descriptor information retrieval languages. The subject of a document is described by a set of descriptors. The descriptors are words and terms that denote simple, fairly elementary categories and concepts of the problem area. As many descriptors are entered into the search image of the document as there are different topics covered in the document. The number of descriptors is not limited, which allows you to describe the document in a multidimensional matrix of features. Often in a descriptor information retrieval language, restrictions are imposed on the compatibility of descriptors. In this case, we can say that the information retrieval language has syntax.

One of the first systems that worked with a descriptor language was the American UNITERM system, created by M. Taube. Document keywords—uniterms—functioned as descriptors in this system. The peculiarity of this IRS is that initially the dictionary of the information language was not specified, but arose in the process of indexing the document and query. The development of modern information retrieval systems is associated with the development of non-thesaurus type information retrieval systems. Such information systems work with the user in a limited natural language, and the search is carried out through the texts of document abstracts, through their bibliographic descriptions, and often through the documents themselves. For indexing in the non-thesaurus type IRS, words and phrases of natural language are used.

To a certain extent, the field of computer linguistics can include work in the field of creating hypertext systems, considered as a special way of organizing text and even as a fundamental the new kind text, contrasted in many of its properties plain text, formed in the Gutenberg tradition of printing. The idea of hypertext is associated with the name of Vannevar Bush, President F. Roosevelt's advisor on science. V. Bush theoretically substantiated the project of the Memex technical system, which allowed the user to connect texts and their fragments using various types of connections, mainly by associative relationships. The lack of computer technology made the project difficult to implement, since the mechanical system turned out to be too complex for practical implementation.

Bush's idea was reborn in the 1960s in T. Nelson's Xanadu system, which already involved the use of computer technology. "Xanadu" allowed the user to read a set of texts entered into the system different ways, in different sequences, the software made it possible to both remember the sequence of viewed texts and select almost any of them at an arbitrary point in time. A set of texts with relationships connecting them (a system of transitions) was called hypertext by T. Nelson. Many researchers view the creation of hypertext as the beginning of a new information era, opposed to the era of printing. The linearity of writing, which outwardly reflects the linearity of speech, turns out to be a fundamental category that limits human thinking and understanding of the text. The world of meaning is nonlinear, therefore, compression of semantic information in a linear speech segment requires the use of special “communicative packaging” - division into theme and rheme, division of the plan of content of an utterance into explicit (statement, proposition, focus) and implicit (presupposition, consequence, discourse implicature) layers . Refusal of the linearity of the text both in the process of its presentation to the reader (i.e. during reading and understanding) and in the process of synthesis, according to theorists, would contribute to the “liberation” of thinking and even the emergence of its new forms.

In a computer system, hypertext is presented in the form of a graph, the nodes of which contain traditional texts or their fragments, images, tables, videos, etc. The nodes are connected by a variety of relationships, the types of which are specified by hypertext software developers or by the reader himself. Relationships define the potential possibilities of movement, or navigation through hypertext. Relationships can be unidirectional or bidirectional. Accordingly, bidirectional arrows allow the user to move in both directions, while unidirectional arrows allow the user to move only in one direction. The chain of nodes through which the reader passes when viewing the components of the text forms a path, or route.

Computer implementations of hypertext can be hierarchical or networked. The hierarchical – tree-like – structure of hypertext significantly limits the possibilities of transition between its components. In such a hypertext, the relationships between components resemble the structure of a thesaurus based on genus-species relationships. Network hypertext allows the use of various types of relationships between components, not limited to genus-species relationships. According to the method of existence of hypertext, static and dynamic hypertexts are distinguished. Static hypertext does not change during operation; in it the user can record his comments, but they do not change the essence of the matter. For dynamic hypertext, change is a normal form of existence. Typically, dynamic hypertexts function where it is necessary to constantly analyze the flow of information, i.e. in information services of various kinds. Hypertext is, for example, the Arizona Information System (AAIS), which is updated monthly by 300–500 abstracts per month.

The relationships between hypertext elements can be initially fixed by the creators, or they can be generated whenever a user accesses the hypertext. In the first case we're talking about about hypertexts of a hard structure, and in the second - about hypertexts of a soft structure. The rigid structure is technologically quite understandable. The technology for organizing a soft structure should be based on a semantic analysis of the proximity of documents (or other sources of information) to each other. This is a non-trivial task in computational linguistics. Nowadays, the use of soft structure technologies on keywords is widespread. The transition from one node to another in a hypertext network is carried out as a result of searching for keywords. Since the set of keywords may be different each time, the structure of the hypertext changes each time.

The technology for building hypertext systems does not distinguish between text and non-text information. Meanwhile, the inclusion of visual and audio information (videos, paintings, photographs, sound recordings, etc.) requires significant change user interface and more powerful software and computer support. Such systems are called hypermedia, or multimedia. The visibility of multimedia systems predetermined their widespread use in teaching and in the creation of computer versions of encyclopedias. There are, for example, beautifully produced CD-roms with multimedia systems based on children's encyclopedias published by Dorlin Kindersley.

Within the framework of computer lexicography, computer technologies for compiling and operating dictionaries are being developed. Special programs - databases, computer file cabinets, word processing programs - allow you to automatically generate dictionary entries, store dictionary information and process it. Many different computer lexicographic programs are divided into two large groups: programs for supporting lexicographic works and automatic dictionaries of various types, including lexicographic databases. An automatic dictionary is a dictionary in a special machine format intended for use on a computer by a user or a computer word processing program. In other words, there is a distinction between automatic dictionaries for the human end user and automatic dictionaries for word processing programs. Automatic dictionaries intended for the end user differ significantly in interface and structure of the dictionary entry from automatic dictionaries included in machine translation systems, automatic abstracting systems, information retrieval systems, etc. Most often they are computer versions of well-known conventional dictionaries. On the software market there are computer analogues of explanatory dictionaries of the English language (automatic Webster, automatic Dictionary English language publishing house Collins, automatic version of the New Great English-Russian dictionary edited by Yu.D. Apresyan and E.M. Mednikova), there is also a computer version of Ozhegov’s dictionary. Automatic dictionaries for word processing programs can be called automatic dictionaries in the strict sense. They are generally not intended for the average user. The features of their structure and the scope of vocabulary material are determined by the programs that interact with them.

Computer modeling of plot structure is another promising area of computer linguistics. The study of plot structure relates to the problems of structural literary criticism (in a broad sense), semiotics and cultural studies. Available computer programs for plot modeling are based on three basic formalisms for plot representation - the morphological and syntactic directions of plot representation, as well as on the cognitive approach. Ideas about the morphological structure of the plot structure go back to the famous works of V.Ya. Propp ( cm.) about a Russian fairy tale. Propp noticed that with the abundance of characters and events in a fairy tale, the number of functions of the characters is limited, and he proposed an apparatus for describing these functions. Propp's ideas formed the basis of the TALE computer program, which simulates the generation of a fairy tale plot. The algorithm of the TALE program is based on the sequence of functions of the characters in the fairy tale. In fact, Propp's functions defined a set of typified situations, ordered on the basis of an analysis of empirical material. The possibilities of linking various situations in the rules of generation were determined by a typical sequence of functions - in the form in which this can be established from the texts of fairy tales. In the program, typical function sequences were described as typical character encounter scenarios.

The theoretical basis of the syntactic approach to the plot of a text was “story grammars” or “story grammars”. They appeared in the mid-1970s as a result of the transfer of the ideas of N. Chomsky’s generative grammar to the description of the macrostructure of the text. If the most important components of the syntactic structure in a generative grammar were verb and noun phrases, then in most plot grammars the exposition (setting), event and episode were singled out as basic ones. In the theory of plot grammars, the conditions of minimality, that is, the restrictions that determine the status of a sequence of plot elements as a normal plot, have been widely discussed. It turned out, however, that this cannot be done using purely linguistic methods. Many restrictions are sociocultural in nature. Plot grammars, while differing significantly in the set of categories in the generation tree, allowed a very limited set of rules for modifying the narrative structure.

In the early 1980s, one of R. Schenk’s students, V. Lehnert, as part of her work on creating a computer plot generator, proposed an original formalism of emotional plot units (Affective Plot Units), which turned out to be a powerful means of representing plot structure. Despite the fact that it was originally developed for an artificial intelligence system, this formalism was used in purely theoretical studies. The essence of Lehnert's approach was that the plot was described as a sequential change in the cognitive-emotional states of the characters. Thus, the focus of Lehnert’s formalism is not on the external components of the plot - exposition, event, episode, morality - but on its content characteristics. In this respect, Lehnert's formalism is partly a return to Propp's ideas.

The competence of computer linguistics also includes machine translation, which is currently experiencing a rebirth.
Literature:

Popov E.V. Communication with a computer in natural language. M., 1982
Sadur V.G. Speech communication with electronic computers and problems of their development. – In the book: Speech communication: problems and prospects. M., 1983
Baranov A.N. Categories of artificial intelligence in linguistic semantics. Frames and scripts. M., 1987
Kobozeva I.M., Laufer N.I., Saburova I.G. Modeling communication in human-machine systems. – Linguistic support information systems. M., 1987
Olker H.R. Fairy tales, tragedies and ways of presenting world history. – In the book: Language and modeling of social interaction. M., 1987
Gorodetsky B.Yu. Computational linguistics: modeling language communication
McQueen K. Discourse strategies for natural language text synthesis. – New in foreign linguistics. Vol. XXIV, Computational Linguistics. M., 1989
Popov E.V., Preobrazhensky A.B. . Features of the implementation of NL systems
Preobrazhensky A.B. State of development of modern NL systems. - Artificial intelligence. Book 1, Communication systems and expert systems. M., 1990
Subbotin M.M. Hypertext. A new form of written communication. – VINITI, Ser. Computer Science, 1994, vol. 18
Baranov A.N. Introduction to Applied Linguistics. M., 2000

COMPUTER LINGUISTICS (calque from English computational linguistics), one of the areas of applied linguistics, in which computer programs, computer technologies for organizing and processing data are developed and used to study language and model the functioning of language in certain conditions, situations and problem areas. On the other hand, this is the area of application of computer language models in linguistics and related disciplines. As a special scientific direction, computational linguistics took shape in European research in the 1960s. Since the English adjective computational can also be translated as “computational,” the term “computational linguistics” is also found in the literature, but in Russian science it takes on a narrower meaning, approaching the concept of “quantitative linguistics.”

The term “quantitative linguistics” is often referred to as computational linguistics, which characterizes an interdisciplinary direction in applied research, where quantitative or statistical methods of analysis are used as the main tool for studying language and speech. Sometimes quantitative (or quantitative) linguistics is contrasted with combinatorial linguistics. In the latter, the dominant role is occupied by the “non-quantitative” mathematical apparatus - set theory, mathematical logic, theory of algorithms, etc. From a theoretical point of view, the use of statistical methods in linguistics makes it possible to supplement the structural model of language with a probabilistic component, i.e. to create a theoretical structural-probabilistic a model with significant explanatory potential. In the applied field, quantitative linguistics is represented, first of all, by the use of fragments of this model, used for linguistic monitoring of the functioning of the language, decryption of coded text, authorization/attribution of text, etc.

The term “computer linguistics” and the problems of this area are often associated with the modeling of communication, and above all with the provision of human interaction with a computer in natural or limited natural language (special natural language processing systems are created for this purpose), as well as with the theory and practice of information technology. search engines (IRS). Ensuring human communication with a computer in natural language is sometimes referred to as “natural language processing” (translation from English of the term Natural Language Processing). This area of computer linguistics arose in the late 1960s abroad and developed within the framework of the scientific and technological discipline called artificial intelligence (works by R. Schenk, M. Lebowitz, T. Winograd, etc.). In its meaning, the phrase “natural language processing” should cover all areas in which computers are used to process language data. In practice, however, a narrower understanding of the term has taken hold - the development of methods, technologies and specific systems that ensure human communication with a computer in natural or limited natural language.

To a certain extent, computer linguistics can include work in the field of creating hypertext systems, considered as a special way of organizing text and even as a fundamentally new type of text, contrasted in many of its properties with ordinary text formed in the Gutenberg tradition of printing (see Gutenberg).

The competence of computational linguistics also includes automatic translation.

Within the framework of computational linguistics, a relatively new direction has emerged, which has been actively developing since the 1980-90s - corpus linguistics, where general principles for constructing linguistic data corpora (in particular, text corpora) using modern computer technologies are being developed. Text corpora are collections of specially selected texts from books, magazines, newspapers, etc., transferred to computer media and intended for automatic processing. One of the first text corpora was created for American English at Brown University (the so-called Brown Corpus) in 1962-63 under the leadership of W. Francis. In Russia, since the early 2000s, the Vinogradov Russian Language Institute of the Russian Academy of Sciences has been developing the National Corpus of the Russian Language, consisting of a representative sample of Russian-language texts with a volume of about 100 million word usages. In addition to the actual construction of data corpora, corpus linguistics deals with the creation of computer tools (computer programs) designed to extract various information from text corpora. From the user's point of view, text corpora are subject to requirements for representativeness (representativeness), completeness and economy.

Computer linguistics is actively developing both in Russia and abroad. The flow of publications in this area is very large. In addition to thematic collections, the journal Computational Linguistics has been published quarterly in the United States since 1984. Much organizational and scientific work is carried out by The Association for Computational Linguistics, which has regional structures around the world (in particular, a European branch). Every two years, international COLINT conferences are held (in 2008 the conference was held in Manchester). The main directions of computational linguistics are also discussed at the annual international conference “Dialogue”, organized by the Russian Research Institute of Artificial Intelligence, the Faculty of Philology of Moscow State University, Yandex and a number of other organizations. Relevant issues are also widely represented at international conferences on artificial intelligence at various levels.

Lit.: Zvegintsev V. A. Theoretical and applied linguistics. M., 1968; Piotrovsky R. G., Bektaev K. B., Piotrovskaya A. A. Mathematical linguistics. M., 1977; Gorodetsky B. Yu. Current problems of applied linguistics // New in foreign linguistics. M., 1983. Issue. 12; Kibrik A. E. Applied linguistics // Kibrik A. E. Essays on general and applied issues of linguistics. M., 1992; Kennedy G. An introduction to corpus linguistics. L., 1998; Bolshakov I.A., Gelbukh A. Computational linguistics: models, resources, applications. Mekh., 2004; National Corpus of the Russian Language: 2003-2005. M., 2005; Baranov A. N. Introduction to applied linguistics. 3rd ed. M., 2007; Computer linguistics and intellectual technologies. M., 2008. Issue. 7.

Share this article:

Similar articles

April 17, 2015
1c for insurance companies

April 17, 2015
Internal cash movement Move money from one cash register to another

April 17, 2015
Determination of the useful life of a fixed asset resulting from the combination of several objects for accounting and tax accounting purposes (I. Bashkirova

April 17, 2015
Combining several OS objects into one - how to reflect it in accounting?

Computational linguistics as an applied linguistic discipline. Automatically obtaining various dictionaries from text. Based on the corpus, you can obtain data

Rapid growth in the volume of text information in electronic form

The need for automatic processing of texts in natural language

1. Products of PROMT and ABBY (Lingvo) 2. Machine translation technologies 3. Translation Memory technologies

Modern commercial systems Reviving texts Communication models Computer lexicography Machine translate Corpus of texts

Reviving texts

Communication models

Computer lexicography

Machine translate

Corpus of texts

Natural language text analysis

3 levels of text structure: Surface syntactic structure Deep syntactic structure Semantic level

Surface syntactic structure

Deep syntactic structure

Semantic level

The problem of synthesis is the reverse of that in analysis

Bringing text to life

2. processing of written text

Spoken speech processing

Mechanized IPS are technical means that ensure the selection of the necessary documents

Automatic- searching for information using computers

Computer lexicography

There are 2 directions in lexicography: Traditional lexicography compiles traditional dictionaries Machine lexicography deals with automation of dictionary preparation and solves problems of developing electronic dictionaries

Traditional lexicography compiles traditional dictionaries

Machine lexicography deals with automation of dictionary preparation and solves problems of developing electronic dictionaries

Tasks of computer lexicography

Automatically obtaining various dictionaries from text

Creation of dictionaries that are electronic versions of traditional dictionaries or complex electronic linguistic dictionaries for traditional dictionary work, for example LINGVO

Development of theoretical and practical aspects of compiling special computer dictionaries, for example for information retrieval, machine translation

Machine translate

FAMT(Fully Automated Machine Translation) – fully automatic translation

HAMT(Human Aided machine Translation) - machine translation with human participation

MAHT(Machine Aided Human Translation) – translation carried out by a person using auxiliary software and linguistic tools.

2) professional MP– higher quality translation followed by human editing

Corpus of texts

Based on the corpus, you can obtain the following data:

Computer science(Computer Science) - the science of the laws of recording, storing, processing, transmitting and using information using technical means. Search for information (Information Retrieval) is the process of finding such documents (texts, records and

Computer science(Computer Science) - the science of the laws of recording, storing, processing, transmitting and using information using technical means.

Plan – representation of knowledge about possible actions that are necessary to achieve a certain goal.

Soviet and Russian scientists: Alexey Lyapunov, Igor Melchuk, Olga Kulagina, Yu.D. Apresyan, N.N. Leontyeva, Yu.S. Martemyanov, Z.M. Shalyapina, Igor Boguslavsky, A.S. Narignani, A.E. Kibrik, Baranov A.N.

Western scientists Stars: Yorick Wilkes, Gregory Grefenstette, Gravil Corbett, John Carroll, Diana McCarthy, Luis Marquez, Dan Moldovan, Joakim Nivre, Victor Raskin, Eduard Hovey.

"Dialogue"- the main Russian conference on computational linguistics with international participation.

Linguistic semantics and semantic analysis

Formal language models and their applications

Theoretical and computer lexicography

Methods for evaluation of text analysis and machine translation systems

Corpus linguistics. Creation, application, evaluation of corpora

Internet as a linguistic resource. Linguistic technologies on the Internet

Ontologies. Knowledge extraction from texts

Computer analysis of documents: abstracting, classification, search

Automatic sentiment analysis of texts

Machine translate

Communication models. Communication, dialogue and speech act

Analysis and speech synthesis

UACL has a European one (EACL) and North American (NAACL) branches.

ACL Book Series, Research in Natural Language Processing, published Cambridge University Press.

Every year ACL and its chapters organize international conferences in different countries.

ACL 2014 was held in Baltimore, USA.

References:

1. Marchuk Yu.N. Computer linguistics: textbook/Yu.N. Marchuk.- M.:AST: East-West, 2007-317 p.

2. Shilikhina K.M. Fundamentals of applied linguistics: textbook for specialty 021800 (031301) - Theoretical and applied linguistics, Voronezh, 2006.

3. Boyarsky K.K. Introduction to computational linguistics. Textbook. - St. Petersburg: NRU ITMO, 2013. - 72 p.

4. Shchipitsina L.Yu. Information technologies in linguistics: textbook / L.Yu. Shchipitsina.- M.: FLINT: science, 2013.- 128 p.

5. Sosnina E.P. Introduction to applied linguistics: textbook / E.P. Sosnina. - 2nd ed., revised. and additional – Ulyanovsk: Ulyanovsk State Technical University, 2012. -110 p.

6. Baranov A.N. Introduction to applied linguistics: Textbook. - M.: Editorial URSS, 2001. - 360 p.

7. Applied linguistics: Textbook / L.V. Bondarko, L.A. Verbitskaya, G.Ya. Martynenko and others; Rep. Editor A.S. Gerd. St. Petersburg: publishing house St. Petersburg. Univ., 1996.- 528 p.

8. Shemyakin Yu.I. Beginnings of computer linguistics: Textbook. M.: Publishing house MGOU, JSC "Rosvuznauka", 1992.

History of the development of computational linguistics

Toolkit for Computational Linguistics.

Directions of computational linguistics.

Modern commercial systems

Reviving texts

Communication models

Computer lexicography

Machine translate

Corpus of texts

3 levels of text structure:
Surface syntactic structure

Deep syntactic structure

Semantic level

There are 2 directions in lexicography:
Traditional lexicography compiles traditional dictionaries

Machine lexicography deals with automation of dictionary preparation and solves problems of developing electronic dictionaries

Computer science(Computer Science) - the science of the laws of recording, storing, processing, transmitting and using information using technical means.

Search for information (Information Retrieval) is the process of finding such documents (texts, records and