Mathematical apparatus developed by Marusenko
Methods of pattern recognition were first used when attributing anonymous and pseudonymous works on the basis of an author’s individual style characteristics in the work of M.A. Marusenko in 1990 (Marusenko, 1990). Since then this method of attribution was successfully applied for attribution of a number of literary works of doubtful authorship. Among them are “Quiet Flows the Don” attributed to M. Sholohov, the works attributed to Romain Gary, and others [Ìàðóñåíêî, 2001, Chepiga 2007].
In the present research work a text is viewed as a complex linguistic object which features a wide inventory of elements that can be analyzed on many levels. Judging by the requirements of an adequate description of a text, a multidimensional statistical analysis – pattern recognition theory – was used as the base of the new method of attribution of anonymous and pseudonymous literary works. If we use pattern recognition then style is considered to be “a set of properties, characterizing the content, the ways of connecting, and the statistical-probability regularity of the use of language means which form the given individual writer’s style” (Marusenko, 1990, p. 24).
The set of properties which characterize the structure of a text in its syntactical aspect becomes in this case the sum total of informative parameters. The informative parameters are meant to distinguish the woks by different authors and the make-up of their sum total is determined by executing a special procedure for selecting informative parameters for each concrete case.
An important theoretical position of the given research work is that the procedure of attribution is divided into three relatively independent stages:
- Formation of a literary-critical attribution hypothesis which is executed using methods of traditional philological analysis employing all accessible methods for attribution.
- Rejecting / not rejecting the literary-critical hypothesis. Means of the theory of pattern recognition are used to test the hypothesis.
- Interpretation of the results of testing the attribution hypothesis.
The hypothesis is considered to be statistically corroborated if the results of recognition coincide with the original literary-critical attribution hypothesis (under the established level of meaning). In the opposite case the hypothesis is considered to be disproved, and either an alternative hypothesis is made, or the original hypothesis is modified. When fulfilling such a scheme of attribution statistical-probability methods of analysis of language and style are used only as supplemental means for testing the original attribution hypothesis with the help of philological methods of attribution.
Testing the literary-critical hypothesis takes place in several stages using a certain set of procedures:
- Determining the a priori set of individual stylistic parameters. Considering that parameters from the a priori dictionary of parameters should be determined by style in its structural-syntactical aspect (Ìàðòûíåíêî, 1988), these parameters are taken from works of those researchers who studied sentence structure and make-up using mathematical methods (between those researches are Ñåâáî, 1981, Ôóêñ, 1975, Õåòñî 1989, Vasak, 1980).
- Determining the a priori set of classes. The make-up of a priori classes is determined by the requirements of uniformity of time and genre, while volume is measured in the main syntactical units — sentences.
- Description of classes from the a priori alphabet of classes in the language of parameters from the a priori dictionary of parameters. Each linguistic object which is subject to analysis with the aim of making stylistic diagnostics is presented in accordance with the mathematical object p, characterized by an n-measure vector, where n is the number of parameters.
During the stage of describing the attributed objects in the language of parameters from the a priori dictionary of parameters the researcher must process data by hand. This makes it possible to adequately describe the text in its syntactical aspect. Of cause, it is very important to exclude the accidental mistakes, that’s why the processing by hand entails forming general rules for analyzing texts, introducing rules for parameterization of the text for each parameter and, finally, making a calculation of linguistic phenomena according to the rules.
- Determining the information set of parameters. This stage of attribution consists of separating a necessary and adequate number of parameters for linking the object to a class from the informational parametric space. Excess parameters are eliminated as a result. When forming the set of informative parameters M.M. Bongard’s scheme is used (Áîíãàðä, 1967). The Bongard’s scheme is a set of successive stages and the research work can be replicated by another researches. The detailed description of this stage is given by the example of problem “Corneille-Molière”.
- Choice of a deciding rule. The task of determining the author of an anonymous or pseudonymous text in this research paper is seen as a task of finding the distance between the multi-dimensional vector which suits the a priori class M1, and the multi-dimensional vector which suits the a priori class M2 of the unknown author. The deciding rule is the function chosen to measure this distance and to take a decision on the sameness or discrimination of sameness of these objects. The recognition algorithm used should provide for a separation of distance of signs into fields which correspond to classes with a minimum of recognition mistakes. In the given research work the algorithm of recognition calls for a two-stepped recognition procedure: determining and probability.
- Appraisal of the quality of qualification. Since the classes received as a result of the mathematical procedure of classification can be artifacts, it is
necessary to make an appraisal of the quality of classification. This appraisal may lead to correction of the structure of the classes received.
Use of the mathematical apparatus developed by M.A. Marusenko on real historical-literary material showed the apparatus’ high effectiveness (Marusenko, 2001).
One can make a conclusion about the stability of the recognition system to fluctuations in the volume of texts and to a temporary evolution of parameters of the author’s style based on the results of tests of real attribution hypotheses described in several research works. In the majority of cases the recognition system provides a complete separation of objects into according classes. In the opposite case a hypothesis can be stated after appraising the quality of classification that another one or several classes of authors also exists who weren’t accounted for in the original attribution hypothesis. That said sequential use of determining and probability algorithms of recognition excludes impossibility of recognition.