Authorship and Linguistic Expertise of Text

To examine modern methods of authorship and linguistic expertise for determining text authorship and their effectiveness in judicial practice.

Contents

Philosophical Framework
Introduction
Literature Review
Delimitation of Competencies of Linguistic and Authorship Expertise
Effectiveness of Computer Methods in Authorship Expertise
Problems of Author Identification with Limited Data
Applicability of Authorship Expertise to Short Texts
The Problem of Plagiarism and Borrowing in Texts
Criticism and Limitations
Limitations Related to Volume and Quality of Data
The Problem of Imitation and Masking of Idiolect
Difficulties of Interpretation and Lack of Standardization
Detailed Exposition
Delimitation of Authorship and Linguistic Expertise
Methods of Authorship Expertise

Philosophical Framework

The question of text authorship, its uniqueness, and the possibility of identifying the author by linguistic features is deeply rooted in philosophical discussions about the nature of language and individuality. If language is considered as a system, which, according to Bloomfield, is a “set of habits” [Bloomfield, 1933], then the author’s idiolect becomes a unique manifestation of these habits, a kind of linguistic fingerprint. However, as Chomsky noted, language is not only a set of habits but also a generative system that allows the creation of an infinite number of new expressions [Chomsky, 1957]. This raises the question of how stable and identifiable individual patterns are in such a dynamic system.

In the context of judicial practice, where the text serves as evidence, we face the problem of the performativity of language described by Austin [Austin, 1962]. Words do not merely describe reality but also perform actions, making the analysis of their semantic and pragmatic content critically important. Text expertise goes beyond simple form comparison, requiring a deep understanding of how linguistic means are used to achieve specific communicative goals and how these goals reflect the individuality of the author.

Introduction

In modern linguistics, authorship and linguistic expertise of texts play a key role in solving tasks related to establishing authorship, detecting plagiarism, and analyzing semantic content in legal practice. These expertises are in demand amid the rise of crimes committed using internet communications, where anonymity and the remote nature of communication create new challenges for law enforcement agencies [Litvinova et al., 2020]. The main hypothesis is that each author has a unique idiolect that can be identified using specialized linguistic and statistical methods.

However, despite the availability of tested methodologies, unresolved problems remain, especially related to new objects of study such as short texts from social networks or texts created using computer programs [Litvinova et al., 2020]. The delimitation of competencies between linguistic experts and authorship experts also remains a controversial issue, especially in cases concerning violations of copyright and related rights [Kuznetsov et al., 2019]. The effectiveness of these methods in judicial practice depends on many factors, including the volume and quality of the material studied and the expert’s qualifications.

Literature Review

Delimitation of Competencies of Linguistic and Authorship Expertise

The question of delimiting the competencies of linguistic and authorship expertise in judicial practice remains a subject of active discussion despite the apparent obviousness of their subject areas. Indeed, at first glance, linguistics deals with language as a system and its functioning, while authorship studies focus on the personality of the author manifested through the text. However, in practice, these fields often overlap, creating methodological and procedural difficulties.

One of the key arguments for a clear division of competencies is presented by V.O. Kuznetsov and E.K. Kryuk, who state that “the tasks within the competence of the authorship expert are related to the study of written-speech skills and aimed at establishing the authorship of the text and the conditions of text composition” [Kuznetsov et al., 2019]. This means that authorship expertise is intended to answer questions like “who wrote this text?” or “was the text written under dictation?”, investigating individual features of speech behavior that form a unique “author’s handwriting” or idiolect. The goal of authorship expertise, in their opinion, is “to establish the author of the document based on the study of general and specific features of written speech reflecting the level of development of his linguistic skills” [Kuznetsov et al., 2019].

At the same time, linguistic expertise, according to Kuznetsov and Kryuk, has a different goal: “to establish the features of a particular linguistic meaning expressed in the text or its components (word, phrase, utterance, etc.), regardless of the degree of development of the linguistic skills of the text’s author” [Kuznetsov et al., 2019]. That is, the linguist analyzes the content of the text, its semantics, pragmatics, and stylistics to determine, for example, the presence of insults, threats, calls for unlawful actions, or the fact of plagiarism. The task of establishing plagiarism, as the authors emphasize, “does not involve studying the author’s written speech and identifying features of written speech reflecting the degree of development of the author’s linguistic skills, but rather the study of at least two texts as products of speech activity” [Kuznetsov et al., 2019]. This is a semantic comparison of speech works aimed at identifying the degree of adequacy of meaning transfer from one text to another, which is the prerogative of linguistic expertise [Kuznetsov et al., 2019].

However, despite these seemingly clear distinctions, many researchers and practitioners do not make such a strict separation. V.O. Kuznetsov and E.K. Kryuk note that “a number of domestic scholars do not distinguish between authorship and linguistic expertise in cases related to copyright and related rights violations” [Kuznetsov et al., 2019]. Such studies may be called “linguistic authorship expertise,” “authorship expertise,” or simply “expertise,” with keywords indicating “authorship expertise” [Kuznetsov et al., 2019]. This indicates that in practice, the boundaries between these types of expertise are often blurred, and their tasks may be considered complementary or even integrated.

For example, E.I. Galyashina, whose works are often cited in the context of forensic linguistics, reserves for linguistic expertise the questions of establishing or interpreting the semantic content of the text, its originality, individuality, novelty, and the degree of mixing with opposing designations [Kuznetsov et al., 2019]. She allows that depending on the tasks set, the expertise may be complex, combining authorship and linguistic research. This points to the recognition of the interdisciplinary nature of many forensic tasks, where a complete answer requires analysis of both authorship features and semantic content.

A.Yu. Khomenko in his 2014 work also speaks about linguistic and authorship expertise but combines their tasks within a single study he calls linguistic analysis [Kuznetsov et al., 2019]. Within such analysis, authorship methods are used to establish the originality of the text, its novelty (based on individual authorship features, idiolect), communicative orientation, and author’s intention. This approach emphasizes that idiolect, being the object of authorship analysis, is simultaneously a key element for assessing originality, which in turn is important for linguistic expertise in plagiarism cases.

Tatiana Litvinova and Anastasiya Gromova, studying problems of forensic authorship expertise, note that its object is “the text embodying the author’s idiolect” [Litvinova et al., 2020]. By idiolect, they mean a unique realization of the language system consisting of stable and variable choices made by the author. This definition emphasizes that even when solving tasks related to author identification, it is impossible to completely abstract from the linguistic features of the text. Moreover, they point out that the analysis of linguistic features acquires primary importance for solving the task of identifying and diagnosing the personality of the text’s author [Litvinova et al., 2020], especially in the absence of handwritten elements. Although theoretically a boundary can be drawn between expertises, in practice they often turn out to be closely intertwined. Authorship expertise, focusing on author identification, inevitably relies on linguistic features forming the idiolect. Linguistic expertise, especially in cases of plagiarism or originality assessment, also cannot ignore individual speech features that distinguish the author’s text from borrowed ones. This creates a need for a comprehensive approach where the expert must have competencies in both areas or work as part of an interdisciplinary team.

The problem of author identification, as Litvinova and Gromova note, can have different aspects, including the “closed-set problem” (who among a limited circle of persons is the author) and the “verification problem” (whether a given person is the author of a forensically significant text) [Litvinova et al., 2020]. Solving these tasks essentially requires deep analysis of linguistic characteristics of the text, which can be both conscious and unconscious manifestations of the idiolect. Even seemingly insignificant parameters such as “symbol sequences, punctuation habits” can be effective for author identification, especially in complex cross-genre scenarios [Litvinova et al., 2020].

Ultimately, delimiting the competencies of linguistic and authorship expertise, while important for methodological clarity, should not hinder their synergistic interaction. Modern challenges, such as analyzing texts from social networks or anonymous messages, require increasingly sophisticated methods often at the intersection of these two fields. The lack of clear division, as Kuznetsov and Kryuk mention, may be less a problem and more a reflection of the natural interrelation between language and its bearer. After all, how can one study text features without considering who created it, and how can one identify the author without analyzing their speech activity? This question becomes especially relevant in the context of developing computer methods that allow processing huge volumes of textual data, revealing patterns invisible to the human eye.

Effectiveness of Computer Methods in Authorship Expertise

After delimiting the competencies of linguistic and authorship expertise, a natural question arises: how capable are modern technologies of assisting experts in solving these tasks? Computer methods undoubtedly promise increased objectivity and reproducibility of results, which is especially valuable in judicial practice where expert intuition, however deep, will always be questioned. Indeed, many researchers, as noted by T.A. Litvinova and A.V. Gromova, consider statistical methods based on computer technologies more objective because they “are not based on expert intuition, and the results obtained on their basis are more reproducible” [Litvinova et al., 2020].

However, despite apparent universality and accuracy, computer methods in authorship expertise face significant limitations that prevent considering them a panacea. V.O. Kuznetsov and E.K. Kryuk emphasize that in judicial practice on copyright and related rights violations, there is often no clear understanding of whether the task falls within the competence of the authorship expert or the linguistic expert [Kuznetsov et al., 2019]. This blurring of boundaries is exacerbated when it comes to applying complex computer algorithms requiring deep knowledge in both linguistics and information technology.

The main problem is that most studies demonstrating high effectiveness of computer methods are conducted under conditions far from real forensic authorship expertise tasks. Litvinova and Gromova point out several such circumstances: “such works study problems far from authorship tasks, for example, the problem of identifying an author from a large group (several hundreds or even thousands)” [Litvinova et al., 2020]. This means models developed to select one author from a thousand may be ineffective when confirming or refuting authorship of a specific person.

Moreover, large volumes of text are often used — “several thousands or even tens of thousands of words,” or a “large number of texts from each author” are analyzed [Litvinova et al., 2020]. In real forensic practice, experts often deal with text fragments, anonymous messages, or short documents, which significantly complicates the application of statistical methods requiring substantial data volume to identify stable stylistic markers. Kim Luyckx and Walter Daelemans, for example, in their 2008 work also noted that most authorship attribution studies focus on a small number of authors and use training data volumes unrealistic for forensic stylometry, leading to overestimation of approach accuracy [Luyckx et al., 2008].

Another critical point highlighted by Litvinova and Gromova is that “little attention is paid to linguistic features themselves and their discriminative ability, since scientists focus mainly on the accuracy of the models they create” [Litvinova et al., 2020]. This leads to computer systems giving a high percentage of matches but being unable to explain which linguistic features of the text served as the basis for such conclusions. For forensic expertise, where not only the result but also its justification is required, such a “black box” model is unacceptable. As Gerald McMenamin notes, authorship attribution should be based on a set of markers, not a single feature [Mcmenamin, 2001].

Nevertheless, the potential of computer methods cannot be denied. They can serve as a powerful auxiliary tool expanding the expert’s capabilities. Litvinova and Gromova emphasize that computer methods are not a panacea but an additional tool expanding the expert’s possibilities [Litvinova et al., 2020]. They suggest not opposing traditional and computer methods but using the advantages of each. For example, linguistic analysis based on corpus data supplemented by statistical methods with visualization of results can significantly increase the accuracy of conclusions.

Modern studies, especially those conducted by linguists, begin to consider the specifics of forensic authorship expertise, focusing on problems such as small text volume. Benedict Boenninghoff and colleagues, for example, investigate authorship verification in social networks where texts are often very short and propose new neural network architectures to improve effectiveness in such challenging conditions [Boenninghoff et al., 2019]. This demonstrates a move towards adapting computer methods to real tasks.

It is important to understand that even with the most advanced computer algorithms, the element of expert subjectivity cannot be completely excluded. The choice of method, linguistic parameters, their number and type — all remain at the specialist’s discretion. No method is free from subjectivity, as Litvinova and Gromova rightly note [Litvinova et al., 2020]. This means computer tools must be in the hands of a qualified expert capable of interpreting results and correlating them with linguistic theory.

Ultimately, the effectiveness of computer methods in authorship expertise depends on how deeply they are integrated with linguistic analysis and how well adapted they are to the specifics of forensic tasks. Simple application of statistical models without considering the linguistic nature of the idiolect and the context of text creation can lead to erroneous conclusions. Arta Misini and colleagues in their review note that stylometric features can be lexical, syntactic, semantic, structural, and content-specific, all requiring careful analysis [Misini et al., 2022].

Despite significant advances in computational linguistics and machine learning, authorship expertise cannot be fully automated. Computer methods provide powerful tools for analyzing large data volumes and revealing hidden patterns, but their application requires deep understanding of linguistic principles and limitations. This leads us to the next question: what to do when data for analysis is extremely limited, and even the most advanced computer methods cannot find enough statistically significant markers?

Problems of Author Identification with Limited Data

The effectiveness of computer methods in authorship expertise, discussed earlier, is not universal. It significantly decreases when the expert faces limited data volume, which is one of the most acute problems in forensic authorship expertise. Most studies in statistical analysis or machine learning for authorship attribution traditionally focus on a small number of authors, leading to an overestimation of the significance of features extracted from training data [Luyckx et al., 2008]. These studies often use training data volumes unrealistic for real situations, for example, in forensics, which distorts the perception of the accuracy of proposed approaches.

Indeed, in practical application, especially in forensic expertise, the volume of text available for analysis is often extremely small. This creates serious methodological challenges. As Tatiana Litvinova and Anastasiya Gromova note, there is a significant gap between tasks set by computer science researchers and the real needs of forensic authorship expertise (FAE) [Litvinova et al., 2020]. They emphasize that research tasks should focus not on the efficiency of machine learning methods in processing large databases but on the tasks most frequently encountered in authorship expert practice.

One key problem is that traditional methods based on statistical analysis of many textual features lose reliability with small text volumes. For example, Moshe Koppel, Jonathan Schler, and Shlomo Argamon in their work [Koppel et al., 2009] describe three classes of authorship attribution approaches: unitary invariant approach, multivariate analysis, and machine learning. However, even they acknowledge that each approach faces difficulties when working with limited data. The unitary invariant approach, which seeks a single numerical function of the text to distinguish authors, proved unstable and gave way to multivariate methods [Koppel et al., 2009].

The problem is exacerbated by the fact that computer programs, as David Wullschleger notes, “read” text as a stream of symbols, recognizing words by boundaries, whereas humans can recognize entire passages and use knowledge from their subject area [Kuznetsov et al., 2019]. This means even the most advanced machine learning algorithms may be limited in capturing subtle stylistic nuances, which become especially important when data is scarce. In such conditions, as foreign researchers argue, qualitative analysis methods such as stylistic and semantic analysis aimed at interpreting textual differences at various linguistic levels become more justified [Kuznetsov et al., 2019].

However, not all is hopeless. Koppel, Schler, and Argamon [Koppel et al., 2009] point to the possibility of adapting machine learning methods for limited data. They explore scenarios when a small closed set of candidates is absent and propose approaches to profiling, “needle in a haystack,” and verification problems. This implies not just applying existing algorithms but modifying them considering task specifics. For example, for authorship verification, when it is necessary to confirm that a given text was written by a specific author rather than identify them from many, similarity-based methods can be used [MacLeod et al., 2012].

Nikki MacLeod and Tim Grant [MacLeod et al., 2012] note that similarity-based methods are more suitable for cases with many potential authors. They propose verifying authorship if the similarity between the anonymous document and known texts of the author exceeds a certain threshold. They use 4-grams (sequences of four characters) as the basis for analysis, which they consider effective for authorship attribution and measurable in any language without special background knowledge. However, they acknowledge that for cases with small open candidate sets and limited anonymous text, no satisfactory solution yet exists.

Burrows also notes that existing computational stylistics methods are better suited for “closed” games than “open” ones. He proposes an authorship attribution method suitable for cases with little or no external evidence to identify the most likely candidate. His approach is based on the idea that a distinctive ‘stylistic signature’ usually consists of many tiny strokes. He claims his procedure is successful in distinguishing the most likely author for texts over 1500 words and, more importantly for our topic, is even more valuable for narrowing down the list of likely candidates for texts as short as 100 words.

With limited data, the focus shifts from searching for universal statistical patterns to identifying unique, albeit few, idiolect markers. This requires deeper linguistic analysis, not just quantitative counting. Efstathios Stamatatos [Stamatatos, 2017] proposes a new method that improves authorship attribution efficiency by distorting the text before extracting stylometric measures. The goal of this step is to mask thematic information unrelated to the author’s personal style, which is especially important under cross-topic conditions when training and test corpora differ in topic.

Ultimately, solving the problem of author identification with limited data lies in an interdisciplinary approach. Litvinova and Gromova [Litvinova et al., 2020] call for integrating efforts of linguists, authorship experts, and computer scientists. They emphasize the need to create a closed database of forensically significant texts whose authorship has been established during judicial examination, allowing more precise tuning of algorithms and testing their effectiveness on realistic data. Without such a database and without considering forensic practice specifics, computer methods, despite their potential, will remain theoretical developments unable to fully meet experts’ needs.

In the context of limited data, the question of the applicability of authorship expertise to short texts such as social media messages or SMS becomes especially relevant. After all, in such cases, the volume of available information is minimal, and author identification can be critically important.

Applicability of Authorship Expertise to Short Texts

If in the previous section we discussed problems of author identification with limited data, the question of applicability of authorship expertise to short texts is a logical continuation of this discussion but with a special focus on the specificity of the material itself. Short texts such as SMS messages, microblog posts (e.g., Twitter), or instant messages represent a unique challenge for traditional authorship analysis methods. Why? Because these methods have traditionally been limited by the message size to which they could be successfully applied, making them unsuitable for analyzing shorter messages [MacLeod et al., 2012].

Indeed, classical approaches based on analyzing extensive idiolect, word frequency, syntactic constructions, and punctuation require a significant text volume to identify stable author markers. When data volume shrinks to a few sentences or even phrases, the statistical significance of many features sharply decreases. For example, text complexity metrics such as average word or sentence length, central to early authorship studies, prove uninformative on short fragments. Moshe Koppel, Jonathan Schler, and Shlomo Argamon note that none of these measures proved particularly useful on their own [Koppel et al., 2009], especially with small samples.

However, this does not mean authorship expertise is powerless against short texts. On the contrary, the active development of internet communication and the growth of forensic authorship expertise aimed at identification in digital environments have stimulated the development of new approaches [Litvinova et al., 2020]. Researchers began seeking other, more subtle markers that manifest even under limited volume conditions. For example, Nicci MacLeod and Tim Grant describe a project aimed at developing and automating forensic linguistic methods successfully applied to short message analysis in criminal cases [MacLeod et al., 2012].

One promising direction has been the use of statistical methods capable of working with sparse data. Tim Grant, discussing authorship attribution of SMS messages, emphasizes that linguistic distinctiveness and linguistic consistency are matters of degree and can be investigated using statistical methods [MacLeod et al., 2012]. He proposes using Jaccard’s coefficient to assess similarity between short messages. This coefficient compares the presence or absence of certain stylistic features encoded as binary values (1 or 0). An important advantage of Jaccard’s coefficient is that matching two zero values (absence of a feature) in two texts does not affect the overall similarity score [MacLeod et al., 2012], which is critical for short texts where absence of many features is normal.

Developing this idea, MacLeod and Grant propose using an extension of Jaccard’s coefficient called Delta-S (Δs), which allows weighting variables and their interrelations [MacLeod et al., 2012]. This is especially relevant for short messages where, for example, replacing different digits in the text may indicate more similar stylistic preferences than, say, accent stylization. This approach allows recognizing similar but not identical stylistic choices, increasing similarity metric accuracy.

Computer methods play a key role here. Moshe Koppel, Jonathan Schler, and Shlomo Argamon note that modern machine learning methods allow considering a wide range of potentially relevant features without significant accuracy loss even if most features turn out irrelevant [Koppel et al., 2009]. This opens possibilities for analyzing microfeatures such as emoji use, specific abbreviations, punctuation patterns characteristic of a particular author under limited text volume. Tatiana Litvinova and Anastasiya Gromova also emphasize that “punctuation choices as a component of the orthological parameter of the idiolect of a modern Russian language speaker” can be used in identification authorship expertise [Litvinova et al., 2020].

However, despite progress, difficulties remain. For example, short texts often contain informal language constructions, slang, errors that may be accidental or intentional. This requires the expert’s deep understanding of communication context and sociolinguistic features. Moreover, as V.O. Kuznetsov and E.K. Kryuk note, when analyzing text matches, even if short, it is important to distinguish fully matching, partially matching, and differing fragments [Kuznetsov et al., 2019]. They propose conducting comparisons starting from theme and composition, then moving to details such as verbatim sentence matches or use of synonymous means.

The applicability of authorship expertise to short texts is not only possible but actively developing thanks to new methodologies and computer tools. However, it requires a more subtle approach to feature selection and statistical processing. This leads us to the next question: if we can identify the author by the smallest details of their style, what about situations where these details are deliberately borrowed or copied?

The Problem of Plagiarism and Borrowing in Texts

After considering the complexities of authorship expertise of short texts, it is logical to move to the problem of plagiarism and borrowing, which also requires thorough analysis of text fragments but with a different goal — establishing the fact of improper appropriation of authorship. Here the focus shifts from identifying a specific author to detecting matches between texts and assessing their legitimacy.

Plagiarism, as a legal concept, is defined as appropriation of authorship, which may manifest in declaring oneself the author of someone else’s work, publishing someone else’s text under one’s own name, or publishing a work created in collaboration without indicating other authors’ names [Kuznetsov et al., 2019]. This definition, enshrined in the Criminal Code of the Russian Federation (Article 146) and clarified by the Plenum of the Supreme Court, emphasizes not only the fact of borrowing but also the intention to pass off another’s work as one’s own.

Detecting plagiarism is a task requiring a comprehensive approach beyond purely linguistic analysis. As Kuznetsov and Kryuk note, studying only linguistic features of compared texts is clearly insufficient to establish borrowing. This means the expert must not only detect matching fragments but also assess their semantic significance, the volume of borrowing relative to the entire text, and the correctness of references if present.

Linguistic expertise in plagiarism cases aims to solve tasks related to identifying text features associated with plagiarism. This includes comparing texts to find matching fragments, establishing the fact of borrowing, and determining its direction [Kuznetsov et al., 2019]. For example, the expert may find verbatim matches in formulations, similar composition, or identical factual information, as shown in an example comparing texts about nutrition [Kuznetsov et al., 2019]. In this case, despite minor changes (removal of pronouns or verbs), the essence and lexical content remained identical, indicating borrowing.

Abroad, within forensic linguistics, the problem of plagiarism is also actively studied, though without the traditional domestic practice division into linguistic and authorship expertise [Kuznetsov et al., 2019]. Various types of plagiarism are distinguished: intralingual, interlingual (translation plagiarism), dictionary plagiarism, and plagiarism of documents recording investigative actions. Quantitative and qualitative methods are used for detection, including statistical and stylometric analysis.

Quantitative methods involve studying indicators such as the percentage of matching words, percentage of so-called hapax legomena (words occurring only once in the text), percentage of unique hapax legomena, and others [Kuznetsov et al., 2019]. These methods are often implemented with specialized plagiarism detection software. However, as David Wullschleger points out, computers “read” text as a stream of symbols, recognizing words by boundaries, indicating the imperfection of automated systems and the need for manual analysis of large text volumes [Kuznetsov et al., 2019].

The problem of imitation and masking of idiolect, discussed by Litvinova and Gromova, is also relevant in plagiarism context. If the author deliberately changes their style to hide borrowing or, conversely, to pass off another’s text as their own, this significantly reduces classification model accuracy. Studies show that detecting attempts to distort idiolect is possible in principle, and frequency analysis of function words plays an important role [Litvinova et al., 2020]. For example, distorted texts may contain more adverbs, particles, and personal pronouns but fewer nouns, and sentences may be shorter and simpler.

However, as Litvinova and Gromova rightly note, studies specifically dedicated to determining the intention to distort idiolect are very few. Although classifiers capable of accurately detecting signs of concealment or imitation of idiolect exist, the question of which linguistic elements contribute most to this differentiation remains open. This is especially important for forensic authorship expertise, where it is necessary not only to state the fact of coincidence but also to understand whether this coincidence was accidental, intentional, or the result of imitation.

In the context of plagiarism, especially in academia, the question of “self-plagiarism” or reuse of one’s own texts arises. Although legally this is not always plagiarism in the strict sense of appropriating another’s authorship, ethical norms and requirements for originality of scientific works often prohibit such reuse without proper citation. Here linguistic expertise can help establish the degree of coincidence and determine whether the reuse is substantial or insignificant.

The problem of plagiarism and borrowing is a multifaceted task requiring not only linguistic but also content analysis. Automated systems can be useful for initial detection of matches, but the final decision on the presence and nature of plagiarism always remains with the expert, who must consider all nuances, including possible idiolect distortion and text creation context. This leads us to the need for critical reflection on the possibilities and limitations of existing methods and understanding that even the most advanced tools cannot replace deep expert evaluation.

Criticism and Limitations

Limitations Related to Volume and Quality of Data

One of the most significant limitations of authorship and linguistic expertise is their effectiveness’s dependence on the volume and quality of the text material studied. As Litvinova and Gromova note, most studies demonstrating high accuracy of computer methods use large volumes of text (thousands and tens of thousands of words) or a large number of texts from each author [Litvinova et al., 2020]. In real forensic practice, the expert often faces limited data such as short social media messages, anonymous notes, or document fragments. Under such conditions, traditional statistical methods requiring significant data volume to identify stable stylistic markers lose their reliability. For example, for identifying an author from a large group (several hundreds or thousands), the accuracy of computer methods may be significantly lower than for a small number of candidates [Koppel et al., 2009]. If data volume were always sufficient, the problem of author identification would reduce to a text classification task where machine learning shows high efficiency. However, in data scarcity conditions, as Luyckx and Daelemans show, the accuracy of approaches is overestimated because they do not consider realistic training data volumes for forensic stylometry [Luyckx et al., 2008].

The Problem of Imitation and Masking of Idiolect

Another serious limitation is the possibility of imitating another’s style or deliberately distorting one’s own idiolect. If the author consciously tries to change their “handwriting” to avoid identification or, conversely, to impersonate another, this can significantly reduce expertise effectiveness. Litvinova and Gromova point out that detecting attempts to distort idiolect is possible in principle, and frequency analysis of function words plays an important role [Litvinova et al., 2020]. However, despite classifiers capable of detecting signs of concealment or imitation, the question of which linguistic elements contribute most to this differentiation remains open. If imitation were impossible, author identification would be a much simpler task based on unique and immutable markers. However, since language is a dynamic system and humans are capable of adaptation and manipulation, the expert must consider this factor, complicating the process and requiring deeper linguistic analysis, not just quantitative counting. For example, studies show that stylometry is not a “silver bullet” for detecting fake news because style can be deliberately changed [Potthast et al., 2018].

Difficulties of Interpretation and Lack of Standardization

Finally, even with sufficient data volume and absence of deliberate imitation, interpreting expertise results can be difficult. The lack of unified standards and methodologies, as well as blurred competencies between linguistic and authorship expertise, as Kuznetsov and Kryuk note, create problems for the legal significance of conclusions [Kuznetsov et al., 2019]. Computer methods providing high accuracy are often complex to interpret, which is critically important for forensic expert research [Litvinova et al., 2020]. If universal, transparent, and generally accepted methodologies existed, and computer analysis results were easily interpretable for non-specialists (e.g., judges), the decision-making process would be significantly simplified. However, since linguistic expertise often requires an interdisciplinary approach and context consideration, standardization remains a difficult task. This leads to expert conclusions being challenged due to methodological disagreements or difficulty explaining obtained results, undermining trust in expertise as a whole [Koehler, 2013].

Detailed Exposition

Delimitation of Authorship and Linguistic Expertise

In judicial practice, where the text serves as evidence, there is a need for specialized knowledge for its analysis. Here, authorship and linguistic expertise come to the fore, each having its specificity but closely related to the other. Understanding their differences and interrelations is critically important for correctly formulating questions to the expert and adequately interpreting the obtained results.

Authorship expertise, by its nature, aims to establish the identity of the text’s author through their unique linguistic features forming the so-called idiolect or “author’s handwriting.” As V.O. Kuznetsov and E.K. Kryuk note, the tasks of the authorship expert “are related to studying written-speech skills and aimed at establishing the authorship of the text and the conditions of text composition” [Kuznetsov et al., 2019]. This means the expert analyzes stable, individual characteristics of written speech that allow distinguishing one author from another. For example, these may be features of lexical choice, syntactic constructions, punctuation, as well as frequency of use of certain words or phrases.

Linguistic expertise, in turn, focuses on the semantic content of the text and its stylistics, regardless of who the author is. Its goal is “to establish the features of a particular linguistic meaning expressed in the text or its components (word, phrase, utterance, etc.), regardless of the degree of development of the linguistic skills of the text’s author” [Kuznetsov et al., 2019]. This may be analysis for the presence of insults, threats, calls for unlawful actions, as well as detection of plagiarism. In plagiarism cases, the linguist compares texts as products of speech activity to find matching fragments, determine the direction of borrowing, and assess originality degree.

However, despite these theoretical distinctions, in practice, a clear division between authorship and linguistic expertise is often not made. Kuznetsov and Kryuk point out that “a number of domestic scholars do not distinguish between authorship and linguistic expertise in cases related to copyright and related rights violations” [Kuznetsov et al., 2019]. Sometimes they are combined under the general name “linguistic authorship expertise” or simply “authorship expertise.” This is explained by the fact that in tasks related to copyright infringement, both establishing authorship and analyzing text originality are often required, which lies at the intersection of competencies.

For example, when establishing plagiarism, it is insufficient to merely detect matching fragments; it is also necessary to assess whether these matches result from borrowing rather than accidental similarity or use of conventional formulations. Here, linguistic analysis of content and text stylistics intertwines with the authorship approach, which can help determine whether the disputed text corresponds to the idiolect of the alleged author. A comprehensive approach involving specialists from various fields becomes a necessity, especially when dealing with the content side of the text under study [Kuznetsov et al., 2019].

Methods of Authorship Expertise

Authorship expertise, as we have established, seeks to identify the author through analysis of their unique idiolect. But how exactly does the expert “read” this idiolect? The methods used in authorship expertise can be conditionally divided into traditional and modern, although the boundary between them is increasingly blurred due to the development of computer technologies.

Traditional methods of authorship expertise are based on deep linguistic analysis of the text and identification of stable features of written speech characteristic of a particular author. These include: idiolect analysis, which involves studying lexical, syntactic, morphological, and stylistic features of the text. For example, the expert may pay attention to the frequency of use of certain words (lexical level), preference for certain types of sentences (syntactic level), characteristic errors or, conversely, impeccable grammar (morphological level), as well as the overall tone and register of the text (stylistic level).

Frequency analysis of words is one of the most common methods. It involves counting the frequency of various words, especially so-called “function” words (prepositions, conjunctions, particles), which are believed to be less subject to conscious control and therefore more reliably reflect the individual style of the author. For example, a study may reveal that one author more often uses the conjunction “however,” while another prefers “nevertheless.” These seemingly insignificant details, accumulating, form a unique statistical profile.

Syntactic constructions also provide rich material for analysis. The expert may study the average sentence length, their structure (simple, compound, compound-complex, complex), use of inversions, introductory words, and constructions. For example, one author may prefer short, choppy phrases, while another uses long, elaborate sentences with many subordinate clauses. These preferences are usually stable and can serve as reliable idiolect markers.

Punctuation, at first glance, seems strictly regulated, but individual features also manifest here. The expert may analyze the frequency of various punctuation marks, their placement in non-standard cases, as well as the presence or absence of punctuation errors. For example, excessive use of dashes or, conversely, their absence where required may be characteristic. Carol Chaski empirically confirmed in her research that syntactic analysis and syntactically classified punctuation are two hypotheses that successfully differentiate and cluster documents [Chaski, 2001].

Modern methods of authorship expertise increasingly include the use of computer programs and corpus linguistics methods. This allows automating the process of counting and analyzing huge volumes of textual data, revealing hidden patterns difficult to detect manually. For example, computer

Sources

Stephan Lewandowsky; Ullrich K. H. Ecker; John Cook. Beyond misinformation: Understanding and coping with the “post-truth” era. (2017) ↗ doi
H. Andrew Schwartz; Johannes C. Eichstaedt; Margaret L. Kern; Lukasz Dziurzynski; Stephanie M. Ramones; Megha Agrawal; Achal Shah; Michał Kosiński; David Stillwell; Martin E. P. Seligman; Lyle Ungar. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach (2013) ↗ doi
Νικόλαος Αλέτρας; Dimitrios Tsarapatsanis; Daniel Preoțiuc-Pietro; Vasileios Lampos. Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective (2016) ↗ doi
Martin Potthast; Johannes Kiesel; Kevin Reinartz; Janek Bevendorff; Benno Stein. A Stylometric Inquiry into Hyperpartisan and Fake News (2018) ↗ doi
Olivier De Vel; Alison Anderson; Malcolm Corney; George Mohay. Mining e-mail content for author identification forensics (2001) ↗ doi
Moshe Koppel; Jonathan Schler; Shlomo Argamon. Computational methods in authorship attribution (2009) ↗ doi
Maciej Eder; Jan Rybicki; Mike Kestemont. Stylometry with R: A Package for Computational Text Analysis (2016) ↗ doi
Jiwei Li; Myle Ott; Claire Cardie; Eduard Hovy. Towards a General Rule for Identifying Deceptive Opinion Spam (2014) ↗ doi
Hossein Hassani; Christina Beneki; Stephan Unger; Maedeh Taj Mazinani; Mohammad Reza Yeganegi. Text Mining in Big Data Analytics (2020) ↗ doi
Miriam A. Locher; Richard J. Watts. Chapter 4. Relational work and impoliteness: Negotiating norms of linguistic behaviour (2008) ↗ doi
Dong Nguyen; A. Seza Doğruöz; Carolyn Penstein Rosé; Franciska de Jong. Computational Sociolinguistics: A Survey (2016) ↗ doi
Cati Brown; Tony Snodgrass; Susan Kemper; Ruth Herman; Michael A. Covington. Automatic measurement of propositional idea density from part-of-speech tagging (2008) ↗ doi
Upendra Sapkota; Steven Bethard; Manuel Montes; Thamar Solorio. Not All Character N-grams Are Created Equal: A Study in Authorship Attribution (2015) ↗ doi
Koen Luyckx; Walter Daelemans. The effect of author set size and data size in authorship attribution (2010) ↗ doi
Kim Luyckx; Walter Daelemans. Authorship attribution and verification with many authors and limited data (2008) ↗ doi
Kate Haworth. The dynamics of power and resistance in police interview discourse (2006) ↗ doi
Norman Meuschke; Béla Gipp. State-of-the-art in detecting academic plagiarism (2013) ↗ doi
Efstathios Stamatatos. Authorship Attribution Using Text Distortion (2017) ↗ doi
Natasha Fernandes; Mark Dras; Annabelle McIver. Generalised Differential Privacy for Text Document Processing (2019) ↗ doi
Patrick Juola; John Sofko; Patrick McKinley Brennan. A Prototype for Authorship Attribution Studies (2006) ↗ doi
Tim Grant. TXT 4N6:method, consistency, and distinctiveness in the analysis of sms text messages (2013)
Yanir Seroussi; Ingrid Zukerman; Fabian Bohnert. Authorship Attribution with Topic Models (2014) ↗ doi
Jack Grieve; Isobelle Clarke; Emily Chiang; Hannah P. Gideon; Annina Heini; Andrea Nini; Emily Waibel. Attributing the Bixby Letter using n-gram tracing (2018) ↗ doi
Nicci MacLeod; Tim Grant. Whose Tweet? Authorship analysis of micro-blogs and other short-form messages (2012)
Heba El-Fiqi; Eleni Petraki; Hussein A. Abbass. Network motifs for translator stylometry identification (2019) ↗ doi
Malcolm Coulthard. An Introduction to Forensic Linguistics (2016) ↗ doi
Shlomo Argamon. Register in computational language research (2019) ↗ doi
Benedikt Boenninghoff; Robert M. Nickel; Steffen Zeiler; Dorothea Kolossa. Similarity Learning for Authorship Verification in Social Media (2019) ↗ doi
Carole E. Chaski. Empirical evaluations of language-based author identification techniques (2001) ↗ doi
Graeme Hirst; Vanessa Wei Feng. Changes in Style in Authors with Alzheimer's Disease (2012) ↗ doi
Gerald Mcmenamin. Style markers in authorship studies (2001) ↗ doi
Nektaria Potha; Efstathios Stamatatos. Intrinsic Author Verification Using Topic Modeling (2018) ↗ doi
David Wright. Stylistics versus Statistics: A corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails (2014)
G. Heydon. Researching Forensic Linguistics (2019) ↗ doi
Kelly Bodwin; Albert Yoon. A Statistical Approach to Judicial Authorship: A Case Study of Judge Easterbrook (2012)
Arta Misini; A. Kadriu; Ercan Canhasi. A Survey on Authorship Analysis Tasks and Techniques (2022) ↗ doi
Ahmed Alduais; Mohammed Ali Al-Khulaidi; Silvia Allegretta; Mona Mohammed Abdulkhalek. Forensic linguistics: A scientometric review (2023) ↗ doi
Tatiana Litvinova; Anastasiya Gromova. Current Problems of Forensic Authorship Analysis and the Possibility of Their Solution with the Use of Computer Methods: Problems and Prospects (2020) ↗ doi
Matthias Schlesewsky. Linguistische Daten aus experimentellen Umgebungen: Eine multiexperimentelle und multimodale Perspektive (2009) ↗ doi
Jonathan J. Koehler. Linguistic Confusion in Court: Evidence From the Forensic Sciences (2013)
В. О. Кузнецов; E. K. Kryuk. Demarcating a Linguistic Expert’s and an Authorship Investigator’s Competencies When Examining Copyright and Related Rights Objects (2019) ↗ doi
Nishchal Sharma; Ajay Kumar. Deep Learning for Stylometry and Authorship Attribution: a Review of Literature (2024) ↗ doi
Doru B, Maier C, Busse JS, Lücke T, Schönhoff J, Enax-Krumova E, Hessler S, Berger M, Tokic M.. Detecting Artificial Intelligence-Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study. (2025) ↗ doi
Juola P.. Verifying authorship for forensic purposes: A computational protocol and its validation. (2021) ↗ doi
Rui Ribeiro; J. P. Carvalho; Luísa Coheur. Leveraging Fuzzy Fingerprints from Large Language Models for Authorship Attribution (2024) ↗ doi
Mario Crespo Miguel. Analysis of parameters on author attribution of Spanish electronic short texts (2016) ↗ doi
Juan Antonio Cutillas Espinosa; Juan Manuel Hernández Campoy. Historical sociolinguistics and authorship elucidation in medieval private written correspondence: (2021) ↗ doi
Alison Johnson; Malcolm Coulthard. Introduction ↗ doi
Leonard Bloomfield. Language (1933)
Noam Chomsky. Syntactic Structures (1957)
Noam Chomsky. Aspects of the Theory of Syntax (1965)
John Langshaw Austin. How to Do Things with Words (1962)
Dwight Fee; Norman Fairclough. Discourse and Social Change. (1993) ↗ doi
Noam Chomsky. ASPECTS OF THE THEORY OF SYNTAX (1964) ↗ doi
Μ. Α. Κ. Halliday; Christian M.I.M. Matthiessen; M.A.K. Halliday; Christian M.I.M. Matthiessen. An Introduction to Functional Grammar (2014) ↗ doi
Noam Chomsky. The Minimalist Program (2014) ↗ doi
Noam Chomsky. Syntactic Structures (1957) ↗ doi
Norman Fairclough. Critical Discourse Analysis: The Critical Study of Language (1995) ↗ doi
Norman Fairclough. Analysing Discourse: Textual Analysis for Social Research (2003)
Μ. Α. Κ. Halliday. Language as social semiotic : the social interpretation of language and meaning (1978)
Μ. Α. Κ. Halliday; Ruqaiya Hasan. Cohesion in English (2014) ↗ doi
Norman Fairclough. Analysing Discourse (2003) ↗ doi
James R. Bennett; Edward S. Herman; Noam Chomsky. Manufacturing Consent: The Political Economy of the Mass Media. (1989) ↗ doi