Software

Historical languages are increasingly being modelled computationally. Syntactically annotated texts are often a sine-qua-non in their modelling, but parsing of pre-modern language varieties faces great data sparsity, intensified by high levels of orthographic variation. In this paper we present a good-quality Early Slavic dependency parser, attained via manipulation of modern Slavic data to resemble the orthography and morphosyntax of pre-modern varieties. The tool can be deployed to expand historical treebanks, which are crucial for data collection and quantification, and beneficial to downstream NLP tasks and historical text mining


Introduction
Dependency parsing is important in many downstream natural language processing (NLP) tasks, including event extraction, word vector representation enhancement, and text classification and summarization. Training good-quality parsers for historical languages is a challenging task, since they normally provide very little data with very high levels of linguistic variation, which in machine learning easily translates into high levels of noise.
In this paper we present a variety-agnostic part-of-speech (PoS) tagger and dependency parser for Early Slavic (OldSlavNet) trained on multi-lingual Slavic data spanning a thousand years via orthographic and morphosyntactic harmonization of the modern data with their pre-modern counterparts. Early Slavic and Modern Russian data was The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals. * Corresponding author.  now available for Russian and Serbian, can be downloaded from the parser's repository and used to harmonize new Modern Russian and Serbian texts with Early Slavic, thus potentially improving the parsing performance.
The parser is especially crucial to expand historical treebanks, large collections of digital texts annotated with syntactic information: treebanks are a versatile source of data, not only directly exploited in many NLP tasks, as the aforementioned ones, but they are used by the humanities at large as a stand-alone collection of carefully digitized textual data enriched with linguistic information.

Data and parser architecture
The parser works in the UD framework [6], one of the most widely employed formats for dependency parsing.
The tool's neural-network architecture is based on jPTDP [7]. The following are the main new features in OldSlavNet's model: -ArgParse substitutes the older OptParse to allow for wider reusability of our code.
-RMSProp [8] is employed instead of Adam [9] as optimizer to avoid exploding gradients. The initial learning rate was set to 0.1 instead of None.
-Since the previous experiment in [10], the training set has been expanded with Modern Russian and Serbian data. OldSlavNet's documentation contains a detailed breakdown of the corpus on which the parser was trained and tested.

Usage
The following is the end-to-end process to use the tool to tag new Early Slavic text: 1. Pre-process your text file: Convert your Early Slavic text to the CoNLL-U UD-format by running the converter.py script included in OldSlavNet's repository. The input must be an already tokenized, one-sentence-per-line text file. Fig. 1

Install the required dependencies:
Run: pip install -r requirements.

Impact
OldSlavNet's previous version (known as jPTDP-GEN) enabled [10], which discussed the improvement of dependency parsers for lowresource historical languages using cross-dialectal data. OldSlavNet, a generic (i.e. variety-agnostic) parser, was shown to perform better than two variety-specific parsers for Early Slavic, indicating that markedly non-standardized historical languages are likely to benefit more from the development of generic, cross-variety models, than from specialized ones. Since [10], OldSlavNet has further improved its real-world performance (i.e. its ability to tackle a wider range of pre-modern Slavic varieties and genres) thanks to additional data from Modern Russian and Modern Serbian, as Table A.1 shows.
OldSlavNet has been trialled on new texts in the TOROT Treebank [1,2], a major annotated historical corpus for Slavic and offspring of the PROIEL project [11,12]. The expansion of historical Slavic treebanks using OldSlavNet will contribute to the advancement of research domains that benefit from syntactically annotated data, particularly from less-resourced languages with great spelling variation: 1. Semantic change detection: A methodological gap which has been noted for decades [13] is the integration of syntactic information in meaning change modelling. Early Slavic treebank data can now be used in semantic change detection by generating word representation that are both semantically and syntactically constrained (e.g. syntactic word embeddings [14] and syntactic topic models [15]), thus improving the semantic models themselves. Understanding the mechanisms of meaning change in different historical contexts will help design better tools for semantic change detection, which has a wide range of applications in text processing, including information retrieval [16][17][18],  culturomics [19], Diachronic Text Evaluation (DTE) [20,21], recontextualization of past texts [22], OCR error correction [23], and abusive content detection [24], among others (see [25] for a detailed survey of applications). 2. Improving NLP system evaluation practices: Early Slavic is ideally placed to be used in the evaluation of NLP systems and methods, in light of its many related subvarieties and its high orthographic variation. This is a challenge in computational models of language change, since NLP systems tend to disregard low-frequency types, which are inevitable in historical sources. More syntactically annotated data for Early Slavic will allow us to systematically investigate how NLP approaches to infrequent tokens impact the generalization of a system's results, thus improving our evaluation practices. 1 3. Improving representativeness: Expanding Early Slavic treebanks will allow us to develop methods for large-scale quantitative diachronic analyses of linguistic phenomena in languages other than English. The lack of large, non-English diachronic corpora has been stressed in the literature (e.g. [26] and [25]) as a possible bias in historical linguistic research that aims at generalizing findings cross-linguistically.

Limitations and future improvements
The scripts used to harmonize Russian and Serbian orthography and morphology to Early Slavic are still experimental. Presently, only the tokens belonging to the most frequent morphological tags have been harmonized. Figs. 3 and 4 illustrate how the harmonization routine currently works on a Serbian and a Russian sentence respectively. Given the promising results, in following releases we plan to develop harmonization scripts encompassing a wider range of morphotags, which is expected to yield even better parsing performance on pre-modern Slavic varieties. A drawback of the current version of OldSlavNet is that it takes already sentencized text (i.e. with one sentence per line, as shown in Fig. 1) as an input, which requires users to manually split their text into sentences. Implementation of OldSlavNet with spaCy [27] is however underway, in order to complement the parser with an Early Slavic sentencizer that takes an unbroken texts as input and provides a onesentence-per-line output, which can then be directly fed to OldSlavNet to add syntactic annotation.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.