This paper presents work where a general-purpose text categorization method was applied to categorize medical free-texts. The purpose of the experiments was to examine how such a method performs without any domain-specific knowledge, hand-crafting or tuning. Additionally, we compare the results from the general-purpose method with results from runs in which a medical thesaurus as well as automatically extracted keywords were used when building the classifiers. We show that standard text categorization techniques using stemmed unigrams as the basis for learning can be applied directly to categorize medical reports, yielding an F-measure of 83.9, and outperforming the more sophisticated methods.
Although Sweden has yet to allocate funds specifically intended for CLARIN activities, there are some ongoing activities which are directly relevant to CLARIN, and which are explicitly linked to CLARIN. These activities have been funded by the Committee for Research Infrastructures and its subcommittee DISC (Database Infrastructure Committee) of the Swedish Research Council.
Historical ciphers, a special type of manuscripts, contain encrypted information, important for the interpretation of our history. The first step towards decipherment is to transcribe the images, either manually or by automatic image processing techniques. Despite the improvements in handwritten text recognition (HTR) thanks to deep learning methodologies, the need of labelled data to train is an important limitation. Given that ciphers often use symbol sets across various alphabets and unique symbols without any transcription scheme available, these supervised HTR techniques are not suitable to transcribe ciphers. In this paper we propose an unsupervised method for transcribing encrypted manuscripts based on clustering and label propagation, which has been successfully applied to community detection in networks. We analyze the performance on ciphers with various symbol sets, and discuss the advantages and drawbacks compared to supervised HTR methods.
In this paper we apply the ensemble approach to the identification of incorrectly annotated items (noise) in a training set. In a controlled experiment, memory-based, decision tree-based and transformation-based classifiers are used as a filter to detect and remove noise deliberately introduced into a manually tagged corpus. The results indicate that the method can be successfully applied to automatically detect errors in a corpus.
CLARIN is a European Research Infrastructure Consortium (ERIC), which aims at (a) making extensive language-based materials available as primary research data to the humanities and social sciences (HSS); and (b) offering state-of-the-art language technology (LT) as an eresearch tool for this purpose, positioning CLARIN centrally in what is often referred to as the digital humanities (DH). The Swedish CLARIN node Swe-Clarin was established in 2015 with funding from the Swedish Research Council.
In this paper, we describe the composition and activities of Swe-Clarin, aiming at meeting the requirements of all HSS and other researchers whose research involves using text and speech as primary research data, and spreading the awareness of what Swe-Clarin can offer these research communities. We focus on one of the central means for doing this: pilot projects conducted in collaboration between HSS researchers and Swe-Clarin, together formulating a research question, the addressing of which requires working with large language-based materials. Four such pilot projects are described in more detail, illustrating research on rhetorical history, second-language acquisition, literature, and political science. A common thread to these projects is an aspiration to meet the challenge of conducting research on the basis of very large amounts of textual data in a consistent way without losing sight of the individual cases making up the mass of data, i.e., to be able to move between Moretti’s “distant” and “close reading” modes.
While the pilot projects clearly make substantial contributions to DH, they also reveal some needs for more development, and in particular a need for document-level access to the text materials. As a consequence of this, work has now been initiated in Swe-Clarin to meet this need, so that Swe-Clarin together with HSS scholars investigating intricate research questions can take on the methodological challenges of big-data language-based digital humanities.
The goal of the project is to model the prosodic structuring of speech in terms of boundaries and groupings. The modeling will include different communicative situations and be based on existing as well as new speech corpora. Production and perception studies will be used in parallel with automatic methods developed for analysis, modeling and prediction of prosody. The model will be perceptually evaluated using synthetic speech.
Historical ciphers contain a wide range of symbols from various symbol sets. Identifying the cipher alphabet is a prerequisite before decryption can take place and is a time-consuming process. In this work we explore the use of image processing for identifying the underlying alphabet in cipher images, and to compare alphabets between ciphers. The experiments show that ciphers with similar alphabets can be successfully discovered through clustering.
Manual transcription of handwritten text is a time consuming task. In the case of encrypted manuscripts, the recognition is even more complex due to the huge variety of alphabets and symbol sets. To speed up and ease this process, we present a web-based tool aimed to (semi)-automatically transcribe the encrypted sources. The user uploads one or several images of the desired encrypted document(s) as input, and the system returns the transcription(s). This process is carried out in an interactive fashion with the user to obtain more accurate results. For discovering and testing, the developed web tool is freely available 1 .
The paper demonstrates how data-driven learning methods are applied in teaching Turkish as a foreign language at the Department of Linguistics and Philology, Uppsala University. In data-driven teaching, language corpora, concordance programs, and annotation tools developed in collaboration with computational linguists are employed. This paper illustrates how resources developed initially for research purposes in different subjects (such as Computational Linguistics, Linguistics, Turkic languages), are now being used in teaching environments.
We present the Swedish-Turkish parallel corpus providing students and researchers with easily accessible annotated linguistic data. The web-based corpora can be used both by regular and distance students. They function also as learning tools for formulating and testing hypotheses concerning lexical, morphological and syntactic aspects of Turkish. Furthermore, they help the students to practice contrastive studies and translation between Swedish and Turkish.
Language resources and tools to create and process these resources are necessary components in human language technology and natural language applications. In this paper, we describe a survey of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on a questionnaire sent to industry and academia, institutions and organizations, and to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide.
Language resources, such as lexicons, databases, dictionaries, corpora, and tools to create and process these resources are necessary components in human language technology and natural language applications. In this survey, we describe the inventory process and the results of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on an investigation sent to industry and academia, institutions and organizations, to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide. This study is a result of the project called “An Infrastructure for Swedish language technology” supported by the Swedish Research Council´s Committee for Research Infrastructures 2007 - 2008.
Handwritten Text Recognition (HTR) in low-resource scenarios (i.e. when the amount of labeled data is scarce) is a challenging problem. This is particularly true for historical encrypted manuscripts, commonly known as ciphers, which contain secret messages and were typically used in military or diplomatic correspondence, records of secret societies, or private letters. To hide their contents, the sender and receiver created their own secret method of writing. The cipher alphabets often include digits, Latin or Greek letters, Zodiac and alchemical signs, combined with various diacritics, as well as invented ones. The first step in the decryption process is the transcription of these manuscripts, which is difficult due to the great variation in handwriting styles and cipher alphabets with a limited number of pages. Although different strategies can be considered to deal with the insufficient amount of training data (e.g., few-shot learning, self-supervised learning), the performance of available HTR models is not yet satisfactory. Thus, the proposed competition, which includes ciphers with a large number of symbol sets and scribes, aims to boost research in HTR in low-resource scenarios.
In historical encrypted sources we can find encrypted text sequences, also called ciphertext, as well as non-encrypted cleartexts written in a known language. While most of the cryptanalysis focuses on the decryption of ciphertext, cleartext is often overlooked although it can give us important clues about the historical interpretation and contextualisation of the manuscript. In this paper, we investigate to what extent we can automatically distinguish cleartext from ciphertext in historical ciphers and to what extent we are able to identify its language. The problem is challenging as cleartext sequences in ciphers are often short, up to a few words, in different languages due to historical code-switching. To identify the sequences and the language(s), we chose a rule-based approach and run 7 different models using historical language models on various ciphertexts.
The automatic transcription of encrypted manuscripts is a challenge due to the different handwriting styles and the often invented symbol alphabets. Many transcription methods require annotated sources, including symbol locations. However, most existing transcriptions are provided at line or page level, making it necessary to find the bounding boxes of the transcribed symbols in the image, a process referred to as alignment. So, in this work, we develop several alignment methods, and discuss their performance on encrypted documents with various symbol sets.
This study aims to investigate the length, frequency and position of various types of pauses in three different speaking styles: elicited spontaneous dialogues, professional reading and non-professional reading.
In this study, we investigate the correlation between silent pauses and discourse boundaries in the notion of theme shift. We examine three speaking styles in Swedish: professional and non-professional reading, and elicited spontaneous dialogues. Considerable attention is given to the syntactic and discourse context in which pauses appear, as well as the characteristics of the discourse structure in terms of pauses.
This paper presents a study on if and how automatically extracted
keywords can be used to improve text categorization. In summary we
show that a higher performance --- as measured by micro-averaged
F-measure on a standard text categorization collection --- is achieved
when the full-text representation is combined with the automatically
extracted keywords. The combination is obtained by giving higher
weights to words in the full-texts that are also extracted as
keywords. We also present results for experiments in which the
keywords are the only input to the categorizer, either represented as
unigrams or intact. Of these two experiments, the unigrams have the
best performance, although neither performs as well as headlines only.
We present a set of resources and tools to support research and development in the field of historical cryptology. The tools aim to support transcription and decipherment of ciphertexts, developed to work together in a pipeline. It encompasses cataloging these documents into the Decode database, which houses ciphers dating from the 14th century to 1965, transcription using both manual and AI-assisted methods, cryptanalysis, and subsequent historical and linguistic analysis to contextualize decrypted content. The project encounters challenges with the accuracy of automated transcription technologies and the necessity for significant user involvement in the transcription and analysis processes. These insights highlight the critical balance between technological innovation and the indispensable input of domain expertise in advancing the field of historical cryptology.
We report recent developments of the DE-CODE database aimed for the system-atic collection and annotation of encryptedsources: ciphertexts, keys and related doc-uments. We released a new, more func-tional graphical user interface, revisedsome metadata features and enlarged thecollection and tripled its size.
The Copiale cipher is a 105-page enciphered book dated 1866. We describe the features of the book and the method by which we deciphered it.
The Copiale Cipher is a 105-page, hand-written encrypted manuscript from the mid-eighteenth century. Its code was cracked and the text was deciphered by using modern computational technology combined with philological methods. We describe the book, the features of the text, and give a brief summary of the method by which we deciphered it. Finally, we present the content and the secret society, namely the Oculists, who were hiding behind the cipher.
In Meister’s 1906 landmark study, “Die Geheimschrift im Dienste der päpstlichen Kurie von ihren Anfängen bis zum Ende des XVI Jahrhunderts”, the 16th Century papal cryptographic service is described as a vibrant, highly professional organization, at the forefront of the science of cryptography in the Late Renaissance. In his work from 1993, Alvarez concluded that by the 19th Century, “the reputation of papal cryptography, once so lustrous, has sadly faded.” However, until now, very little was known about the evolution of papal cryptography from the 16th to the 18th Century. In this article, we describe how we obtained a large collection of original papal ciphertexts from the Vatican archives, transcribed them, and how we were able to recover most of the keys, and to decipher the original plaintexts using novel cryptanalysis methods and the open-source e-learning CrypTool platform. The recovered keys and decipherments provide unique insights into papal cryptographic practices from the 16th to the 18th Century. The 16th Century is characterized by innovation and a high level of sophistication, with a primary focus on cryptographic security. From the 17th Century, only the simpler but also less secure forms of ciphers remain in use, and papal cryptography significantly lags behind other European states.
A widely shared recognition over the past decade is that the methodology and the basic concepts of science and technology studies (STS) can be used to analyze collaborations in the cross-disciplinary field of digital humanities (DH). The concepts of trading zones (Galison, 2010), boundary objects (Star and Griesemer, 1989), and interactional expertise (Collins and Evans, 2007) are particularly fruitful for describing projects in which researchers from massively different epistemic cultures (Knorr Cetina, 1999) are trying to develop a common language. The literature, however, primarily concentrates on examples where only two parties, historians and IT experts, work together. More exciting perspectives open up for analysis when more than two, more nuanced and different epistemic cultures seek a common language and common research goals. In the DECRYPT project funded by the Swedish Research Council, computational linguists, historians, computer scientists and AI experts, cryptologists, computer vision specialists, historical linguists, archivists, and philologists collaborate with strikingly different methodologies, publication patterns, and approaches. They develop and use common resources (including a database and a large collection of European historical texts) and tools (among others a code-breaking software, a hand-written text recognition tool for transcription), researching partly overlapping topics (handwritten historical ciphers and keys) to reach common goals. In this article, we aim to show how the STS concepts are illuminating when describing the mechanisms of the DECRYPT collaboration and shed some light on the best practices and challenges of a truly cross-disciplinary DH project.
We present an overview of instructions for the use of European historical cipher keys in early modern times. We describe the structure of instructions and the content presented to the key users. We exemplify various key instruction types and give a text edition of typical examples in various languages. The study is based on the analysis of more than 1,600 cipher keys collected from archives and libraries in ten European countries. We examine the practical implementation of cipher keys to the extent that instructions offer insights into everyday cryptographic practices. We focus on the typical rules scribes were expected to adhere to and the common errors they were instructed to avoid. We aim to reconstruct the apprehensions and considerations of the authors of cipher keys: They sought to offer assistance to users while likely harboring concerns regarding the potential misuse of their intellectual product. Given the secretive nature of cryptology, the documentation of knowledge transfer is scarce. In addition to the detailed manuals authored by well-known cryptologists, anonymous cipher key instructions offer valuable insights into this knowledge transfer process. By studying these instructions, historians gain direct access to a realm of knowledge that would otherwise remain hidden from their view.
Hand-written Text Recognition techniques withthe aim to automatically identify and transcribehand-written text have been applied to histor-ical sources including ciphers. In this paper,we compare the performance of two machinelearning architectures, an unsupervised methodbased on clustering and a deep learning methodwith few-shot learning. Both models are testedon seen and unseen data from historical cipherswith different symbol sets consisting of varioustypes of graphic signs. We compare the modelsand highlight their differences in performance,with their advantages and shortcomings.
The goal of this study is to investigate the structuring of speech in terms of prosodic boundaries in spontaneous dialogues in Swedish. In particular, the relation between boundaries as percieved by listeners, and their acoustic and linguistic realizations as uttered by the speakers is examined.
This study investigates the structuring of speech in terms of prosodic boundaries. In particular, the relation between boundaries as perceived by the listeners, and their acoustic and linguistic realizations as uttered by speakers is examined.
The aim of this study is a systematic evaluation and comparison of four state-of-the-art data-driven learning algorithms applied to part of speech tagging of Swedish. The algorithms included in this study are Hidden Markov Model, Maximum Entropy, Memory-Based Learning, and Transformation-Based Learning. The systems are evaluated from several aspects. Both the effects of tag set and the effects of the size of training data are examined. The accuracy is calculated as well as the error rate for known and unknown tokens. The results show differences between the approaches due to the different linguistic information built into the systems.
In this paper well-known state-of-the-art data-driven algorithms are applied to
part-of-speech tagging and shallow parsing of Swedish texts.
In this paper Brill's rule-based PoS tagger is tested and adapted for Hungarian. It is shown that the present system does not obtain as high accuracy for Hungarian as it does for English (and other Germanic languages) because of the structural difference between these languages. Hungarian, unlike English, has rich morphology, is agglutinative with some inflectional characteristics and has fairly free word order. The tagger has the greatest difficulties with parts-of-speech belonging to open classes because of their complicated morphological structure. It is shown that the accuracy of tagging can be increased from approximately 83% to 97% by simply changing the rule generating mechanisms, namely the lexical templates in the lexical training module.
Three data-driven algorithms are applied to shallow parsing of Swedish texts by using PoS taggers as the basis for parsing. The constituent structure is represented by nine types of phrases in a hierarchical structure containing labels for every constituent type the token belongs to. The results show that best performance can be obtained by training on the basis of PoS tags with labels marking the phrasal constituents without considering the words themselves. Transformation-based learning gives highest accuracy (94.44%) followed by the Maximum Entropy framework (mxpost) (92.47%) and the Hidden Markov model (TnT) (92.42%).
We investigate the relationship between prosodic phrase boundaries in terms of pausing and the linguistic structure on morpho-syntactic and discourse levels in
spontaneous dialogues as well as in read aloud speech in Swedish. Both the speakers' production and the listeners' perception of pausing are considered and mapped to the linguistic structure.
Three data-driven publicly available part-of-speech taggers are applied to shallow parsing of Swedish texts. The phrase structure is represented by nine types of phrases in a hierarchical structure containing labels for every constituent type the token belongs to in the parse tree. The encoding is based on the concatenation of the phrase tags on the path from lowest to higher nodes. Various linguistic features are used in learning; the taggers are trained on the basis of lexical information only, part-of-speech only, and a combination of both, to predict the phrase structure of the tokens with or without part-of-speech. Special attention is directed to the taggers' sensitivity to different types of linguistic information included in learning, as well as the taggers' sensitivity to the size and the various types of training data sets. The method can be easily transferred to other languages.
HunPoS, a freely available open source
part-of-speech tagger—a reimplementation
of one of the best performing taggers,
TnT—is applied to Swedish and evaluated
when the tagger is trained on various sizes
of training data. The tagger’s accuracy is
compared to other data-driven taggers for
Swedish. The results show that the tagging
performance of HunPoS is as accurate as
TnT and can be used efficiently to tag running
text.
HunPoS, a freely available open source part-of-speech tagger—a reimplementation of one of the best performing taggers, TnT—is applied to Swedish and evaluated when the tagger is trained on various sizes of training data. The tagger’s accuracy is compared to other data-driven taggers for Swedish. The results show that the tagging performance of HunPoS is as accurate as TnT and can be used efficiently to tag running text.