Digitala Vetenskapliga Arkivet

Ändra sökning
Avgränsa sökresultatet
1 - 46 av 46
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Adams, Allison
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Learning with learner corpora: Using the TLE for native language identification2017Ingår i: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition, 2017, s. 1-7Konferensbidrag (Refereegranskat)
    Abstract [en]

    This study investigates the usefulness of the Treebank of Learner English (TLE) when applied to the task of Native Language Identification (NLI). The TLE is effectively a parallel corpus of Standard/Learner English, as there are two versions; one based on original learner essays, and the other an error-corrected version. We use the corpus to explore how useful a parser trained on ungrammatical relations is compared to a parser trained on grammatical relations, when used as features for a native language classification task. While parsing results are much better when trained on grammatical relations, native language classification is slightly better using a parser trained on the original treebank containing ungrammatical relations.

    Ladda ner fulltext (pdf)
    fulltext
  • 2. Baldwin, Timothy
    et al.
    Croft, William
    Nivre, Joakim
    Savary, Agata
    Stymne, Sara
    Vylomova, Ekaterina
    Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics2023Ingår i: Vol. 13, nr 5, s. 70s. 22-70Artikel i tidskrift (Övrigt vetenskapligt)
    Abstract [en]

    The Dagstuhl Seminar 23191 entitled “Universals of Linguistic Idiosyncrasy in MultilingualComputational Linguistics” took place May 7–12, 2023. Its main objectives were to deepenthe understanding of language universals and linguistic idiosyncrasy, to harness idiosyncrasyin treebanking frameworks in computationally tractable ways, and to promote a higher degreeof convergence in universalism-driven initiatives to natural language morphology, syntax andsemantics.Most of the seminar was devoted to working group discussions, covering topics such as:representations below and beyond word boundaries; annotation of particular kinds of constructions;semantic representations, in particular for multiword expressions; finding idiosyncrasy in corpora;large language models; and methodological issues, community interactions and cross-communityinitiatives. Thanks to the collaboration of linguistic typologists, NLP experts and experts indifferent annotation frameworks, significant progress was made towards the theoretical, practicaland networking objectives of the seminar.

    Ladda ner fulltext (pdf)
    fulltext
  • 3.
    Cerniavski, Rafal
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. Conversy AB.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Multilingual Automatic Speech Recognition for Scandinavian Languages2023Ingår i: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) / [ed] Tanel Alumäe; Mark Fishel, Tartu: University of Tartu, 2023, s. 460-466Konferensbidrag (Refereegranskat)
    Abstract [en]

    We investigate the effectiveness of multilingual automatic speech recognition models for Scandinavian languages by further fine-tuning a Swedish model on Swedish, Danish, and Norwegian. We first explore zero-shot models, which perform poorly across the three languages. However, we show that a multilingual model based on a strong Swedish model, further fine-tuned on all three languages, performs well for Norwegian and Danish, with a relatively low decrease in the performance for Swedish. With a language classification module, we improve the performance of the multilingual model even further.

    Ladda ner fulltext (pdf)
    fulltext
  • 4.
    Danilova, Vera
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    UD-MULTIGENRE: a UD-Based Dataset Enriched with Instance-Level Genre Annotations2023Ingår i: Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL) / [ed] Duygu Ataman, Association for Computational Linguistics, 2023, s. 253-267Konferensbidrag (Refereegranskat)
    Abstract [en]

    Prior research on the impact of genre on cross-lingual dependency parsing has suggested that genre is an important signal. However, these studies suffer from a scarcity of reliable data for multiple genres and languages. While Universal Dependencies (UD), the only available large-scale resource for cross-lingual dependency parsing, contains data from diverse genres, the documentation of genre labels is missing, and there are multiple inconsistencies. This makes studies of the impact of genres difficult to design. To address this, we present a new dataset, UD-MULTIGENRE, where 17 genres are defined and instance-level annotations of these are applied to a subset of UD data, covering 38 languages. It provides a rich ground for research related to text genre from a multilingual perspective. Utilizing this dataset, we can overcome the data shortage that hindered previous research and reproduce experiments from earlier studies with an improved setup. We revisit a previous study that used genre-based clusters and show that the clusters for most target genres provide a mix of genres. We compare training data selection based on clustering and gold genre labels and provide an analysis of the results. The dataset is publicly available. (https://github.com/UppsalaNLP/UD-MULTIGENRE)

    Ladda ner fulltext (pdf)
    fulltext
  • 5.
    de Lhoneux, Miryam
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Arc-Hybrid Non-Projective Dependency Parsing with a Static-Dynamic Oracle2017Ingår i: IWPT 2017 15th International Conference on Parsing Technologies: Proceedings of the Conference, Pisa, Italy: Association for Computational Linguistics, 2017, s. 99-104Konferensbidrag (Refereegranskat)
    Abstract [en]

    We extend the arc-hybrid transition system for dependency parsing with a SWAP transition that enables reordering of the words and construction of non-projective trees. Although this extension potentially breaks the arc-decomposability of the transition system, we show that the existing dynamic oracle can be modified and combined with a static oracle for the SWAP transition. Experiments on five languages with different degrees of non-projectivity show that the new system gives competitive accuracy and is significantly better than a system trained with a purely static oracle.

    Ladda ner fulltext (pdf)
    fulltext
  • 6.
    de Lhoneux, Miryam
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Old School vs. New School: Comparing Transition-Based Parsers with and without Neural Network Enhancement2017Ingår i: Proceedings of the 15th Treebanks and Linguistic Theories Workshop (TLT), 2017, s. 99-110Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper, we attempt a comparison between "new school" transition-based parsers that use neural networks and their classical "old school" coun-terpart. We carry out experiments on treebanks from the Universal Depen-dencies project. To facilitate the comparison and analysis of results, we onlywork on a subset of those treebanks. However, we carefully select this sub-set in the hope to have results that are representative for the whole set oftreebanks. We select two parsers that are hopefully representative of the twoschools; MaltParser and UDPipe and we look at the impact of training sizeon the two models. We hypothesize that neural network enhanced modelshave a steeper learning curve with increased training size. We observe, how-ever, that, contrary to expectations, neural network enhanced models needonly a small amount of training data to outperform the classical models butthe learning curves of both models increase at a similar pace after that. Wecarry out an error analysis on the development sets parsed by the two sys-tems and observe that overall MaltParser suffers more than UDPipe fromlonger dependencies. We observe that MaltParser is only marginally betterthan UDPipe on a restricted set of short dependencies.

  • 7.
    de Lhoneux, Miryam
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?2019Ingår i: CoRR, Vol. abs/1907.07950Artikel i tidskrift (Övrigt vetenskapligt)
    Abstract [en]

    This article is a linguistic investigation of a neural parser. We look at transitivity and agreement information of auxiliary verb constructions (AVCs) in comparison to finite main verbs (FMVs). This comparison is motivated by theoretical work in dependency grammar and in particular the work of Tesnière (1959) where AVCs and FMVs are both instances of a nucleus, the basic unit of syntax. An AVC is a dissociated nucleus, it consists of at least two words, and a FMV is its non-dissociated counterpart, consisting of exactly one word. We suggest that the representation of AVCs and FMVs should capture similar information. We use diagnostic classifiers to probe agreement and transitivity information in vectors learned by a transition-based neural parser in four typologically different languages. We find that the parser learns different information about AVCs and FMVs if only sequential models (BiLSTMs) are used in the architecture but similar information when a recursive layer is used. We find explanations for why this is the case by looking closely at how information is learned in the network and looking at what happens with different dependency representations of AVCs.

  • 8.
    de Lhoneux, Miryam
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?2020Ingår i: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 46, nr 4, s. 763-784Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    There is a growing interest in investigating what neural NLP models learn about language. A prominent open question is the question of whether or not it is necessary to model hierarchical structure. We present a linguistic investigation of a neural parser adding insights to this question. We look at transitivity and agreement information of auxiliary verb constructions (AVCs) in comparison to finite main verbs (FMVs). This comparison is motivated by theoretical work in dependency grammar and in particular the work of Tesnière (1959), where AVCs and FMVs are both instances of a nucleus, the basic unit of syntax. An AVC is a dissociated nucleus; it consists of at least two words, and an FMV is its non-dissociated counterpart, consisting of exactly one word. We suggest that the representation of AVCs and FMVs should capture similar information. We use diagnostic classifiers to probe agreement and transitivity information in vectors learned by a transition-based neural parser in four typologically different languages. We find that the parser learns different information about AVCs and FMVs if only sequential models (BiLSTMs) are used in the architecture but similar information when a recursive layer is used. We find explanations for why this is the case by looking closely at how information is learned in the network and looking at what happens with different dependency representations of AVCs. We conclude that there may be benefits to using a recursive layer in dependency parsing and that we have not yet found the best way to integrate it in our parsers.

    Ladda ner fulltext (pdf)
    fulltext
  • 9.
    de Lhoneux, Miryam
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Yan, Shao
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Basirat, Ali
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Kiperwasser, Eliyahu
    Bar-Ilan University.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Goldberg, Yoav
    Bar-Ilan University.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    From raw text to Universal Dependencies: look, no tags!2017Ingår i: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada: Association for Computational Linguistics, 2017, s. 207-217Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present the Uppsala submission to the CoNLL 2017 shared task on parsing from raw text to universal dependencies. Our system is a simple pipeline consisting of two components. The first performs joint word and sentence segmentation on raw text; the second predicts dependency trees from raw words. The parser bypasses the need for part-of-speech tagging, but uses word embeddings based on universal tag distributions. We achieved a macroaveraged LAS F1 of 65.11 in the official test run and obtained the 2nd best result for sentence segmentation with a score of 89.03. After fixing two bugs, we obtained an unofficial LAS F1 of 70.49.

    Ladda ner fulltext (pdf)
    fulltext
  • 10.
    Della Corte, Giuseppe
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    IESTAC: English-Italian Parallel Corpus for End-to-End Speech-to-Text Machine Translation2020Ingår i: Proceedings of the First International Workshop on Natural Language Processing Beyond Text, 2020, s. 41-50Konferensbidrag (Refereegranskat)
    Abstract [en]

    We discuss a set of methods for the creation of IESTAC: a English-Italian speech and text parallel corpus designed for the training of end-to-end speech-to-text machine translation models and publicly released as part of this work. We first mapped English LibriVox audiobooks and their corresponding English Gutenberg Project e-books to Italian e-books with a set of three complementary methods. Then we aligned the English and the Italian texts using both traditional Gale-Church based alignment methods and a recently proposed tool to perform bilingual sentences alignment computing the cosine similarity of multilingual sentence embeddings. Finally, we forced the alignment between the English audiobooks and the English side of our textual parallel corpus with a text-to-speech and dynamic time warping based forced alignment tool. For each step, we provide the reader with a critical discussion based on detailed evaluation and comparison of the results of the different methods.

  • 11.
    Dürlich, Luise
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. RISE Research Institutes of Sweden, Kista, Sweden.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. RISE Research Institutes of Sweden, Kista, Sweden.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    What Causes Unemployment?: Unsupervised Causality Mining from Swedish Governmental Reports2023Ingår i: Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), Association for Computational Linguistics, 2023, s. 25-29Konferensbidrag (Refereegranskat)
    Abstract [en]

    Extracting statements about causality from text documents is a challenging task in the absence of annotated training data. We create a search system for causal statements about user-specified concepts by combining pattern matching of causal connectives with semantic similarity ranking, using a language model fine-tuned for semantic textual similarity. Preliminary experiments on a small test set from Swedish governmental reports show promising results in comparison to two simple baselines.

    Ladda ner fulltext (pdf)
    fulltext
  • 12.
    Dürlich, Luise
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. RISE.
    Reimann, Sebastian
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. Ruhr-Universität Bochum.
    Finnveden, Gustav
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. RISE.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Cause and Effect in Governmental Reports: Two Data Sets for Causality Detection in Swedish2022Ingår i: Proceedings of the First Workshop on Natural Language Processing for Political Sciences (PoliticalNLP), 2022, s. 46-55Konferensbidrag (Refereegranskat)
    Abstract [en]

    Causality detection is the task of extracting information about causal relations from text. It is an important task for different types of document analysis, including political impact assessment. We present two new data sets for causality detection in Swedish. The first data set is annotated with binary relevance judgments, indicating whether a sentence contains causality information or not. In the second data set, sentence pairs are ranked for relevance with respect to a causality query, containing a specific hypothesized cause and/or effect. Both data sets are carefully curated and mainly intended for use as test data. We describe the data sets and their annotation, including detailed annotation guidelines. In addition, we present pilot experiments on cross-lingual zero-shot and few-shot causality detection, using training data from English and German.

    Ladda ner fulltext (pdf)
    fulltext
  • 13.
    Hardmeier, Christian
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nakov, Preslav
    Qatar Computing Research Institute.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Tiedemann, Jörg
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Versley, Yannick
    University of Heidelberg.
    Cettolo, Mauro
    Fondazione Bruno Kessler.
    Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation2015Ingår i: Proceedings of the Second Workshop on Discourse in Machine Translation (DiscoMT), Stroudsburg, PA: Association for Computational Linguistics, 2015, s. 1-16Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    We describe the design, the evaluation setup, and the results of the DiscoMT 2015 shared task, which included two subtasks, relevant to both the machine translation (MT) and the discourse communities: (i) pronoun-focused translation, a practical MT task, and (ii) cross-lingual pronoun prediction, a classification task that requires no specific MT expertise and is interesting as a machine learning task in its own right. We focused on the English–French language pair, for which MT output is generally of high quality, but has visible issues with pronoun translation due to differences in the pronoun systems of the two languages. Six groups participated in the pronoun-focused translation task and eight groups in the cross-lingual pronoun prediction task.

    Ladda ner fulltext (pdf)
    DiscoMTSharedTask
  • 14.
    Hardmeier, Christian
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Tiedemann, Jörg
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation2013Ingår i: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, 2013, s. 193-198Konferensbidrag (Refereegranskat)
    Abstract [en]

    We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-by-sentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models.

    Ladda ner fulltext (pdf)
    ACL2013Demo
  • 15.
    Hardmeier, Christian
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Tiedemann, Jörg
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Smith, Aaron
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Anaphora Models and Reordering for Phrase-Based SMT2014Ingår i: Proceedings of the Ninth Workshop on Statistical Machine Translation, Association for Computational Linguistics, 2014, s. 122-129Konferensbidrag (Refereegranskat)
    Abstract [en]

    We describe the Uppsala University systems for WMT14. We look at the integration of a model for translating pronominal anaphora and a syntactic dependency projection model for English–French. Furthermore, we investigate post-ordering and tunable POS distortion models for English–German.

    Ladda ner fulltext (pdf)
    WMT2014
  • 16.
    Håkansson, David
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för nordiska språk.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Östman, Carin
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för nordiska språk.
    Som om …: Stil och struktur hos komparativa subjunktionsfraser med konditional bisats2024Ingår i: Svenskans beskrivning 38: Förhandlingar vid trettioåttonde sammankomsten. Örebro 4–6 maj 2022, Del I / [ed] Denny Jansson;Ida Melander;Gustaw Westberg;Daroon Yassin Falk, Örebro: Örebro universitet , 2024, Vol. 38:1, s. 306-323Konferensbidrag (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 17. Karamolegkou, Antonia
    et al.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Investigation of Transfer Languages for Parsing Latin: Italic Branch vs. Hellenic Branch2021Ingår i: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Linköping University Electronic Press, 2021, s. 315-320Konferensbidrag (Refereegranskat)
    Abstract [en]

    Choosing a transfer language is a crucial step in transfer learning. In much previous research on dependency parsing, related languages have successfully been used. However, when parsing Latin, it has been suggested that languages such as ancient Greek could be helpful. In this work we parse Latin in a low-resource scenario, with the main goal to investigate if Greek languages are more helpful for parsing Latin than related Italic languages, and show that this is indeed the case. We further investigate the influence of other factors including training set size and content as well as linguistic distances. We find that one explanatory factor seems to be the syntactic similarity between Latin and Ancient Greek. The influence of genres or shared annotation projects seems to have a smaller impact.

    Ladda ner fulltext (pdf)
    fulltext
  • 18.
    Lameris, Harm
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Whit’s the Richt Pairt o Speech: PoS tagging for Scots2021Ingår i: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Association for Computational Linguistics Association for Computational Linguistics, 2021, s. 39-48Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper we explore PoS tagging for the Scots language. Scots is spoken in Scotland and Northern Ireland, and is closely related to English. As no linguistically annotated Scots data were available, we manually PoS tagged a small set that is used for evaluation and training. We use English as a transfer language to examine zero-shot transfer and transfer learning methods. We find that training on a very small amount of Scots data was superior to zero-shot transfer from English. Combining the Scots and English data led to further improvements, with a concatenation method giving the best results. We also compared the use of two different English treebanks and found that a treebank containing web data was superior in the zero-shot setting, while it was outperformed by a treebank containing a mix of genres when combined with Scots data.

    Ladda ner fulltext (pdf)
    fulltext
  • 19.
    Loáiciga, Sharid
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nakov, Preslav
    Qatar Computing Research Institute.
    Hardmeier, Christian
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Tiedemann, Jörg
    University of Helsinki.
    Cettolo, Mauro
    Fondazione Bruno Kessler.
    Versley, Yannick
    LinkedIn.
    Findings of the 2017 DiscoMT Shared Task on Cross-lingual Pronoun Prediction2017Ingår i: Proceedings of the Third Workshop on Discourse in Machine Translation, 2017, artikel-id 4801Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    We describe the design, the setup, and the evaluation results of the DiscoMT 2017 shared task on cross-lingual pronoun prediction. The task asked participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provided a lemmatized target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. The aim of the task was to predict, for each target-language pronoun placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the entire document.

    We offered four subtasks, each for a different language pair and translation direction: English-to-French, English-to-German, German-to-English, and Spanish-to-English. Five teams participated in the shared task, making submissions for all language pairs. The evaluation results show that all participating teams outperformed two strong n-gram-based language model-based baseline systems by a sizable margin.

    Ladda ner fulltext (pdf)
    fulltext
  • 20.
    Parks, Magdalena
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Karlgren, Jussi
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Plausibility Testing for Lexical Resources2017Ingår i: Proceedings of CLEF 2017, 2017, s. 132-137Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper describes principles for evaluation metrics for lex-ical components and an implementation of them based on requirementsfrom practical information systems.

  • 21.
    Ramisch, Carlos
    et al.
    Aix-Marseille Université.
    Savary, Agata
    University of Tours.
    Guillaume, Bruno
    LORIA/Inria Nancy.
    Waszczuk, Jakub
    University of Duesseldorf.
    Candito, Marie
    Paris Diderot University.
    Ashwini, Vaidya
    IIT Delhi.
    Barbu Mititelu, Verginica
    Romanian Academy.
    Bhatia, Archna
    Florida IHMC.
    Iñurrieta, Uxoa
    University of the Basque Country.
    Giouli, Voula
    Athena Research Center.
    Güngör, Tunga
    Boğaziçi University.
    Jiang, Menghan
    The Hong Kong Polytechnic University.
    Lichte, Timm
    University of Tübingen.
    Liebeskind, Chaya
    Jerusalem College of Technology.
    Monti, Johanna
    “L’Orientale” University of Naples.
    Ramisch, Renata
    The Interinstitutional Center for Computational Linguistics, Federal University of São Carlos.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Walsh, Abigail
    Dublin City University.
    Xu, Hongzhi
    Shanghai International Studies University.
    Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions2020Ingår i: Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, 2020, s. 107-118Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

    Ladda ner fulltext (pdf)
    fulltext
  • 22.
    Reimann, Sebastian
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. Ruhr-Universität Bochum.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Exploring Cross-Lingual Transfer to Counteract Data Scarcity for Causality Detection2022Ingår i: WWW '22: Companion Proceedings of the Web Conference 2022, New York, USA: Association for Computing Machinery (ACM), 2022, s. 501-508Konferensbidrag (Refereegranskat)
    Abstract [en]

    Finding causal relations in text is an important task for many types of textual analysis. It is a challenging task, especially for the many languages with no or only little annotated training data available. To overcome this issue, we explore cross-lingual methods. Our main focus is on Swedish, for which we have a limited amount of data, and where we explore transfer from English and German. We also present additional results for German with English as a source language. We explore both a zero-shot setting without any target training data, and a few-shot setting with a small amount of target data. An additional challenge is the fact that the annotation schemes for the different data sets differ, and we discuss how we can address this issue. Moreover, we explore the impact of different types of sentence representations. We find that we have the best results for Swedish with German as a source language, for which we have a rather small but compatible data set. We are able to take advantage of a limited amount of noisy Swedish training data, but only if we balance its classes. In addition we find that the newer transformer-based representations can make better use of target language data, but that a representation based on recurrent neural networks is surprisingly competitive in the zero-shot setting.

    Ladda ner fulltext (pdf)
    fulltext
  • 23.
    Rizal, Arra’Di Nur
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data2020Ingår i: Proceedings of the 4th Workshop on Computational Approaches to Code Switching / [ed] Thamar Solorio, Monojit Choudhury, Kalika Bali, Sunayana Sitaram, Amitava Das & Mona Diab, 2020, s. 26-35Konferensbidrag (Refereegranskat)
    Abstract [en]

    Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian–English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.

    Ladda ner fulltext (pdf)
    fulltext
  • 24.
    Ruby, Ahmed
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Hardmeier, Christian
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. IT Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    A Mention-Based System for Revision Requirements Detection2021Ingår i: Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language, Association for Computational Linguistics, 2021, s. 58-63Konferensbidrag (Refereegranskat)
    Abstract [en]

    Exploring aspects of sentential meaning that are implicit or underspecified in context is important for sentence understanding. In this paper, we propose a novel architecture based on mentions for revision requirements detection. The goal is to improve understandability, addressing some types of revisions, especially for the Replaced Pronoun type. We show that our mention-based system can predict replaced pronouns well on the mention-level. However, our combined sentence-level system does not improve on the sentence-level BERT baseline. We also present additional contrastive systems, and show results for each type of edit.

  • 25. Savary, Agata
    et al.
    Ben Khelil, Cherifa
    Ramisch, Carlos
    Giouli, Voula
    Barbu Mititelu, Verginica
    Hadj Mohamed, Najet
    Krstev, Cvetana
    Liebeskind, Chaya
    Xu, Hongzhi
    Jiang, Menghan
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Güngör, Tunga
    Pickard, Thomas
    Guillaume, Bruno
    Bhatia, Archna
    Butler, Alexandra
    Candito, Marie
    Gantar, Apolonija
    Iñurrieta, Uxoa
    Gatt, Albert
    Kovalevskaite, Jolanta
    Krek, Simon
    Lichte, Timm
    Ljubešic, Nikola
    Monti, Johanna
    Parra Escartín, Carla
    Shamsfard, Mehrnoush
    Stoyanova, Ivelina
    Vincze, Veronika
    Walsh, Abigail
    PARSEME Corpus Release 1.32023Ingår i: Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023) / [ed] Archna Bhatia; Kilian Evang; Marcos Garcia; Voula Giouli; Lifeng Han; Shiva Taslimipoor, Stroudsburg: Association for Computational Linguistics, 2023, s. 24-35Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

    Ladda ner fulltext (pdf)
    fulltext
  • 26.
    Savary, Agata
    et al.
    Université Paris-Saclay.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Barbu Mititelu, Verginica
    Romanian Academy Research Institute for Artificial Intelligence.
    Schneider, Nathan
    Georgetown University.
    Ramisch, Carlos
    Aix Marseille University.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. RISE Research Institutes of Sweden, Sweden.
    PARSEME Meets Universal Dependencies: Getting on the Same Page in Representing Multiword Expressions2023Ingår i: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 9, nr 1Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Multiword expressions (MWEs) are challenging and pervasive phenomena whose idiosyncratic properties show notably at the levels of lexicon, morphology, and syntax. Thus, they should best be annotated jointly with morphosyntax. In this position paper we discuss two multilingual initiatives, Universal Dependencies and PARSEME, addressing these annotation layers in cross-lingually unified ways. We compare the annotation principles of these initiatives with respect to MWEs, and we put forward a roadmap towards their gradual unification. The expected outcomes are more consistent treebanking and higher universality in modeling idiosyncrasy.

    Ladda ner fulltext (pdf)
    fulltext
  • 27.
    Smith, Aaron
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Bohnet, Bernd
    de Lhoneux, Miryam
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Shao, Yan
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models2018Ingår i: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2018, s. 113-123Konferensbidrag (Refereegranskat)
  • 28.
    Smith, Aaron
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    de Lhoneux, Miryam
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing2018Ingår i: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2018, s. 2711-2720Konferensbidrag (Refereegranskat)
    Abstract [en]

    We provide a comprehensive analysis of the interactions between pre-trained word embeddings, character models and POS tags in a transition-based dependency parser. While previous studies have shown POS information to be less important in the presence of character models, we show that in fact there are complex interactions between all three techniques. In isolation each produces large improvements over a baseline system using randomly initialised word embeddings only, but combining them quickly leads to diminishing returns. We categorise words by frequency, POS tag and language in order to systematically investigate how each of the techniques affects parsing quality. For many word categories, applying any two of the three techniques is almost as good as the full combined system. Character models tend to be more important for low-frequency open-class words, especially in morphologically rich languages, while POS tags can help disambiguate high-frequency function words. We also show that large character embedding sizes help even for languages with small character sets, especially in morphologically rich languages.

  • 29.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Cross-Lingual Domain Adaptation for Dependency Parsing2020Ingår i: Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (TLT), 2020, s. 62-69Konferensbidrag (Refereegranskat)
    Abstract [en]

    We show how we can adapt parsing to low-resource domains by combining treebanks across languages for a parser model with treebank embeddings. We demonstrate how we can take advantage of in-domain treebanks from other languages, and show that this is especially useful when only out-of-domain treebanks are available for the target language. The method is also extended to low-resource languages by using out-of-domain treebanks from related languages. Two parameter-free methods for applying treebank embeddings at test time are proposed, which give competitive results to tuned methods when applied to Twitter data and transcribed speech. This gives us a method for selecting treebanks and training a parser targeted at any combination of domain and language.

    Ladda ner fulltext (pdf)
    fulltext
  • 30.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    The Effect of Translationese on Tuning for Statistical Machine Translation2017Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, 2017, s. 241-246Konferensbidrag (Refereegranskat)
    Abstract [en]

    We explore how the translation direction in the tuning set used for statistical machine translation affects the translation results. We explore this issue for three language pairs. While the results on different metrics are somewhat conflicting, using tuning data translated in the same direction as the translation systems tends to give the best length ratio and Meteor scores for all language pairs. This tendency is confirmed in a small human evaluation.

    Ladda ner fulltext (pdf)
    fulltext
  • 31.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Using a Grammar Checker and its Error Typology for Annotation of Statistical Machine Translation Errors2013Ingår i: Proceeedings of the 24th Scandinavian Conference of Linguistics / [ed] Jani-Matti Tirkkonen, Esa Anttikoski, Joensuu, Finland: University of Eastern Finland , 2013, s. 332-344Konferensbidrag (Refereegranskat)
  • 32.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Cancedda, Nicola
    Ahrenberg, Lars
    Generation of Compound Words in Statistical Machine Translation into Compounding Languages2013Ingår i: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 39, nr 4, s. 1067-1108Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.

  • 33.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    de Lhoneux, Miryam
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Smith, Aaron
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Parser Training with Heterogeneous Treebanks2018Ingår i: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, 2018, s. 619-625Konferensbidrag (Refereegranskat)
    Abstract [en]

    How to make the most of multiple heterogeneous treebanks when training a monolingual dependency parser is an open question. We start by investigating previouslysuggested, but little evaluated, strategiesfor exploiting multiple treebanks based onconcatenating training sets, with or without fine-tuning. We go on to propose anew method based on treebank embeddings. We perform experiments for severallanguages and show that in many casesfine-tuning and treebank embeddings leadto substantial improvements over singletreebanks or concatenation, with averagegains of 2.0–3.5 LAS points. We arguethat treebank embeddings should be preferred due to their conceptual simplicity,flexibility and extensibility.

    Ladda ner fulltext (pdf)
    fulltext
  • 34.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Hardmeier, Christian
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Tiedemann, Jörg
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Feature Weight Optimization for Discourse-Level SMT2013Ingår i: Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT), Association for Computational Linguistics, 2013, s. 60-69Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present an approach to feature weight optimization for document-level decoding. This is an essential task for enabling future development of discourse-level statistical machine translation, as it allows easy integration of discourse features in the decoding process. We extend the framework of sentence-level feature weight optimization to the document-level. We show experimentally that we can get competitive and relatively stable results when using a standard set of features, and that this framework also allows us to optimize document- level features, which can be used to model discourse phenomena.

    Ladda ner fulltext (pdf)
    DiscoMT2013
  • 35.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Hardmeier, Christian
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Tiedemann, Jörg
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Tunable Distortion Limits and Corpus Cleaning for SMT2013Ingår i: Proceedings of the Eighth Workshop on Statistical Machine Translation, Association for Computational Linguistics, 2013, s. 225-231Konferensbidrag (Refereegranskat)
    Abstract [en]

    We describe the Uppsala University system for WMT13, for English-to-German translation. We use the Docent decoder, a local search decoder that translates at the document level. We add tunable distortion limits, that is, soft constraints on the maximum distortion allowed, to Docent. We also investigate cleaning of the noisy Common Crawl corpus. We show that we can use alignment-based filtering for cleaning with good results. Finally we investigate effects of corpus selection for recasing.

    Ladda ner fulltext (pdf)
    WMT2013
  • 36.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Pettersson, Eva
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Megyesi, Beáta
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Palmér, Anne
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för nordiska språk.
    Annotating Errors in Student Texts: First Experiences and Experiments2017Ingår i: Proceedings of Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop, Göteborg, 2017, s. 47-60Konferensbidrag (Refereegranskat)
    Abstract [en]

    We describe the creation of an annotation layer for word-based writing errors for a corpus of student writings. The texts are written in Swedish by students between 9 and 19 years old. Our main purpose is to identify errors regarding spelling, split compounds and merged words. In addition, we also identify simple word-based grammatical errors, including morphological errors and extra words. In this paper we describe the corpus and the annotation process, including detailed descriptions of the error types and guidelines. We find that we can perform this annotation with a substantial inter-annotator agreement, but that there are still some remaining issues with the annotation. We also report results on two pilot experiments regarding spelling correction and the consistency of downstream NLP tools, to exemplify the usefulness of the annotated corpus.

    Ladda ner fulltext (pdf)
    fulltext
  • 37.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Smith, Christian
    Department of Computer Science, Linköping University.
    On the Interplay between Readability, Summarization, and MTranslatability2012Ingår i: Proceedings of the Fourth Swedish Language Technology Conference (SLTC 2012), 2012, s. 70-71Konferensbidrag (Refereegranskat)
  • 38.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Svedjedal, Johan
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Historisk-filosofiska fakulteten, Litteraturvetenskapliga institutionen.
    Östman, Carin
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för nordiska språk.
    Språklig rytm i skönlitterär prosa. En fallstudie i Karin Boyes Kallocain2018Ingår i: Samlaren: Tidskrift för forskning om svensk och annan nordisk litteratur, ISSN 0348-6133, E-ISSN 2002-3871, Vol. 139, s. 128-161Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Sara Stymne, Department of Linguistics and Philology, Uppsala University

    Johan Svedjedal, Department of Literature, Uppsala University

    Carin Östman, Department of Scandinavian Languages, Uppsala University

    Linguistic Rhythm in Narrative Prose: the case of Karin Boye’s Kallocain (Språklig rytm i skönlitterär prosa. En fallstudie i Karin Boyes Kallocain)

    The concept of rhythm in prose is ambiguous, and there is no consensus on how to define it. In this work, we focus on linguistic rhythm, at word, sentence and paragraph levels. We adopt and slightly extend rhythm indicators used in previous research, and show that these can be calculated fully automatically, on a much larger scale than previously done.

    We adopt the Swedish poet and novelist Karin Boye’s (1900–41) novel Kallocain (1940), as a case study. It is an icily dystopian depiction of a totalitarian future, where the protagonist Leo Kall first embraces this system, but for various reasons later rebels against it. The peripety comes when he gives a public speech, questioning the State. It has been pointed out that the novel from precisely this point on is characterized by a much freer rhythm, and that Boye as an author had considerable interest in questions of linguistic rhythm. This paper sets out to test this hypothesis by applying sixteen indicators of linguistic rhythm in narrative prose (such as word length, sentence length, ratio of punctuation, etc.).

    We first note that we can expect differences between narrative and dialogue and limit most of our study to the first-person narrative. We find that there are significant differences mainly between phrase and word lengths in the parts before and after Leo Kall’s conversion. In a further investigation we note that there is also great variation among indicators within each part of the novel. We also show that machine learning can be used to differentiate small segments from each part of the novel, with higher accuracy than a random classifier. Finally, we undertake a small study of dialogue, which, however is mainly inconclusive. In summary we find some support for the claim that there is a rhythm break in Kallocain. We also believe that our study is important from a methodological point of view, since it provides a method for largescale studies of prose rhythm in the future.

    Ladda ner fulltext (pdf)
    Samlaren_2018_128-161.pdf
  • 39.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Tiedemann, Jörg
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Hardmeier, Christian
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Statistical Machine Translation with Readability Constraints2013Ingår i: Proceedings of the 19th Nordic Conference on Computational Linguistics (NODALIDA 2013), Linköping, Sweden: Linköping University Electronic Press, 2013, s. 375-386Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper presents experiments with document-level machine translation with readabilityconstraints. We describe the task of producing simplified translations from a given source withthe aim to optimize machine translation for specific target users such as language learners. Inour approach, we introduce global features that are known to affect readability into a document-level SMT decoding framework. We show that the decoder is capable of incorporating thosefeatures and that we can influence the readability of the output as measured by commonmetrics. This study presents the first attempt of jointly performing machine translation and textsimplification, which is demonstrated through the case of translating parliamentary texts fromEnglish to Swedish.

  • 40. Stymne, Sara
    et al.
    Tiedemann, Jörg
    Nivre, Joakim
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Estimating Word Alignment Quality for SMT Reordering Tasks2014Ingår i: Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014, s. 275-286Konferensbidrag (Refereegranskat)
    Abstract [en]

    Previous studies of the effect of word alignment on translation quality in SMT generally explore link level metrics only and mostly do not show any clear connections between alignment and SMT quality. In this paper, we specifically investigate the impact of word alignment on two pre-reordering tasks in translation, using a wider range of quality indicators than previously done. Experiments on German–English translation show that reordering may require alignment models different from those used by the core translation system. Sparse alignments with high precision on the link level, for translation units, and on the subset of crossing links, like intersected HMM models, are preferred. Unlike SMT performance the desired alignment characteristics are similar for small and large training data for the pre-reordering tasks. Moreover, we confirm previous research showing that the fuzzy reordering score is a useful and cheap proxy for performance on SMT reordering tasks.

  • 41.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Östman, Carin
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för nordiska språk.
    SLäNDa: An Annotated Corpus of Narrative and Dialogue in Swedish Literary Fiction2020Ingår i: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, s. 826-834Konferensbidrag (Refereegranskat)
    Abstract [en]

    We describe a new corpus, SLäNDa, the Swedish Literary corpus of Narrative and Dialogue. It contains Swedish literary fiction, which has been manually annotated for cited materials, with a focus on dialogue. The annotation covers excerpts from eight Swedish novels written between 1879-1940, a period of modernization of the Swedish language. SLäNDa contains annotations for all cited materials that are separate from the main narrative, like quotations and signs. The main focus is on dialogue, for which we annotate speech segments, speech tags, and speakers. In this paper we describe the annotation protocol and procedure and show that we can reach a high inter-annotator agreement. In total, SLäNDa contains annotations of 44 chapters with over 220K tokens. The annotation identified 4,733 instances of cited material and 1,143 named speaker-speech mappings. The corpus is useful for developing computational tools for different types of analysis of literary narrative and speech. We perform a small pilot study where we show how our annotation can help in analyzing language change in Swedish. We find that a number of common function words have their modern version appear earlier in speech than in narrative.

    Ladda ner fulltext (pdf)
    fulltext
  • 42.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Östman, Carin
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för nordiska språk.
    SLäNDa Version 2.0: Improved and Extended Annotation of Narrative and Dialogue in Swedish Literature2022Ingår i: Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022), European Language Resources Association, 2022, s. 5324-5333Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper, we describe version 2.0 of the SLaNDa corpus. SLäNDa, the Swedish Literary corpus of Narrative and Dialogue, now contains excerpts from 19 novels, written between 1809–1940. The main focus of the SLaNDa corpus is to distinguish between direct speech and the main narrative. In order to isolate the narrative, we also annotate everything else which does not belong to the narrative, such as thoughts, quotations, and letters. SLäNDa version 2.0 has a slightly updated annotation scheme from version 1.0. In addition, we added new texts from eleven authors and performed quality control on the previous version. We are specifically interested in different ways of marking speech segments, such as quotation marks, dashes, or no marking at all. To allow a detailed evaluation of this aspect, we added dedicated test sets to SLaNDa for these different types of speech marking. In a pilot experiment, we explore the impact of typographic speech marking by using these test sets, as well as artificially stripping the training data of speech markers.

    Ladda ner fulltext (pdf)
    fulltext
  • 43.
    Stymne, Sara
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Östman, Carin
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för nordiska språk.
    Håkansson, David
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för nordiska språk.
    Parser Evaluation for Analyzing Swedish 19th–20th Century Literature2023Ingår i: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) / [ed] Tanel Alumäe; Mark Fishel, Tartu: University of Tartu, 2023, s. 335-346Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this study, we aim to find a parser for accurately identifying different types of subordinate clauses, and related phenomena, in 19th–20th-century Swedish literature. Since no test set is available for parsing from this time period, we propose a lightweight annotation scheme for annotating a single relation of interest per sentence. We train a variety of parsers for Swedish and compare evaluations on standard modern test sets and our targeted test set. We find clear trends in which parser types perform best on the standard test sets, but that performance is considerably more varied on the targeted test set. We believe that our proposed annotation scheme can be useful for complementing standard evaluations, with a low annotation effort.

    Ladda ner fulltext (pdf)
    fulltext
  • 44.
    You, Huiling
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Zhu, Xingran
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Uppsala NLP at SemEval-2021 Task 2: Multilingual Language Models for Fine-tuning and Feature Extraction in Word-in-Context Disambiguation2021Ingår i: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Association for Computational Linguistics, 2021, s. 150-156Konferensbidrag (Refereegranskat)
    Abstract [en]

    We describe the Uppsala NLP submission to SemEval-2021 Task 2 on multilingual and cross-lingual word-in-context disambiguation. We explore the usefulness of three pre-trained multilingual language models, XLM-RoBERTa (XLMR), Multilingual BERT (mBERT) and multilingual distilled BERT (mDistilBERT). We compare these three models in two setups, fine-tuning and as feature extractors. In the second case we also experiment with using dependency-based information. We find that fine-tuning is better than feature extraction. XLMR performs better than mBERT in the cross-lingual setting both with fine-tuning and feature extraction, whereas these two models give a similar performance in the multilingual setting. mDistilBERT performs poorly with fine-tuning but gives similar results to the other models when used as a feature extractor. We submitted our two best systems, fine-tuned with XLMR and mBERT.

  • 45.
    Černiavski, Rafal
    et al.
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Uppsala University at SemEval-2022 Task 1: Can Foreign Entries Enhance an English Reverse Dictionary?2022Ingår i: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Association for Computational Linguistics, 2022, s. 88-93Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present the Uppsala University system for SemEval-2022 Task 1: Comparing Dictionaries and Word Embeddings (CODWOE). We explore the performance of multilingual reverse dictionaries as well as the possibility of utilizing annotated data in other languages to improve the quality of a reverse dictionary in the target language. We mainly focus on characterbased embeddings. In our main experiment, we train multilingual models by combining the training data from multiple languages. In an additional experiment, using resources beyond the shared task, we use the training data in Russian and French to improve the English reverse dictionary using unsupervised embeddings alignment and machine translation. The results show that multilingual models occasionally but not consistently can outperform the monolingual baselines. In addition, we demonstrate an improvement of an English reverse dictionary using translated entries from the Russian training data set.

  • 46.
    Šoštarić, Margita
    et al.
    University of Zagreb.
    Hardmeier, Christian
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Stymne, Sara
    Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi.
    Discourse-Related Language Contrasts in English-Croatian Human and Machine Translation2018Ingår i: Proceedings of the Third Conference on Machine Translation: Research Papers, 2018, s. 36-48Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present an analysis of a number of coreference phenomena in English-Croatian human and machine translations. The aim is to shed light on the differences in the way these structurally different languages make use of discourse information and provide insights for discourse-aware machine translation system development. The phenomena are automatically identified in parallel data using annotation produced by parsers and word alignment tools, enabling us to pinpoint patterns of interest in both languages. We make the analysis more fine-grained by including three corpora pertaining to three different registers. In a second step, we create a test set with the challenging linguistic constructions and use it to evaluate the performance of three MT systems. We show that both SMT and NMT systems struggle with handling these discourse phenomena, even though NMT tends to perform somewhat better than SMT. By providing an overview of patterns frequently occurring in actual language use, as well as by pointing out the weaknesses of current MT systems that commonly mistranslate them, we hope to contribute to the effort of resolving the issue of discourse phenomena in MT applications.

    Ladda ner fulltext (pdf)
    fulltext
1 - 46 av 46
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf