Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Word2Vec: Optimal hyperparameters and their impact on natural language processing downstream tasks
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-5582-2031
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-6756-0147
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0003-4029-6574
2022 (English)In: Open Computer Science, E-ISSN 2299-1093, Vol. 12, no 1, p. 134-141Article in journal (Refereed) Published
Abstract [en]

Word2Vec is a prominent model for natural language processing tasks. Similar inspiration is found in distributed embeddings (word-vectors) in recent state-of-the-art deep neural networks. However, wrong combination of hyperparameters can produce embeddings with poor quality. The objective of this work is to empirically show that Word2Vec optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the publicly released, original Word2Vec embedding. Both intrinsic and extrinsic (downstream) evaluations are carried out, including named entity recognition and sentiment analysis. Our main contributions include showing that the best model is usually task-specific, high analogy scores do not necessarily correlate positively with F1 scores, and performance is not dependent on data size alone. If ethical considerations to save time, energy, and the environment are made, then relatively smaller corpora may do just as well or even better in some cases. Increasing the dimension size of embeddings after a point leads to poor quality or performance. In addition, using a relatively small corpus, we obtain better WordSim scores, corresponding Spearman correlation, and better downstream performances (with significance tests) compared to the original model, which is trained on a 100 billion-word corpus.

Place, publisher, year, edition, pages
Walter de Gruyter, 2022. Vol. 12, no 1, p. 134-141
Keywords [en]
Word2Vec, hyperparameters, embeddings, named entity recognition, sentiment analysis
National Category
Language Technology (Computational Linguistics)
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-90107DOI: 10.1515/comp-2022-0236ISI: 000772580600001Scopus ID: 2-s2.0-85127883356OAI: oai:DiVA.org:ltu-90107DiVA, id: diva2:1650577
Funder
Vinnova, 2019-02996
Note

Validerad;2022;Nivå 2;2022-04-07 (sofila);

This article has previously appeared as a manuscript in a thesis as: Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks

Available from: 2022-04-07 Created: 2022-04-07 Last updated: 2022-10-28Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Adewumi, OluwatosinLiwicki, FoteiniLiwicki, Marcus
By organisation
Embedded Internet Systems Lab
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 164 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf