Рус Eng During last 365 days Approved articles: 2075,   Articles in work: 301 Declined articles: 873 
Articles and journals | Tariffs | Payments | Your profile

Back to contents

Using word2vec in clustering operons
Romashko Dmitrii Aleksandrovich

Graduate Student, Department of Mathematical Methods in Economics, Far Eastern Federal University

690091, Russia, Primorskii krai, g. Vladivostok, o. Russkii, poselok Ayaks - 10, kampus DVFU, Korpus D
Medvedev Aleksandr Yur'evich

Graduate Student, Department of Mathematical Methods in Economics, Far Eastern Federal University

690091, Russia, Primorskii krai, g. Vladivostok, o. Russkii, poselok Ayaks - 10, kampus DVFU, Korpus D


In this article the task of clustering operons (special units of genetic information) is solved. The authors describe its use for the identification of groups of operons with similar functions. The specifics of the open bases of operons used as sources of initial data for the study are considered. The authors describe the selection and preparation of data for clustering, the features of the clustering process, and its relationship with the approaches traditionally used for the analysis of natural languages. Based on the clustering performed, the quality and composition of the obtained groups is analyzed. To convert the raw data into vectors, the classical implementation of the word2vec algorithm and a number of features of the original data are used. The resulting representation is clustered by the DBScan algorithm based on the cosine distance. The novelty of the proposed method is associated with the use of non-standard algorithms for the initial data. The approach used effectively manifests itself when working with a large amount of data, does not require additional data markup and independently forms factors for clustering. The obtained results show the possibility of using the proposed approach for the implementation of services that allow comparative analysis of bacterial genomes.

Keywords: clustering, DBScan, word embeddings, word2vec, machine learning, methods, algorithms, operons, natural language processing, open access databases



Article was received:


Review date:


Publish date:


This article written in Russian. You can find full text of article in Russian here .

Brouwer R.W., Kuipers O.P., Hijum S.A. The relative value of operon predictions // Brief Bioinform.-2008.-C. 367–375.
Cao H., Ma Q., Chen X., Xu Y. DOOR: a prokaryotic operon database for genome analyses and functional inference // Brief Bioinform.-2017.-S. 8.
Ester M., Kriegel Hans-Peter, Sander J., Xu X., Simoudis E., Han J., Fayyad U. M., A density-based algorithm for discovering clusters in large spatial databases with noise // Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.-1996.-S. 226–231.
Goldberg Y., Levy O., word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method // URL: (data obrashcheniya 27.01.2018)
Loman N.J., Pallen M.J. Twenty years of bacterial genome sequencing // Nat Rev Microbiol.-2015.-S. 787–794.
Mihaela Pertea, Kunmi Ayanbule, Megan Smedinghoff and Steven L. Salzberg., Prediction of Operons in Microbial Genomes // Nucleic Acids Research.-2008.-S. 479–482.
Mikolov T., Sutskever I., Chen K., Corrado G., Dean J., Distributed representations of words and phrases and their compositionality // Advances in neural information processing systems.-2013.-№26.-S. 3111–3119.
Pedregosa F., Scikit-learn: Machine Learning in Python // JMLR.-2011.-C. 2825–2830.
Rehurek R., Sojka P., Software Framework for Topic Modelling with Large Corpora // In proceedings of the lrec 2010 workshop on new challenges for nlp frameworks.-2010.-S. 7.
Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux, The NumPy Array: A Structure for Efficient Numerical Computation // Computing in Science & Engineering.-2011.-C. 22-30.
Taboada B., Ciria R., Martinez-Guerrer C.E., Merino E., ProOpDB: Prokaryotic Operon DataBase // Nucleic Acids Research.-2012.-№40.-S. 627-631.
Taboada B., Verde C., Merino E., High accuracy operon prediction method based on STRING database scores // Nucleic Acids Research.-2010.-№38.-S. 130.
Wes McKinney, Data Structures for Statistical Computing in Python // Proceedings of the 9th Python in Science Conference.-2010.-C. 51-56.