(сетевой с №3-2015)
(сетевой с №2-2015)
|Articles and journals | Tariffs | Payments | Your profile|
Using word2vec in clustering operons
Abstract.In this article the task of clustering operons (special units of genetic information) is solved. The authors describe its use for the identification of groups of operons with similar functions. The specifics of the open bases of operons used as sources of initial data for the study are considered. The authors describe the selection and preparation of data for clustering, the features of the clustering process, and its relationship with the approaches traditionally used for the analysis of natural languages. Based on the clustering performed, the quality and composition of the obtained groups is analyzed. To convert the raw data into vectors, the classical implementation of the word2vec algorithm and a number of features of the original data are used. The resulting representation is clustered by the DBScan algorithm based on the cosine distance. The novelty of the proposed method is associated with the use of non-standard algorithms for the initial data. The approach used effectively manifests itself when working with a large amount of data, does not require additional data markup and independently forms factors for clustering. The obtained results show the possibility of using the proposed approach for the implementation of services that allow comparative analysis of bacterial genomes.
Keywords: clustering, DBScan, word embeddings, word2vec, machine learning, methods, algorithms, operons, natural language processing, open access databases
Article was received:28-01-2018
This article written in Russian. You can find full text of article in Russian here .