Word2Vec: classification of text documents

tool of distributional Word2Vec demonstrates amazing results and using it consistently provides professionals with winning places to contests native linguistics. The advantage of utilities as well as its analogues – Glove and AdaGram is the cheapness of the process of training and preparation of the training texts. But there are drawbacks – the representation of words as vectors works well in words, phrases satisfactorily, so-in his words and in any way – on long texts.

In this paper we propose to discuss an approach that allows you to submit text of any length into a vector, allowing the texts of comparison (calculation of distance), addition and subtraction.

From vector representations to the semantic vector

Vector representations of words obtained as a result of work Word2Vec have the following interesting property – meaning only the distances between the vectors, not the vector. In other words – to put itself a vector representation of a specific word into components and to study it seems a formidable task. First and foremost, because the learning process starts with some random initial vectors and moreover, the process of learning is random. His accident is linked to the principle of stochastic learning as in parallel running streams of study not sinhroniziruete changes introduced with each other by pure race data. But the quality of training this race is significantly decreased, while the learning rate increases very markedly. Thanks to the random structure of the algorithm and the data – vector representation of words does not decompose into meaningful components and can only be used in General.

The negative effect of the properties of vector representation is the rapid degradation of the vectors when operations on them. Vector addition of the two words usually demonstrates common, that is, between these words (if the words are really connected in the real world), but the attempt to increase the number of terms very quickly leads to the loss of any practically valuable result. To lay down the words of a phrase, a few sentences – no longer. Requires a different approach.

From the point of view of common sense, to describe any text? Try to specify his subject may be – to say a few words about the style. Texts dedicated to cars, obviously, will contain a rather large number of words “car” and close to it, may contain the word “sport”, the names of car brands and so on. On the other hand, the texts on other subject will contain such words much less or not contain at all. Thus, listing a sufficient number of possible themes of the text, we can calculate the statistics of the presence in the text of the relevant each category of words and to semantic vector text – vector, each element of which denotes the relationship of this text to the themes, coded this element.

The style of the text, in turn, also determined by statistical methods is characteristic for the author of the words parasites and speech patterns, the specifics of the beginning of sentences and placing of punctuation marks. As we shared while teaching upper and lowercase letters and do not remove from the text of punctuation, the dictionary Word2Vec is full of words like “text” – that is, with the comma. And these words can be used to highlight the author's style. Of course, for the sustainable allocation style requires a really huge text body, or at least a very original style, but, nevertheless, to distinguish a newspaper from writing on the forum or tweet a snap.
Thus, for constructing the semantic vector of text needed to describe a sufficient number of stable clusters that reflect the theme and style of the text. In Word2Vec utility itself has built-in clusterization based on kMeans, and use. Clusterization divide all words from the dictionary to the specified number of clusters, and if the number of clusters is large enough, it can be expected that each cluster will point to a relatively narrow theme of the text, or rather, to narrow the symptom to the subject matter or style. In my task I used two thousand clusters. That is, the length of semantic vectors text is two thousand elements, and each element of this vector can be explained via a given cluster of words.

The relative density of the words from each cluster in the studied text well describes the text. Of course, each particular word is related to many clusters, some more, some less. Therefore, it is first necessary to compute the semantic vector of the word – as a vector describing the distance from the word to the center of the corresponding cluster in the vector space Word2Vec. Then, folding the semantic vector of the individual words comprising the text, we get the semantic vector of the entire text.

The algorithm, based on calculation of the relative frequency of words that specifies the relevant subjects, good that is suitable for texts of any length. One word to infinity. However, as we know, it is difficult to find a rather long text with a common theme, often the theme of the text change from its beginning to the end. A short text, or message, on the contrary, can not cover the variety of topics is due to its brevity. As a result, it turns out that the semantic vector of the long text is different characteristics of the multiple short text in which signs of the much smaller, but they are represented much stronger. The length of the text is not explicitly taken into account, however, the algorithm reliably separates short and long texts in the vector space.

How to use the semantic vector of text?

Since each text set matching the vectors in the semantic space, we can compute the distance between any two texts as the cosine measure between them. Having the distance between the texts, you can use the kMeans algorithm to perform clustering or classification. Only this time, it is already in the vector space of texts, not individual words. For example, if we have a task to filter out from the stream of texts (news, forums, tweets, etc.) only we are interested in the subject, it is possible to prepare a pre-labeled texts, and for each of the studied text to compute the class to which he gravitates more (maximum of cosine of the averaged measures for several of the best occurrences of each class – kMeans in its pure form).

This way was solved a rather complex task of text classification on a large (several hundred) the number of classes, with a significant difference between the texts in style (different sources, length, even the languages of the messages) and the presence of thematic relatedness classes (one text can often be relevant to several classes). Unfortunately, specific figures the obtained results are under NDA, but the overall effectiveness of the approach is as follows – 90% accuracy on 9% grades, 99% accuracy 44% of classes, 76% accuracy on 3% grades. These results should be interpreted in the following way classifier sorts all several hundred of target classes for evaluating the degree of conformity of the text of this class, and then, if we take the top 3% of classes, the target class is in this list with 76% probability and 9% of the classes the probability already exceeds 90%. Without exaggeration, this is the amazing power of the result, which is of great practical value for the customer.

A more detailed report with a detailed description of the algorithm, formulas, timelines and deliverables I invite you to listen to the next Dialogue.

How to use semantic vector?

The semantic vector of text as already mentioned, consists of a meaningful (no one will comprehend all two thousand of the elements of the vector, but it is possible) items. Yes, they are not independent, but, nevertheless, are ready-to-use feature vector that you can download to your favorite universal classifier, SVM, trees or deep nets.

Conclusions

A method of converting text of an arbitrary length to a vector based on vector representations of words Word2Vec really works and gives good results in problems of clustering and classification of texts. Text features to encode the semantic vector does not degrade with increase in the length of the text, and Vice versa – allow you to more finely differentiate long texts from each other and greatly diluted the texts with significantly different length. The total amount of computation is modest – one month on normal server.

Will be happy to answer your questions in the comments.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express