Classification of ads from the social. networks. In search of a better solution

Tell how the classification of text helped me in finding an apartment, and also why I gave up on regular expressions and neural networks and started to use the lexical analyzer.

About a year ago I needed to find an apartment for rent. Most ads from individuals are published in social networks, where ad is written in free form and search has no filters. Manually review publications in different communities for a long time and is inefficient.

At that time there were several services that were collected ads from social networks and published them on the website. Thus, it was possible to see all the ads in one place. Unfortunately, there was also no filters on the type of ad, price. So after some time I wanted to create a service with needed functionality.

the

text Classification

the

First attempt (RegExp)

At first I thought to solve the problem in the forehead with regular expressions.

In addition to writing regular expressions themselves, it was necessary to make subsequent processing of the results. It was necessary to consider the number of occurrences and their locations relative to each other. The problem was with the processing of text sentences: it was impossible to separate one sentence from another, and the text was processed all at once.
As the complexity of regular expressions and processing result, it became harder to increase the percentage of correct answers on the test sample.

Regular expressions that were used in tests

- '/(room|\d.{0,10}rooms[^n])/u'
- '/(apartments\D{4})/u'
- '/(((^|\D)1\D{0,30} (\.|KK|kV)|single|ednos)|(apartments\D{0,3}1(\D).{0,10}room))/u'
- '/(((^|\D)2\D{0,30}(C\.|CC|CV)|duh.{0,5}K|dvos|duh.{5,10}(C\.|CC|CV))|(quarter\D{0,3}2(\D).{0,10}comnat))/u'
- '/(((^|\D)3\D{0,30}(C\.|CC|CV)|Tr(e|e)h{0,5}K|Tr(e|e)s|Tr(e|e)h{5,10}(C\.|CC|CV))|(quarter\D{0,3}3(\D).{0,10}comnat))/u'
- '/(((^|\D)4\D{0,30} (\.|KK|kV)|the four seasons\SX)|(apartments\D{0,3}4(\D).{0,10}room))/u'
- '/(Studio)/u'
- '/(isch.{1,5}neighbor)/u'
- '/(SDA|sat down|sat down|Liber(W|d))/u'
- '/(\?)$/u'

This method for the test set gave a 72.61% correct answers.

the

Second attempt (Neural networks)

In recent times it has become very fashionable to use machine learning to just about anything. After training the network difficult or even impossible to say why she decided so, but this does not prevent to successfully apply neural networks in the classification of texts. For the test a multilayer perceptron method of training back-propagation.

As a ready-made library of neural networks has been used:

FANN written in C
Brain written in JavsScript

It was necessary to convert text of different lengths so that it could be applied to the input of neural network with constant number of inputs.

To do this, of all the texts of the test sample was identified n-grams more than 2 characters and recurring in more than 15% of the texts. These were little more than 200.

Example n-gram

- /nye/u
- /C/u
- /DOB/u
- /con/u
- /gender/u
- /ale/u
- /two/u
- //u
- /I'll give/u

To classify an advertisement in the text were searched for n-grams, it turned out their location, and this data is then fed to the input of the neural network so that the values were ranging from 0 to 1.
This method for the test set gave a 77.13% correct answers (despite the fact that the tests were performed on the same sample, which was used for training).
I'm sure in larger by several orders of magnitude the test set and using networks with feedback it would be possible to achieve much better results.

the

Third attempt (the parser)

At the same time I started to read more articles about natural language processing and came across a great parser Tomita from Yandex. Its main advantage over other similar programs is that it works with the Russian language and has a quite intelligible documentation. In configuration, you can use regular expressions, which is great, because some of them I already had written.

In essence, this is a much more advanced version of regular expressions, but much more powerful and convenient. Here is also not without preprocessing the text. The text that users write in social networks, often does not meet the grammatical and syntactic rules of the language, so the parser is having difficulty processing: splitting the text into sentences, splitting sentences into tokens, the reduction of words to normal form.

Example configuration

#encoding "utf8"
#GRAMMAR_ROOT ROOT

Rent -> Word<kwset=[rent, populate]>;
Flat -> Word<kwset=[flat]> interp (+FactRent.Type="flat");
AnyWordFlat -> AnyWord<kwset=~[rent, populate, studio, flat, room, neighbor, search, number, numeric]>;

ROOT -> Rent AnyWordFlat* Flat { weight=1 };

All configurations can be viewed here. This method for the test set gave a 93.40% correct answers. In addition to the classification of the text it highlighted facts such as the cost of rent, the size of the apartment, subway station, telephone.

Try the parser online

Sapros:
the

curl -X POST -d 'rent kopeck piece 50.4 sqm 30 thousand per month. phone + 7 999 999 9999' 'http://api.socrent.ru/parse'

The answer is:
the

{"type":2,"phone":["9999999999"],"area":50.4,"price":30000}

Types of ads:
0 — room
1 — 1 bedroom apartment
2 — 2 bedroom apartment
3 — 3 bedroom apartment
4 — 4+ room flat

In the end, with a small test set and the need for high accuracy has turned out more profitable to write algorithms by hand.

the

Development service

In parallel, the task of text classification was written by some services to collect ads and presenting them in a user-friendly form.

github.com/mrsuh/rent-view
The service, which is responsible for displaying.
Written on NodeJS. Used template engine doT.js DB Mongo.

github.com/mrsuh/rent-collector
The service, which is responsible for collecting ads. Written on PHP. Use the framework Symfony3 DB Mongo.
It was written with the expectation to gather data from different sources, but as it turned out, almost all ads are placed in the social network Vkontakte. This social network is the perfect API, so it was not difficult to collect declarations from the walls and discussions in public groups.

github.com/mrsuh/rent-parser
The service, which is responsible for the classification of ads. Written on Golang. The parser uses the Tomita. In essence, it represents a wrapper for the parser, but also carries out pre-processing and post-processing of the results of parsing.

For all services configured CI Travis-CI and Ansible (as configured automatic deployment, wrote in this article).
the

Statistics

The service has been operating for about two months for the city of Saint Petersburg and during that time managed to collect a little more 8000 ads. Here's some interesting statistics on the ads for the entire period.

Per day on average added 131.2 ads (more precisely the texts that were classified as ads).

The most active hours 12 day.

The most popular subway station Devyatkino

Conclusion: if you have a large test sample, where you could train the network, and you need high accuracy, it is best to use algorithms written by hand.

If someone wants to solve a similar problem, here is here is a test set of 8000 texts and their types.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express