ElasticSearch with MongoDB

ElasticSearch

When the indexes of MongoDB database are not enough, better to rely on a specific product to implement the search. In this post we talk about integration between MongoDB and ElasticSearch as the basis for a service on Ruby on Rails
I’m working on a new project for a client who has the need to manage a few terabytes of e-mail archives: it is a history of about 10 years of Mailing List data for which he needs to replace the web search  engine of mailman with something more scalable and high-performance: the choice fell on MongoDB and ElasticSearch and I’ll tell you what I learned.

The first problem is the size of the historian that, being considerable, requires a system that scale horizontally when there is the need to increase the space: in this case the tested MongoDB is a security. The ability to simultaneously activate replication and sharding allows to have a sort of “Raid 10” at the database level.

We come to ElastisSearch: frankly this is the first time I use this component in production, so it is proving to be a continuous discovery.

For those not familiar, ElasticSearch is a search engine written in Java that is based on Apache Lucene. The features are similar to Apache Solr, but, unlike this, has the ability to scale horizontally with more ease.

The theory of operation of ElasticSearch is simple: once installed, the daemon listens on a port (default 9200) and communicates to the outside with the REST API through which you can manage indexes, send the data to index and perform searches.

Here are the five things I learned in the implementation phase of the staging infrastructure with MongoDB and ElasticSearch:
  1. integrate ElasticSearch with MongoDB: there are many guides on how to use the “river” to make the two systems connected. Basically the river monitors MongoDB log files to update the indexes of ElasticSearch. This solution has two problems: the log file is generated by the MongoDB replication system and, in a structure in which is active both replication that sharding, we have X replication files as many as the nodes that make up the sharding. So, the river of ElasticSearch has a partial view of the new data; also I found it inefficient (slow). The solution was to use a specific gem on Ruby on Rails (mongoid-elasticsearch) that is responsible for updating ElasticSearch indexes when any modification of the database MongoDB occurs.
  2. What to feed to ElasticSearch: of course it is useless to index the binary content. In any case, regardless, I discovered that ElasticSearch is very susceptible to encoding, so just like the data in UTF-8 format. To force the encoding to UTF-8 in Rails I used Marshal.dump($dato).force_encoding("ISO-8859-1").encode("UTF-8")
  3. Search and filter: ElasticSearch analyze the query result to assign a score, while the filters do not, therefore, the latter are much faster.
  4. The syntax for complex queries is really verbose: it is json code, even if the gem “mask” it in style ruby.
  5. Filters for “termand “terms“, just like the downcase

I hope these few travel notes might like to hear about for your implementations of ElasticSearch.

Leave a Reply

Your email address will not be published. Required fields are marked *

two × 3 =