July 22, 2015

Geo Clustering 3,000,000 Points On The Fly: A Brief HOW TO

Why Clustering?

Showing a lot of markers on a map looks very messy. Markers may start to overlap and the user experience is beginning to suffer. That problem is solved by using clustering: combining markers that are close to each other and displaying a single marker containing the number of markers it represents. 

Source: http://goo.gl/V9JdU5
This can be done on the client side or on the server side. Client-side clustering has three major problems:

  1. For example there are 3 million points in your data base and the user has a very high zoom level on the map. Therefore, you would have to load all the data. Depending on your data storage loading all the data may be a problem. But for sure it becomes one if you have to handle a lot of requests. 
  2. The second problem is the transfer to the client. Even if your data set for each point is very small, the payload becomes huge if many data sets are transferred to the client. In our example it would be a couple of hundred kilobytes. Again depending on your infrastructure it may be a problem or it could be handled easily, but for sure it's going to be a bottle neck as traffic starts to increase.
  3. The last problem is displaying that huge number of points on the client. Tests showed us that if you try client-side clustering, this is only suitable for up to 1,000 points. After that it starts to become laggy on the user side because the client does not have enough resources. In that context, consider that there are still clients with 1048 MB RAM out there and don't forget to think about tablets and phones. Testing rendering performance with markers can be done e.g. with this tool. Rendering 500 Points on an Intel Core i7 takes about 200ms. Even that small number of points take some time to render. So you want to keep the number of points to show as small as possible.
The solution to these problems is server-side clustering. The clustering takes place at the point of data storage. 

Clustering data is actually a very complex topic. There are various methods and algorithms available that fit different needs. The best solution for clustering points is using the Euclidean distance as dimension. Informally spoken: group points that are close to each other.

One example of an algorithm for clustering geo points is k-means clustering. It is a popular approach in data mining - this algorithm clusters a number of observations into a given number of clusters. Each observation in this cluster belongs to the cluster with the nearest mean. From a complexity theory view point this is an NP-hard problem. Nevertheless, for our special purpose this can be solved in reasonable time (for those who love details: O(ndk+1 log n) with n is the number of data points, d is dimensions, k is the number of clusters - and there are heuristics that can do better). Solutions with this algorithm are available for e.g. PostgreSQL.

But it can be done in a simpler and faster way without advanced mathematics. To illustrate this point, it takes some basic understanding of the concept of geohashes.

Geohashes

Geohashes are another way of representing latitude and longitude values. For example 67,99 lat and -22,99 lng can be written as gkpdue5bt. Originally developed to provide a URL friendly coordinate representation it turned out that they are a very useful way of indexing geo points in databases or search engines like SOLR.

A wide range of tools already have native support for geo hashes e.g. Elastic Search or SOLR.

This blog post does a great job in explaining how geohash representation is visualised. The world is divided in a grid as shown in the next picture. Each rectangle gets a unique letter by which it can be referenced. At this level it is a very inaccurate location.

Source: http://goo.gl/y3oa5c
Narrowing it down is done by dividing the rectangles. As shown in the next picture, each region is divided into 32 cells. Each cell gets its own unique identifier that starts with the identifier of the parent area and ends with another letter. As we can see, the region referenced via DP is smaller than the region D.

Source: http://goo.gl/y3oa5c
Still it's not nearly as exact as normal latitude and longitude values. The concept of dividing and adding another identifier to the hash at the end is done up to 12 times. As shown in the next picture, each additional digit narrows it down more and more up to a specific point.

Source: http://goo.gl/y3oa5c
But it will never be a single point as compared to normal geo coordinates, and it doesn't have to be. This calculation already is close enough for most applications. The following table shows the size of the areas that are referenced with a geohash with a given length. At 12 digits, it's down to 4 cm which is feasible for geo clustering or even just showing points on a map.



Creating geohashes

If you are using Java there is a library in place that can handle the overall calculation for you. Add this maven dependency:

<dependency>
<groupId>ch.hsr</groupId>
<artifactId>geohash</artifactId>
<version>1.1.0</version>
</dependency>

The actual Java code is just a single line:

GeoHash geoHash = GeoHash.withCharacterPrecision(48.13,16.28), 5);

This generates a geo hash with a precision of 5 digits.

Clustering with geohashes

Having the geohashes concept in place, it can easily be used to cluster search results based on their geolocation. As this clustering is just a prefix search and a way of grouping or faceting, it can be performed by a wide range of tools. In this example we use SOLR as data storage.

Our infrastructure looks like the model in the following figure:




Our clients, e.g. web, mobile phones, etc,  do not know anything about how the clustering is done on the server. It is important to hide these details in order to be able to change the implementation. The clients request points for a certain location and zoom level. The web server then determines the geohash and the geohash level to request these from the data storage. It creates the corresponding prefix, does a prefix search and performs the grouping on the server. This is a very fast way of clustering results. Even if you have millions of records, this is a scalable implementation.

At first what is needed is a multivalued field that contains our geohases with different levels of precision. In SOLR we use a multivalued field that contains these values:

<field name="geohash" type="string" indexed="true" stored="true" required="false" multiValued="true"/>

The indexed value of some point with 48.1347389,16.28228 will be:

geohash : 
1_u 
2_u2 
3_u2e 
4_u2e9 
5_u2e9f 
6_u2e9fd 
7_u2e9fd8 
8_u2e9fd8m
9_u2e9fd8mh 
10_u2e9fd8mht 
11_u2e9fd8mhte 
12_u2e9fd8mhtev 

Each geohash is prefixed with its length. This makes it very easy to query the data. After you have indexed your data according to the configuration above you can execute a query like the following:

select?wt=json&fl=*&q=*:*&facet=true&rows=0&facet.field=geohash&facet.prefix=2_&facet.limit=-1

{
    "responseHeader": {
        "status": 0,
        "QTime": 2,
        "params": {
            "facet": "true",
            "fl": "*",
            "q": "*:*",
            "facet.limit": "-1",
            "facet.prefix": "2_",
            "facet.field": "geohash",
            "wt": "json",
            "rows": "0"
        }
    },
    "response": {
        "numFound": 29916,
        "start": 0,
        "docs": []
    },
    "facet_counts": {
        "facet_queries": {},
        "facet_fields": {
            "geohash": [
                "2_9q",
                1,
                "2_9u",
                1,
                "2_d7",
                1,
                ...
                58,
                "2_sw",
                1,
                "2_sx",
                2,
                "2_u0",
                292,
                "2_u2",
                29000,
                "2_u3",
                2
            ]
        }
    }
}

As you can see the result contains the geohash and the number of results that are found within. That is all information that is actually needed to be displayed on a map with clustered results.

All you need to do now is convert the geohashes back to lat/lng values that could be easily displayed on the front end. With Java this can be done with the GeoHash library:


GeoHash geoHash = GeoHash.fromGeohashString("sw");
clusterPoint.setLat(geoHash.getPoint().getLatitude());
clusterPoint.setLng(geoHash.getPoint().getLongitude());

This data can be sent to the front end and displayed. Taking the example from above, instead of sending 3,000,000 points to the front end we now only send about a dozen. That means less work load on the back end, the way to the user and on the client. All problems with clustering on the client are solved.

Summary

As is the case with many other topics, displaying points on a map can turn out to be tricky due to the amount of data you have to handle. Dealing with 100 data points or even 1,000 is no problem at all. But as the number of records is going to be in the millions, completely different solutions are needed. Server-side clustering may seem like a very difficult topic at first glance, but with the right tools and know-how it's going to be as simple as a prefix query on your favourite data storage.

2 comments:

  1. Clustering is a process of partitioning a set of data into a set of meaningful sub-classes, called clusters. Help users understand the natural grouping or structure in a data set. Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms.

    ReplyDelete
  2. the set of the data english to spanish translation online is to used either tool to get data useful information

    ReplyDelete