[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor #5856

heemin32 · 2023-01-13T19:36:50Z

The purpose of this RFC (request for comments) is to gather community feedbacks on a proposal to provide a way to update a GeoIP database in GeoIP processor automatically.

Problem Statement

There is a need to add location information like city name, country name, or coordinates of a given IP address during a data ingestion in an OpenSearch cluster. As IP addresses are assigned to organizations, the mapping between an IP address to a location information keeps changing by nature. Therefore, to get a better accuracy on a location information of a given IP address, the mapping data need to be updated periodically. However, the OpenSearch uses a static mapping data which does not get updated.

Current State

OpenSearch has a GeoIP processor with which a user can add location data like city name, country name, latitude/longitude, and more based on an IP address in a document. OpenSearch uses GeoLite2 databases as a mapping data from an IP address to a location information which was provided by MaxMind in 2019/11/19.

OpenSearch gets GeoLite2 Country, GeoLite2 City, and GeoLite2 ASN database file from a maven repository and include them in the build artifact. When a node starts, it prepare the list of available databases by reading the GeoLite2 databases from a local disk. Once GeoIP processor is called for the first time after the node starts, it loads an appropriate database into memory and use it. Users can put their own database in $OS_CONFIG/ingest-geoip folder either to override existing database or to add new database. However, users have to restart every nodes to reload the updated database files from a disk.

MaxMind update the GeoLite2 database twice weekly but the OpenSearch users cannot benefit from the update as there is no easy way to update the database automatically without restarting nodes in a cluster.

Proposal

We want to have a feature in the OpenSearch where the mapping data from IP address to location information is updated regularly without manual intervention so that a user can get better accuracy on a location information of an IP address during a data ingestion with minimum effort.

Approach

We will have free database distribution server which will host the MaxMind database file. The file in the server will get updated regularly. The server will have a manifest file for each database.
A user call an API to create a GeoIP policy with a url of a manifest file.
An OpenSearch cluster will parse the manifest file, download a GeoIP database file, and make it ready to be used in a GeoIP processor.
While the OpenSearch cluster prepare the GeoIP database to be used by GeoIP processors, a request to create a GeoIP processor using the GeoIP policy will fail.
Once the GeoIP database is ready to be used by GeoIP processors, a user can create a GeoIP processor using the GeoIP policy name.
In the background, the OpenSearch cluster update the GeoIP database with given interval.

Data flow diagram

API design

#5860

Data format

Option1. MaxMind Format

One option is to use MaxMind data format and MaxMind SDK to read the database file as what we have today. A cluster manager node will download GeoIP database from an external endpoint and store it in an index. It will notify to all ingest nodes to download the new database file from the index. Once every ingest node is ready to use the new database file, the manager node will update a flag in an index. Each ingest node check the flag in the index to decide whether to start to use the new database or not.

Pros

Small IP address to location information mapping data size (GeoLite2-City: 70 MB)
Just a few seconds to prepare data to be used by a GeoIP processor
Fast data ingestion time. (0.05ms/doc)

Cons

Dependency on a MaxMind format
Dependency on a MaxMind data

Possible future improvement

We can have our own binary format and SDK for the mapping data. This will remove a dependency on the MaxMind data format in OpenSearch.

Option2. OpenSearch Index(Preferred)

In this option, we will utilize an OpenSearch index. An OpenSearch cluster will download a file in CSV format from an external endpoint. After the download complete, it will put the data into an index in an OpenSearch cluster. ~~It will also create an index alias pointing to the newly created index.~~ Update: We are not going to use alias but query the index directly as alias can be modified or deleted by a user unlike system index.

The index will be a single shard with auto_expand_replicas value as 0-all so that querying an index can happen within a same node to achieve a fast processing time.

The GeoIP processor will query the index internally to populate location data during the ingest time.

CSV file format

The first line will be a field name for each column.
The first column should be IP range in CIDR format.

//Example
cidr latitude longitude city country 
1.0.0.0/24 -37.8333 145.2375 Seattle "United States"

Pros

No dependency on MaxMind data format.
Can use another GeoIP database provided other than MaxMind. For example, IP2Location database provided by Hexasoft. We can not provide a free database distribution server for IP2Location database due to its license but a user can setup their own server.
Can benefit from future performance improvements on indexing in an OpenSearch out of the box.

Cons

Bigger data size than the option1. (GeoLite2-City: 400MB in a segment file which is 3.5 times larger than option1)
Slower to prepare data to be used in a processor due to indexing time. (GeoLite2-City: 300 seconds)
Slow data ingestion time compared to the option 1. (0.059ms/doc) Even slower data ingestion time if a cluster has separate ingest nodes which are not data node. (0.071ms/doc)
Repeated load on a cluster with indexing process during database update.

Possible future improvement

For the index which was created by geoip policy, we can make them to be stored in ingest nodes. This will prevent an increase in a query latency when a cluster has separate ingest nodes from data nodes.
We can generate index using a lucene library directly to reduce the index time.

Questions to community

Do you want to use your own GeoIP database or any other GeoIP database other than free GeoLite2 database provided by MaxMind?
Which GeoIP database have you used in OpenSearch: GeoLite2-City, GeoLite2-Country, GeoLite2-ASN, or all of them?
How frequently do you want to update the GeoIP database? Once a day or once a week?
Do you have separate ingest nodes from data nodes in the cluster?

Implementations

The text was updated successfully, but these errors were encountered:

navneet1v · 2023-01-14T02:17:47Z

The GeoIP processor will query the index internally to populate location data during the ingest time.

If there are separate ingest nodes and data nodes how the search will look like as the data is not present on the node? Who will get the query request for search?

Will there be a separate RFC which will provide details on index mapping, query to get IP to location details?

heemin32 · 2023-01-14T03:24:11Z

If there are separate ingest nodes and data nodes how the search will look like as the data is not present on the node? Who will get the query request for search?

It will use internal client to query the index. Therefore, the routing will be decided as same way as other index queries.

Will there be a separate RFC which will provide details on index mapping, query to get IP to location details?

Will share implementation detail later in a separate document.

hdhalter · 2023-03-17T22:35:31Z

Hi @heemin32 , if this feature requires documentation, can you please create a documentation issue for it? Thanks!

heemin32 · 2023-03-20T16:32:32Z

Hi @heemin32 , if this feature requires documentation, can you please create a documentation issue for it? Thanks!

Thanks @hdhalter. Created an issue.

hdhalter · 2023-03-20T16:50:42Z

Thanks! Will you be adding it to the unified backlog project?

heemin32 · 2023-03-20T17:05:15Z

~~What is the unified backlog project?~~
Added. Thanks!

dblock · 2023-03-24T17:39:55Z

I like the idea to use an OpenSearch index a lot more than introducing an additional store or format. Will comment on #5860.

wbeckler · 2023-03-24T18:33:44Z

If we are using an OpenSearch index, maybe it makes sense to store the data as an index snapshot.

heemin32 · 2023-03-24T19:12:44Z

With index snapshot, there is an issue of a chance of incompatibility between versions. Also, generation of index snapshot make a user hard to setup their own server for isolated region.

heemin32 · 2023-06-16T22:15:02Z

After implementation we have done a performance test and found a high performance degradation compared to existing GeoIP processor. Following is a summary of the benchmark result.

Tool: https://github.com/opensearch-project/opensearch-benchmark
Machine(OpenSearch cluster): Single docker container in m5.xlarge with 10G heap size.
Machine(Benchmark tool): m5.xlarge
Workload: http_logs
Setting: default (5000 bulk size and 100% of data)

Processor	Throughput
Legacy GeoIP	~~140,969~~ docs/s(Update: The value is with benchmark tool 0.5.0. With version 1.0.0, the throughput is 53,611 docs/s)
New GeoIP	16,631 docs/s

The reason of this low performance is in three places. 1. We make search call inside ingest process for each document. This consume search thread and the requests are rejected at certain point because search thread is not released quickly enough to handle all the incoming ingestion requests. This can be a greater problem when there is a search traffic. Both ingest and search request will fight for limited search thread resource and the ingestion throughput will decrease further. 2. As we make search call for each and every document, there are some overhead on request/response serialization and deserialization. 3. The Lucene library to search IP range is not good enough compared to Maxmind library. See the benchmark result on library level below.

Tool: https://github.com/maxmind/MaxMind-DB-Reader-java/blob/main/sample/Benchmark.java
Machine: Mac M1 processor with 32GB memory
Data: GeoLite2-City
For Lucene, we constructed segment file using Lucene library before running the benchmark.

Maxmind	Lucene
220K(850K with cache) search/seconds	36K search/seconds

Therefore, we will look for another way to overcome this performance degradation. Mostly we will use the Maxmind data format and its library for now and give some room to support other format later in the future.

dblock · 2023-06-20T20:46:36Z

What is the IP distribution in your data set? Would an in-memory size-bound cache suffice to overcome this?

nandi-github · 2023-06-20T22:16:43Z

What is the size of the entire IP-Geo mapping file ?

heemin32 · 2023-06-20T23:47:02Z

What is the size of the entire IP-Geo mapping file ?

Depends on what format we use. For MaxMind binary format, City data is 70MB, Country data is 6MB, ASN data is 8MB as of today.

nandi-github · 2023-06-21T02:06:48Z

What is the size of the entire IP-Geo mapping file?

Depends on what format we use. For MaxMind binary format, City data is 70MB, Country data is 6MB, ASN data is 8MB as of today.

For running the mapping, can we assume the maximum memory requirement is ~100MB worst case? If so, can we evaluate keeping an in-mem for performance?

heemin32 · 2023-06-21T02:19:32Z

What is the IP distribution in your data set? Would an in-memory size-bound cache suffice to overcome this?

With cache size of 10,000 items

file	record count	cache miss count	cache hit ratio
documents-181998.json	2,708,746	47,636	98.24%
documents-191998.json	9,697,882	133,047	98.62%
documents-201998.json	13,053,463	145,562	98.88%
documents-211998.json	17,647,279	167,819	99.04%
documents-221998.json	10,716,760	99,268	99.07%
documents-231998.json	11,961,342	99,138	99.17%
documents-241998.json	181,463,624	1,043,508	99.42%

With cache size of 1,000 items

file	record count	cache miss count	cache hit ratio
documents-181998.json	2,708,746	62,591	97.69%
documents-191998.json	9,697,882	224,963	97.68%
documents-201998.json	13,053,463	264,461	97.97%
documents-211998.json	17,647,279	304,869	98.27%
documents-221998.json	10,716,760	191,528	98.21%
documents-231998.json	11,961,342	223,639	98.13%
documents-241998.json	181,463,624	19,161,527	89.44%

heemin32 · 2023-06-21T16:43:57Z

Changed a code to make a search call synchronously which makes ingestion thread to wait until search complete. After the change, the error rate(too many request) went to zero. Also, with caching, the throughput is close to legacy GeoIP processor.

Method	Throughput
GeoIP + no cache	46K docs/s
GeoIP + cache 1,000	53K docs/s
IP2Geo(Index) + no cache	13K docs/s
IP2Geo(Index) + cache 1,000	57K docs/s
IP2Geo(Index) + cache 10,000	78K docs/s

dreamer-89 · 2023-07-19T19:16:45Z

@heemin32 @vamshin : Is this on track for 2.9 release ? If not, please update the labels accordingly

DarshitChanpura · 2023-09-08T17:08:17Z

@heemin32 Should this issue be labelled for 2.11?

heemin32 · 2023-09-08T17:10:44Z

It will be released in 2.10. Closing it.

heemin32 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 13, 2023

minalsha added RFC Issues requesting major changes and removed untriaged labels Jan 13, 2023

heemin32 mentioned this issue Jan 13, 2023

[RFC] GeoIP database auto update - API design #5860

Closed

heemin32 mentioned this issue Mar 3, 2023

About GeoIP processor opensearch-project/geospatial#41

Closed

heemin32 changed the title ~~[RFC]GeoIP database auto update~~ [RFC]GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processorGeoIP database auto update Mar 3, 2023

heemin32 changed the title ~~[RFC]GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processorGeoIP database auto update~~ [RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor Mar 3, 2023

This was referenced Mar 7, 2023

[Feature]GeoIP datasource implementation #6559

Closed

[Feature]GeoIP datasource integration in GeoIP processor #6560

Closed

heemin32 mentioned this issue Mar 15, 2023

[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor opensearch-project/geospatial#241

Closed

vamshin added the Geospatial label Mar 15, 2023

vamshin assigned heemin32 Mar 15, 2023

vamshin added the v2.8.0 'Issues and PRs related to version v2.8.0' label Mar 15, 2023

heemin32 mentioned this issue Mar 20, 2023

[DOC]GeoIP processor opensearch-project/documentation-website#3524

Closed

4 tasks

heemin32 mentioned this issue Mar 29, 2023

Move Job Scheduler plugin to core (modules) opensearch-project/job-scheduler#147

Open

vamshin added v2.9.0 'Issues and PRs related to version v2.9.0' and removed v2.8.0 'Issues and PRs related to version v2.8.0' labels May 25, 2023

heemin32 mentioned this issue May 26, 2023

Blocking snapshot of given index patterns #7778

Closed

vamshin added v2.10.0 and removed v2.9.0 'Issues and PRs related to version v2.9.0' labels Jul 19, 2023

This was referenced Jul 21, 2023

IP2Geo processor implementation opensearch-project/geospatial#360

Closed

IP2Geo processor implementation opensearch-project/geospatial#362

Merged

gdiazlo mentioned this issue Aug 22, 2023

Wazuh-indexer ingest-geoip module is using outdated maxmind's databases. wazuh/wazuh-packages#2008

Closed

heemin32 closed this as completed Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor #5856

[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor #5856

heemin32 commented Jan 13, 2023 •

edited

navneet1v commented Jan 14, 2023

heemin32 commented Jan 14, 2023 •

edited

hdhalter commented Mar 17, 2023

heemin32 commented Mar 20, 2023

hdhalter commented Mar 20, 2023

heemin32 commented Mar 20, 2023 •

edited

dblock commented Mar 24, 2023

wbeckler commented Mar 24, 2023

heemin32 commented Mar 24, 2023

heemin32 commented Jun 16, 2023 •

edited

dblock commented Jun 20, 2023

nandi-github commented Jun 20, 2023

heemin32 commented Jun 20, 2023 •

edited

nandi-github commented Jun 21, 2023

heemin32 commented Jun 21, 2023 •

edited

heemin32 commented Jun 21, 2023 •

edited

dreamer-89 commented Jul 19, 2023

DarshitChanpura commented Sep 8, 2023

heemin32 commented Sep 8, 2023

[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor #5856

[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor #5856

Comments

heemin32 commented Jan 13, 2023 • edited

Problem Statement

Current State

Proposal

Approach

Data flow diagram

API design

Data format

Option1. MaxMind Format

Pros

Cons

Possible future improvement

Option2. OpenSearch Index(Preferred)

CSV file format

Pros

Cons

Possible future improvement

Questions to community

Implementations

navneet1v commented Jan 14, 2023

heemin32 commented Jan 14, 2023 • edited

hdhalter commented Mar 17, 2023

heemin32 commented Mar 20, 2023

hdhalter commented Mar 20, 2023

heemin32 commented Mar 20, 2023 • edited

dblock commented Mar 24, 2023

wbeckler commented Mar 24, 2023

heemin32 commented Mar 24, 2023

heemin32 commented Jun 16, 2023 • edited

dblock commented Jun 20, 2023

nandi-github commented Jun 20, 2023

heemin32 commented Jun 20, 2023 • edited

nandi-github commented Jun 21, 2023

heemin32 commented Jun 21, 2023 • edited

heemin32 commented Jun 21, 2023 • edited

dreamer-89 commented Jul 19, 2023

DarshitChanpura commented Sep 8, 2023

heemin32 commented Sep 8, 2023

heemin32 commented Jan 13, 2023 •

edited

heemin32 commented Jan 14, 2023 •

edited

heemin32 commented Mar 20, 2023 •

edited

heemin32 commented Jun 16, 2023 •

edited

heemin32 commented Jun 20, 2023 •

edited

heemin32 commented Jun 21, 2023 •

edited

heemin32 commented Jun 21, 2023 •

edited