Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor #5856

Closed
heemin32 opened this issue Jan 13, 2023 · 19 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Geospatial RFC Issues requesting major changes v2.10.0

Comments

@heemin32
Copy link
Contributor

heemin32 commented Jan 13, 2023

The purpose of this RFC (request for comments) is to gather community feedbacks on a proposal to provide a way to update a GeoIP database in GeoIP processor automatically.

Problem Statement

There is a need to add location information like city name, country name, or coordinates of a given IP address during a data ingestion in an OpenSearch cluster. As IP addresses are assigned to organizations, the mapping between an IP address to a location information keeps changing by nature. Therefore, to get a better accuracy on a location information of a given IP address, the mapping data need to be updated periodically. However, the OpenSearch uses a static mapping data which does not get updated.

Current State

OpenSearch has a GeoIP processor with which a user can add location data like city name, country name, latitude/longitude, and more based on an IP address in a document. OpenSearch uses GeoLite2 databases as a mapping data from an IP address to a location information which was provided by MaxMind in 2019/11/19.

OpenSearch gets GeoLite2 Country, GeoLite2 City, and GeoLite2 ASN database file from a maven repository and include them in the build artifact. When a node starts, it prepare the list of available databases by reading the GeoLite2 databases from a local disk. Once GeoIP processor is called for the first time after the node starts, it loads an appropriate database into memory and use it. Users can put their own database in $OS_CONFIG/ingest-geoip folder either to override existing database or to add new database. However, users have to restart every nodes to reload the updated database files from a disk.

MaxMind update the GeoLite2 database twice weekly but the OpenSearch users cannot benefit from the update as there is no easy way to update the database automatically without restarting nodes in a cluster.

Proposal

We want to have a feature in the OpenSearch where the mapping data from IP address to location information is updated regularly without manual intervention so that a user can get better accuracy on a location information of an IP address during a data ingestion with minimum effort.

Approach

  1. We will have free database distribution server which will host the MaxMind database file. The file in the server will get updated regularly. The server will have a manifest file for each database.
  2. A user call an API to create a GeoIP policy with a url of a manifest file.
  3. An OpenSearch cluster will parse the manifest file, download a GeoIP database file, and make it ready to be used in a GeoIP processor.
  4. While the OpenSearch cluster prepare the GeoIP database to be used by GeoIP processors, a request to create a GeoIP processor using the GeoIP policy will fail.
  5. Once the GeoIP database is ready to be used by GeoIP processors, a user can create a GeoIP processor using the GeoIP policy name.
  6. In the background, the OpenSearch cluster update the GeoIP database with given interval.

Data flow diagram

Screen Shot 2023-01-13 at 11 17 15 AM

API design

#5860

Data format

Option1. MaxMind Format

One option is to use MaxMind data format and MaxMind SDK to read the database file as what we have today. A cluster manager node will download GeoIP database from an external endpoint and store it in an index. It will notify to all ingest nodes to download the new database file from the index. Once every ingest node is ready to use the new database file, the manager node will update a flag in an index. Each ingest node check the flag in the index to decide whether to start to use the new database or not.
Screen Shot 2022-12-21 at 11 10 44 AM

Pros

  • Small IP address to location information mapping data size (GeoLite2-City: 70 MB)
  • Just a few seconds to prepare data to be used by a GeoIP processor
  • Fast data ingestion time. (0.05ms/doc)

Cons

  • Dependency on a MaxMind format
  • Dependency on a MaxMind data

Possible future improvement

  • We can have our own binary format and SDK for the mapping data. This will remove a dependency on the MaxMind data format in OpenSearch.

Option2. OpenSearch Index(Preferred)

In this option, we will utilize an OpenSearch index. An OpenSearch cluster will download a file in CSV format from an external endpoint. After the download complete, it will put the data into an index in an OpenSearch cluster. It will also create an index alias pointing to the newly created index. Update: We are not going to use alias but query the index directly as alias can be modified or deleted by a user unlike system index.

The index will be a single shard with auto_expand_replicas value as 0-all so that querying an index can happen within a same node to achieve a fast processing time.

The GeoIP processor will query the index internally to populate location data during the ingest time.
Screen Shot 2023-01-05 at 2 17 39 PM

CSV file format

  1. The first line will be a field name for each column.
  2. The first column should be IP range in CIDR format.
//Example
cidr latitude longitude city country 
1.0.0.0/24 -37.8333 145.2375 Seattle "United States"

Pros

  • No dependency on MaxMind data format.
  • Can use another GeoIP database provided other than MaxMind. For example, IP2Location database provided by Hexasoft. We can not provide a free database distribution server for IP2Location database due to its license but a user can setup their own server.
  • Can benefit from future performance improvements on indexing in an OpenSearch out of the box.

Cons

  • Bigger data size than the option1. (GeoLite2-City: 400MB in a segment file which is 3.5 times larger than option1)
  • Slower to prepare data to be used in a processor due to indexing time. (GeoLite2-City: 300 seconds)
  • Slow data ingestion time compared to the option 1. (0.059ms/doc) Even slower data ingestion time if a cluster has separate ingest nodes which are not data node. (0.071ms/doc)
  • Repeated load on a cluster with indexing process during database update.

Possible future improvement

  • For the index which was created by geoip policy, we can make them to be stored in ingest nodes. This will prevent an increase in a query latency when a cluster has separate ingest nodes from data nodes.
  • We can generate index using a lucene library directly to reduce the index time.

Questions to community

  1. Do you want to use your own GeoIP database or any other GeoIP database other than free GeoLite2 database provided by MaxMind?
  2. Which GeoIP database have you used in OpenSearch: GeoLite2-City, GeoLite2-Country, GeoLite2-ASN, or all of them?
  3. How frequently do you want to update the GeoIP database? Once a day or once a week?
  4. Do you have separate ingest nodes from data nodes in the cluster?

Implementations

@heemin32 heemin32 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 13, 2023
@minalsha minalsha added RFC Issues requesting major changes and removed untriaged labels Jan 13, 2023
@navneet1v
Copy link
Contributor

The GeoIP processor will query the index internally to populate location data during the ingest time.

If there are separate ingest nodes and data nodes how the search will look like as the data is not present on the node? Who will get the query request for search?

Will there be a separate RFC which will provide details on index mapping, query to get IP to location details?

@heemin32
Copy link
Contributor Author

heemin32 commented Jan 14, 2023

If there are separate ingest nodes and data nodes how the search will look like as the data is not present on the node? Who will get the query request for search?

It will use internal client to query the index. Therefore, the routing will be decided as same way as other index queries.

Will there be a separate RFC which will provide details on index mapping, query to get IP to location details?

Will share implementation detail later in a separate document.

@heemin32 heemin32 changed the title [RFC]GeoIP database auto update [RFC]GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processorGeoIP database auto update Mar 3, 2023
@heemin32 heemin32 changed the title [RFC]GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processorGeoIP database auto update [RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor Mar 3, 2023
@vamshin vamshin added the v2.8.0 'Issues and PRs related to version v2.8.0' label Mar 15, 2023
@hdhalter
Copy link

Hi @heemin32 , if this feature requires documentation, can you please create a documentation issue for it? Thanks!

@heemin32
Copy link
Contributor Author

Hi @heemin32 , if this feature requires documentation, can you please create a documentation issue for it? Thanks!

Thanks @hdhalter. Created an issue.

@hdhalter
Copy link

Thanks! Will you be adding it to the unified backlog project?

@heemin32
Copy link
Contributor Author

heemin32 commented Mar 20, 2023

What is the unified backlog project?
Added. Thanks!

@dblock
Copy link
Member

dblock commented Mar 24, 2023

I like the idea to use an OpenSearch index a lot more than introducing an additional store or format. Will comment on #5860.

@wbeckler
Copy link

If we are using an OpenSearch index, maybe it makes sense to store the data as an index snapshot.

@heemin32
Copy link
Contributor Author

With index snapshot, there is an issue of a chance of incompatibility between versions. Also, generation of index snapshot make a user hard to setup their own server for isolated region.

@vamshin vamshin added v2.9.0 'Issues and PRs related to version v2.9.0' and removed v2.8.0 'Issues and PRs related to version v2.8.0' labels May 25, 2023
@heemin32
Copy link
Contributor Author

heemin32 commented Jun 16, 2023

After implementation we have done a performance test and found a high performance degradation compared to existing GeoIP processor. Following is a summary of the benchmark result.

Tool: https://github.com/opensearch-project/opensearch-benchmark
Machine(OpenSearch cluster): Single docker container in m5.xlarge with 10G heap size.
Machine(Benchmark tool): m5.xlarge
Workload: http_logs
Setting: default (5000 bulk size and 100% of data)
Processor Throughput
Legacy GeoIP 140,969 docs/s(Update: The value is with benchmark tool 0.5.0. With version 1.0.0, the throughput is 53,611 docs/s)
New GeoIP 16,631 docs/s

The reason of this low performance is in three places. 1. We make search call inside ingest process for each document. This consume search thread and the requests are rejected at certain point because search thread is not released quickly enough to handle all the incoming ingestion requests. This can be a greater problem when there is a search traffic. Both ingest and search request will fight for limited search thread resource and the ingestion throughput will decrease further. 2. As we make search call for each and every document, there are some overhead on request/response serialization and deserialization. 3. The Lucene library to search IP range is not good enough compared to Maxmind library. See the benchmark result on library level below.

Tool: https://github.com/maxmind/MaxMind-DB-Reader-java/blob/main/sample/Benchmark.java
Machine: Mac M1 processor with 32GB memory
Data: GeoLite2-City
For Lucene, we constructed segment file using Lucene library before running the benchmark.
Maxmind Lucene
220K(850K with cache) search/seconds 36K search/seconds

Therefore, we will look for another way to overcome this performance degradation. Mostly we will use the Maxmind data format and its library for now and give some room to support other format later in the future.

@dblock
Copy link
Member

dblock commented Jun 20, 2023

What is the IP distribution in your data set? Would an in-memory size-bound cache suffice to overcome this?

@nandi-github
Copy link

What is the size of the entire IP-Geo mapping file ?

@heemin32
Copy link
Contributor Author

heemin32 commented Jun 20, 2023

What is the size of the entire IP-Geo mapping file ?

Depends on what format we use. For MaxMind binary format, City data is 70MB, Country data is 6MB, ASN data is 8MB as of today.

@nandi-github
Copy link

What is the size of the entire IP-Geo mapping file?

Depends on what format we use. For MaxMind binary format, City data is 70MB, Country data is 6MB, ASN data is 8MB as of today.

For running the mapping, can we assume the maximum memory requirement is ~100MB worst case? If so, can we evaluate keeping an in-mem for performance?

@heemin32
Copy link
Contributor Author

heemin32 commented Jun 21, 2023

What is the IP distribution in your data set? Would an in-memory size-bound cache suffice to overcome this?

With cache size of 10,000 items

file record count cache miss count cache hit ratio
documents-181998.json 2,708,746 47,636 98.24%
documents-191998.json 9,697,882 133,047 98.62%
documents-201998.json 13,053,463 145,562 98.88%
documents-211998.json 17,647,279 167,819 99.04%
documents-221998.json 10,716,760 99,268 99.07%
documents-231998.json 11,961,342 99,138 99.17%
documents-241998.json 181,463,624 1,043,508 99.42%

With cache size of 1,000 items

file record count cache miss count cache hit ratio
documents-181998.json 2,708,746 62,591 97.69%
documents-191998.json 9,697,882 224,963 97.68%
documents-201998.json 13,053,463 264,461 97.97%
documents-211998.json 17,647,279 304,869 98.27%
documents-221998.json 10,716,760 191,528 98.21%
documents-231998.json 11,961,342 223,639 98.13%
documents-241998.json 181,463,624 19,161,527 89.44%

@heemin32
Copy link
Contributor Author

heemin32 commented Jun 21, 2023

Changed a code to make a search call synchronously which makes ingestion thread to wait until search complete. After the change, the error rate(too many request) went to zero. Also, with caching, the throughput is close to legacy GeoIP processor.

Method Throughput
GeoIP + no cache 46K docs/s
GeoIP + cache 1,000 53K docs/s
IP2Geo(Index) + no cache 13K docs/s
IP2Geo(Index) + cache 1,000 57K docs/s
IP2Geo(Index) + cache 10,000 78K docs/s

@dreamer-89
Copy link
Member

@heemin32 @vamshin : Is this on track for 2.9 release ? If not, please update the labels accordingly

@DarshitChanpura
Copy link
Member

@heemin32 Should this issue be labelled for 2.11?

@heemin32
Copy link
Contributor Author

heemin32 commented Sep 8, 2023

It will be released in 2.10. Closing it.

@heemin32 heemin32 closed this as completed Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Geospatial RFC Issues requesting major changes v2.10.0
Projects
None yet
Development

No branches or pull requests

10 participants