Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add processor "GeoLite2" #277

Open
DpoBoceka opened this issue Sep 13, 2019 · 9 comments
Open

add processor "GeoLite2" #277

DpoBoceka opened this issue Sep 13, 2019 · 9 comments
Labels
enhancement processors Any tasks or issues relating specifically to processors

Comments

@DpoBoceka
Copy link
Contributor

Sometimes, if we have IP addresses in our messages (especially if we are triaging web-server's logs) we want them to be enriched with geoip database, like this one:

https://dev.maxmind.com/geoip/geoip2/geolite2/

And here is a reader to it:

https://github.com/oschwald/maxminddb-golang

What do you think, should we expand benthos with such functionality?
But of course, we are able to insert all that data into some cache or sql and utilise processors which we already have, but that would be more of a workaround. Implementing this would mean another point's taken from a logstash.

@Jeffail
Copy link
Collaborator

Jeffail commented Sep 13, 2019

Hey @DpoBoceka, seems like a reasonable addition.

@Jeffail Jeffail added enhancement processors Any tasks or issues relating specifically to processors labels Sep 13, 2019
@DpoBoceka
Copy link
Contributor Author

I'm on it.
I think, on an advanced stage of implementing this we should have some sort of cahe_size like they use in Logstash, because it would be a waste to lookup some addresses every time all over again. Or perhaps, linux filesystem's cache would manage that and no overhead occurred.
Any word of advise?

@Jeffail
Copy link
Collaborator

Jeffail commented Jan 29, 2020

Don't worry for now, eventually we can add a cache field to optionally point to a cache resource.

@jamesharr
Copy link
Contributor

Do you think this is going to make its way into benthos? Is there any work I can help with?

@Jeffail
Copy link
Collaborator

Jeffail commented Aug 16, 2021

Hey @jamesharr, my plan was to adapt the processor from the existing PR into a bloblang method as it'd make it easier to compose but it's taking me a while to get around to it. If you're interested in having a go that'd be awesome, just let me know if I can help.

@jamesharr
Copy link
Contributor

Hello Jeffail, I'm struggling to get started with this one. I took a wrong turn somewhere learning the code-base and I think I need to set it down for a little bit and pick it up again.

What all do I need to do create a bloblang method? Is there a good example I can base some work off of?

In part, it's been a long time since I've written Go, but I also think my lack of Benthos experience probably isn't helping here. Any pointers would be helpful, thanks!

@jamesharr
Copy link
Contributor

Hi @Jeffail,

So I have a "hello world" bloblang functioning, but not anything super useful at the moment.

I'm wondering a few things:

  1. What do you think the appropriate API would look like?

On the API topic, which makes more sense to you?

root.geo_city = this.ip_address.geoip_city()
root.geo_city.country.iso_code // == "US"
root.geo_city.country.name // == 'United States'
root.geo_city.city.name // == "Minneapolis"
// other fields as noted in https://github.com/maxmind/GeoIP2-python#city-database

root.geo_asn = this.ip_address.geoip_asn()
root.geo_asn.autonomous_system_number // == "1211"
root.geo_asn.autonomous_system_organization // == "Telstra Pty Ltd"

or how about this API?

root.geo_city = geoip_city(this.ip_address)
root.geo_asn = geoip_asn(this.ip_address)
  1. I'm not sure how to open (and keep open) the GeoIP file. This is probably where I'll need a pointer and/or example if there is one.

@Jeffail
Copy link
Collaborator

Jeffail commented Sep 4, 2021

hey @jamesharr, I would suggest taking a string argument for a file path. The constructor of a bloblang function/method gets called only once when the value is static, so in the case of something like foo.bar("baz") the method bar is only created once and called many times, so you can simply read the file and not worry about caching the result or anything, similar to the file function: https://github.com/Jeffail/benthos/blob/master/internal/bloblang/query/functions.go#L320

And I think we ought to go with the method approach as it generally looks cleaner when put at the end of a long coersion/coalesce chain:

root.foo = this.(bar | baz).string().trim().geoip_city(path: "./something/db.zip")

In my opinion looks cleaner than:

root.foo = geoip_city(ip_address: this.(bar | baz).string().trim(), path: "./something/db.zip")

Having said all that, there's a few caveats that ought to be addressed, I'll take care of these myself afterwards just noting here for future reference:

  • Linting bloblang mappings in a config will call the constructor of this method and block on reading the file. Instead, the linting context should swap out impure methods/functions (this one, file, env, etc) and replace them with placeholders (since we only care that the mapping is valid)
  • If a user calls this method with dynamic arguments (root.foo = this.bar.geoip_city(path: this.baz)) then there's no limit to how many files will be opened which in this particular case is a bit of a footgun. We should lock this method down so that arguments must be static in order for it to parse.

@jamesharr
Copy link
Contributor

Here's my first-pass at getting a .geo_city structure.

https://github.com/Jeffail/benthos/pull/866/files

It seems to work so far, but it's missing a lot of polish. A few questions...

  • Skim over the code, there's quite a few TODOs in there with context. Pick out a couple important ones and I'll try to
  • I'm being lazy about getting the data back from this API right now (since it's a POC). I believe I'm leaning on the JSON marshaller to convert the values, but if I had to guess, it seems like it's happening at the output stage in benthos and not in the bloblang processor. I say this because I can't seem to access fields while I'm in bloblang. I'm actually kind of amazed it's working.
    • See the sample below, we get a lot of fields. It's using CamelCase instead of snake_case on output. I'm guessing you'll want the latter.
    • I'm thinking what's needed is a struct2map tool to convert the structures to map[string]interface{} before it returns. Is that the correct approach?

Blobl example:

        root = this
        let geoip_data = this.ip.geoip_city(path: "GeoLite2-City.mmdb")
        root.geoip_data = $geoip_data
        root.city_name = $geoip_data.City.Names.en # this always returns null

Output (for 2001:4860:4860::8844 / dns.google)

{
  "geoip_data": {
    "City": {
      "GeoNameID": 0,
      "Names": null
    },
    "Continent": {
      "Code": "NA",
      "GeoNameID": 6255149,
      "Names": {
        "de": "Nordamerika",
        "en": "North America",
        "es": "Norteamérica",
        "fr": "Amérique du Nord",
        "ja": "北アメリカ",
        "pt-BR": "América do Norte",
        "ru": "Северная Америка",
        "zh-CN": "北美洲"
      }
    },
    "Country": {
      "GeoNameID": 6252001,
      "IsInEuropeanUnion": false,
      "IsoCode": "US",
      "Names": {
        "de": "USA",
        "en": "United States",
        "es": "Estados Unidos",
        "fr": "États-Unis",
        "ja": "アメリカ合衆国",
        "pt-BR": "Estados Unidos",
        "ru": "США",
        "zh-CN": "美国"
      }
    },
    "Location": {
      "AccuracyRadius": 100,
      "Latitude": 37.751,
      "Longitude": -97.822,
      "MetroCode": 0,
      "TimeZone": "America/Chicago"
    },
    "Postal": {
      "Code": ""
    },
    "RegisteredCountry": {
      "GeoNameID": 6252001,
      "IsInEuropeanUnion": false,
      "IsoCode": "US",
      "Names": {
        "de": "USA",
        "en": "United States",
        "es": "Estados Unidos",
        "fr": "États-Unis",
        "ja": "アメリカ合衆国",
        "pt-BR": "Estados Unidos",
        "ru": "США",
        "zh-CN": "美国"
      }
    },
    "RepresentedCountry": {
      "GeoNameID": 0,
      "IsInEuropeanUnion": false,
      "IsoCode": "",
      "Names": null,
      "Type": ""
    },
    "Subdivisions": null,
    "Traits": {
      "IsAnonymousProxy": false,
      "IsSatelliteProvider": false
    }
  },
  "ip": "2001:4860:4860::8844"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement processors Any tasks or issues relating specifically to processors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants