Ip ranges/CIDR

go
hhvm

#1

Hello ! How can I save IP ranges (ipv4/ipv6) / CIDR in AeroSpike , and search for specific IP Address in these ranges ? Thanks a lot ! Zac.

e.g. example in Redis : http://stackoverflow.com/questions/9989023/store-ip-ranges-in-redis


How to query IP to Geo data from Aerospike
Query records with BETWEEN operator
Storing/Querying IP
#2

Zac,

The techniques you use in a relational system (convert to an integer and select) work exactly the same in Aerospike. In a particular table, store the integer value of the IP address. Create an integer index on that column - it’s a range index.

Then you can query on that integer range, and you’ll get back the necessary rows.

However, let me suggest a higher performance approach than anything suggested in the IP range post. If you’re willing to live with Class C (/24) resolution, you can store all the Class C addresses on the internet in 4M entries, which - even allowing for 128B of name/info per class C, fits in 1G of RAM with Aerospike.

For short ranges, the reads will be really quick - and vastly more scalable in any distributed database (including aerospike).

The tradeoff? Using an Index in aerospike will be constant-speed on writes regardless of the range size. For a write-heavy use, you’re better off using the secondary index. For reads of large ranges with many IPs, index. For reads of small ranges or single providers, use the KV approach.

This KV approach (you can imagine extra bells and whistles — like storing individual IP information in different columns (remember, NoSQL is “flex schema” thus loves sparse data layouts) is better when you have more information about the storage patterns, like often storing by /24 and reading by single IP address.


#3

Thank you so much for your answer ! The data of IP ranges is 99% reads … Do you have any tricks for IPv6 ranges ( even /28 ranges … )? Thanks again ! Zac


#4

I’ve done a little research but I don’t have any clear insights (I know a lot about IPv4). Even though IPv6 has segmentation in the address space itself, the real world of how the data is distributed and queried is of greatest importance, and I don’t have real world experience with IPv6 distributions yet.

Maybe someone else in the community can offer advice, or you can check back with what you’ve found.


#5

Hi Again,

We are trying to load something similar to MaxMind City database into AeroSpike, and find the appropriate range for a specific IP address ( the ranges are all around , even /12 ranges . For simplicity let’s assume that there aren’t any collisions between them).

We are updating the ranges constantly, so saving all possible IPs in the world or saving them as C class is too much of a hassle.

I saw that you answered in How to query IP to Geo data from Aerospike :

“An alternate pattern is to think of the problem in a more relational way. If your geo information is supplied as a “network” (with a unique name and a range), you can insert a row per network, use the integer form of the IP address in a column, index that column, and look up using range queries — just like you would do in a relational or SQL system (but it’ll be parallel and distributed).”

But, we saw that the RANGE query cannot access 2 different columns… What are we missing ?

We also thought to build one 64bit column that contains the [ from_ip << 32|to_ip ] and then search for the IP using BETWEEN [current_ip << 32, with masking of the biggest possible network, e.g. /12] AND [current_ip << 32 | 0xFFFFFFFF ], but we will have a lot of results in return …

Maybe we can use lua script that will only return the last result ?

Will it be fast enough ( we need to perform this search for each http query ) ?

Also , what will happen if the database is spread across several nodes ?

Thanks again !!!

Zac.


#6

Zac,

The only way to create a reasonably optimal solution would be for you to explain some of the following:

  • Do the ranges themselves change? Or just the values inside a range?
  • What are the common sizes of the range (how many IP addresses per range)?
  • How many reads per second are you planning for?
  • How many writes per second are you planning for?
  • What amount of latency is acceptable?

Range queries in distributed databases are expensive. When implemented in the naive way, they lead to hot spots (for example, large parts of the IP address space are dark, military, or assigned to parts of the world where you might receive little traffic). Aerospike’s are not naive - they never create a hotspot - but they do impact every server in a cluster. Finding a way to turn your query into key-value lookups - even several of them - will almost always be better than a distributed range query.

The blanket statement that “saving them as class C is too much of a hassle” obviously depends on the pattern. If all the ranges are class C that solution would be optimal, if all are class A it’s poor but storing everything as Class A is very good. Let me know more about the use case pattern, and I can suggest a more optimal solution.

It seems this is an interesting enough problem that we might consider writing a library that does this. What language are you writing in?


#7

Thanks a lot for your answer !

I will try to answer your questions one-by-one :

Our ranges change several times each hour ( so generating the whole appropriate/updated ranges and updating the AS database is very consuming work…).

If we build the MaxMind GeoIP City database (GeoLiteCity-Blocks.csv) using the MaxMind Database builder, it can take several minutes …

They have a very nice in-memory Database format ( that I think you would like …) . They improved their code several times and they claim 2.6M requests per second .

Ranges can change, and some of the ranges are just single IP addresses.

Ranges can be very large - even /12 .

We want to use it for a Real Time Bidding system.

We are writing it in Java, but we’re considering golang as well (for its Async).

Thanks a lot !

Zac


#8

Hi Zac, I think my questions got lost a bit in the formatting:

  • Do the ranges themselves change? Or just the values inside a range?
  • What are the common sizes of the range (how many IP addresses per range)?
  • How many reads per second are you planning for?
  • How many writes per second are you planning for?
  • What amount of latency is acceptable?

Thanks !


#9

Is there anything else I should add ?


#10

Hi Zac,

I don’t see answers to all the questions our CTO has asked … please post a clear answer to each one, and I am sure he will be able to assist you further!

Cheers,

Maud


#11

Sorry for being misunderstood in my previous answer. I’ve tried to clarify my answers below:

  • Do the ranges themselves change? Or just the values inside a range?

  • Ranges can change . Let’s assume that there aren’t any values inside the ranges ( BTW, some of the ranges are just single IP addresses; from-ip/to-ip are the same ).

  • What are the common sizes of the range (how many IP addresses per range)?

  • Ranges can be very large - even CIDR of “/12” ( just like in MaxMind City database ).

  • How many reads per second are you planning for?

  • We want to use it for a Real Time Bidding system ( we will query each one of the RTB requests ), so i guess even 1Million rps.

  • How many writes per second are you planning for?

  • Small amount. Max of 10 requests per second ( after the first database loading ).

  • What amount of latency is acceptable?

  • We want to use it for a Real Time Bidding system. Up to 10ms.


#12

Hi Zac,

Sorry for the delay.

From these parameters, I think you should use primary key. In a RTB, as Aerospike has a lot of experience, a secondary index has too much variability “in the line of fire”.

Given the small number of writes (10 per second after first load), you want to optimize for low read latency.

I would suggest starting with a class C and put everything in the class C. Create 256 bins and the data for each IP address within that. Exactly how to structure inside that key can depend on the data structure you have. A UDF might make sense.

Your reads will have the optimal speed. Your writes will have to update many thing , but that’s OK, because updates are rare. The amount of memory you’ll consume is pretty small, compared to having one key per IP address.

-b


#13

Hi,

So I have read this thread several times, but I just don’t understand what I should be doing to have a solution where I can look up ips.

I have 18m addresses with ip_from, ip_to format. Unfortunately and I have seen, is that this data does change as a lot of it is mobile data.

I too am running an RTB like system, so I am requiring as many reads as possible. Writes aren’t that many, only when the DB gets updated (maybe 1 a week).

The amount of latency that is acceptable right now 50ms and below? But who am I kidding? I’m using postgres for the job right now with https://github.com/RhodiumToad/ip4r which allowed me to get running very quickly.

Let me be honest, it took me as long to read this thread as it did to get me up and running with my db and postgres.

So please, in a concise way. What exactly should I be doing here?

Thanks


#14

Brian,

I found this on google groups:

Is this what you are talking about when making 256 bins?

https://groups.google.com/forum/#!topic/nosql-databases/9wIn9xdA_co

So, this is just off the top of my head, and I'm not an expert, but
perhaps this will give you some ideas:

I'm assuming you're talking IPv4.  If it's IPv6, you'll need to do some tweaking

Create a CF with RowKey of AsciiType and Column Names of LongType.
Values are BytesType.  Basically you create one Row for each /8 and
name the row with the network address of that row.  So you'd have:

1.0.0.0
2.0.0.0
3.0.0.0

etc as row keys

Then for every /16 under that, store a single name/value pair where
the name is the inet_aton encoded value of the IP address of the /16
network and the value is a bitmask representing the IP's of the /16.
Set 1 = filter, 0 = don't filter.  Basically the first bit would be
0.1, 255th bit would be 0.255 and the 256th bit 1.0, etc.

So basically you'll have:
255 rows
each row with 255 columns
each column would store 8K bytes (since 2^16/8 = 8K)

Alternatively, you could store 16K columns per row (each column is a
/24) and each column would have 8 bytes.  Off the top of my head I'm
not sure which would be faster, but the first solution would be more
disk space efficient.  If you need to update your bitmasks regularly,
you're probably better off with the second solution.

Wrap a little API around this and you have fast and direct access to
know if a given IP should be filtered or not.

#15

Bump? Am I allowed to bump? Just need some guidance on the right way to solve this.

I also found (what’s below).

This this a better solution? I actually understand this as opposed to what I posted above.

I wonder how aerospike would fare?

As I say, I have 18m records which is gonna balloon doing it the /24 way. I’ll post my findings and my code in PHP.

Would be nice to hear from someone so I’m not talking to the void.

http://stackoverflow.com/questions/10478794/more-than-4-billion-key-value-pairs-in-redis#13925591

The approach we use for fast Geo-IP resolution is take all the IP ranges and break them at the /24 (the first three quads), and store a record holding all the matches in those addresses. This gives you 16 million keys and O(1) access. If you'll tolerate the client-side complexity of breaking up the stored record, it's performant without taking up lots of RAM.

In more detail:

take all ranges, and break them by their first 24 bits.
The range 128.100.60.0-128.100.60.9 becomes one record, <128.100.60 | 0 9 | (...recA...)>
The range 128.100.60.10 - 128.100.62.80 would become <128.100.60 | 10 255 | (...recB...)>, <128.100.61 | 0 255 | (...recB...)>, and <128.100.62 | 0 80 | (...recB...)>.
combine all the records with the same prefix into a hash whose key is the top of its range. So
key 128.100.60: {9: {...recA...}, 255: {...recB...}}
key 128.100.61: {255: {...recB...}}
key 128.100.62: {80: {...recB...}, ...}
To retrieve a specific IP, retrieve the compound record by its 24-bit key, and return the first result whose sub-key is larger than the last part. If I looked up 128.100.60.20, I would find that 9 was not larger, but that 255 was, and so return recB.

This is a common strategy for doing range joins (even spatial joins!) in things like Hadoop: partition on some reasonable chunk, and then index on one end of the range.

Query with Primary and Secondary Indexes
#16

Sure, this is a fine way to do it. Sorry if that sounds terse, but it’s what I was talking about - a class C is a /24.

Hopefully you simply did it…


#17

Thanks for the clarification Brian. When I get a chance to do this. I will throw up my process/code base for critique.


#18

Thought I would contribute this. I finally managed to do it.

Thanks to @bbulkow for the clarification.

I could have done this in PHP, but traversing 18m+ records, I thought it best to use Go. :smile:

Feel free to use as you wish. Any improvements? Please let me know!

Would love some feedback from @bbulkow or a golang expert at aerospike. :smile:

package main

import (
	"database/sql"
	"encoding/binary"
	"errors"
	"fmt"
	as "github.com/aerospike/aerospike-client-go"
	_ "github.com/lib/pq"
	"net"
	"strings"
)

func as_connect() (*as.Client, error) {
	client, err := as.NewClient("127.0.0.1", 3000)

	if err != nil {
		panic(err)
	}

	return client, err
}

const (
	DB_HOST     = "host"
	DB_USER     = "user"
	DB_PASSWORD = "password"
	DB_NAME     = "db"
)

func pg_connect() *sql.DB {
	dbinfo := fmt.Sprintf("host=%s dbname=%s user=%s password='%s' sslmode=disable", DB_HOST, DB_NAME, DB_USER, DB_PASSWORD)
	db, err := sql.Open("postgres", dbinfo)

	if err != nil {
		fmt.Println("Error: The data source arguments are not valid")
	}

	return db
}

func put_data(client *as.Client, _key string, m map[string]interface{}) {
	key, _ := as.NewKey("cache", "geo_ip", _key)

	err := client.Put(nil, key, m)

	if err != nil {
		fmt.Println(_key)

		panic(err)
	} else {
		fmt.Println("Imported - " + _key)
	}
}

func gen_ranges() (*sql.Rows, error) {
	db := pg_connect()

	sql := `
		select ip_from, ip_to, country_code, country_name, region_code, region as region_name, city_code, city_name, isp_name, mobile_carrier
		from bar
		order by ip_from
	`

	rows, err := db.Query(sql)

	return rows, err
}

func Ip2long(ipAddr string) (uint32, error) {
	ip := net.ParseIP(ipAddr)
	if ip == nil {
		return 0, errors.New("wrong ipAddr format")
	}
	ip = ip.To4()
	return binary.BigEndian.Uint32(ip), nil
}

func Long2ip(ipLong uint32) string {
	ipByte := make([]byte, 4)
	binary.BigEndian.PutUint32(ipByte, ipLong)
	ip := net.IP(ipByte)
	return ip.String()
}

type Record struct {
	Ip_from        string
	Ip_to          string
	Country_code   string
	Country_name   string
	Region_code    string
	Region_name    string
	City_code      string
	City_name      string
	Isp_name       string
	Mobile_carrier string
}

func main() {
	var rowData Record
	ips := make(map[string]map[string]map[string]string)

	rows, _ := gen_ranges()

	if rows != nil {
		for rows.Next() {
			rows.Scan(&rowData.Ip_from, &rowData.Ip_to, &rowData.Country_code, &rowData.Country_name, &rowData.Region_code, &rowData.Region_name, &rowData.City_code, &rowData.City_name, &rowData.Isp_name, &rowData.Mobile_carrier)

			s_ip := strings.Split(rowData.Ip_from, ".")
			s_network := strings.Join(s_ip[:3], ".")
			e_ip := strings.Split(rowData.Ip_to, ".")
			e_network := strings.Join(e_ip[:3], ".")

			if s_network == e_network {
				// Match
				e_host := e_ip[len(e_ip)-1]

				row := make(map[string]string)
				row["ip_from"] = rowData.Ip_from
				row["ip_to"] = rowData.Ip_to
				row["country_code"] = rowData.Country_code
				row["country_name"] = rowData.Country_name
				row["region_code"] = rowData.Region_code
				row["region_name"] = rowData.Region_name
				row["city_code"] = rowData.City_code
				row["city_name"] = rowData.City_name
				row["isp_name"] = rowData.Isp_name
				row["mobile_carrier"] = rowData.Mobile_carrier

				if _, ok := ips[s_network]; !ok {
					ips[s_network] = make(map[string]map[string]string)
				}

				ips[s_network][e_host] = row
			} else {
				n_ips := make(map[string][]string)

				// Non Match
				start_ip, err := Ip2long(rowData.Ip_from)

				if err != nil {
					panic(err)
				}

				end_ip, err := Ip2long(rowData.Ip_to)

				if err != nil {
					panic(err)
				}

				for ip := start_ip; ip <= end_ip; ip++ {
					split_ip := strings.Split(string(Long2ip(ip)), ".")
					split_network := strings.Join(split_ip[:3], ".")
					split_host := split_ip[len(split_ip)-1]

					n_ips[split_network] = append(n_ips[split_network], split_host)
				}

				for network, hosts := range n_ips {
					row := make(map[string]string)
					row["ip_from"] = rowData.Ip_from
					row["ip_to"] = rowData.Ip_to
					row["country_code"] = rowData.Country_code
					row["country_name"] = rowData.Country_name
					row["region_code"] = rowData.Region_code
					row["region_name"] = rowData.Region_name
					row["city_code"] = rowData.City_code
					row["city_name"] = rowData.City_name
					row["isp_name"] = rowData.Isp_name
					row["mobile_carrier"] = rowData.Mobile_carrier

					last_host := len(hosts) - 1
					z_host := hosts[last_host]

					if _, ok := ips[network]; !ok {
						ips[network] = make(map[string]map[string]string)
					}

					ips[network][z_host] = row
				}
			}
		}
	}

	client, _ := as_connect()

	for network, hosts := range ips {
		m := make(map[string]interface{})

		for host, _map := range hosts {
			m[host] = _map
		}

		put_data(client, network, m)
	}
}

#19

Oh and I tested the end result in my hhvm application. By searching an IP and getting back the result.

Works nicely. Now to do the real result and test on my live-test servers. :smiley:

        private function gen_geo_location($click_ip)
	{
		$ip = explode(".", $click_ip);

		list($host) = array_slice($ip, -1); //host
		array_pop($ip);
	        $network = implode(".", $ip); //network

		$as_data = array(
			'bin' => 'cache',
			'table' => 'geo_ip',
			'key' => $network
		);

		$row = $this->aerospike_funcs->get($as_data);

		if ($row)
		{
			$keys = array_keys($row);
			sort($keys);

			foreach($keys as $key => $value)
			{
				if ($host <= $value)
				{
					$g_row = $row[$value];
					break;
				}
			}

			$data = array(
				'ip_from' => $g_row['ip_from'],
				'ip_to' => $g_row['ip_to'],
				'click_isp' => $g_row['isp_name'],
				'country_code' => $g_row['country_code'],
				'country_name' => $g_row['country_name'],
				'region_code' => $g_row['region_code'],
				'region_name' => $g_row['region_name'],
				'city_code' => $g_row['city_code'],
				'city_name' => $g_row['city_name'],
				'isp_name' => $g_row['isp_name'],
				'carrier' => $g_row['mobile_carrier']
			);
		}
		else
		{
			$data = array(
				'ip_from' => 'No IP Address',
				'ip_to' => 'No IP Address',
				'click_isp' => 'No IP Address',
				'country_code' => 'NA',
				'country_name' => 'No IP Address',
				'region_code' => 'NA',
				'region_name' => 'No IP Address',
				'city_code' => 'No IP Address',
				'city_name' => 'No IP Address',
				'isp_name' => 'No IP Address',
				'carrier' => 'No IP Address'
			);
		}

		return $data;
	}

#20

Ewwwwwww. Doing this in Golang, that is reading from the database the GeoIP data.

Lets say I’m getting the data (I have my own custom function which returns as.Binmap:

rec, err := core.Get(client, "geoip", "ranges", "124.209.1")

But what I get back is a map[interface{}]interface{} which I cannot iterate through and then have to cast it into a map[interface{}]interface{}, so then i can put the key/value pairs into another map.

m2 := make(map[string]string)
foo := rec[host].(map[interface{}]interface{})

for key, value := range foo {
    m2[key.(string)] = value.(string)
}

m["ip_from"] = m2["ip_from"]
.......

Again, would be nice if someone from aerospike could chime in! :slight_smile: