Motivation

Background

Nowadays, most smart algorithms rely on distances computed on the data they manage.

For instance, face recognition systems compute embeddings of facial characteristics and compute distances between these embeddings in order to determine who is the person under consideration.

However, for computer networks entities such as IP addresses or computer ports, it is often difficult to define distance metrics. Indeed, different kind of similarities may be relevant (physical or geographical distance, euclidean distance, ...)

Some distances metrics exists for IP addresses. For instance [1] or [2]. However, to the best of our knowledge, very little effort have been made in order to define a computer port distance/similarity.

In this website, we intend to offer an open access to a metric of this kind. This one can be accessed in two ways. Through the web interface of this application and through it's public API.

Computation of the distance

In the application proposed in this paper, we use a network dissimilarity metric (that, for the sake of simplicity, we will call distance) that is built from a wide Darknet or Network Telescope. In a Darknet, we capture traffic targeting unused address space of a network in order to observe large scale events on the Internet. Therefore, all received packets are unsolicited and this traffic can be labelled as abnormal or malicious.

The semantic distance is derived from these anomalous TCP connections by building a graph of network ports representing paths followed by attackers while probing Darknet addresses. For instance, if an attacker starts with a scan of port 80 and then jumps to port 443, an edge is added to the graph of ports. It's worth to note that the edges of the constructed graph are weighted by the number of time the jump for the source to the destination port happens. This graph representation of scans allows to encode probing activities in a way that convey trends and knowledge of the attackers in these activities

After a rescaling of the graph edges weights (in order to minimize the impact of outliers edges weights), edges weights are inverted to transform them into distances. The final representation is a graph with nodes being port numbers and edges rescaled distances between them. From this representation, pairwise distances between port numbers are computed with the algorithm of Dijkstra.

This distance metric showed itself capable of preventively blocking more than 99% of probes in real world traffic while blocking less than 0.5% of legitimate traffic.

Source of this work

The proposed similarity metric come from a research internship made in the Resist team of the INRIA of Nancy by Laurent Evrard from the University of Namur.

A scientific paper about this work have been written and is available here.

If you are using this work in an academic purpose, please use this reference

    
@inproceedings{evrard_attacker_2019,
    address = {Washington, DC, USA},
    title = {Attacker {Behavior}-{Based} {Metric} for {Security} {Monitoring} {Applied} to {Darknet} {Analysis}},
    booktitle = {Proceedings of {IFIP}/{IEEE} {International} {Symposium} on {Integrated} {Network} {Management}},
    author = {Evrard, Laurent and François, Jérôme and Colin, Jean-Noël},
    year = {2019}
}
    

Limitations

This web application and its API don't have any limitations on the number of requests made by a user. However, with the rapid evolution of online probing activities, the number of ports targeted by probes becomes higher every day and semantic distances dataset file size fastly raises too.

Thus, in order to prevent performance issues, only 50 millions lowest semantic distances are provided for download for each of the provided datasets. Also, only one million of these lowest distances are available for direct search.

Finaly, it's worth to note that several type of dataset are supported by this application. These kinds are defined by the number of days the distances are based on.

Four kinds are disintguished:

  1. A day dataset : distances are generated given one day of data
  2. A week dataset : distances are generated given one week of data
  3. A month dataset : distances are generated given one month of data
  4. A year dataset : distances are generated given one year of data

A second distinction is made between datasets. Some are considered as permanent where others are defined as temporary. Temporary datasets are day datasets or time shifted (do not begin at the beginning of the time period : monday, first day of the month or of the year) week, month or year datasets.

Non temporary datasets are kept forever where for temporary ones are kept while no more recent temporary datasets of this kind are discovered. For instance, if we have a dataset for a duration of a month that begin on 11 October 2017 and we add a new one the begin on 15 November 2017, the older one will be removed.

Hosting and data providing

The INRIA High Security Lab provides the hosting of this application together with the needed Darknet data to compute the distances.

What is the HSL ?

The High Security Laboratory (HSL, LHS = Laboratoire de Haute sécurité in french) was build in 2008-2009 by Inria’s Nancy-Grand Est Center with the help from FEDER (European Fund for Regional Development), the Region of Lorraine, the Greater Nancy Metropolitan District and the Ministry for Higher Education and Research via the Regional Research and Technology Delegation. Business research has been undertaken in partnership with universities in Lorraine, the CNRS, and Inria.