Device for Fingerprinting HTTP requests of malware. Primarily based on Tshark and written in Python3. Working prototype stage
Its fundamental goal is to offer distinctive representations (fingerprints) of malware requests, which assist in their identification. Distinctive means right here that every fingerprint ought to be seen solely in a single specific malware household, but one household can have a number of fingerprints. Hfinger represents the request in a shorter kind than printing the entire request, however nonetheless human interpretable.
Hfinger can be utilized in guide malware evaluation but in addition in sandbox programs or SIEMs. The generated fingerprints are helpful for grouping requests, pinpointing requests to specific malware households, figuring out totally different operations of 1 household, or discovering unknown malicious requests omitted by different safety programs however which share fingerprint.
A tutorial paper accompanies work on this instrument, describing, for instance, the motivation of design decisions, and the analysis of the instrument in comparison with p0f, FATT, and Mercury.
The thought
The fundamental assumption of this venture is that HTTP requests of various malware households are roughly distinctive, to allow them to be fingerprinted to offer some kind of identification. Hfinger retains details about the construction and values of some headers to offer means for additional evaluation. For instance, grouping of comparable requests – at this second, it’s nonetheless a piece in progress.
After evaluation of malware’s HTTP requests and headers, we’ve recognized some elements of requests as being most distinctive. These embrace: * Request technique * Protocol model * Header order * Well-liked headers’ values * Payload size, entropy, and presence of non-ASCII characters
Moreover, some commonplace options of the request URL have been additionally thought-about. All these elements have been translated right into a set of options, described in particulars right here.
The above options are translated into various size illustration, which is the precise fingerprint. Relying on report mode, totally different options are used to fingerprint requests. Extra data on these modes is offered under. The function choice course of will probably be described within the forthcoming tutorial paper.
Set up
Minimal necessities wanted earlier than set up: * Python
>= 3.3, * Tshark
>= 2.2.0.
Set up accessible from PyPI:
pip set up hfinger
Hfinger has been examined on Xubuntu 22.04 LTS with tshark
bundle in model 3.6.2
, however ought to work with older variations like 2.6.10
on Xubuntu 18.04 or 3.2.3
on Xubuntu 20.04.
Please observe that as with every PoC, it is best to run Hfinger in a separated setting, not less than with Python digital setting. Its setup shouldn’t be lined right here, however you’ll be able to attempt this tutorial.
Utilization
After set up, you’ll be able to name the instrument immediately from a command line with hfinger
or as a Python module with python -m hfinger
.
For instance:
foo@bar:~$ hfinger -f /tmp/take a look at.pcap
[1.4"]
Assist will be displayed with brief -h
or lengthy --help
switches:
utilization: hfinger [-h] (-f FILE | -d DIR) [-o output_path] [-m {0,1,2,3,4}] [-v]
[-l LOGFILE]Hfinger - fingerprinting malware HTTP requests saved in pcap information
non-obligatory arguments:
-h, --help present this assist message and exit
-f FILE, --file FILE Learn a single pcap file
-d DIR, --directory DIR
Learn pcap information from the listing DIR
-o output_path, --output-path output_path
Path to the output listing
-m {0,1,2,3,4}, --mode {0,1,2,3,4}
Fingerprint report mode.
0 - comparable variety of collisions and fingerprints as mode 2, however utilizing fewer options,
1 - illustration of all designed options, however just a little extra collisions than modes 0, 2, and 4,
2 - optimum (the default mode),
3 - the bottom variety of generated fingerprints, however the highest variety of collisions,
4 - the best fingerprint entropy, however barely extra fingerprints than modes 0-2
-v, --verbose Report details about non-standard values within the request
(e.g., non-ASCII characters, no CRLF tags, values not current within the configuration record).
With out --logfile (-l) will print to the usual error.
-l LOGFILE, --logfile LOGFILE
Output logfile within the verbose mode. Implies -v or --verbose swap.
You need to present a path to a pcap file (-f), or a listing (-d) with pcap information. The output is in JSON format. It will likely be printed to plain output or to the offered listing (-o) utilizing the identify of the supply file. For instance, output of the command:
hfinger -f instance.pcap -o /tmp/pcap
will probably be saved to:
/tmp/pcap/instance.pcap.json
Report mode -m
/--mode
can be utilized to alter the default report mode by offering an integer within the vary 0-4
. The modes differ on represented request options or rounding modes. The default mode (2
) was chosen by us to symbolize all options which are often used throughout requests’ evaluation, however it additionally affords low variety of collisions and generated fingerprints. With different modes, you’ll be able to obtain totally different objectives. For instance, in mode 3
you get a decrease variety of generated fingerprints however the next probability of a collision between malware households. If you’re uncertain, you do not have to alter something. Extra data on report modes is right here.
Starting with model 0.2.1
Hfinger is much less verbose. You must use -v
/--verbose
if you wish to obtain details about encountered non-standard values of headers, non-ASCII characters within the non-payload a part of the request, lack of CRLF tags (rnrn
), and different issues with analyzed requests that aren’t utility errors. When any such points are encountered within the verbose mode, they are going to be printed to the usual error output. It’s also possible to save the log to an outlined location utilizing -l
/--log
swap (it implies -v
/--verbose
). The log knowledge will probably be appended to the log file.
Utilizing hfinger in a Python utility
Starting with model 0.2.0
, Hfinger helps importing to different Python purposes. To make use of it in your app merely import hfinger_analyze
operate from hfinger.evaluation
and name it with a path to the pcap file and reporting mode. The returned result’s a listing of dicts with fingerprinting outcomes.
For instance:
from hfinger.evaluation import hfinger_analyzepcap_path = "SPECIFY_PCAP_PATH_HERE"
reporting_mode = 4
print(hfinger_analyze(pcap_path, reporting_mode))
Starting with model 0.2.1
Hfinger makes use of logging
module for logging details about encountered non-standard values of headers, non-ASCII characters within the non-payload a part of the request, lack of CRLF tags (rnrn
), and different issues with analyzed requests that aren’t utility errors. Hfinger creates its personal logger utilizing identify hfinger
, however with out prior configuration log data in follow is discarded. If you wish to obtain this log data, earlier than calling hfinger_analyze
, it is best to configure hfinger
logger, set log degree to logging.INFO
, configure log handler as much as your wants, add it to the logger. Extra data is offered within the hfinger_analyze
operate docstring.
Fingerprint creation
A fingerprint relies on options extracted from a request. Utilization of specific options from the total record depends upon the chosen report mode from a predefined record (extra data on report modes is right here). The determine under represents the creation of an exemplary fingerprint within the default report mode.
Three elements of the request are analyzed to extract data: URI, headers’ construction (together with technique and protocol model), and payload. Specific options of the fingerprint are separated utilizing |
(pipe). The ultimate fingerprint generated for the POST
request from the instance is:
2|3|1|php|0.6|PO|1|us-ag,ac,ac-en,ho,co,co-ty,co-le|us-ag:f452d7a9/ac:as-as/ac-en:id/co:Ke-Al/co-ty:te-pl|A|4|1.4
The creation of options is described under within the order of look within the fingerprint.
Firstly, URI options are extracted: * URI size represented as a logarithm base 10 of the size, rounded to an integer, (within the instance URI is 43 characters lengthy, so log10(43)≈2
), * variety of directories, (within the instance there are 3 directories), * common listing size, represented as a logarithm with base 10 of the particular common size of the listing, rounded to an integer, (within the instance there are three directories with whole size of 20 characters (6+6+8), so log10(20/3)≈1
), * extension of the requested file, however solely whether it is on a listing of recognized extensions in hfinger/configs/extensions.txt
, * common worth size represented as a logarithm with base 10 of the particular common worth size, rounded to at least one decimal level, (within the instance two values have the identical size of 4 characters, what is clearly equal to 4 characters, and log10(4)≈0.6
).
Secondly, header construction options are analyzed: * request technique encoded as first two letters of the strategy (PO
), * protocol model encoded as an integer (1 for model 1.1, 0 for model 1.0, and 9 for model 0.9), * order of the headers, * and fashionable headers and their values.
To symbolize order of the headers within the request, every header’s identify is encoded based on the schema in hfinger/configs/headerslow.json
, for instance, Person-Agent
header is encoded as us-ag
. Encoded names are separated by ,
. If the header identify doesn’t begin with an higher case letter (or any of its elements when analyzing compound headers resembling Settle for-Encoding
), then encoded illustration is prefixed with !
. If the header identify shouldn’t be on the record of the recognized headers, it’s hashed utilizing FNV1a hash, and the hash is used as encoding.
When analyzing fashionable headers, the request is checked if they seem in it. These headers are: * Connection * Settle for-Encoding * Content material-Encoding * Cache-Management * TE * Settle for-Charset * Content material-Kind * Settle for * Settle for-Language * Person-Agent
When the header is discovered within the request, its worth is checked towards a desk of typical values to create pairs of header_name_representation:value_representation
. The identify of the header is encoded based on the schema in hfinger/configs/headerslow.json
(as offered earlier than), and the worth is encoded based on schema saved in hfinger/configs
listing or configs.py
file, relying on the header. Within the above instance Settle for
is encoded as ac
and its worth */*
as as-as
(asterisk-asterisk
), giving ac:as-as
. The pairs are inserted into fingerprint so as of look within the request and are delimited utilizing /
. If the header worth can’t be discovered within the encoding desk, it’s hashed utilizing the FNV1a hash.
If the header worth consists of a number of values, they’re tokenized to offer a listing of values delimited with ,
, for instance, Settle for: */*, textual content/*
would give ac:as-as,te-as
. Nevertheless, at this level of growth, if the header worth accommodates a “high quality worth” tag (q=
), then the entire worth is encoded with its FNV1a hash. Lastly, values of Person-Agent and Settle for-Language headers are immediately encoded utilizing their FNV1a hashes.
Lastly, within the payload options: * presence of non-ASCII characters, represented with the letter N
, and with A
in any other case, * payload’s Shannon entropy, rounded to an integer, * and payload size, represented as a logarithm with base 10 of the particular payload size, rounded to at least one decimal level.
Report modes
Hfinger
operates in 5 report modes, which differ in options represented within the fingerprint, thus data extracted from requests. These are (with the quantity used within the instrument configuration): * mode 0
– producing an identical variety of collisions and fingerprints as mode 2
, however utilizing fewer options, * mode 1
– representing all designed options, however producing just a little extra collisions than modes 0
, 2
, and 4
, * mode 2
– optimum (the default mode), representing all options that are often used throughout requests’ evaluation, but in addition providing a low variety of collisions and generated fingerprints, * mode 3
– producing the bottom variety of generated fingerprints from all modes, however reaching the best variety of collisions, * mode 4
– providing the best fingerprint entropy, but in addition producing barely extra fingerprints than modes 0
–2
.
The modes have been chosen as a way to optimize Hfinger’s capabilities to uniquely determine malware households versus the variety of generated fingerprints. Modes 0
, 2
, and 4
provide an identical variety of collisions between malware households, nevertheless, mode 4
generates just a little extra fingerprints than the opposite two. Mode 2
represents extra request options than mode 0
with a comparable variety of generated fingerprints and collisions. Mode 1
is the one one representing all designed options, however it will increase the variety of collisions by nearly two instances evaluating to modes 0
, 1
, and 4
. Mode 3
produces not less than two instances fewer fingerprints than different modes, however it introduces about 9 instances extra collisions. Description of all designed options is right here.
The modes include options (within the order of look within the fingerprint): * mode 0
: * variety of directories, * common listing size represented as an integer, * extension of the requested file, * common worth size represented as a float, * order of headers, * fashionable headers and their values, * payload size represented as a float. * mode 1
: * URI size represented as an integer, * variety of directories, * common listing size represented as an integer, * extension of the requested file, * variable size represented as an integer, * variety of variables, * common worth size represented as an integer, * request technique, * model of protocol, * order of headers, * fashionable headers and their values, * presence of non-ASCII characters, * payload entropy represented as an integer, * payload size represented as an integer. * mode 2
: * URI size represented as an integer, * variety of directories, * common listing size represented as an integer, * extension of the requested file, * common worth size represented as a float, * request technique, * model of protocol, * order of headers, * fashionable headers and their values, * presence of non-ASCII characters, * payload entropy represented as an integer, * payload size represented as a float. * mode 3
: * URI size represented as an integer, * common listing size represented as an integer, * extension of the requested file, * common worth size represented as an integer, * order of headers. * mode 4
: * URI size represented as a float, * variety of directories, * common listing size represented as a float, * extension of the requested file, * variable size represented as a float, * common worth size represented as a float, * request technique, * model of protocol, * order of headers, * fashionable headers and their values, * presence of non-ASCII characters, * payload entropy represented as a float, * payload size represented as a float.