🧠 CyberDeception-Threat_Actor_Profiling

Automates the journey from raw PCAPs to context‑rich adversary profiles: transforms packets into enriched flows, cleans and groups attack patterns, applies hierarchical rules to infer attacker motivations and skills, maps findings in a STIX‑based knowledge graph, and includes an ML module for instant attribute prediction.

Features

🔸 Raw .pcap Files

Raw network capture files.

🔸 CICFlowMeter Files

Direct output from CICFlowMeter. Includes statistical characteristics per flow such as: Flow Duration, Fwd Packet Length Mean, Bwd IAT Max, etc. Base data for creating groupings and enrichment.

🔸 Low-Level Characteristics

These characteristics describe in detail the behavior of grouped traffic, allowing you to identify suspicious patterns, statistical anomalies, or traces of cyberattacks. They are classified by category:

📌 Identification and Context

id: Unique identifier (UUID) of the traffic group.
threat_type, threat, attack, stage: Semantic tags if available (threat type, attack, kill chain stage).
ips_src, ips_dst: List of unique source and destination IPs within the group.
ports_src, ports_dst: List of unique source and destination ports.

📌 Volume and Count

total_flows: Total number of network flows.
total_packets_sent, total_packets_received: Total packets sent and received.
total_bytes_sent, total_bytes_received: Bytes sent and received.
unique_dst_ports_count, num_unique_src_ports: Number of unique ports per address.
most_frequent_dst_port, most_frequent_src_port: Most common port observed.
unique_dst_ips: Number of unique destination IPs.

📌 Time and Duration

avg_flow_duration, std_flow_duration, iqr_flow_duration: Average, standard deviation, and interquartile range of flow duration.
skewness_flow_duration, kurtosis_flow_duration: Skewness and kurtosis of flow duration.
first_seen, last_activity: Timestamp of the first and last observed flow.
active_time_ratio: Active time over total time.
total_idle_time, avg_idle_time_between_flows, max_idle_time: Idle time measures.
avg_time_between_connections, avg_time_between_flows: Average time between connections or flows.
flow_frequency_per_minute: Number of flows per minute.
extreme_flow_duration_count: Count of flows with extreme duration.
time_diff: Total difference between first and last flow.

📌 Size and Packets

avg_packet_size: Average packet size.
variance_packet_size: Packet size variance.
fwd_packet_size_range, bwd_packet_size_range: Round-trip packet size range.
fwd_packet_size_to_mean_ratio, bwd_packet_size_to_mean_ratio: Ratio between individual size and average.
flow_variance_ratio, high_variance_flow: Traffic variability indicators.

📌 Inter-arrival Time (IAT)

mean_iat, std_iat, max_iat, min_iat: Metrics for the time between packet arrivals.

📌 TCP Indicators / Flags

syn_count, ack_count, fin_count, rst_count, psh_count, urg_count: Number of packets with each TCP flag.
syn_ack_ratio, rst_syn_ratio, rst_syn_flag_ratio, fin_to_syn_ratio, urg_to_ack_ratio,
psh_to_syn_ratio: Ratio between flags, useful for detecting scans or attacks.

📌 Distributions and Entropy

protocol_distribution: Proportion of protocols (TCP, UDP, etc.).
udp_count, tcp_count: Total count per protocol.
udp_tcp_traffic_ratio: Ratio between UDP and TCP traffic.
entropy_dst_ips, entropy_dst_ports: Measure of entropy (diversity) in destination IPs or ports.

📌 Aggregate Ratios and Behaviors

ratio_fwd_to_bwd_packets, ratio_fwd_to_bwd_bytes: Bidirectional information flow.
byte_sent_received_ratio: Ratio between bytes sent and received.
flow_duration_to_packet_ratio: Ratio between flow duration and number of packets.
unique_ports_per_dst_ip_ratio: Metric of unique port usage per destination IP.
active_to_idle_ratio: Ratio between active and idle time.

📌 Connection Dynamics

unique_dst_ports_group, unique_dst_ips_group: Structural diversity within the group.
unique_sessions_count: Number of sessions (unique IP:port combinations).
port_change_frequency, dst_ip_change_frequency: Frequency of port or destination IP changes.
num_flows_in_high_traffic_periods: Flows detected during periods of high activity.

🚨 Attack Indicators

ddos_indicator: Heuristic indicator of DDoS attacks.
port_scan_indicator: Probability of port scanning.
brute_force_indicator: Probability of brute force attack.
data_exfiltration_indicator: Signs of data exfiltration.

🔸 High-Level Characteristics

These characteristics represent attributes of potential attackers. They are generated from enriched traffic views to support threat hunting, intelligence analysis, and attribution activities.

📌 Identification and Context

Id: Profile identifier.
IPs: List of observed IPs.
Target, PreferredTarget: Primary and secondary targets of the attacker.
FirstSeen, LastActivity: First and last trace of activity.
Country: Country associated with the attacker's infrastructure.

📌 Attacker Profile

Profile: Main type of attacker (APT, Hacktivist, Criminal...).
ProfileDist: Probabilistic distribution among profiles.
Motivation: Motivation (economic, ideological, espionage...).
Affiliation: Relationship with known groups, governments, or actors.
Attitude: Level of aggressiveness or exposure of the attacker.
Knowledge: Level of technical or environmental knowledge.
Skills: Detected or inferred skills.
AutomationLevel: Level of automation of actions.
Evasion: Level and types of evasive techniques used.

📌 Techniques and Tools

TTPs: Observed techniques, tactics, and procedures (MITRE).
KillChainPhase: Phase(s) of the attack in the Kill Chain model.
Tools: Tools used (exploits, RATs, backdoors, etc.).

📌 Campaigns and Groups

ThreatGroup: Known group (if association is successful).
Campaigns: Previous campaigns related to current behavior.

📌 Risk Assessment

RiskLevel: Estimated risk level.
Confidence: Degree of certainty in the generated profile.

📌 Others

Comments: Additional observations, semantic explanations, or automatically generated inferences.

⚙️ Installation and Prerequisites

This section describes the software requirements and dependencies necessary for the proper functioning of the traffic analysis, attacker profiling, Neo4j visualization, and machine learning model training scripts.

🐍 1. Python Environment

Python 3.7 or higher is required.

Required libraries:

pandas
numpy
scikit-learn
imbalanced-learn
joblib
matplotlib
geoip2
ipwhois
neo4j
transformer

These libraries enable statistical analysis, data preprocessing, visualization, geographic enrichment, and NLP model execution for advanced profiling.

📡 2. CICFlowMeter

Essential tool for converting .pcap traffic files into labeled .csv files that can be processed by analysis scripts.

🧬 3. Neo4j

An operational Neo4j database is required to store and visualize attacker profiles as a graph.

🤖 4. Machine Learning Environment

The ML environment must support:

Multi-class classification models
Class balancing (via SMOTE or balanced RandomForest)
Custom transformations with ColumnTransformer
Exporting models and metrics
Folder structure for batch processing by class

This requires that the ML, visualization, and data manipulation libraries are correctly installed, as detailed in the Python libraries block.

🚀 Usage

Once installation is complete, the general workflow follows these steps:

Traffic conversion: use CICFlowMeter to transform .pcap to .csv.
Low-level analysis: process .csv streams with Low-Level.py to generate enriched views grouped by IP, pair IP, or attack tags.
High-level profiling: use High-Level.py on the enriched output to generate attacker profiles with 23 analytical attributes.
Graph visualization:

Define the node and relationship schema by running neo4j_schema.cypher.
Load the generated profiles into Neo4j using the Dataset-Neo.py script.

ML model training:

Preprocess the data if necessary.
Run ML.py to train classification models and evaluate metrics.
Use script_ML.sh if batch automation across multiple folders/classes is required.

🔁 Flow summary

[.pcap files] 
    ↓
[CICFlowMeter] 
    ↓
[Low-Level.py → enriched views] 
    ↓
[High-Level.py → attacker profiles] 
    ↓
[Dataset-Neo.py → Neo4j graph]
    ↓
[ML.py / script_ML.sh → classification models]

1️⃣ Convert .pcap traffic to .csv with CICFlowMeter

The first step in the flow is to transform the network capture files (.pcap) into labeled network flows (.csv), which will serve as input for subsequent analyses.

Required command:

cicflowmeter -f example.pcap -c flows.csv

Parameters:

-f example.pcap: path to the network capture file to be processed.
-c flows.csv: name of the .csv file that will be generated as output, containing the network flows.

This flows.csv file will include statistics per flow such as source/destination IP addresses, ports, duration, number of packets, bytes transferred, and more. It is the starting point for running the Low-Level.py script.

2️⃣ Generate enriched traffic views with Low-Level.py

This script takes a previously generated .csv file of network flows (for example, using CICFlowMeter) as input and produces four enriched views grouped from different perspectives. These views are essential for statistical analysis, anomaly pattern detection, and subsequent attacker profiling.

🔍 What exactly does it do?

It groups network flows by:

Source IP (_by_src.csv)
Destination IP (_by_dst.csv)
IP pair (_by_pair.csv)
Conditional grouping (_conditional.csv) — based on whether the attack is DoS, DDoS, or APT.

It aggregates network behavior metrics by group:

Average flow duration
Total number of packets and bytes sent/received
TCP flags, packet/byte rate, IP entropy, etc.

It adds additional context:

Lists of unique IPs and ports (ips_src, ips_dst, ports_src, ports_dst)
Unique group identifier (id, UUID v4)
Fill any columns that are not available in the original dataset with NaN to maintain consistency in the output structure.

📤 Generated outputs

Running this script with an input file generates the following files:

output_by_src.csv         # Grouped by src_ip
output_by_dst.csv         # Grouped by dst_ip
output_by_pair.csv        # Grouped by src_ip + dst_ip
output_conditional.csv    # Conditional grouping based on attack type

▶️ Execution command

python3 Low-Level.py -i flows.csv -o output

-i flows.csv: previously generated .csv file with network flows.
-o output: prefix for the output files that will be generated.

The resulting files will be used as input for high-level profiling with High-Level.py.

3️⃣ Generate attacker profiles with High-Level.py

This script implements a hybrid expert system to infer attacker profiles from enriched network traffic. It uses a fact base (the enriched traffic dataset), a rule-encoded knowledge base, and inference techniques.

🧠 What does it do?

From a .csv file with grouped and enriched malicious traffic (generated by Low-Level.py), the script:

Assigns one of 12 attacker profiles (nation-state, spy, criminal, etc.).
Calculates strategic attributes such as:
Motivation (economic, espionage, sabotage...)
Skill and knowledge level
MITRE techniques (TTPs) used
Phase of the attack in the Kill Chain
APT group or campaign (if there is a match)
Probable affiliation (state, contractor, criminal, etc.)
Optionally, it generates automated comments with NLP models (FLAN-T5).

📥 Required files

Mandatory input:

Enriched traffic CSV: enriched_output.csv

Required semantic maps:

network_observable_attack_mapping_v2.csv: rules for detecting TTPs.
network_tool_mapping_v1.csv: rules for detecting tools.
network_evasion_mapping_v1.csv: rules for evasion techniques.

External database:

groups.csv: database of known groups.
campaigns.csv: database of campaigns with techniques and dates.
GeoLite2-Country.mmdb: GeoIP database for location.

📤 Generated outputs

CSV of the attacker model:

Id, IPs, Target, PreferredTarget, FirstSeen, LastActivity,
Country, AutomationLevel, Evasion, TTPs, KillChainPhase,
RiskLevel, Confidence, Tools, Skills,
Profile, ProfileDist, Motivation, Knowledge,
Attitude, Affiliation, ThreatGroup, Campaigns, Comment

▶️ Execution command

python3 High-Level.py enriched_output.csv profiles_output.csv \
  -m network_observable_attack_mapping_v2.csv \
  -t network_tool_mapping_v1.csv \
  -e network_evasion_mapping_v1.csv \
  -c iot

Arguments:

enriched_output.csv: input file with enriched traffic.
profiles_output.csv: output file name.
-m: file with the mapping of observables to TTPs (default: network_observable_attack_mapping_v2.csv)
-t: file with the mapping of tools.
-e: file with the mapping of evasion techniques.
-c: environment context (iot, 5g, cloud, ics)

4️⃣ Upload profiles to Neo4j

Once the attacker profiles (profiles_output.csv) have been generated using High-Level.py, you can upload them to Neo4j as a knowledge graph with entities, tools, techniques, campaigns, groups, and their relationships.

🏗️ 4.1 Inserting the schema into Neo4j

The neo4j_schema.cypher file contains the necessary instructions to:

Create unique indexes on key nodes (:Entity, :Tool, :TTP, etc.)
Define integrity constraints (for example, avoid duplicate nodes by ID)
Establish the model of nodes and relationships in the graph

✅ Instruction to execute the schema:

cat neo4j_schema.cypher | cypher-shell -u neo4j -p <your_password>

Make sure you have the Neo4j server active and accessible through the bolt:// port.

🧾 4.2 Insert the data generated by High-Level.py

The Dataset-Neo.py script loads the .csv profile file and performs the following:

Creation of :Entity nodes for each profile (one per traffic group)
Creation of related nodes: :TTP, :Tool, :Campaign, :ThreatGroup
Establish relationships:

(:Entity)-[:USES]->(:Tool)
(:Entity)-[:APPLIES]->(:TTP)
(:Entity)-[:BELONGS_TO]->(:ThreatGroup)
(:Entity)-[:PART_OF]->(:Campaign)

⚙️ Preliminary configuration

Open the Dataset-Neo.py file and modify the following variables to connect to your Neo4j instance:

URI = “bolt://localhost:7687”
USER = “neo4j”
PASSWORD = “your_password”
CSV_PATH = “profiles_output.csv”

▶️ Run the load

Once configured:

python3 Dataset-Neo.py

This will automatically create the complete graph from the profiles, allowing you to query it visually or using Cypher.

5️⃣ Training Machine Learning models with ML.py

Once you have the attacker profiles generated with High-Level.py, you can use this data to train supervised classification models, which allow you to:

Predict the type of attacker
Classify anomalous behaviors
Evaluate the effectiveness of countermeasures

🎯 What does ML.py do?

This script:

Automatically preprocesses the columns of the dataset
Trains multiple models (including Random Forests and Gradient Boosting)
Applies class balancing techniques (such as SMOTE and balanced RandomForest)
Evaluates the models

▶️ Command to train a model

python3 ML.py -i profiles_output.csv -t Profile

-i: Path to the input .csv file
-t: Name of the target column to predict (e.g., Profile, Motivation, Affiliation...)

📤 Outputs generated

After running, the script will automatically generate:

metrics_cv.csv: cross-validation results (mean ± deviation)
metrics_train.csv: metrics in the training data
metrics_test.csv: metrics in the test data
.pkl files with the trained models (if enabled)
Optional graphs if visualization generation is enabled

⚙️ Automation with script_ML.sh (optional)

If you have datasets organized into subdirectories (for example, by attack type), you can run mass training with:

chmod +x script_ML.sh
./script_ML.sh

This script:

Automatically preprocesses each subdirectory
Runs batch training with ML.py
Organizes results and models by folder

📁 Repository Structure

The repository is structured to cover all phases of the system in a modular way: traffic analysis, enrichment, attacker profiling, knowledge graph generation, and machine learning model training.

Repository-Name/
│
├── docs/                          # Documentation and visualizations
│   ├── Correlations/              # Correlation heatmaps
│   │   ├── High_level_correlation_heatmap.png
│   │   ├── High_vs_low_level_correlation_heatmap.png
│   │   └── low_level_correlation_heatmap.png
│   ├── Implementation/           # System design and workflow
│   │   ├── Framework.pdf         # General framework diagram
│   │   ├── High-Order.pdf        # Calculation order for high-level attributes
│   │   └── Implementation.pdf    # Technical implementation schema
│   └── KG/                       # Sample queries for the Knowledge Graph (Neo4j)
│       └── queries.txt
│
├── src/                           # System source code
│   ├── High/                     # Scripts for high-level attribute calculation
│   │   └── High-Level.py
│   ├── Low/                      # Scripts for traffic enrichment
│   │   └── Low-Level.py
│   └── KG/                       # Loading and structuring graph in Neo4j
│       ├── Dataset-Neo.py
│       └── neo4j_schema.cypher
│
├── tests/                         # ML model evaluation and training
│   ├── ML/                       # General training
│   │   ├── ML.py
│   │   ├── Script_ML.sh
│   │   └── Results.png           # Graphical comparison of results
│   ├── Affiliation/
│   ├── Attitude/
│   ├── Knowledge/
│   ├── Motivation/
│   ├── Profile/
│   ├── RiskLevel/
│   └── Skills/
│       ├── Merged.csv            # Unified dataset for training
│       ├── Models_good/          # Trained models and metrics
│       └── Preprocessed_good/    # Processed and scaled datasets
│       ...                       # (This same structure repeats for all folders above)
│
├── data/                          # Input and output datasets for the pipeline
│   ├── High/                     # Output from high-level attribute computation
│   │   └── profiles_output.csv
│   └── Low/                      # Output from low-level enrichment
│       └── enriched_output_by_src.csv
│
└── README.md                      # Main project documentation

Contact & Support

Pedro Beltrán López -> pedro.beltranl@um.es
Manuel Gil Pérez -> mgilperez@um.es
Pantaleone Nespoli -> pantaleone.nespoli-@um.es

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 CyberDeception-Threat_Actor_Profiling

Features

🔸 Raw .pcap Files

🔸 CICFlowMeter Files

🔸 Low-Level Characteristics

🔸 High-Level Characteristics

⚙️ Installation and Prerequisites

🐍 1. Python Environment

📡 2. CICFlowMeter

🧬 3. Neo4j

🤖 4. Machine Learning Environment

🚀 Usage

🔁 Flow summary

1️⃣ Convert .pcap traffic to .csv with CICFlowMeter

2️⃣ Generate enriched traffic views with Low-Level.py

3️⃣ Generate attacker profiles with High-Level.py

4️⃣ Upload profiles to Neo4j

5️⃣ Training Machine Learning models with ML.py

Contact & Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
docs		docs
src		src
test		test
.gitattributes		.gitattributes
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧠 CyberDeception-Threat_Actor_Profiling

Features

🔸 Raw .pcap Files

🔸 CICFlowMeter Files

🔸 Low-Level Characteristics

🔸 High-Level Characteristics

⚙️ Installation and Prerequisites

🐍 1. Python Environment

📡 2. CICFlowMeter

🧬 3. Neo4j

🤖 4. Machine Learning Environment

🚀 Usage

🔁 Flow summary

1️⃣ Convert .pcap traffic to .csv with CICFlowMeter

2️⃣ Generate enriched traffic views with Low-Level.py

3️⃣ Generate attacker profiles with High-Level.py

4️⃣ Upload profiles to Neo4j

5️⃣ Training Machine Learning models with ML.py

Contact & Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages