Automates the journey from raw PCAPs to context‑rich adversary profiles: transforms packets into enriched flows, cleans and groups attack patterns, applies hierarchical rules to infer attacker motivations and skills, maps findings in a STIX‑based knowledge graph, and includes an ML module for instant attribute prediction.
Raw network capture files.
Direct output from CICFlowMeter. Includes statistical characteristics per flow such as: Flow Duration, Fwd Packet Length Mean, Bwd IAT Max, etc. Base data for creating groupings and enrichment.
These characteristics describe in detail the behavior of grouped traffic, allowing you to identify suspicious patterns, statistical anomalies, or traces of cyberattacks. They are classified by category:
📌 Identification and Context
- id: Unique identifier (UUID) of the traffic group.
- threat_type, threat, attack, stage: Semantic tags if available (threat type, attack, kill chain stage).
- ips_src, ips_dst: List of unique source and destination IPs within the group.
- ports_src, ports_dst: List of unique source and destination ports.
📌 Volume and Count
- total_flows: Total number of network flows.
- total_packets_sent, total_packets_received: Total packets sent and received.
- total_bytes_sent, total_bytes_received: Bytes sent and received.
- unique_dst_ports_count, num_unique_src_ports: Number of unique ports per address.
- most_frequent_dst_port, most_frequent_src_port: Most common port observed.
- unique_dst_ips: Number of unique destination IPs.
📌 Time and Duration
- avg_flow_duration, std_flow_duration, iqr_flow_duration: Average, standard deviation, and interquartile range of flow duration.
- skewness_flow_duration, kurtosis_flow_duration: Skewness and kurtosis of flow duration.
- first_seen, last_activity: Timestamp of the first and last observed flow.
- active_time_ratio: Active time over total time.
- total_idle_time, avg_idle_time_between_flows, max_idle_time: Idle time measures.
- avg_time_between_connections, avg_time_between_flows: Average time between connections or flows.
- flow_frequency_per_minute: Number of flows per minute.
- extreme_flow_duration_count: Count of flows with extreme duration.
- time_diff: Total difference between first and last flow.
📌 Size and Packets
- avg_packet_size: Average packet size.
- variance_packet_size: Packet size variance.
- fwd_packet_size_range, bwd_packet_size_range: Round-trip packet size range.
- fwd_packet_size_to_mean_ratio, bwd_packet_size_to_mean_ratio: Ratio between individual size and average.
- flow_variance_ratio, high_variance_flow: Traffic variability indicators.
📌 Inter-arrival Time (IAT)
- mean_iat, std_iat, max_iat, min_iat: Metrics for the time between packet arrivals.
📌 TCP Indicators / Flags
- syn_count, ack_count, fin_count, rst_count, psh_count, urg_count: Number of packets with each TCP flag.
- syn_ack_ratio, rst_syn_ratio, rst_syn_flag_ratio, fin_to_syn_ratio, urg_to_ack_ratio,
- psh_to_syn_ratio: Ratio between flags, useful for detecting scans or attacks.
📌 Distributions and Entropy
- protocol_distribution: Proportion of protocols (TCP, UDP, etc.).
- udp_count, tcp_count: Total count per protocol.
- udp_tcp_traffic_ratio: Ratio between UDP and TCP traffic.
- entropy_dst_ips, entropy_dst_ports: Measure of entropy (diversity) in destination IPs or ports.
📌 Aggregate Ratios and Behaviors
- ratio_fwd_to_bwd_packets, ratio_fwd_to_bwd_bytes: Bidirectional information flow.
- byte_sent_received_ratio: Ratio between bytes sent and received.
- flow_duration_to_packet_ratio: Ratio between flow duration and number of packets.
- unique_ports_per_dst_ip_ratio: Metric of unique port usage per destination IP.
- active_to_idle_ratio: Ratio between active and idle time.
📌 Connection Dynamics
- unique_dst_ports_group, unique_dst_ips_group: Structural diversity within the group.
- unique_sessions_count: Number of sessions (unique IP:port combinations).
- port_change_frequency, dst_ip_change_frequency: Frequency of port or destination IP changes.
- num_flows_in_high_traffic_periods: Flows detected during periods of high activity.
🚨 Attack Indicators
- ddos_indicator: Heuristic indicator of DDoS attacks.
- port_scan_indicator: Probability of port scanning.
- brute_force_indicator: Probability of brute force attack.
- data_exfiltration_indicator: Signs of data exfiltration.
These characteristics represent attributes of potential attackers. They are generated from enriched traffic views to support threat hunting, intelligence analysis, and attribution activities.
📌 Identification and Context
- Id: Profile identifier.
- IPs: List of observed IPs.
- Target, PreferredTarget: Primary and secondary targets of the attacker.
- FirstSeen, LastActivity: First and last trace of activity.
- Country: Country associated with the attacker's infrastructure.
📌 Attacker Profile
- Profile: Main type of attacker (APT, Hacktivist, Criminal...).
- ProfileDist: Probabilistic distribution among profiles.
- Motivation: Motivation (economic, ideological, espionage...).
- Affiliation: Relationship with known groups, governments, or actors.
- Attitude: Level of aggressiveness or exposure of the attacker.
- Knowledge: Level of technical or environmental knowledge.
- Skills: Detected or inferred skills.
- AutomationLevel: Level of automation of actions.
- Evasion: Level and types of evasive techniques used.
📌 Techniques and Tools
- TTPs: Observed techniques, tactics, and procedures (MITRE).
- KillChainPhase: Phase(s) of the attack in the Kill Chain model.
- Tools: Tools used (exploits, RATs, backdoors, etc.).
📌 Campaigns and Groups
- ThreatGroup: Known group (if association is successful).
- Campaigns: Previous campaigns related to current behavior.
📌 Risk Assessment
- RiskLevel: Estimated risk level.
- Confidence: Degree of certainty in the generated profile.
📌 Others
- Comments: Additional observations, semantic explanations, or automatically generated inferences.
This section describes the software requirements and dependencies necessary for the proper functioning of the traffic analysis, attacker profiling, Neo4j visualization, and machine learning model training scripts.
Python 3.7 or higher is required.
Required libraries:
pandas
numpy
scikit-learn
imbalanced-learn
joblib
matplotlib
geoip2
ipwhois
neo4j
transformer
These libraries enable statistical analysis, data preprocessing, visualization, geographic enrichment, and NLP model execution for advanced profiling.
Essential tool for converting .pcap traffic files into labeled .csv files that can be processed by analysis scripts.
An operational Neo4j database is required to store and visualize attacker profiles as a graph.
The ML environment must support:
- Multi-class classification models
- Class balancing (via SMOTE or balanced RandomForest)
- Custom transformations with ColumnTransformer
- Exporting models and metrics
- Folder structure for batch processing by class
This requires that the ML, visualization, and data manipulation libraries are correctly installed, as detailed in the Python libraries block.
Once installation is complete, the general workflow follows these steps:
- Traffic conversion: use CICFlowMeter to transform .pcap to .csv.
- Low-level analysis: process .csv streams with Low-Level.py to generate enriched views grouped by IP, pair IP, or attack tags.
- High-level profiling: use High-Level.py on the enriched output to generate attacker profiles with 23 analytical attributes.
- Graph visualization:
- Define the node and relationship schema by running neo4j_schema.cypher.
- Load the generated profiles into Neo4j using the Dataset-Neo.py script.
- ML model training:
- Preprocess the data if necessary.
- Run ML.py to train classification models and evaluate metrics.
- Use script_ML.sh if batch automation across multiple folders/classes is required.
[.pcap files]
↓
[CICFlowMeter]
↓
[Low-Level.py → enriched views]
↓
[High-Level.py → attacker profiles]
↓
[Dataset-Neo.py → Neo4j graph]
↓
[ML.py / script_ML.sh → classification models]
The first step in the flow is to transform the network capture files (.pcap) into labeled network flows (.csv), which will serve as input for subsequent analyses.
Required command:
cicflowmeter -f example.pcap -c flows.csv
Parameters:
- -f example.pcap: path to the network capture file to be processed.
- -c flows.csv: name of the .csv file that will be generated as output, containing the network flows.
This flows.csv file will include statistics per flow such as source/destination IP addresses, ports, duration, number of packets, bytes transferred, and more. It is the starting point for running the Low-Level.py script.
This script takes a previously generated .csv file of network flows (for example, using CICFlowMeter) as input and produces four enriched views grouped from different perspectives. These views are essential for statistical analysis, anomaly pattern detection, and subsequent attacker profiling.
🔍 What exactly does it do?
It groups network flows by:
- Source IP (_by_src.csv)
- Destination IP (_by_dst.csv)
- IP pair (_by_pair.csv)
- Conditional grouping (_conditional.csv) — based on whether the attack is DoS, DDoS, or APT.
It aggregates network behavior metrics by group:
- Average flow duration
- Total number of packets and bytes sent/received
- TCP flags, packet/byte rate, IP entropy, etc.
It adds additional context:
- Lists of unique IPs and ports (ips_src, ips_dst, ports_src, ports_dst)
- Unique group identifier (id, UUID v4)
- Fill any columns that are not available in the original dataset with NaN to maintain consistency in the output structure.
📤 Generated outputs
Running this script with an input file generates the following files:
output_by_src.csv # Grouped by src_ip
output_by_dst.csv # Grouped by dst_ip
output_by_pair.csv # Grouped by src_ip + dst_ip
output_conditional.csv # Conditional grouping based on attack type
python3 Low-Level.py -i flows.csv -o output
- -i flows.csv: previously generated .csv file with network flows.
- -o output: prefix for the output files that will be generated.
The resulting files will be used as input for high-level profiling with High-Level.py.
This script implements a hybrid expert system to infer attacker profiles from enriched network traffic. It uses a fact base (the enriched traffic dataset), a rule-encoded knowledge base, and inference techniques.
🧠 What does it do?
From a .csv file with grouped and enriched malicious traffic (generated by Low-Level.py), the script:
- Assigns one of 12 attacker profiles (nation-state, spy, criminal, etc.).
- Calculates strategic attributes such as:
- Motivation (economic, espionage, sabotage...)
- Skill and knowledge level
- MITRE techniques (TTPs) used
- Phase of the attack in the Kill Chain
- APT group or campaign (if there is a match)
- Probable affiliation (state, contractor, criminal, etc.)
- Optionally, it generates automated comments with NLP models (FLAN-T5).
📥 Required files
Mandatory input:
- Enriched traffic CSV: enriched_output.csv
Required semantic maps:
- network_observable_attack_mapping_v2.csv: rules for detecting TTPs.
- network_tool_mapping_v1.csv: rules for detecting tools.
- network_evasion_mapping_v1.csv: rules for evasion techniques.
External database:
- groups.csv: database of known groups.
- campaigns.csv: database of campaigns with techniques and dates.
- GeoLite2-Country.mmdb: GeoIP database for location.
📤 Generated outputs
CSV of the attacker model:
Id, IPs, Target, PreferredTarget, FirstSeen, LastActivity,
Country, AutomationLevel, Evasion, TTPs, KillChainPhase,
RiskLevel, Confidence, Tools, Skills,
Profile, ProfileDist, Motivation, Knowledge,
Attitude, Affiliation, ThreatGroup, Campaigns, Comment
python3 High-Level.py enriched_output.csv profiles_output.csv \
-m network_observable_attack_mapping_v2.csv \
-t network_tool_mapping_v1.csv \
-e network_evasion_mapping_v1.csv \
-c iot
Arguments:
- enriched_output.csv: input file with enriched traffic.
- profiles_output.csv: output file name.
- -m: file with the mapping of observables to TTPs (default: network_observable_attack_mapping_v2.csv)
- -t: file with the mapping of tools.
- -e: file with the mapping of evasion techniques.
- -c: environment context (iot, 5g, cloud, ics)
Once the attacker profiles (profiles_output.csv) have been generated using High-Level.py, you can upload them to Neo4j as a knowledge graph with entities, tools, techniques, campaigns, groups, and their relationships.
🏗️ 4.1 Inserting the schema into Neo4j
The neo4j_schema.cypher file contains the necessary instructions to:
- Create unique indexes on key nodes (:Entity, :Tool, :TTP, etc.)
- Define integrity constraints (for example, avoid duplicate nodes by ID)
- Establish the model of nodes and relationships in the graph
✅ Instruction to execute the schema:
cat neo4j_schema.cypher | cypher-shell -u neo4j -p <your_password>
Make sure you have the Neo4j server active and accessible through the bolt:// port.
🧾 4.2 Insert the data generated by High-Level.py
The Dataset-Neo.py script loads the .csv profile file and performs the following:
- Creation of :Entity nodes for each profile (one per traffic group)
- Creation of related nodes: :TTP, :Tool, :Campaign, :ThreatGroup
- Establish relationships:
(:Entity)-[:USES]->(:Tool)
(:Entity)-[:APPLIES]->(:TTP)
(:Entity)-[:BELONGS_TO]->(:ThreatGroup)
(:Entity)-[:PART_OF]->(:Campaign)
⚙️ Preliminary configuration
Open the Dataset-Neo.py file and modify the following variables to connect to your Neo4j instance:
URI = “bolt://localhost:7687”
USER = “neo4j”
PASSWORD = “your_password”
CSV_PATH = “profiles_output.csv”
Once configured:
python3 Dataset-Neo.py
This will automatically create the complete graph from the profiles, allowing you to query it visually or using Cypher.
Once you have the attacker profiles generated with High-Level.py, you can use this data to train supervised classification models, which allow you to:
- Predict the type of attacker
- Classify anomalous behaviors
- Evaluate the effectiveness of countermeasures
🎯 What does ML.py do?
This script:
- Automatically preprocesses the columns of the dataset
- Trains multiple models (including Random Forests and Gradient Boosting)
- Applies class balancing techniques (such as SMOTE and balanced RandomForest)
- Evaluates the models
python3 ML.py -i profiles_output.csv -t Profile
- -i: Path to the input .csv file
- -t: Name of the target column to predict (e.g., Profile, Motivation, Affiliation...)
📤 Outputs generated
After running, the script will automatically generate:
- metrics_cv.csv: cross-validation results (mean ± deviation)
- metrics_train.csv: metrics in the training data
- metrics_test.csv: metrics in the test data
- .pkl files with the trained models (if enabled)
- Optional graphs if visualization generation is enabled
⚙️ Automation with script_ML.sh (optional)
If you have datasets organized into subdirectories (for example, by attack type), you can run mass training with:
chmod +x script_ML.sh
./script_ML.sh
This script:
- Automatically preprocesses each subdirectory
- Runs batch training with ML.py
- Organizes results and models by folder
📁 Repository Structure
The repository is structured to cover all phases of the system in a modular way: traffic analysis, enrichment, attacker profiling, knowledge graph generation, and machine learning model training.
Repository-Name/
│
├── docs/ # Documentation and visualizations
│ ├── Correlations/ # Correlation heatmaps
│ │ ├── High_level_correlation_heatmap.png
│ │ ├── High_vs_low_level_correlation_heatmap.png
│ │ └── low_level_correlation_heatmap.png
│ ├── Implementation/ # System design and workflow
│ │ ├── Framework.pdf # General framework diagram
│ │ ├── High-Order.pdf # Calculation order for high-level attributes
│ │ └── Implementation.pdf # Technical implementation schema
│ └── KG/ # Sample queries for the Knowledge Graph (Neo4j)
│ └── queries.txt
│
├── src/ # System source code
│ ├── High/ # Scripts for high-level attribute calculation
│ │ └── High-Level.py
│ ├── Low/ # Scripts for traffic enrichment
│ │ └── Low-Level.py
│ └── KG/ # Loading and structuring graph in Neo4j
│ ├── Dataset-Neo.py
│ └── neo4j_schema.cypher
│
├── tests/ # ML model evaluation and training
│ ├── ML/ # General training
│ │ ├── ML.py
│ │ ├── Script_ML.sh
│ │ └── Results.png # Graphical comparison of results
│ ├── Affiliation/
│ ├── Attitude/
│ ├── Knowledge/
│ ├── Motivation/
│ ├── Profile/
│ ├── RiskLevel/
│ └── Skills/
│ ├── Merged.csv # Unified dataset for training
│ ├── Models_good/ # Trained models and metrics
│ └── Preprocessed_good/ # Processed and scaled datasets
│ ... # (This same structure repeats for all folders above)
│
├── data/ # Input and output datasets for the pipeline
│ ├── High/ # Output from high-level attribute computation
│ │ └── profiles_output.csv
│ └── Low/ # Output from low-level enrichment
│ └── enriched_output_by_src.csv
│
└── README.md # Main project documentation
- Pedro Beltrán López -> pedro.beltranl@um.es
- Manuel Gil Pérez -> mgilperez@um.es
- Pantaleone Nespoli -> pantaleone.nespoli-@um.es