A comprehensive data analysis system for identifying and clustering similar tactical plays in football (soccer) matches using hierarchical clustering and advanced feature engineering. Built with Object-Oriented Programming principles and featuring an interactive GUI for exploration.
- Overview
- Features
- Quick Start
- Project Architecture
- Play Definition & Extraction Algorithm
- Feature Engineering
- Clustering Algorithm
- Configuration Parameters
- GUI Usage
- Code Structure
- Customization Guide
- Algorithm Tuning
- Output Files
- Requirements
This system analyzes football match event data to:
- Extract tactical plays from raw event sequences
- Engineer features that capture tactical characteristics
- Cluster similar plays using hierarchical clustering
- Generate descriptive names for each tactical pattern
- Provide interactive exploration through a GUI
- Visualize plays on football field diagrams
- Compare plays side-by-side with detailed metrics
A play is defined as a sequence of events by a single team that:
- Starts with a forward pass (PA event)
- Contains at least 2 passes (PA or CR - cross events)
- Ends in the attacking third (final ball position must be x ≥ 20 in normalized coordinates)
- Ends with a terminal event:
- Possession lost (team changes, LO, CA, TA events)
- Shot taken (SH event - may result in GOAL)
- Same team maintains possession throughout
- Intermediate events (dribbles, touches, etc.) are included but don't count toward the pass requirement
- Minimum forward progress if starting in defensive third (≥5 meters)
Field Positioning: The system divides the pitch into three zones:
- Defensive third: x ≤ -16.67 (own half defensive zone)
- Middle third: -16.67 < x < 16.67 (midfield zone)
- Attacking third: x ≥ 16.67 (opponent's defensive zone)
Only plays ending in the attacking third are extracted to focus on genuine attacking patterns.
Example Play:
Pass (PA) → Dribble (DR) → Pass (PA) → Touch (IT) → Shot (SH) at x=25 ✓ VALID (2 passes + shot in attacking third)
Pass (PA) → Pass (PA) → Team Change at x=22 ✓ VALID (2 passes + possession lost in attacking third)
Pass (PA) → Shot (SH) ✗ INVALID (only 1 pass)
Pass (PA) → Pass (PA) → Shot (SH) at x=10 ✗ INVALID (not in attacking third)
- Automatic play extraction from JSON event data
- Multi-dimensional feature engineering (13+ features per play)
- Hierarchical clustering using Ward's linkage method
- Automatic cluster naming based on tactical characteristics
- Dynamic re-clustering with adjustable threshold
- Field plots showing ball movement paths
- Side-by-side comparison of plays
- Color-coded outcomes (goal, shot, possession lost)
- Player position tracking throughout plays
- Browse clusters with descriptive names
- List all plays in a cluster with details
- View cluster statistics (goals, shots, averages)
- Compare any two plays with similarity scores
- Adjust clustering threshold and re-analyze in real-time
- Install required packages:
pip install -r requirements.txt- Ensure your data is in the correct location:
- Place JSON event files in:
Event Data/folder - JSON files should follow StatsBomb-style event format
- Place JSON event files in:
Option 1: GUI Application (Recommended)
python run_gui.pyOption 2: Command Line
from src.main import TacticalAnalyzer
analyzer = TacticalAnalyzer()
results = analyzer.run_analysis()
browser = analyzer.create_browser(cluster_id=1)
browser.list()
browser.compare(1, 2)The system follows SOLID principles for maintainability and extensibility:
📦 Project Root
├── 📁 Event Data/ # Input JSON files (match event data)
├── 📁 output/ # Generated analysis results
│ ├── all_plays.csv
│ ├── cluster_analysis.csv
│ ├── cluster_summary.csv
│ └── detailed_clusters.json
├── 📁 src/ # Core source code
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── models.py # Data models (Play, PlayEvent, etc.)
│ ├── utils.py # Utility functions
│ ├── data_loader.py # JSON parsing & play extraction
│ ├── feature_engineer.py # Feature calculation
│ ├── clustering.py # Clustering algorithms
│ ├── visualizer.py # Field visualization
│ ├── browser.py # Interactive exploration
│ └── main.py # Main analysis pipeline
├── gui_app.py # Tkinter GUI application
├── run_gui.py # GUI launcher
├── requirements.txt # Python dependencies
└── README.md # This file
Each module has one clear purpose:
config.py→ Manages configuration settingsmodels.py→ Defines data structuresdata_loader.py→ Loads and parses datafeature_engineer.py→ Calculates featuresclustering.py→ Performs clusteringvisualizer.py→ Creates visualizationsbrowser.py→ Provides interactive interface
- Easy to add new event types without modifying existing code
- New clustering algorithms can be added by extending
PlayClusterer - New features can be added in
feature_engineer.pywithout breaking existing features
- All components work with abstract interfaces
PlayExtractor,PlayClusterer,FeatureEngineercan be replaced with alternative implementations
- Small, focused classes with minimal public methods
- GUI components separate from core analysis logic
- Components depend on configuration objects, not hardcoded values
- Easy to test with mock configurations
- Database/storage layer abstracted through
PathConfig
The play extraction uses a sliding window approach:
def extract_plays(events):
"""
Scan through all events looking for valid play sequences.
"""
plays = []
i = 0
while i < len(events):
# Try to extract a play starting from position i
play_data = try_extract_play(events, i)
if play_data:
play, next_idx = play_data
plays.append(play)
i = next_idx # Jump to end of this play
else:
i += 1 # Move to next event
return playsStep 1: Find Starting Point
- Scan for forward pass (PA event)
- Verify pass direction matches team's attack direction
- Calculate attack direction based on:
- Stadium metadata
- Team (home/away)
- Period (1st/2nd half)
Step 2: Collect Same-Team Events
while current_event.team_id == starting_team_id:
# Add event to play
if event_type in ['PA', 'CR']:
pass_count += 1
if event_type in ['SH', 'LO', 'CA', 'TA']:
# Terminal event - check if valid play
if pass_count >= 2:
create_play()
breakStep 3: Validation
- Minimum passes: 2 (configurable in
config.py) - Duration: 3-30 seconds (configurable)
- Forward progress: ≥5 meters (configurable)
Step 4: Play Creation
- Extract all metadata (team, match, time)
- Calculate features (see Feature Engineering)
- Normalize coordinates to standard field orientation
- Determine outcome (GOAL, SHOT, POSSESSION_LOST, etc.)
This definition captures meaningful attacking sequences:
- 2+ passes filters out simple turnovers
- Terminal event ensures plays have clear outcomes
- Same team requirement maintains tactical coherence
- Intermediate events preserve full context (dribbles, touches)
The system calculates 21 features per play grouped into 5 categories:
One-hot encoding of common event types:
- PA (Pass), SH (Shot), CR (Cross), IT (Interception/Touch)
- LO (Loss), CA (Clearance), DR (Dribble), TC (Touch)
| Feature | Description | Calculation | Tactical Meaning |
|---|---|---|---|
delta_x |
Forward progress | final_x - initial_x |
Penetration depth |
delta_y |
Lateral movement | abs(final_y - initial_y) |
Width of attack |
max_x |
Deepest penetration | max(all_x_coords) |
Threat level |
total_distance |
Ball travel distance | Σ√(Δx² + Δy²) |
Play complexity |
num_events |
Event count | Integer | Play length |
duration |
Play length | Seconds | Tempo |
| Feature | Description | Tactical Use |
|---|---|---|
start_x |
Horizontal starting position | Identifies build-up zone |
start_y |
Absolute lateral starting position | Distinguishes wing vs center starts |
| Feature | Description | Calculation | Meaning |
|---|---|---|---|
y_variance |
Lateral movement variance | var(all_y_coords) |
Straight vs diagonal path |
final_y |
Ending lateral position | abs(final_y) |
Wing vs center finish |
| Feature | Description | Calculation | Use |
|---|---|---|---|
avg_attackers_ahead |
Offensive support | mean(attackers ahead of ball) |
Formation analysis |
avg_defenders_ahead |
Defensive pressure | max(1, mean(defenders ahead)) |
Resistance level (min 1 for GK) |
wing_side |
Attack position | 'WING' if abs(y) > 15 else 'CENTER' |
Positional categorization |
Note: The avg_defenders_ahead is guaranteed to be at least 1, accounting for the goalkeeper who is always present.
For clustering, features are combined into a 21-dimensional vector:
def get_feature_vector(play):
"""
21-dimensional feature vector for clustering.
"""
return np.array([
# Event type counts (8)
count_PA, count_SH, count_CR, count_IT,
count_LO, count_CA, count_DR, count_TC,
# Spatial features (6)
play.delta_x, play.delta_y, play.max_x,
play.total_distance, play.num_events, play.duration,
# Starting position (2)
start_x, start_y,
# Trajectory shape (2)
y_variance, final_y,
# Tactical features (3)
play.avg_attackers_ahead, play.avg_defenders_ahead,
1.0 if play.wing_side == 'WING' else 0.0
])The system uses Agglomerative Hierarchical Clustering with Ward's linkage:
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
# Calculate pairwise distances
distance_matrix = pdist(feature_matrix, metric='euclidean')
# Build linkage tree using Ward's method
linkage_matrix = linkage(distance_matrix, method='ward')
# Cut tree at threshold
cluster_labels = fcluster(linkage_matrix,
t=clustering_threshold,
criterion='distance')✅ Advantages:
- No need to specify number of clusters beforehand
- Produces dendrogram showing hierarchical relationships
- Works well with Euclidean distance in tactical feature space
- Ward's method minimizes within-cluster variance
❌ Limitations:
- O(n² log n) time complexity
- Sensitive to outliers
- Can't undo merges
Step 1: Filter Plays
valid_plays = [p for p in plays if p.delta_x >= min_forward_progress]Step 2: Extract Features
feature_matrix = np.array([get_feature_vector(p) for p in valid_plays])Step 3: Cluster
cluster_labels = hierarchical_clustering(feature_matrix, threshold)Step 4: Filter Small Clusters
# Remove clusters with < 2 plays
filtered = {cid: plays for cid, plays in clusters.items() if len(plays) >= 2}Step 5: Renumber & Sort
# Sort by cluster size (largest first)
# Renumber sequentially: 1, 2, 3, ...Clusters are automatically named based on characteristics:
def generate_cluster_name(plays):
"""
Create descriptive name from play statistics.
Format: [Position] [Speed] [Depth] [Conversion]
Example: "Wing Attack Fast Deep High-Conv"
"""
# Position (wing_side >= 70% threshold)
if wing_pct >= 0.7:
position = "Wing Attack"
elif wing_pct <= 0.3:
position = "Central Attack"
else:
position = "Mixed Attack"
# Speed (duration thresholds)
if avg_duration < 5:
speed = "Fast"
elif avg_duration > 10:
speed = "Slow Build"
else:
speed = "Medium"
# Depth (forward progress)
if avg_forward > 30:
depth = "Deep"
elif avg_forward > 20:
depth = "Mid"
else:
depth = "Short"
# Conversion (goal rate)
if goal_rate >= 0.3:
conversion = "High-Conv"
elif goal_rate > 0:
conversion = "Low-Conv"
return f"{position} {speed} {depth} {conversion}"Example Names:
- "Wing Attack Fast Deep High-Conv" → Quick wing plays that score
- "Central Attack Slow Build Mid" → Patient buildup through center
- "Mixed Attack Medium Short Low-Conv" → Varied short attacks
@dataclass
class AnalysisConfig:
# Play Duration Filters
min_play_duration: float = 3.0 # Minimum seconds
max_play_duration: float = 30.0 # Maximum seconds
# Spatial Filters
min_forward_progress: float = 5.0 # Minimum meters forward
# Clustering
clustering_threshold: float = 12.0 # Distance threshold
# Position Thresholds
ahead_threshold: float = 1.0 # Meters to count as "ahead"
forward_threshold: float = 1.0 # Meters to count as "forward"-
Increase (e.g., 5.0):
- ✅ Filters quick turnovers
- ✅ Focuses on sustained attacks
- ❌ May miss quick counter-attacks
-
Decrease (e.g., 1.0):
- ✅ Captures rapid transitions
- ❌ Includes more noise
-
Increase (e.g., 60.0):
- ✅ Includes long possession plays
- ❌ May merge multiple distinct sequences
-
Decrease (e.g., 15.0):
- ✅ Focuses on direct attacks
- ❌ Misses patient buildup
-
Increase (e.g., 10.0):
- ✅ Only penetrating attacks
- ❌ Misses lateral/possession plays
-
Decrease (e.g., 2.0):
- ✅ Includes all forward movement
- ❌ More plays to cluster
-
Increase (e.g., 20.0):
- ✅ Fewer, broader clusters
- ✅ Merges similar patterns
- ❌ May group distinct tactics
-
Decrease (e.g., 8.0):
- ✅ More specific clusters
- ✅ Finer tactical distinctions
- ❌ More clusters to analyze
Recommended Values by Use Case:
| Use Case | Duration | Progress | Threshold | Result |
|---|---|---|---|---|
| Counter-attacks | 1-10s | 15m | 8.0 | Fast, direct plays |
| Possession play | 10-60s | 3m | 15.0 | Patient buildup |
| General analysis | 3-30s | 5m | 12.0 | Balanced coverage |
| High-level patterns | 5-45s | 8m | 20.0 | Broad categories |
Controls Section:
- Cluster Dropdown: Select pattern to explore
- Format:
Cluster 1: Wing Attack Fast Deep (15 plays)
- Format:
- List Plays: Show all plays in cluster with details
- Summary: Display cluster statistics
- Compare: Enter two play numbers to compare
- Cluster Threshold: Adjust and re-analyze
Output Section:
- Displays results of commands
- Scrollable text area
- Monospace font for alignment
When comparing two plays, a single window opens with:
Top Section:
- Cluster name and similarity score (0.0-1.0)
Middle Section (Field Plots):
- Left: Play #1 in gold
- Right: Play #2 in blue
- Ball path with markers
- Start (circle) and end (X) positions
- Outcome icons (⚽ goal, 🎯 shot, ❌ lost)
Bottom Section (Details):
- Left Panel: Play #1 metrics and events
- Right Panel: Play #2 metrics and events
- Complete event sequences with:
- Event number
- Event type (PA, IT, DR, SH, etc.)
- Player name
- Ball position
- Attackers ahead
- Enter: Execute selected command
- Tab: Navigate between fields
- Ctrl+C: Copy selected text from output
# Global settings
analysis_config = AnalysisConfig()
path_config = PathConfig()
# Usage
from src.config import analysis_config
threshold = analysis_config.clustering_threshold@dataclass
class PlayEvent:
"""Single event in a play."""
event_type: str
time: float
ball_x: float
ball_y: float
attacking_players_ahead: int
defending_players_ahead: int
team_id: int
player_name: Optional[str]
@dataclass
class Play:
"""Complete play sequence."""
play_id: str
match_id: int
team_name: str
events: List[PlayEvent]
# ... 25+ fieldsclass EventParser:
"""Parse raw JSON events."""
class PlayExtractor:
"""Extract plays from events."""
def extract_plays(events, metadata) -> List[Play]
def _try_extract_play(events, start_idx) -> Optional[tuple]
class DataLoader:
"""Load all JSON files."""
def load_all_matches() -> tuple[List[Play], Dict]class FeatureEngineer:
"""Calculate play features."""
def engineer_features(plays: List[Play]) -> List[Play]
def _calculate_spatial_features(play: Play)
def _calculate_tactical_features(play: Play)class PlayClusterer:
"""Cluster plays by pattern."""
def cluster_plays(plays) -> OrderedDict
def calculate_similarity(play1, play2) -> float
class ClusterAnalyzer:
"""Analyze clusters."""
def analyze_clusters(clusters) -> Dict
def _generate_cluster_name(plays) -> strclass FieldVisualizer:
"""Draw plays on fields."""
def draw_field(ax)
def plot_play(ax, play, color)
def compare_plays(play1, play2) -> Figure
class ComparisonPrinter:
"""Print comparison tables."""
def print_comparison(play1, play2, similarity)class PlayBrowser:
"""Browse and compare plays."""
def list() # List all plays
def compare(n1, n2) # Compare two plays
def summary() # Show statisticsclass TacticalAnalyzer:
"""Main analysis pipeline."""
def run_analysis() -> dict
def reanalyze_with_threshold(threshold) -> dict
def create_browser(cluster_id) -> PlayBrowser
def save_results()In src/feature_engineer.py:
def _calculate_custom_feature(self, play: Play):
"""Add your custom feature."""
# Example: Calculate average pass distance
pass_distances = []
for i in range(len(play.events) - 1):
if play.events[i].event_type == 'PA':
dx = play.events[i+1].ball_x - play.events[i].ball_x
dy = play.events[i+1].ball_y - play.events[i].ball_y
dist = np.sqrt(dx**2 + dy**2)
pass_distances.append(dist)
play.avg_pass_distance = np.mean(pass_distances) if pass_distances else 0In src/utils.py:
def get_feature_vector(play: Play) -> np.ndarray:
"""Add to feature vector."""
return np.array([
# ... existing features ...
play.avg_pass_distance, # NEW FEATURE
])In src/data_loader.py:
# Add to terminal events list
if event_type in ['SH', 'LO', 'CA', 'TA', 'NEW_EVENT']:
# Handle new terminal eventIn src/clustering.py:
def _generate_cluster_name(self, plays: List[Play]) -> str:
"""Custom naming logic."""
# Add your own naming criteria
if avg_passes > 5:
return "Long Possession Pattern"
# ...In src/clustering.py:
# Replace Euclidean with custom metric
from scipy.spatial.distance import pdist
def custom_distance(u, v):
# Your custom distance calculation
return np.sum(np.abs(u - v)) # Manhattan distance example
distance_matrix = pdist(feature_matrix, metric=custom_distance)Solution:
- ✅ Increase
clustering_threshold(e.g., 15.0 → 20.0) - ✅ Increase
min_forward_progress(filter more plays) - ✅ Reduce feature dimensionality (fewer features)
Solution:
- ✅ Decrease
clustering_threshold(e.g., 12.0 → 8.0) - ✅ Add more discriminative features
- ✅ Use different linkage method (e.g., 'complete' instead of 'ward')
Solution:
- ✅ Check
min_forward_progress(may be too high) - ✅ Verify goal events in source data
- ✅ Lower duration thresholds to capture quick goals
Option 1: Faster Clustering
# Use mini-batch k-means instead
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=20, batch_size=1000)
labels = kmeans.fit_predict(feature_matrix)Option 2: Dimensionality Reduction
from sklearn.decomposition import PCA
# Reduce 13 features to 5
pca = PCA(n_components=5)
reduced_features = pca.fit_transform(feature_matrix)Option 3: Sampling
# Analyze subset of plays
import random
sample_plays = random.sample(all_plays, 5000)from sklearn.metrics import silhouette_score
score = silhouette_score(feature_matrix, cluster_labels)
# Score: -1 (poor) to 1 (excellent)
# Good clusters: > 0.5# Try different thresholds
thresholds = [5, 10, 15, 20, 25]
for t in thresholds:
clusters = cluster_plays(plays, threshold=t)
print(f"Threshold {t}: {len(clusters)} clusters")Complete play database with all features:
play_id,match_id,team_name,duration,delta_x,delta_y,num_events,outcome,cluster_id
M3812_T1_T123,3812,Poland,5.2,12.3,-3.4,4,SHOT,1
...Cluster summaries:
cluster_id,name,total,goals,shots,avg_duration,avg_forward
1,Wing Attack Fast Deep,15,3,8,4.2,18.5
...Statistical overview:
metric,value
total_plays,247
total_clusters,8
avg_plays_per_cluster,30.9
...Full cluster details in JSON:
{
"1": {
"name": "Wing Attack Fast Deep",
"plays": [...],
"statistics": {...}
}
}- Python 3.8+ required
- Tested on Python 3.9, 3.10, 3.11
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.4.0
scipy>=1.7.0Optional:
scikit-learn>=0.24.0 # For advanced clustering
seaborn>=0.11.0 # For enhanced visualizations# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtField Dimensions:
- Length: 105 meters
- Width: 68 meters
- Origin: Center of field (0, 0)
- X-axis: Length (-52.5 to +52.5)
- Y-axis: Width (-34 to +34)
Normalization: All plays normalized to attack left-to-right regardless of actual direction.
| Code | Event Type | Description |
|---|---|---|
| PA | Pass | Player passes ball to teammate |
| CR | Cross | Cross from wing into box |
| SH | Shot | Shot at goal |
| IT | Interception/Touch | Ball touched/intercepted |
| DR | Dribble | Player dribbles with ball |
| LO | Loss | Possession lost |
| CA | Clearance | Defensive clearance |
| TA | Tackle | Defensive tackle |
| TC | Touch | General touch |
| CH | Challenge | Challenge for ball |
def calculate_similarity(play1, play2) -> float:
"""
Calculate similarity score between plays.
Returns: 0.0 (very different) to 1.0 (identical)
"""
vec1 = get_feature_vector(play1)
vec2 = get_feature_vector(play2)
# Euclidean distance
distance = np.linalg.norm(vec1 - vec2)
# Normalize to 0-1 similarity
max_distance = 100.0
similarity = 1.0 - min(distance / max_distance, 1.0)
return similarity- ✅ Check JSON format matches StatsBomb schema
- ✅ Verify
min_forward_progressnot too high - ✅ Check duration thresholds
- ✅ Decrease
clustering_threshold - ✅ Check feature variance (may need more features)
- ✅ Verify plays have diverse characteristics
- ✅ Check for long-running analysis (large datasets)
- ✅ Verify matplotlib backend compatibility
- ✅ Try running analysis in command line first
- ✅ Update matplotlib:
pip install -U matplotlib - ✅ Check TkAgg backend:
matplotlib.use('TkAgg')
-
Machine Learning
- Neural network embeddings for plays
- Supervised classification by play type
- Outcome prediction models
-
Advanced Clustering
- DBSCAN for arbitrary shapes
- HDBSCAN for hierarchical density
- Fuzzy clustering for overlapping patterns
-
Extended Features
- Player formation analysis
- Passing network metrics
- Pressure indicators
- Space creation metrics
-
Visualization
- 3D plots of feature space
- Interactive web dashboard
- Animation of play sequences
- Heatmaps of player positions
-
Analysis Tools
- Compare across matches
- Team tactical signatures
- Evolution of patterns over season
- Success rate prediction