HAWKS
A tool for generating synthetic clustering datasets with controllable difficulty — used to benchmark and stress-test clustering algorithms via evolutionary multi-objective optimisation.
Overview
HAWKS evolves synthetic datasets with parameterised difficulty, allowing researchers to systematically test how clustering algorithms perform as data complexity increases. Rather than benchmarking on fixed datasets, you can generate datasets that target a specific silhouette score, cluster overlap, or shape profile — then run your algorithm of choice against them.
The core idea: treat dataset generation as a multi-objective optimisation problem, using a genetic algorithm to search the space of possible datasets subject to difficulty constraints.
Key Features
- Controllable difficulty — set target silhouette scores, cluster shapes, and overlap levels
- Multi-objective — jointly optimise multiple dataset properties
- scikit-learn compatible — drop-in with standard clustering workflows
- pip installable —
pip install hawks
Publications
Associated with two papers. The GECCO 2019 conference paper (nominated for best paper on the evolutionary machine learning track) introduced the core approach. The 2021 IEEE Transactions on Evolutionary Computation paper expands on the work, in particular adding a “versus mode” where datasets can be evolved specifically to two clustering algorithms designated a “winner” and “loser”, such that the biases of each algorithm are exploited to create a dataset with maximal performance difference.