HAWKS

A tool for generating synthetic clustering datasets with controllable difficulty — used to benchmark and stress-test clustering algorithms via evolutionary multi-objective optimisation.

Python Evolutionary Algorithms Clustering Synthetic Data Research

Website Source Code Paper

Overview

HAWKS evolves synthetic datasets with parameterised difficulty, allowing researchers to systematically test how clustering algorithms perform as data complexity increases. Rather than benchmarking on fixed datasets, you can generate datasets that target a specific silhouette score, cluster overlap, or shape profile — then run your algorithm of choice against them.

The core idea: treat dataset generation as a multi-objective optimisation problem, using a genetic algorithm to search the space of possible datasets subject to difficulty constraints.

Key Features

Controllable difficulty — set target silhouette scores, cluster shapes, and overlap levels
Multi-objective — jointly optimise multiple dataset properties
scikit-learn compatible — drop-in with standard clustering workflows
pip installable — pip install hawks

Publications

Associated with two papers. The GECCO 2019 conference paper (nominated for best paper on the evolutionary machine learning track) introduced the core approach. The 2021 IEEE Transactions on Evolutionary Computation paper expands on the work, in particular adding a “versus mode” where datasets can be evolved specifically to two clustering algorithms designated a “winner” and “loser”, such that the biases of each algorithm are exploited to create a dataset with maximal performance difference.