Graph Clustering Recipe¶

The Graph clustering recipe computes community assignments on selected node groups and edge groups. It can write either a dataset of nodes or an enriched edge dataset.

The recipe reads from a graph database and runs clustering algorithms in Dataiku execution. See graph database recipe settings and algorithm execution and sampling.

Algorithms¶

The following table summarizes algorithm support.

Algorithm	Dataiku execution	In database
Fastgreedy	Undirected only	Not supported
Multilevel	Undirected only	Not supported
Infomap	Directed and undirected	Not supported
Walktrap	Directed and undirected	Not supported

Input / Output¶

Input

Graph folder (Optional): Dataiku Folder that contains your materialized graph database. Leave it empty to run on an unmanaged Neo4j database directly.

Output

Output dataset: Dataset containing the computed community assignments.

Settings¶

Node groups

Choose one or more node groups to include in the computation.

Edge groups

Select the edge groups that define the relationships to consider.

Directed graph

Enable this option to treat relationships as directed. Some algorithms are hidden when directed graphs are selected because they only support undirected graphs.

Execution engine

Graph clustering is Dataiku execution only. No graph clustering algorithm currently runs in database on Neo4j or the built-in graph database.

Weight property

Optionally select a numeric edge property to use as the relationship weight for clustering. The selected property must exist on all selected edge groups.

Output type

Choose Dataset of nodes to write one row per node, or Dataset of edges to keep an edge dataset enriched with community assignments for both endpoints.

For edge output, community assignments are computed on nodes and then joined back to both endpoints of each output relationship.

Clustering algorithms

Use Select all to compute all algorithms supported by the current graph settings, or select individual algorithms.

Algorithm-specific parameters

Multilevel

Resolution: Resolution parameter used by the multilevel community detection algorithm.

Infomap

Trials: Number of trials used by the Infomap algorithm.

Walktrap

Steps: Number of steps used by the Walktrap random walks.

Advanced parameters

Output batch size: Number of result rows written at a time. This only controls output writing and does not change the graph used for computation. With Dataiku execution, the graph is loaded in memory independently of this value.