HomeTechnologyGraph Data Science With Python/NetworkX

Graph Data Science With Python/NetworkX

We’re inundated with data. Ever-increasing databases and spreadsheets are rife with hidden business insights. How can we analyze data and extract conclusions when there’s so much of it? Graphs (networks, not bar graphs) provide an elegant approach.

We often use tables to represent information generically. But graphs use a specialised data structure: Instead of a desk row, a node represents an element. An edge connects 2 nodes to indicate their relationship.

This graph data structure enables us to observe data from unique angles, which is why graph data science is used in every field from molecular biology to the social sciences:

On the left, a protein interaction graph with many dots of varied sizes and colors, and lines of different colors between them. Most dots (nodes) form a large central cluster, but some dots are connected only in pairs, triplets, or quadruplets on the fringes, disconnected from the main cluster. On the right, a Twitter interaction graph where nodes are of subpixel size and fall broadly into three sets: A dense central cluster with a few fuzzy blobs of various colors and sizes connected by fuzzy streams of various colors; a light cloud consisting of small smudges and sprinklings of mostly gray; and a buffer of white before an outer gray fuzzy ring surrounding the first two sets.
Left image credit: TITZ, Björn, et al. “The Binary Protein Interactome of Treponema Pallidum …” PLoS One, 3, no. 5 (2008).

Right image credit: ALBANESE, Federico, et al. “Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter.” (August 24, 2020): arXiv:2008.10749 [cs.SI]

So how can developers leverage graph data science? Let’s turn to the most-used data science programming language: Python.

Getting Started With “Graph Theory” Graphs in Python

Python developers have several graph data libraries available to them, such as NetworkX, igraph, SNAP, and graph-tool. Pros and cons aside, they have very identical interfaces for handling and processing Python graph data buildings.

We’ll use the wellliked NetworkX library. It’s simple to install and use, and helps the community detection algorithm we’ll be using.

Creating a new graph with NetworkX is straightforward:

import networkx as nx
G = nx.Graph()

But G isn’t much of a graph yet, being devoid of nodes and edges.

How to Add Nodes to a Graph

We can add a node to the community by chaining on the return value of Graph() with .add_node() (or .add_nodes_from() for multiple nodes in a list). We can also add arbitrary traits or attributes to the nodes by passing a dictionary as a parameter, as we show with node 4 and node 5:

G.add_node("node 1")
G.add_nodes_from(["node 2", "node 3"])
G.add_nodes_from([("node 4", {"abc": 123}), ("node 5", {"abc": 0})])
print(G.nodes)
print(G.nodes["node 4"]["abc"]) # accessed like a dictionary

This will output:

['node 1', 'node 2', 'node 3', 'node 4', 'node 5']
123

But without edges between nodes, they’re isolated, and the dataset is no better than a simple desk.

How to Add Edges to a Graph

Similar to the technique for nodes, we can use .add_edge() with the names of 2 nodes as parameters (or .add_edges_from() for multiple edges in a list), and optionally include a dictionary of attributes:

G.add_edge("node 1", "node 2")
G.add_edge("node 1", "node 6")
G.add_edges_from([("node 1", "node 3"), 
                  ("node 3", "node 4")])
G.add_edges_from([("node 1", "node 5", {"weight" : 3}), 
                  ("node 2", "node 4", {"weight" : 5})])

The NetworkX library helps graphs like these, where each edge can have a weight. For example, in a social community graph where nodes are users and edges are interactions, weight could signify how many interactions happen between a given pair of users—a highly relevant metric.

NetworkX lists all edges when using G.edges, but it does not include their attributes. If we want edge attributes, we can use G[node_name] to get everything that’s related to a node or G[node_name][connected_node_name] to get the attributes of a particular edge:

print(G.nodes)
print(G.edges)
print(G["node 1"])
print(G["node 1"]["node 5"])

This will output:

['node 1', 'node 2', 'node 3', 'node 4', 'node 5', 'node 6']
[('node 1', 'node 2'), ('node 1', 'node 6'), ('node 1', 'node 3'), ('node 1', 'node 5'), ('node 2', 'node 4'), ('node 3', 'node 4')]
{'node 2': {}, 'node 6': {}, 'node 3': {}, 'node 5': {'weight': 3}}
{'weight': 3}

But reading our first graph this way is impractical. Thankfully, there’s a much better illustration.

How to Generate Images From Graphs (and Weighted Graphs)

Visualizing a graph is essential: It lets us see the relationships between the nodes and the structure of the community quickly and clearly.

A quick call to nx.draw(G) is all it takes:

Six red dots with black lines connecting them. Four form a quadrilateral, one corner of which connects to the remaining two.

Let’s make weightier edges correspondingly thicker via our nx.draw() call:

weights = [1 if G[u][v] == {} else G[u][v]['weight'] for u,v in G.edges()]
nx.draw(G, width=weights)

We provided a default thickness for weightless edges, as seen in the result:

Similar to the previous image but with slightly shifted dot positions and two lines standing out (one being three times thicker and another five times thicker than the rest).

Our methods and graph algorithms are about to get more complex, so the next step is to use a better-known dataset.

Graph Data Science Using Data From the Movie Star Wars: Episode IV

To make it simpler to interpret and understand our results, we’ll use this dataset. Nodes represent necessary characters, and edges (which aren’t weighted here) signify co-appearance in a scene.

Note: The dataset is from Gabasova, E. (2016). Star Wars social community. DOI: https://doi.org/10.5281/zenodo.1411479.

First, we’ll visualize the data with nx.draw(G_starWars, with_labels = True):

A much busier graph with 19 blue dots (each labeled with a character name in all capitals) and uniformly thick lines between many of them.

Characters that usually appear together, like R2-D2 and C-3PO, appear closely related. In contrast, we can see that Darth Vader does not share scenes with Owen.

Python NetworkX Visualization Layouts

Why is each node located where it is in the previous graph?

It’s the result of the default spring_layout algorithm. It simulates the force of a spring, attracting related nodes and repelling disconnected ones. This helps highlight well-related nodes, which end up in the center.

NetworkX has other layouts that use different standards to position nodes, like circular_layout:

pos = nx.circular_layout(G_starWars)
nx.draw(G_starWars, pos=pos, with_labels = True)

The result:

The exact same graph in terms of node and edge presence but the blue dots form a circle. (Note: Not every pair of neighboring dots in the oval share an edge.)

This format is neutral in the sense that the location of a node does not depend on its significance—all nodes are represented equally. (The circular format could also help visualize separate related components—subgraphs having a path between any 2 nodes—but here, the whole graph is 1 huge related component.)

Both of the layouts we’ve seen have a degree of visible muddle because edges are free to cross other edges. But Kamada-Kawai, another force-directed algorithm like spring_layout, positions the nodes so as to minimize the energy of the system.

This reduces edge-crossing but at a price: It’s slower than other layouts and therefore not highly counseled for graphs with many nodes.

This 1 has a specialised draw function:

nx.draw_kamada_kawai(G_starWars, with_labels = True)

That yields this form instead:

The same graph again. It looks more like the first one but the blue dots are spread out more uniformly with fewer overlapping edges.

Without any special intervention, the algorithm put main characters (like Luke, Leia, and C-3PO) in the center, and less prominent ones (like Camie and General Dodonna) by the border.

Visualizing the graph with a specific format can give us some interesting qualitative results. Still, quantitative results are a vital part of any data science analysis, so we’ll need to define some metrics.

Node Analysis: Degree and PageRank

Now that we can visualize our community clearly, it may be of interest to us to characterize the nodes. There are multiple metrics that portray the traits of the nodes and, in our example, of the characters.

One basic metric for a node is its degree: how many edges it has. The degree of a Star Wars character’s node measures how many other characters they shared a scene with.

The degree() function can calculate the degree of a character or of the full community:

print(G_starWars.degree["LUKE"])
print(G_starWars.degree)

The output of both instructions:

15
[('R2-D2', 9), ('CHEWBACCA', 6), ('C-3PO', 10), ('LUKE', 15), ('DARTH VADER', 4), ('CAMIE', 2), ('BIGGS', 8), ('LEIA', 12), ('BERU', 5), ('OWEN', 4), ('OBI-WAN', 7), ('MOTTI', 3), ('TARKIN', 3), ('HAN', 6), ('DODONNA', 3), ('GOLD LEADER', 5), ('WEDGE', 5), ('RED LEADER', 7), ('RED TEN', 2)]

Sorting nodes from highest to lowest according to degree can be done with a single line of code:

print(sorted(G_starWars.degree, key=lambda x: x[1], reverse=True))

The output:

[('LUKE', 15), ('LEIA', 12), ('C-3PO', 10), ('R2-D2', 9), ('BIGGS', 8), ('OBI-WAN', 7), ('RED LEADER', 7), ('CHEWBACCA', 6), ('HAN', 6), ('BERU', 5), ('GOLD LEADER', 5), ('WEDGE', 5), ('DARTH VADER', 4), ('OWEN', 4), ('MOTTI', 3), ('TARKIN', 3), ('DODONNA', 3), ('CAMIE', 2), ('RED TEN', 2)]

Being just a total, the degree doesn’t take into account details of individual edges. Does a given edge connect to an otherwise isolated node or to a node that is related with the full community? Google’s PageRank algorithm aggregates this information to gauge how “necessary” a node is in a community.

The PageRank metric can be interpreted as an agent shifting randomly from 1 node to another. Better-related nodes have more paths leading through them, so the agent will tend to visit them more often.

Such nodes will have a higher PageRank, which we can calculate with the NetworkX library:

pageranks = nx.pagerank(G_starWars) # A dictionary
print(pageranks["LUKE"])
print(sorted(pageranks, key=lambda x: x[1], reverse=True))

This prints Luke’s rank and our characters sorted by rank:

0.12100659993223405
['OWEN', 'LUKE', 'MOTTI', 'DODONNA', 'GOLD LEADER', 'BIGGS', 'CHEWBACCA', 'LEIA', 'BERU', 'WEDGE', 'RED LEADER', 'RED TEN', 'OBI-WAN', 'DARTH VADER', 'CAMIE', 'TARKIN', 'HAN', 'R2-D2', 'C-3PO']

Owen is the character with the highest PageRank, surpassing Luke, who had the highest degree. The analysis: Although Owen is not the character who shares the most scenes with other characters, he is a character who shares scenes with many necessary characters such as Luke himself, R2-D2, and C-3PO.

In greater contrast, C-3PO, the character with the third-highest degree, is the 1 with the lowest PageRank. Despite C-3PO having many connections, a lot of them are with unimportant characters.

The takeaway: Using multiple metrics can give deeper perception into different traits of a graph’s nodes.

When analyzing a community it may be necessary to separate communities: groups of nodes that are highly related to each other but minimally related with nodes outdoors their community.

There are multiple algorithms for this. Most of them are found inside unsupervised machine learning algorithms because they assign a label to the nodes without the need for them to have been labeled previously.

One of the most wellknown is label propagation. In it, each node begins with a unique label, in a community of 1. The labels of the nodes are iteratively updated according to the majority of the labels of the neighboring nodes.

The labels diffuse through the community until all nodes share a label with most of their neighbors. Groups of nodes closely related to each other end up having the same label.

With the NetworkX library, operating this algorithm takes a mere 3 lines of Python:

from networkx.algorithms.community.label_propagation import label_propagation_communities

communities = label_propagation_communities(G_starWars)
print([community for community in communities])

The output:

[{'R2-D2', 'CAMIE', 'RED TEN', 'RED LEADER', 'OBI-WAN', 'DODONNA', 'LEIA', 'WEDGE', 'HAN', 'OWEN', 'CHEWBACCA', 'GOLD LEADER', 'LUKE', 'BIGGS', 'C-3PO', 'BERU'}, {'DARTH VADER', 'TARKIN', 'MOTTI'}]

In this list of sets, each set represents a community. Readers acquainted with the movie will notice the algorithm managed to perfectly separate the “fine guys” from the “bad guys,” differentiating the characters meaningfully without using any true (community) label or metadata.

Intelligent Insights Using Graph Data Science in Python

We’ve seen that getting initiated with graph data science tools is more straightforward than it might sound. Once we represent data as a graph using the NetworkX library in Python, a few quick lines of code can be illuminating. We can visualize our dataset, measure and compare node traits, and cluster nodes sensibly via community detection algorithms.

Having the skill to extract conclusions and insights from a community using Python enables developers to combine with tools and methodology commonly found in data science services pipelines. From search engines to flight scheduling to electrical engineering, these methods apply easily to a broad range of contexts.

Community Detection Algorithms
Zhao Yang, René Algesheimer, and Claudio Tessone. “A Comparative Analysis of Community Detection Algorithms on Artificial Networks.” Scientific Reports, 6, no. 30750 (2016).

Graph Deep Learning
Thomas Kipf. “Graph Convolutional Networks.” September 30, 2016.

Applications of Graph Data Science
Albanese, Federico, Leandro Lombardi, Esteban Feuerstein, and Pablo Balenzuela. “Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter.” (August 24, 2020): arXiv:2008.10749 [cs.SI].

Cohen, Elior. “PyData Tel Aviv Meetup: Node2vec.” YouTube. November 22, 2018. Video, 21:09. https://www.youtube.com/watch?v=828rZgV9t1g.

Go to the source

Most Popular