We’re inundated with data. Ever-increasing databases and spreadsheets are rife with hidden business insights. How can we analyze data and extract conclusions when there’s so much of it? Graphs (networks, not bar graphs) provide an elegant approach.

We often use tables to represent information generically. But graphs use a specialised data structure: Instead of a desk row, a *node* represents an element. An *edge* connects 2 nodes to indicate their relationship.

This graph data structure enables us to observe data from unique angles, which is why graph data science is used in every field from molecular biology to the social sciences:

Right image credit: ALBANESE, Federico, et al. “Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter.” (August 24, 2020): arXiv:2008.10749 [cs.SI]

So how can developers leverage graph data science? Let’s turn to the most-used data science programming language: Python.

## Getting Started With “Graph Theory” Graphs in Python

Python developers have several graph data libraries available to them, such as NetworkX, igraph, SNAP, and graph-tool. Pros and cons aside, they have very identical interfaces for handling and processing Python graph data buildings.

We’ll use the wellliked NetworkX library. It’s simple to install and use, and helps the community detection algorithm we’ll be using.

Creating a new graph with NetworkX is straightforward:

```
import networkx as nx
G = nx.Graph()
```

But `G`

isn’t much of a graph yet, being devoid of nodes and edges.

### How to Add Nodes to a Graph

We can add a node to the community by chaining on the return value of `Graph()`

with `.add_node()`

(or `.add_nodes_from()`

for multiple nodes in a list). We can also add arbitrary traits or attributes to the nodes by passing a dictionary as a parameter, as we show with `node 4`

and `node 5`

:

```
G.add_node("node 1")
G.add_nodes_from(["node 2", "node 3"])
G.add_nodes_from([("node 4", {"abc": 123}), ("node 5", {"abc": 0})])
print(G.nodes)
print(G.nodes["node 4"]["abc"]) # accessed like a dictionary
```

This will output:

```
['node 1', 'node 2', 'node 3', 'node 4', 'node 5']
123
```

But without edges between nodes, they’re isolated, and the dataset is no better than a simple desk.

### How to Add Edges to a Graph

Similar to the technique for nodes, we can use `.add_edge()`

with the names of 2 nodes as parameters (or `.add_edges_from()`

for multiple edges in a list), and optionally include a dictionary of attributes:

```
G.add_edge("node 1", "node 2")
G.add_edge("node 1", "node 6")
G.add_edges_from([("node 1", "node 3"),
("node 3", "node 4")])
G.add_edges_from([("node 1", "node 5", {"weight" : 3}),
("node 2", "node 4", {"weight" : 5})])
```

The NetworkX library helps graphs like these, where each edge can have a weight. For example, in a social community graph where nodes are users and edges are interactions, weight could signify how many interactions happen between a given pair of users—a highly relevant metric.

NetworkX lists all edges when using `G.edges`

, but it does not include their attributes. If we want edge attributes, we can use `G[node_name]`

to get everything that’s related to a node or `G[node_name][connected_node_name]`

to get the attributes of a particular edge:

```
print(G.nodes)
print(G.edges)
print(G["node 1"])
print(G["node 1"]["node 5"])
```

This will output:

```
['node 1', 'node 2', 'node 3', 'node 4', 'node 5', 'node 6']
[('node 1', 'node 2'), ('node 1', 'node 6'), ('node 1', 'node 3'), ('node 1', 'node 5'), ('node 2', 'node 4'), ('node 3', 'node 4')]
{'node 2': {}, 'node 6': {}, 'node 3': {}, 'node 5': {'weight': 3}}
{'weight': 3}
```

But reading our first graph this way is impractical. Thankfully, there’s a much better illustration.

### How to Generate Images From Graphs (and Weighted Graphs)

Visualizing a graph is essential: It lets us see the relationships between the nodes and the structure of the community quickly and clearly.

A quick call to `nx.draw(G)`

is all it takes:

Let’s make weightier edges correspondingly thicker via our `nx.draw()`

call:

```
weights = [1 if G[u][v] == {} else G[u][v]['weight'] for u,v in G.edges()]
nx.draw(G, width=weights)
```

We provided a default thickness for weightless edges, as seen in the result:

Our methods and graph algorithms are about to get more complex, so the next step is to use a better-known dataset.

## Graph Data Science Using Data From the Movie *Star Wars: Episode IV*

To make it simpler to interpret and understand our results, we’ll use this dataset. Nodes represent necessary characters, and edges (which aren’t weighted here) signify co-appearance in a scene.

*Note: The dataset is from Gabasova, E. (2016).* Star Wars *social community. DOI: https://doi.org/10.5281/zenodo.1411479.*

First, we’ll visualize the data with `nx.draw(G_starWars, with_labels = True)`

:

Characters that usually appear together, like R2-D2 and C-3PO, appear closely related. In contrast, we can see that Darth Vader does not share scenes with Owen.

### Python NetworkX Visualization Layouts

Why is each node located where it is in the previous graph?

It’s the result of the default `spring_layout`

algorithm. It simulates the force of a spring, attracting related nodes and repelling disconnected ones. This helps highlight well-related nodes, which end up in the center.

NetworkX has other layouts that use different standards to position nodes, like `circular_layout`

:

```
pos = nx.circular_layout(G_starWars)
nx.draw(G_starWars, pos=pos, with_labels = True)
```

The result:

This format is neutral in the sense that the location of a node does not depend on its significance—all nodes are represented equally. (The circular format could also help visualize separate *related components*—subgraphs having a path between any 2 nodes—but here, the whole graph is 1 huge related component.)

Both of the layouts we’ve seen have a degree of visible muddle because edges are free to cross other edges. But Kamada-Kawai, another force-directed algorithm like `spring_layout`

, positions the nodes so as to minimize the energy of the system.

This reduces edge-crossing but at a price: It’s slower than other layouts and therefore not highly counseled for graphs with many nodes.

This 1 has a specialised draw function:

```
nx.draw_kamada_kawai(G_starWars, with_labels = True)
```

That yields this form instead:

Without any special intervention, the algorithm put main characters (like Luke, Leia, and C-3PO) in the center, and less prominent ones (like Camie and General Dodonna) by the border.

Visualizing the graph with a specific format can give us some interesting qualitative results. Still, quantitative results are a vital part of any data science analysis, so we’ll need to define some metrics.

### Node Analysis: Degree and PageRank

Now that we can visualize our community clearly, it may be of interest to us to characterize the nodes. There are multiple metrics that portray the traits of the nodes and, in our example, of the characters.

One basic metric for a node is its *degree:* how many edges it has. The degree of a *Star Wars* character’s node measures how many other characters they shared a scene with.

The `degree()`

function can calculate the degree of a character or of the full community:

```
print(G_starWars.degree["LUKE"])
print(G_starWars.degree)
```

The output of both instructions:

```
15
[('R2-D2', 9), ('CHEWBACCA', 6), ('C-3PO', 10), ('LUKE', 15), ('DARTH VADER', 4), ('CAMIE', 2), ('BIGGS', 8), ('LEIA', 12), ('BERU', 5), ('OWEN', 4), ('OBI-WAN', 7), ('MOTTI', 3), ('TARKIN', 3), ('HAN', 6), ('DODONNA', 3), ('GOLD LEADER', 5), ('WEDGE', 5), ('RED LEADER', 7), ('RED TEN', 2)]
```

Sorting nodes from highest to lowest according to degree can be done with a single line of code:

```
print(sorted(G_starWars.degree, key=lambda x: x[1], reverse=True))
```

The output:

```
[('LUKE', 15), ('LEIA', 12), ('C-3PO', 10), ('R2-D2', 9), ('BIGGS', 8), ('OBI-WAN', 7), ('RED LEADER', 7), ('CHEWBACCA', 6), ('HAN', 6), ('BERU', 5), ('GOLD LEADER', 5), ('WEDGE', 5), ('DARTH VADER', 4), ('OWEN', 4), ('MOTTI', 3), ('TARKIN', 3), ('DODONNA', 3), ('CAMIE', 2), ('RED TEN', 2)]
```

Being just a total, the degree doesn’t take into account details of individual edges. Does a given edge connect to an otherwise isolated node or to a node that is related with the full community? Google’s PageRank algorithm aggregates this information to gauge how “necessary” a node is in a community.

The PageRank metric can be interpreted as an agent shifting randomly from 1 node to another. Better-related nodes have more paths leading through them, so the agent will tend to visit them more often.

Such nodes will have a higher PageRank, which we can calculate with the NetworkX library:

```
pageranks = nx.pagerank(G_starWars) # A dictionary
print(pageranks["LUKE"])
print(sorted(pageranks, key=lambda x: x[1], reverse=True))
```

This prints Luke’s rank and our characters sorted by rank:

```
0.12100659993223405
['OWEN', 'LUKE', 'MOTTI', 'DODONNA', 'GOLD LEADER', 'BIGGS', 'CHEWBACCA', 'LEIA', 'BERU', 'WEDGE', 'RED LEADER', 'RED TEN', 'OBI-WAN', 'DARTH VADER', 'CAMIE', 'TARKIN', 'HAN', 'R2-D2', 'C-3PO']
```

Owen is the character with the highest PageRank, surpassing Luke, who had the highest degree. The analysis: Although Owen is not the character who shares the most scenes with other characters, he is a character who shares scenes with many necessary characters such as Luke himself, R2-D2, and C-3PO.

In greater contrast, C-3PO, the character with the third-highest degree, is the 1 with the lowest PageRank. Despite C-3PO having many connections, a lot of them are with unimportant characters.

The takeaway: Using multiple metrics can give deeper perception into different traits of a graph’s nodes.

When analyzing a community it may be necessary to separate *communities*: groups of nodes that are highly related to each other but minimally related with nodes outdoors their community.

There are multiple algorithms for this. Most of them are found inside unsupervised machine learning algorithms because they assign a label to the nodes without the need for them to have been labeled previously.

One of the most wellknown is *label propagation*. In it, each node begins with a unique label, in a community of 1. The labels of the nodes are iteratively updated according to the majority of the labels of the neighboring nodes.

The labels diffuse through the community until all nodes share a label with most of their neighbors. Groups of nodes closely related to each other end up having the same label.

With the NetworkX library, operating this algorithm takes a mere 3 lines of Python:

```
from networkx.algorithms.community.label_propagation import label_propagation_communities
communities = label_propagation_communities(G_starWars)
print([community for community in communities])
```

The output:

```
[{'R2-D2', 'CAMIE', 'RED TEN', 'RED LEADER', 'OBI-WAN', 'DODONNA', 'LEIA', 'WEDGE', 'HAN', 'OWEN', 'CHEWBACCA', 'GOLD LEADER', 'LUKE', 'BIGGS', 'C-3PO', 'BERU'}, {'DARTH VADER', 'TARKIN', 'MOTTI'}]
```

In this list of sets, each set represents a community. Readers acquainted with the movie will notice the algorithm managed to perfectly separate the “fine guys” from the “bad guys,” differentiating the characters meaningfully without using any true (community) label or metadata.

## Intelligent Insights Using Graph Data Science in Python

We’ve seen that getting initiated with graph data science tools is more straightforward than it might sound. Once we represent data as a graph using the NetworkX library in Python, a few quick lines of code can be illuminating. We can visualize our dataset, measure and compare node traits, and cluster nodes sensibly via community detection algorithms.

Having the skill to extract conclusions and insights from a community using Python enables developers to combine with tools and methodology commonly found in data science services pipelines. From search engines to flight scheduling to electrical engineering, these methods apply easily to a broad range of contexts.

### Recommended Reading on Graph Data Science

**Community Detection Algorithms**

Zhao Yang, René Algesheimer, and Claudio Tessone. “A Comparative Analysis of Community Detection Algorithms on Artificial Networks.” Scientific Reports, 6, no. 30750 (2016).

**Graph Deep Learning**

Thomas Kipf. “Graph Convolutional Networks.” September 30, 2016.

**Applications of Graph Data Science**

Albanese, Federico, Leandro Lombardi, Esteban Feuerstein, and Pablo Balenzuela. “Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter.” (August 24, 2020): arXiv:2008.10749 [cs.SI].

Cohen, Elior. “PyData Tel Aviv Meetup: Node2vec.” YouTube. November 22, 2018. Video, 21:09. https://www.youtube.com/watch?v=828rZgV9t1g.