## Basic Graph Theory

We begin with a top-level view of graphs by looking at definitions, structure, and dynamics.

### Graph Theory

A graph is a formal mathematical representation of a network ("a collection of objects connected in some fashion"). Each object in a graph is called a node (or vertex). Corresponding to the connections (or lack thereof) in a network are edges (or links) in a graph. Each edge in a graph joins two distinct nodes.

More formally, we define a graph G as an ordered pair G = (V, E) where

• V is a set of nodes (vertices).
• E is a set of edges (links).
• Each edge is a pair of vertices.

In other words, each element of E is a pair of elements of V. Example: The picture above represents the following graph:

• V = {1, 2, 3, 4, 5, 6}
• E = {{1, 2}, {1, 5}, {2, 3}, {2, 5}, {3, 4}, {4, 5}, {4, 6}}
• G = (V, E)

Loops: An edge with identical endpoints is a loop: We disallow loops: any graph G=(V,E) by definition has no loops in its set of edges E.

### Undirected and Directed

Undirected graph: The edges of a graph are assumed to be unordered pairs of nodes. Sometimes we say undirected graph to emphasize this point. In an undirected graph, we write edges using curly braces to denote unordered pairs. For example, an undirected edge {2,3} from vertex 2 to vertex 3 is the same thing as an undirected edge {3,2} from vertex 3 to vertex 2.

Directed graph: In a directed graph, the two directions are counted as being distinct directed edges. In an directed graph, we write edges using parentheses to denote ordered pairs. For example, edge (2, 3)  is directed from 2 to 3 , which is different than the directed edge (3, 2) from 3 to 2. Directed graphs are drawn with arrowheads on the links, as shown below: Neighborhood and Degree

Two vertices are called adjacent if they share a common edge, in which case the common edge is said to join the two vertices. An edge and a vertex on that edge are called incident.

See the 6-node graph for examples of adjacent and incident: Nodes 4 and 6 are adjacent (as well as many other pairs of nodes)

Nodes 1 and 3 are not adjacent (as well as many other pairs of nodes)

Edge {2,5} is incident to node 2 and node 5.

The neighborhood of a vertex v in a graph G is the set of vertices adjacent to v. The neighborhood is denoted N(v). The neighborhood does not include v itself. For example, in the graph below N(5) = {4, 2, 1} and  N(6) = {4}.

The degree of a vertex is the total number of vertices adjacent to the vertex. The degree of a vertex v is denoted deg(v). We can equivalently define the degree of a vertex as the cardinality of its neighborhood and say that for any vertex v, deg(v) = |N(v)|.

The following is a vertex degree table of the graph above.

 Vertex ID Vertex Degree 1 2 2 3 3 2 4 3 5 3 6 1

In a directed graph, we define degree exactly the same as above (and note that "adjacent" does not imply any direction or lack of direction).

In a directed graph it is important to distinguish between indegree and outdegree. Recall that any directed edge has two distinct ends: a head (the end with an arrowhead) and a tail. Each end is counted separately. The sum of head endpoints count toward the indegree of a vertex and the sum of tail endpoints count toward the outdegree of a vertex. The directed graph above has the following degrees, indegrees, and outdegrees:

 Vertex ID Vertex Degree Indegree Outdegree 1 2 0 2 2 4 3 1 3 1 0 1 4 2 1 1 5 1 1 0

### Density and Average Degree

The density of a graph G = (V, E) measures how many edges are in set E compared to the maximum possible number of edges between vertices in set V.

Density is calculated as follows: An undirected graph has no loops and can have at most $|V| \ast (|V|-1)/2$ edges, so the density of an undirected graph is $2 \ast |E|/(|V| \ast (|V|-1))$. A directed graph has no loops and can have at most $|V| \ast (|V|-1)$ edges, so the density of a directed graph is $|E|/(|V| \ast (|V|-1))$.

The average degree of a graph G is another measure of how many edges are in set E compared to number of vertices in set V. Because each edge is incident to two vertices and counts in the degree of both vertices, the average degree of an undirected graph is $2 \ast |E|/|V|$.

### Paths

A path in a graph represents a way to get from an origin to a destination by traversing edges in the graph. For example, in the undirected graph G = (V, E) drawn below, there are many paths from node 6 to node 1. One such path is highlighted with red: We write a path P  as an ordered list of directed edges:  P = ((v1, v2), (v2, v3), ..., (vk, vk + 1)) . The first vertex of the first edge of a path is the origin and the second vertex of the last edge is the destination. Both origin and destination are called endpoints of the path.

Examples:

• The red path above is ((6,4), (4,3), (3,2), (2,5), (5,1)); it is a path in G from node 6 to node 1
• The blue path below is ((6,4), (4,5), (5,1)); it is also a path in G from node 6 to node 1 Paths in undirected graphs defined formally

Even in an undirected graph (such as the 6-node drawing above), we define a path as an ordered list of directed edges: P = ((v1, v2), (v2, v3), ..., (vk, vk + 1)). The strict ordering of nodes in the definition of a path corresponds to the order in which nodes and edges are traversed to get from the origin to the destination.

Let us suppose we have an undirected graph G = (V, E) and an ordered list of directed edges P = (e1, e2, ..., ek). P is a path in G only if the following are true of every directed edge ei = (vi, v i + 1) in P:

Directed edge ei in P corresponds to an undirected edge in  E.

In other words, {vi, vi + 1} ∈ E.

If directed edge is not the last edge in P , then let ei + 1  be the next edge in P. Directed edge  ei + 1  starts with a tail node identical to the head node of  ei .

This means that ei = (vi, vi + 1) and ei + 1 = (vi + 1, vi + 2).

The final requirement for P to be a path in G is that P cannot revisit any nodes. An ordered list of directed edges that meets the above two requirements and that does revisit nodes is called a walk. Examples of walks that are not paths, based on the 6-node graph drawn above:

• W = ((6, 4), (4, 3), (3, 2), (2, 5), (5, 3), (4, 3), (3, 2), (2, 5), (5,1)) is a walk from node 6 to node 1
• W = ((6, 4), (4, 6), (6, 4), (4, 5), (5, 1), (1, 5), (5, 1)) is a walk from node 6 to node 1

A walk does not have to revisit nodes. Therefore, any path by definition is also a walk.

### Paths in directed graphs

A directed path is a path in a directed graph where the directions of edges in the path match the directions of edges in the directed graph. For example, suppose we consider this directed graph: The blue lines highlight P = ((6, 4), (4, 5), (5, 1)) which which is a directed path from 6 to 1 in directed graph G.

As a counter-example, consider the same directed graph with a different ordered list of edges that is not a directed path: In this case, the red lines highlight P = ((6, 4), (4, 3), (3, 2), (2, 5), (5, 1)); the 4th edge of P -- (2, 5) -- goes the "wrong way"; and so P  is not a directed path in G. We instead refer to P as an undirected path in a directed graph. Information Today's "Bow Tie Structure of the Web" article puts it this way: "In an undirected path, hyperlinks can be followed forward or backward, a technique available to search engine spiders but not to a person using a Web browser".

### Length

The length of a path is the number of edges that it uses. Note that this definition is merely a restatement of the definition of the length of an ordered list.

Examples:

• The length of the red path below is 5
• The length of the blue path below is 3 ### Distance

Given a graph G, The distance d(x, y) between two vertices x and y is the length of the shortest path from x to y, considering all possible paths in G from x to y.

The distance between any node and itself is 0. If there is no path from x to y then d(x, y) is infinity.

Examples:

• In the preceding graph, the distance from 6 to 1 is 3.
• There is no path from 6 to 1 shorter than 3.
• There is only one path from 6 to 1 with length 3, and that path is ((6,4), (4,5), (5,1)).
• There are many longer paths from 6 to 1, none of which has any bearing on the distance from 6 to 1.
• In the graph below, the distance from b to c is 2.
• There is no path from b to c shorter than 2.
• There are multiple paths from b to c of length 2. One such path is ((b,f), (f,c)). There are also many longer paths from b to c.
• In the graph below, the distance from a to b is infinity.
• There is no path from a to b. The above examples all rely on small graphs where a simple visual inspection reveals what is the shortest path between any specified pair of nodes. For larger graphs, it is generally not possible to discern shortest paths so easily. Computer science provides algorithms, such as the breadth-first search algorithm, to compute shortest paths (and hence distances) in a graph of any size.

For now, we will limit our calculations of shortest paths to small graphs where visual inspection is easy and a formal algorithmic approach is not necessary.

### "Connected": a word of many meanings

The word "connected" speaks to the most basic structural properties of networks. It is arguably both the most important and the most overused term in the network vocabulary.

Important and proper uses of "connected":

Connected has several distinct formal definitions, each of which is important. We introduce three of them here and elaborate the details later.

• A path connects its origin and destination nodes
• Two nodes are connected if there is at least one path that connects them
• A graph is connected if each of its nodes is connected to all the others

Examples:

• Nodes 6 and 2 are connected in graph G1 below
• Nodes a and b are not connected in graph G2 below
• Graph G1 is connected; graph G2 is not connected Graph G1 = (V1,E1) Graph G2 = (V2,E2)

#### Common sloppy uses of "connected":

Sometimes "connected" is used informally when another term would be more clear. For example, avoid using "connected" when one of these unambiguous words would apply:

• Joined
• Incident

### Induced Subgraphs

It is often useful to consider only part of a graph. Induced subgraphs are one particularly convenient way to define a specific sub-part of a larger graph.

Given a graph G = (V, E), an induced subgraph is defined by a subset of vertices S (i.e., S ⊆  V). Then

• The nodes of the subgraph induced by S are simply the nodes in S, and
• The edges of the subgraph induced by S are all edges in E that have both endpoints in S.
• We can write the induced subgraph formally as GS = (S, ES) , where ES is the set of edges defined above (edges in E that have both endpoints in S).

Example 1: Consider the graph G = (V, E) drawn below. Original graph

The subgraph induced by the subset of nodes S = {1, 2, 3, 4} can then be drawn: Induced subgraph

Example 2: Let G = (V, E) be the Facebook friends network:

• V = {x: x has a Facebook account}
• E = {{x,y}: x and y are Facebook friends}

This is a big network: if we want to draw it, we must consider a small subgraph.

### "Connected" defined formally

Earlier we stated that:

• A path connects its origin and destination nodes
• Two nodes are connected if there is at least one path that connects them
• A graph is connected if each of its nodes is connected to all the others

"Connected nodes" defined formally:

Given an undirected graph G = (V, E), two nodes x and y (i.e., x ∈ V and y ∈ V) are connected if and only if one of the following is true:

• x = y, or
• there is at least one path in G with origin x and destination y

The most important clarification in our formal definition of connected nodes is that any node is by definition connected to itself.

### Connected graphs and connected components

A connected graph (as defined above) is said to consist of a single connected component. If a graph is not connected, then it consists of two or more connected components.

The book Six Degrees offers this intuitive metaphor for connected component:

"Suppose the nodes of a graph are buttons and the edges of the graph are threads joining the buttons. The buttons and threads are all on the floor. Grab one button and lift it. If that button has any threads, then other buttons will start to rise in addition to the one button you are holding. Each additional button that rises off the floor may have even more threads that cause even more buttons to rise. Keep lifting until every button connected to the one in your hand (by any combination of threads) is off the floor. The buttons and threads that have risen off the floor are a connected component".

The following examples illustrate the idea of connected component.

Graph G1 is connected and therefore has 1 connected component: Graph G2 has 2 connected components: Graph G3 has 3 connected components: A connected component can be defined as an induced subgraph. For example, G3 (above) consists of three connected components:

• the subgraph induced by {1,2,5,6}
• the subgraph induced by {3,4}
• the subgraph induced by {7,8}

### Hubs

A hub is a node in a graph with much higher degree than average. For example, nodes 1 and 2 are both hubs in the figure below: ### Clusters

Informally speaking, a cluster is a group of nodes in a graph that are more connected to each other than they are to the rest of the graph. For example, the red and yellow regions below are clusters: #### Defining clusters, part one: connected components

The following sections present a "semi-formal" definition of clusters in three parts:

• First, we formally define connected component
• Second, we introduce and formally define clique
• Third, we informally define cluster based on the above two formal definitions

Given an undirected graph G = (V, E), we define a connected component to be a subgraph induced by node set S ⊂ V, i.e., GS = (S, ES), with the following two properties:

1. GS is connected
2. There is no edge in E that joins a node in S to a node not in S

The above 2 mathematical properties of a connected component translate into the button-and-thread metaphor as follows:

GS is connected: The only buttons that rise off the floor do so because of the one button you are lifting and the threads that ultimately connect that button to other buttons (i.e., paths)

There is no edge in E that joins a node in S to a node not in S: You have lifted the one button in your hand high enough so that no more buttons will come off the floor no matter how much higher you lift the button you are holding.

For example:

The graph G(drawn above and again below) is not a connected component because it violates property : it is not connected. Graph G2 = (V2,E2)

The subgraph induced by S = {c, d, e, f} (drawn with dark nodes and edges below) is not a connected component because it violates property #2. There are edges in E that join nodes in S to node b, which is a node not in S. Defining clusters, part two: cliques

In an undirected graph G = (V, E) a clique is a subgraph induced by node set S ⊆ V, i.e., GS = (S, ES) such that the density of GS is 1. Put another way, in a clique, every pair of nodes is adjacent.

Example: Consider undirected graph G = (V, E) drawn below. The following are cliques:

• The subgraph induced by {c, d, e, f}
• The subgraph induced by {b, f}

The following are not cliques:

• The subgraph induced by {a, b, c}
• The subgraph induced by {b, c, d, e}, which is drawn below. The subgraph is not a clique since its density is less than 1. The subgraph induced by {c, d, e, f} is the largest clique in G: the clique with the most nodes. Usually the largest clique in a graph is the most interesting clique in that graph.

### Defining clusters, part three

Our definition of cluster informally draws upon our formal definitions of connected components and cliques.

In an undirected graph G = (V, E) a cluster is a subgraph induced by node set S ⊆ V, i.e., GS = (S, ES) with the following two properties:

• The average degree of GS is "relatively high"; (a relaxed adaptation of clique-ness)
• There are "relatively few" edges in E that join a node in S to a node not in S; (a relaxed adaptation of connected component-ness)

Example: In the graph G = (V, E) drawn below, the following subsets of nodes induce subgraphs that can fairly be called clusters:

Subsets of nodes:

• {1, 2, 3, 4, 5, 7}
• {20, 21, 22, 23, 24, 25}
• {14, 15, 16}

Corresponding clusters: Not all connected components are clusters. For example:

• In the graph above, the subgraph induced by {9, 10, 11, 12, 13} is a connected component, but it is arguably not a cluster, because the average degree of that induced subgraph is relatively low.
• The subgraph induced by {6} is a connected component, but it is definitely not a cluster, because the average degree of the induced subgraph is zero.

Not all cliques are clusters – even relatively large cliques. For example:

• The subgraph induced by {1, 2, 3, 4} is a clique, but it is not a cluster because every single one of those nodes is adjacent to node 5; there are too many edges joining nodes in {1, 2, 3, 4} to nodes not in {1, 2, 3, 4}.
• Even the largest clique in the above graph – the subgraph induced by {1, 2, 3, 4, 5} – is arguably still not a cluster because node 7 is adjacent to so many nodes in {1, 2, 3, 4, 5}.

### Limitations of traditional graph theory

Conceived in 1735 by the fertile Leonhard Euler, graph theory developed over the next 200 years to include topics such as:

• Shortest paths: Given a graph G = (V, E) and a pair of nodes x, y: How can we find a shortest path from x to y?
• Maximum cliques: A clique is a subset of nodes that are all adjacent to each other. Given a graph G = (V, E): How can we find large cliques in G? How can we find the largest clique in G? (Please refer to the network structure page for a more thorough definition of clique.)
• Graph coloring: Given a graph and the ability to assign a "color" to each of its nodes: How can we use the fewest possible number of distinct colors so that (1) every node has a color assigned to it, and (2) no two adjacent nodes are assigned the same color?

In studying these and other questions, graph theorists usually made two important assumptions: (1) The graph is known, and (2) the graph does not change.

When we use graph theory to study the Web, we must acknowledge that these two assumptions are not realistic. For example:

• The Web graph is only partially known. Estimating the cardinality of the set of all Web pages is enough to stretch our powers of creative arithmetic; what if we actually had to name all 30-70 billion of them? Our knowledge of the set of all hyperlinks is rougher still.
• The Web graph is constantly changing. Even if all Web builders were to cease and desist immediately, the Web pages they have already created include many that automatically change their outgoing links each time a Web user visits. For example, Google, Facebook, Amazon, etc. all present different links to different users at different times.

### Introduction to network dynamics

In the 1940s and 50s, graph theory expanded in important ways to strengthen our understanding of network dynamics. Two notable milestones were:

1. Studying dynamic flows: commodities that move over predetermined graphs. During World War II, a great deal of attention was paid to questions such as: "How do we get supplies to our soldiers despite our enemies' attempts to stop us"? Here nodes represent sources, destinations, and transfer stations for commodities such as food, fuel, ammunition, and medicine. Edges represent roads and other means of transport between pairs of nodes.
2. Studying dynamic graphs: sets of nodes and edges that change over time. Just after World War II, for reasons much more theoretical than supplying troops, mathematicians formalized the study of graphs whose very nodes and edges change over time. For example, if a graph started with 1,000 nodes and no edges, and then edges were added randomly one at a time to join those 1,000 nodes, what could we say about the evolving properties of that dynamic graph? Both types of network dynamics are important to understanding the Web. For example:
1. Information flows through the Web via the Internet. Typically, Web hosts provide information, Web browsers request and receive information, and Internet routers are transfer stations. Internet connections such as cables and wi-fi correspond to edges.
2. The Web is an ever-changing collection of pages and links. So many Web pages and hyperlinks are being added and edited at this very moment that any attempt to catalog them would be outdated before it could be completed. Pages are also removed from the Web (either temporarily by service disruptions or permanently by decommissioning), resulting in "dead links" and other complications that frustrate our hope to know the Web.

Our discussion here will focus on the second type of network dynamics: What are the evolving properties of the Web as millions of people continually edit its pages and hyperlinks? This is a core question in the realm of Web Science.

The first type of network dynamics–how information flows through the Web–is no less important than our chosen focus. We defer its discussion for two reasons: (1) The question of how information can best move right now from a Web host to a Web browser, known as packet-routing, has been well-studied in the realm of traditional Computer Science, where we can use it as a "black box"; and (2) The question of how information spreads over time through populations of Web users (e.g., information diffusion) is more advanced than the material presented here.

Also, in focusing on the ever-changing Web, we will ignore dynamic pages such as Google, Facebook and Amazon, that present different links each time a user visits. We will focus on static pages that millions of regular Web builders make every day with HTML and CSS. What do we learn when we view each of those pages not as an isolated artifact but as a crossroads in a network of millions of Web users and builders?

### Random Graphs

Chapter 2 of Six Degrees starts with an excellent introduction to random graphs. Have that handy before you continue.

Random graphs (and their dynamics) are based on the following assumptions:

• The graph we start with has many nodes but no edges.
• Edges are added randomly one at a time to the graph. (I.e., any pair of non-adjacent nodes is equally likely to be the next edge added to the existing graph.)
• Eventually, we end with a clique: every pair of nodes is adjacent.

Given the above framework, how does a graph evolve between the starting and ending points? That is a core question of random graph dynamics.

One of the most famous results in the dynamics of random graph describes the size of the largest connected component, also called the giant component. Read in Six Degrees the explanation of Figure 2.2 on p. 45. A similar figure labeled with our own terminology is below: After you absorb Six Degrees, then consider what this means:

• By eyeballing the above figure, we can estimate that a random graph with average degree of 2 has a giant component with 75% of all nodes.
• Whether |V| is big or small doesn't matter at all: To connect a set of nodes into a giant component, just randomly add |V| edges and you have guaranteed that the average degree is at least 2, thus creating a giant component with 75% of all nodes.

Example: Suppose all links are deleted from the Web. Only the 30-70 billion link-less pages remain. If each author of each Web page added just one random link to each of his pages, then the resulting giant component would connect 75% of all Web pages into a giant component connected by undirected paths (which allow both forward and backward traversing of hyperlinks).

### Random graph algorithm

An algorithm is a set of instructions that explain how to compute something. It is like a recipe, only for making data instead of food.

An algorithm has three parts:

1. Input (like the ingredients of a recipe)
2. Output (like the title of a recipe: "chile con carne, serves 12")
3. The steps of the algorithm (like the steps of the recipe)

The following algorithm describes the creation of a random graph.

#### Random Graph Algorithm:

Input:

1. Node set V, with |V|>1

Output:

1. Graph G = (V, E), with E = {{x, y} : x ∈ V and y ∈ V and : x ≠  y} (i.e., a clique)

Steps:

1. Define E = { } and G = (V, E)

2. Choose random pair of nodes x, y that are non-adjacent

3. Add element {x, y} to set E

4. If |E| = |V| * (|V| - 1)/2 then we are done

5. Go to step 2

Note: The main point of most algorithms is generating output, just as the main reason for most recipes is serving food. For example, a sorting algorithm is useful as a way to alphabetize names, and Google's PageRank algorithm is useful as a way to list the most popular and influential Web pages first.

Our random graph algorithm, however, does not generate interesting output. It creates a clique using a more complicated sequence of steps than is necessary to create a clique. The point of the algorithm is the journey it describes, not the destination where it eventually arrives.

### Clusters and homophily

Informally speaking, a cluster is a group of nodes in a graph that are more connected to each other than they are to the rest of the graph. For example, the red and yellow regions below are clusters: Clusters provide a good big-picture view of the Web, much like a table of contents provides a big-picture view of a book. Unlike a book and its table of contents, however, a cluster involves many authors (Web builders) acting independently. Also, a Web builder may be unaware of the clusters he is forming.

The sociological force behind Web clusters is homophily, or "birds of a feather" (which "flock together"). Here is one important way that homophily impacts the Web:

1. When a Web builder links from his site to another site, he usually does so because he perceives something important shared by his site and the other site.
2. Therefore hyperlinks represent more than channels of navigation; they also represent shared likes and dislikes.
3. The navigational meaning of a hyperlink is clear and explicit, but the shared likes and dislikes represented by a hyperlink are often unstated and implicit.
4. Therefore clusters of mutually interlinked Web pages represent self-organized groups with specific shared likes and dislikes, which are usually unstated and implicit.

Motivated by the above, cluster-based search engines find interlinked groups of Web pages and then deduce the specific "meaning" of each cluster by analyzing the content of the interlinked pages. An example of cluster-based search engines is iBoogie.

Triadic closure is a simple and effective mathematical model of homophily in a network. The basic idea is illustrated below: Nodes 2 and 3 share a mutual neighbor but are not adjacent. Adding edge {2, 3} is then an example of triadic closure.

Triadic closure usually occurs in the context of a larger graph. For example, in the graph below–considering only the solid black lines as edges–there are many possible opportunities for triadic closure. Just a few of those possible opportunities are illustrated with dotted red lines. Chapter 2 of Six Degrees pp 56-61 has a good introduction to homophily and triadic closure. Read that before you continue. The following algorithm describes the process of repeated triadic closure:

Input:

An undirected graph G = (V, E) with |V| > 0 and |E| > 0

Output:

See below…

Steps:

1. Look for two nodes x, y that are non-adjacent and share a common neighbor
2. If no such pair of nodes x, y exists, then we are done
3. Add element {x, y} to set E
4. Go to step 1

The output of the triadic closure algorithm has some notable properties. We can deduce these properties from the following observations:

Consider the closing of a single triad: Nodes x and y are non-adjacent and share a common neighbor; then x and y are joined by edge {x, y}.

1. Adding edge {x, y} does not change the number of connected components in G. Nodes x and y were already connected before we joined them with {x, y}.

2. Adding edge {x, y} increases the density of the connected component that contains nodes x and y.

Repeating the above observations over the course of every iteration of the triadic closure algorithm, we see that the output graph G' = (V, E') computed by the algorithm has these two properties

1. Output graph G' has exactly the same number of connected components as input graph G. Furthermore, the nodes that induce each connected component are the same in G' and G.
2. Each connected component of G' has the maximum possible density: it is a clique.

Example: Suppose the triadic closure algorithm starts with graph G = (V, E) drawn below: Upon completion, the algorithm will have added all the orange edges drawn below: The output graph has the same number of connected components as the input graph (with the same nodes in those components). And each connected component in the output graph is a clique.

An algorithm that identifies and closes triads is an example of random-bias. A network resulting from that algorithm is an example of randomly-biased network (graph). Intuitively, a random-biased network is a network that obeys to some property but it is otherwise random. In the triadic closure example, the bias is towards closing a triad (adding an edge) if two nodes have already a common neighbor; the two nodes with the common neighbor are selected however at random.

A hub is a node in a graph with a much higher degree than average. For example, nodes 1 and 2 are both hubs in the figure below: Centrality is the mathematical term most commonly used to describe hubs. Degree and indegree are two common simple measures of centrality.

Google made centrality-based Web search famous, and centrality-based search engines now dominate the Web. If cluster-based search engines reveal something like a Web "table of contents", then centrality-based search engines provide quick shortcuts to "the good parts"–the most popular pages and pictures.

The sociological force behind Web hubs is cumulative advantage, or "rich get richer". Here is one important way that cumulative advantage impacts the Web:

1. When a Web builder is looking for online resources, he probably uses a centrality-based search engine like Google, and he is more likely to look at higher-ranked pages.
2. Pages atop centrality-based rankings are there because they are already popular (essentially because they have high indegree.)
3. Therefore Web builders are more likely to link to (and increase the indegree of) pages that are already popular. It is possible that Web builders will never even see the "good" pages that are not popular.
4. Therefore Google and its ilk, to some extent, drive a feedback loop that amplifies the popularity of whatever is already popular. In other words, the rank of a Web page next week has less to do with its content and more to do with its rank this week. The main change to be expected in next week's rankings is that disparities between high rankings and low rankings will increase. The rich get richer.

### Preferential attachment algorithm

Chapter 4 of Six Degrees pp 108-109 has a good introduction to cumulative advantage (known also as "rich get richer" and the "Matthew effect" among other names) and its effect on network dynamics. Read that before you continue.

Preferential attachment is the mathematical model used to represent the force of cumulative advantage in network dynamics. The algorithm below describes the process of repeated preferential attachment. This is equivalent to the algorithm informally described on pp 108-109 of Six Degrees:

Preferential Attachment Algorithm:

Input:

1. An undirected graph G = (V, E) such that V = {1, 2, 3, …, k}, k > 1, and |E| > 0
2. Integer n, with n > k. Variable n represents the number of nodes in the output graph

Output:

See example below…

Steps:

1. i = k + 1
2. Choose a node x based on degree: if deg(x) = 2 * deg(y), then x is twice as likely to be picked as y.
3. Add element i to set V
4. Add element {i,x} to set E
5. Add 1 to the value of i
6. if i > n then we are done

7. Go to step 2

Example: Suppose the preferential attachment algorithm starts with V = {1, 2}, E = {{1, 2}}, and n = 20. Therefore k = 2 and the input graph can be drawn as: The sequence below illustrates the first few iterations of the algorithm.

Iteration 1

Create node 3 and join it to a pre-existing node. Node 1 and node 2 are equally likely. Suppose the algorithm joins node 3 to node 1. Then…

Iteration 2

Create node 4 and join it to a pre-existing node. Node 1 is twice as likely as node 2 or node 3. Suppose the algorithm joins node 4 to node 1. Then…

Iteration 3

Create node 5 and join it to a pre-existing node. Node 1 is 3 times as likely as node 2, 3 or 4. Suppose the algorithm joins node 5 to node 2 (i.e., a lucky break for node 2). Then…

Iteration 4

Create node 6 and join it to a pre-existing node. Iterations continue...

In the very last iteration, node 20 is joined to node 2

Output graph: Notice in the above example that nodes 1 and 2 start out as equals. However, the random choice of node 1 in the first iteration gives node 1 a nudge up in popularity that feeds on itself. In the end, node 1 is the single dominant hub and node 2 is a second-tier mini-hub.

A seemingly insignificant event in iteration 1 has caused a very substantial effect because of cumulative advantage.

Source: Flavio Esposito, https://cs.slu.edu/~esposito/teaching/1080/webscience/graphs.html

https://cs.slu.edu/~esposito/teaching/1080/webscience/structure.html