Basic Graph Theory
We begin with a toplevel view of graphs by looking at definitions, structure, and dynamics.
Graph Theory
A graph is a formal mathematical representation of a network ("a collection of objects connected in some fashion").
Each object in a graph is called a node (or vertex). Corresponding to the connections (or lack thereof) in a network are edges (or links) in a graph. Each edge in a graph joins two distinct nodes.
More formally, we define a graph G as an ordered pair G = (V, E) where
 V is a set of nodes (vertices).
 E is a set of edges (links).
 Each edge is a pair of vertices.
In other words, each element of E is a pair of elements of V. Example: The picture above represents the following graph:
 V = {1, 2, 3, 4, 5, 6}
 E = {{1, 2}, {1, 5}, {2, 3}, {2, 5}, {3, 4}, {4, 5}, {4, 6}}
 G = (V, E)
Loops: An edge with identical endpoints is a loop: We disallow loops: any graph G=(V,E) by definition has no loops in its set of edges E.
Undirected and Directed
Undirected graph: The edges of a graph are assumed to be unordered pairs of nodes. Sometimes we say undirected graph to emphasize this point. In an undirected graph, we write edges using curly braces to denote unordered pairs. For example, an undirected edge {2,3} from vertex 2 to vertex 3 is the same thing as an undirected edge {3,2} from vertex 3 to vertex 2.
Directed graph: In a directed graph, the two directions are counted as being distinct directed edges. In an directed graph, we write edges using parentheses to denote ordered pairs. For example, edge (2, 3) is directed from 2 to 3 , which is different than the directed edge (3, 2) from 3 to 2. Directed graphs are drawn with arrowheads on the links, as shown below:
Neighborhood and Degree
Two vertices are called adjacent if they share a common edge, in which case the common edge is said to join the two vertices. An edge and a vertex on that edge are called incident.
See the 6node graph for examples of adjacent and incident:
Nodes 4 and 6 are adjacent (as well as many other pairs of nodes)
Nodes 1 and 3 are not adjacent (as well as many other pairs of nodes)
Edge {2,5} is incident to node 2 and node 5.
The neighborhood of a vertex v in a graph G is the set of vertices adjacent to v. The neighborhood is denoted N(v). The neighborhood does not include v itself. For example, in the graph below N(5) = {4, 2, 1} and N(6) = {4}.
The degree of a vertex is the total number of vertices adjacent to the vertex. The degree of a vertex v is denoted deg(v). We can equivalently define the degree of a vertex as the cardinality of its neighborhood and say that for any vertex v, deg(v) = N(v).
The following is a vertex degree table of the graph above.
Vertex ID  Vertex Degree 
1  2 
2  3 
3  2 
4  3 
5  3 
6  1 
In a directed graph, we define degree exactly the same as above (and note that "adjacent" does not imply any direction or lack of direction).
In a directed graph it is important to distinguish between indegree and outdegree. Recall that any directed edge has two distinct ends: a head (the end with an arrowhead) and a tail. Each end is counted separately. The sum of head endpoints count toward the indegree of a vertex and the sum of tail endpoints count toward the outdegree of a vertex.
The directed graph above has the following degrees, indegrees, and outdegrees:
Vertex ID  Vertex Degree  Indegree  Outdegree 
1  2  0  2 
2  4  3  1 
3  1  0  1 
4  2  1  1 
5  1  1  0 
Density and Average Degree
The density of a graph G = (V, E) measures how many edges are in set E compared to the maximum possible number of edges between vertices in set V.
Density is calculated as follows: An undirected graph has no loops and can have at most edges, so the density of an undirected graph is . A directed graph has no loops and can have at most edges, so the density of a directed graph is .
The average degree of a graph G is another measure of how many edges are in set E compared to number of vertices in set V. Because each edge is incident to two vertices and counts in the degree of both vertices, the average degree of an undirected graph is .
Paths
A path in a graph represents a way to get from an origin to a destination by traversing edges in the graph. For example, in the undirected graph G = (V, E) drawn below, there are many paths from node 6 to node 1. One such path is highlighted with red:
Examples:
 The red path above is ((6,4), (4,3), (3,2), (2,5), (5,1)); it is a path in G from node 6 to node 1
 The blue path below is ((6,4), (4,5), (5,1)); it is also a path in G from node 6 to node 1
Even in an undirected graph (such as the 6node drawing above), we define a path as an ordered list of directed edges: P = ((v1, v2), (v2, v3), ..., (vk, vk + 1)). The strict ordering of nodes in the definition of a path corresponds to the order in which nodes and edges are traversed to get from the origin to the destination.
Let us suppose we have an undirected graph G = (V, E) and an ordered list of directed edges P = (e_{1}, e_{2}, ..., e_{k}). P is a path in G only if the following are true of every directed edge e_{i} = (v_{i}, v _{i + 1}) in P:
Directed edge e_{i} in P corresponds to an undirected edge in E.
In other words, {v_{i}, v_{i + 1}} ∈ E.
If directed edge is not the last edge in P , then let e_{i + 1 } be the next edge in P. Directed edge e_{i + 1} starts with a tail node identical to the head node of e_{i} .
This means that e_{i} = (v_{i}, v_{i + 1}) and e_{i + 1 }= (v_{i + 1}, v_{i + 2}).
The final requirement for P to be a path in G is that P cannot revisit any nodes. An ordered list of directed edges that meets the above two requirements and that does revisit nodes is called a walk. Examples of walks that are not paths, based on the 6node graph drawn above:
 W = ((6, 4), (4, 3), (3, 2), (2, 5), (5, 3), (4, 3), (3, 2), (2, 5), (5,1)) is a walk from node 6 to node 1

W = ((6, 4), (4, 6), (6, 4), (4, 5), (5, 1), (1, 5), (5, 1)) is a walk from node 6 to node 1
A walk does not have to revisit nodes. Therefore, any path by definition is also a walk.
Paths in directed graphs
A directed path is a path in a directed graph where the directions of edges in the path match the directions of edges in the directed graph. For example, suppose we consider this directed graph:
The blue lines highlight P = ((6, 4), (4, 5), (5, 1)) which which is a directed path from 6 to 1 in directed graph G.
As a counterexample, consider the same directed graph with a different ordered list of edges that is not a directed path:
In this case, the red lines highlight P = ((6, 4), (4, 3), (3, 2), (2, 5), (5, 1)); the 4th edge of P  (2, 5)  goes the "wrong way"; and so P is not a directed path in G. We instead refer to P as an undirected path in a directed graph. Information Today's "Bow Tie Structure of the Web" article puts it this way: "In an undirected path, hyperlinks can be followed forward or backward, a technique available to search engine spiders but not to a person using a Web browser".
Length
The length of a path is the number of edges that it uses. Note that this definition is merely a restatement of the definition of the length of an ordered list.
Examples:
 The length of the red path below is 5
 The length of the blue path below is 3
Distance
Given a graph G, The distance d(x, y) between two vertices x and y is the length of the shortest path from x to y, considering all possible paths in G from x to y.
The distance between any node and itself is 0. If there is no path from x to y then d(x, y) is infinity.
Examples:
 In the preceding graph, the distance from 6 to 1 is 3.
 There is no path from 6 to 1 shorter than 3.
 There is only one path from 6 to 1 with length 3, and that path is ((6,4), (4,5), (5,1)).
 There are many longer paths from 6 to 1, none of which has any bearing on the distance from 6 to 1.
 In the graph below, the distance from b to c is 2.
 There is no path from b to c shorter than 2.
 There are multiple paths from b to c of length 2. One such path is ((b,f), (f,c)). There are also many longer paths from b to c.
 In the graph below, the distance from a to b is infinity.
 There is no path from a to b.
The above examples all rely on small graphs where a simple visual inspection reveals what is the shortest path between any specified pair of nodes. For larger graphs, it is generally not possible to discern shortest paths so easily. Computer science provides
algorithms, such as the breadthfirst search algorithm, to compute shortest paths (and hence distances) in a graph of any size.
For now, we will limit our calculations of shortest paths to small graphs where visual inspection is easy and a formal algorithmic approach is not necessary.
Network Structure
"Connected": a word of many meanings
The word "connected" speaks to the most basic structural properties of networks. It is arguably both the most important and the most overused term in the network vocabulary.
Important and proper uses of "connected":
Connected has several distinct formal definitions, each of which is important. We introduce three of them here and elaborate the details later.
 A path connects its origin and destination nodes
 Two nodes are connected if there is at least one path that connects them
 A graph is connected if each of its nodes is connected to all the others
Examples:
 Nodes 6 and 2 are connected in graph G1 below
 Nodes a and b are not connected in graph G2 below
 Graph G1 is connected; graph G2 is not connected
Graph G1 = (V1,E1) 

Graph G2 = (V2,E2) 
Common sloppy uses of "connected":
Sometimes "connected" is used informally when another term would be more clear. For example, avoid using "connected" when one of these unambiguous words would apply:
 Joined
 Adjacent
 Incident
Induced Subgraphs
It is often useful to consider only part of a graph. Induced subgraphs are one particularly convenient way to define a specific subpart of a larger graph.
Given a graph G = (V, E), an induced subgraph is defined by a subset of vertices S (i.e., S ⊆ V). Then
 The nodes of the subgraph induced by S are simply the nodes in S, and
 The edges of the subgraph induced by S are all edges in E that have both endpoints in S.

We can write the induced subgraph formally as G_{S} = (S, E_{S}) , where E_{S} is the set of edges defined above (edges in E that have both endpoints in S).
Example 1: Consider the graph G = (V, E) drawn below.
Original graph 
The subgraph induced by the subset of nodes S = {1, 2, 3, 4} can then be drawn:
Induced subgraph 
Example 2: Let G = (V, E) be the Facebook friends network:
 V = {x: x has a Facebook account}
 E = {{x,y}: x and y are Facebook friends}
This is a big network: if we want to draw it, we must consider a small subgraph.
"Connected" defined formally
Earlier we stated that:
 A path connects its origin and destination nodes
 Two nodes are connected if there is at least one path that connects them
 A graph is connected if each of its nodes is connected to all the others
"Connected nodes" defined formally:
Given an undirected graph G = (V, E), two nodes x and y (i.e., x ∈ V and y ∈ V) are connected if and only if one of the following is true:
 x = y, or
 there is at least one path in G with origin x and destination y
The most important clarification in our formal definition of connected nodes is that any node is by definition connected to itself.
Connected graphs and connected components
A connected graph (as defined above) is said to consist of a single connected component. If a graph is not connected, then it consists of two or more connected components.
The book Six Degrees offers this intuitive metaphor for connected component:
"Suppose the nodes of a graph are buttons and the edges of the graph are threads joining the buttons. The buttons and threads are all on the floor. Grab one button and lift it. If that button has any threads, then other buttons will start to rise in addition to the one button you are holding. Each additional button that rises off the floor may have even more threads that cause even more buttons to rise. Keep lifting until every button connected to the one in your hand (by any combination of threads) is off the floor. The buttons and threads that have risen off the floor are a connected component".
The following examples illustrate the idea of connected component.
Graph G1 is connected and therefore has 1 connected component:
Graph G2 has 2 connected components:
Graph G3 has 3 connected components:
A connected component can be defined as an induced subgraph. For example, G_{3} (above) consists of three connected components:
 the subgraph induced by {1,2,5,6}
 the subgraph induced by {3,4}
 the subgraph induced by {7,8}
Hubs
A hub is a node in a graph with much higher degree than average. For example, nodes 1 and 2 are both hubs in the figure below:
Clusters
Informally speaking, a cluster is a group of nodes in a graph that are more connected to each other than they are to the rest of the graph. For example, the red and yellow regions below are clusters:
Defining clusters, part one: connected components
The following sections present a "semiformal" definition of clusters in three parts:
 First, we formally define connected component
 Second, we introduce and formally define clique
 Third, we informally define cluster based on the above two formal definitions
Given an undirected graph G = (V, E), we define a connected component to be a subgraph induced by node set S ⊂ V, i.e., G_{S} = (S, E_{S}), with the following two properties:
 G_{S} is connected
 There is no edge in E that joins a node in S to a node not in S
The above 2 mathematical properties of a connected component translate into the buttonandthread metaphor as follows:
G_{S} is connected: The only buttons that rise off the floor do so because of the one button you are lifting and the threads that ultimately connect that button to other buttons (i.e., paths)
There is no edge in E that joins a node in S to a node not in S: You have lifted the one button in your hand high enough so that no more buttons will come off the floor no matter how much higher you lift the button you are holding.
For example:
The graph G_{2 }(drawn above and again below) is not a connected component because it violates property : it is not connected.
Graph G2 = (V2,E2) 
The subgraph induced by S = {c, d, e, f} (drawn with dark nodes and edges below) is not a connected component because it violates property #2. There are edges in E that join nodes in S to node b, which is a node not in S.
Defining clusters, part two: cliques
In an undirected graph G = (V, E) a clique is a subgraph induced by node set S ⊆ V, i.e., G_{S} = (S, E_{S}) such that the density of G_{S} is 1. Put another way, in a clique, every pair of nodes is adjacent.
Example: Consider undirected graph G = (V, E) drawn below.
The following are cliques:
 The subgraph induced by {c, d, e, f}
 The subgraph induced by {b, f}
The following are not cliques:
 The subgraph induced by {a, b, c}
 The subgraph induced by {b, c, d, e}, which is drawn below. The subgraph is not a clique since its density is less than 1.
The subgraph induced by {c, d, e, f} is the largest clique in G: the clique with the most nodes. Usually the largest clique in a graph is the most interesting clique in that graph.
Defining clusters, part three
Our definition of cluster informally draws upon our formal definitions of connected components and cliques.
In an undirected graph G = (V, E) a cluster is a subgraph induced by node set S ⊆ V, i.e., G_{S} = (S, E_{S}) with the following two properties:
 The average degree of G_{S} is "relatively high"; (a relaxed adaptation of cliqueness)
 There are "relatively few" edges in E that join a node in S to a node not in S; (a relaxed adaptation of connected componentness)
Example: In the graph G = (V, E) drawn below, the following subsets of nodes induce subgraphs that can fairly be called clusters:
Subsets of nodes:
 {1, 2, 3, 4, 5, 7}
 {20, 21, 22, 23, 24, 25}
 {14, 15, 16}
Corresponding clusters:
Not all connected components are clusters. For example:
 In the graph above, the subgraph induced by {9, 10, 11, 12, 13} is a connected component, but it is arguably not a cluster, because the average degree of that induced subgraph is relatively low.
 The subgraph induced by {6} is a connected component, but it is definitely not a cluster, because the average degree of the induced subgraph is zero.
Not all cliques are clusters – even relatively large cliques. For example:
 The subgraph induced by {1, 2, 3, 4} is a clique, but it is not a cluster because every single one of those nodes is adjacent to node 5; there are too many edges joining nodes in {1, 2, 3, 4} to nodes not in {1, 2, 3, 4}.

Even the largest clique in the above graph – the subgraph induced by {1, 2, 3, 4, 5} – is arguably still not a cluster because node 7 is adjacent to so many nodes in {1, 2, 3, 4, 5}.
Limitations of traditional graph theory
Conceived in 1735 by the fertile Leonhard Euler, graph theory developed over the next 200 years to include topics such as:
 Shortest paths: Given a graph G = (V, E) and a pair of nodes x, y: How can we find a shortest path from x to y?
 Maximum cliques: A clique is a subset of nodes that are all adjacent to each other. Given a graph G = (V, E): How can we find large cliques in G? How can we find the largest clique in G? (Please refer to the network structure page for a more thorough definition of clique.)
 Graph coloring: Given a graph and the ability to assign a "color" to each of its nodes: How can we use the fewest possible number of distinct colors so that (1) every node has a color assigned to it, and (2) no two adjacent nodes are assigned the same color?
In studying these and other questions, graph theorists usually made two important assumptions: (1) The graph is known, and (2) the graph does not change.
When we use graph theory to study the Web, we must acknowledge that these two assumptions are not realistic. For example:
 The Web graph is only partially known. Estimating the cardinality of the set of all Web pages is enough to stretch our powers of creative arithmetic; what if we actually had to name all 3070 billion of them? Our knowledge of the set of all hyperlinks is rougher still.
 The Web graph is constantly changing. Even if all Web builders were to cease and desist immediately, the Web pages they have already created include many that automatically change their outgoing links each time a Web user visits. For example, Google, Facebook, Amazon, etc. all present different links to different users at different times.
Introduction to network dynamics
In the 1940s and 50s, graph theory expanded in important ways to strengthen our understanding of network dynamics. Two notable milestones were:
 Studying dynamic flows: commodities that move over predetermined graphs. During World War II, a great deal of attention was paid to questions such as: "How do we get supplies to our soldiers despite our enemies' attempts to stop us"? Here nodes represent sources, destinations, and transfer stations for commodities such as food, fuel, ammunition, and medicine. Edges represent roads and other means of transport between pairs of nodes.
 Studying dynamic graphs: sets of nodes and edges that change over time. Just after World War II, for reasons much more theoretical than supplying troops, mathematicians formalized the study of graphs whose very nodes and edges change over time. For example, if a graph started with 1,000 nodes and no edges, and then edges were added randomly one at a time to join those 1,000 nodes, what could we say about the evolving properties of that dynamic graph? Both types of network dynamics are important to understanding the Web. For example:
 Information flows through the Web via the Internet. Typically, Web hosts provide information, Web browsers request and receive information, and Internet routers are transfer stations. Internet connections such as cables and wifi correspond to edges.
 The Web is an everchanging collection of pages and links. So many Web pages and hyperlinks are being added and edited at this very moment that any attempt to catalog them would be outdated before it could be completed. Pages are also removed from the Web (either temporarily by service disruptions or permanently by decommissioning), resulting in "dead links" and other complications that frustrate our hope to know the Web.
Our discussion here will focus on the second type of network dynamics: What are the evolving properties of the Web as millions of people continually edit its pages and hyperlinks? This is a core question in the realm of Web Science.
The first type of network dynamics–how information flows through the Web–is no less important than our chosen focus. We defer its discussion for two reasons: (1) The question of how information can best move right now from a Web host to a Web browser, known as packetrouting, has been wellstudied in the realm of traditional Computer Science, where we can use it as a "black box"; and (2) The question of how information spreads over time through populations of Web users (e.g., information diffusion) is more advanced than the material presented here.
Also, in focusing on the everchanging Web, we will ignore dynamic pages such as Google, Facebook and Amazon, that present different links each time a user visits. We will focus on static pages that millions of regular Web builders make every day with HTML and CSS. What do we learn when we view each of those pages not as an isolated artifact but as a crossroads in a network of millions of Web users and builders?
Random Graphs
Chapter 2 of Six Degrees starts with an excellent introduction to random graphs. Have that handy before you continue.
Random graphs (and their dynamics) are based on the following assumptions:
 The graph we start with has many nodes but no edges.
 Edges are added randomly one at a time to the graph. (I.e., any pair of nonadjacent nodes is equally likely to be the next edge added to the existing graph.)
 Eventually, we end with a clique: every pair of nodes is adjacent.
Given the above framework, how does a graph evolve between the starting and ending points? That is a core question of random graph dynamics.
One of the most famous results in the dynamics of random graph describes the size of the largest connected component, also called the giant component. Read in Six Degrees the explanation of Figure 2.2 on p. 45. A similar figure labeled with our own terminology is below:
After you absorb Six Degrees, then consider what this means:
 By eyeballing the above figure, we can estimate that a random graph with average degree of 2 has a giant component with 75% of all nodes.
 Whether V is big or small doesn't matter at all: To connect a set of nodes into a giant component, just randomly add V edges and you have guaranteed that the average degree is at least 2, thus creating a giant component with 75% of all nodes.
Example: Suppose all links are deleted from the Web. Only the 3070 billion linkless pages remain. If each author of each Web page added just one random link to each of his pages, then the resulting giant component would connect 75% of all Web pages into a giant component connected by undirected paths (which allow both forward and backward traversing of hyperlinks).
Random graph algorithm
An algorithm is a set of instructions that explain how to compute something. It is like a recipe, only for making data instead of food.
An algorithm has three parts:
 Input (like the ingredients of a recipe)
 Output (like the title of a recipe: "chile con carne, serves 12")
 The steps of the algorithm (like the steps of the recipe)
The following algorithm describes the creation of a random graph.
Random Graph Algorithm:
Input:

Node set V, with V>1
Output:

Graph G = (V, E), with E = {{x, y} : x ∈ V and y ∈ V and : x ≠ y} (i.e., a clique)
Steps:

Define E = { } and G = (V, E)

Choose random pair of nodes x, y that are nonadjacent

Add element {x, y} to set E

If E = V * (V  1)/2 then we are done

Go to step 2
Note: The main point of most algorithms is generating output, just as the main reason for most recipes is serving food. For example, a sorting algorithm is useful as a way to alphabetize names, and Google's PageRank algorithm is useful as a way to list the most popular and influential Web pages first.
Our random graph algorithm, however, does not generate interesting output. It creates a clique using a more complicated sequence of steps than is necessary to create a clique. The point of the algorithm is the journey it describes, not the destination where it eventually arrives.
Clusters and homophily
Informally speaking, a cluster is a group of nodes in a graph that are more connected to each other than they are to the rest of the graph. For example, the red and yellow regions below are clusters:
Clusters provide a good bigpicture view of the Web, much like a table of contents provides a bigpicture view of a book. Unlike a book and its table of contents, however, a cluster involves many authors (Web builders) acting independently. Also, a Web builder may be unaware of the clusters he is forming.
The sociological force behind Web clusters is homophily, or "birds of a feather" (which "flock together"). Here is one important way that homophily impacts the Web:
 When a Web builder links from his site to another site, he usually does so because he perceives something important shared by his site and the other site.
 Therefore hyperlinks represent more than channels of navigation; they also represent shared likes and dislikes.
 The navigational meaning of a hyperlink is clear and explicit, but the shared likes and dislikes represented by a hyperlink are often unstated and implicit.
 Therefore clusters of mutually interlinked Web pages represent selforganized groups with specific shared likes and dislikes, which are usually unstated and implicit.
Motivated by the above, clusterbased search engines find interlinked groups of Web pages and then deduce the specific "meaning" of each cluster by analyzing the content of the interlinked pages. An example of clusterbased search engines is iBoogie.
Triadic closure
Triadic closure is a simple and effective mathematical model of homophily in a network. The basic idea is illustrated below:
Nodes 2 and 3 share a mutual neighbor but are not adjacent. 
Adding edge {2, 3} is then an example of triadic closure. 
Triadic closure usually occurs in the context of a larger graph. For example, in the graph below–considering only the solid black lines as edges–there are many possible opportunities for triadic closure. Just a few of those possible opportunities are illustrated with dotted red lines.
Triadic closure algorithm
Chapter 2 of Six Degrees pp 5661 has a good introduction to homophily and triadic closure. Read that before you continue. The following algorithm describes the process of repeated triadic closure:
Triadic Closure Algorithm:
Input:
An undirected graph G = (V, E) with V > 0 and E > 0
Output:
See below…
Steps:
 Look for two nodes x, y that are nonadjacent and share a common neighbor
 If no such pair of nodes x, y exists, then we are done
 Add element {x, y} to set E
 Go to step 1
The output of the triadic closure algorithm has some notable properties. We can deduce these properties from the following observations:
Consider the closing of a single triad: Nodes x and y are nonadjacent and share a common neighbor; then x and y are joined by edge {x, y}.

Adding edge {x, y} does not change the number of connected components in G. Nodes x and y were already connected before we joined them with {x, y}.

Adding edge {x, y} increases the density of the connected component that contains nodes x and y.
Repeating the above observations over the course of every iteration of the triadic closure algorithm, we see that the output graph G' = (V, E') computed by the algorithm has these two properties
 Output graph G' has exactly the same number of connected components as input graph G. Furthermore, the nodes that induce each connected component are the same in G' and G.
 Each connected component of G' has the maximum possible density: it is a clique.
Example: Suppose the triadic closure algorithm starts with graph G = (V, E) drawn below:
Upon completion, the algorithm will have added all the orange edges drawn below:
The output graph has the same number of connected components as the input graph (with the same nodes in those components). And each connected component in the output graph is a clique.
An algorithm that identifies and closes triads is an example of randombias. A network resulting from that algorithm is an example of randomlybiased network (graph). Intuitively, a randombiased network is a network that obeys to some property but it is otherwise random. In the triadic closure example, the bias is towards closing a triad (adding an edge) if two nodes have already a common neighbor; the two nodes with the common neighbor are selected however at random.
Hubs and cumulative advantage
A hub is a node in a graph with a much higher degree than average. For example, nodes 1 and 2 are both hubs in the figure below:
Centrality is the mathematical term most commonly used to describe hubs. Degree and indegree are two common simple measures of centrality.
Google made centralitybased Web search famous, and centralitybased search engines now dominate the Web. If clusterbased search engines reveal something like a Web "table of contents", then centralitybased search engines provide quick shortcuts to "the good parts"–the most popular pages and pictures.
The sociological force behind Web hubs is cumulative advantage, or "rich get richer". Here is one important way that cumulative advantage impacts the Web:
 When a Web builder is looking for online resources, he probably uses a centralitybased search engine like Google, and he is more likely to look at higherranked pages.
 Pages atop centralitybased rankings are there because they are already popular (essentially because they have high indegree.)
 Therefore Web builders are more likely to link to (and increase the indegree of) pages that are already popular. It is possible that Web builders will never even see the "good" pages that are not popular.
 Therefore Google and its ilk, to some extent, drive a feedback loop that amplifies the popularity of whatever is already popular. In other words, the rank of a Web page next week has less to do with its content and more to do with its rank this week. The main change to be expected in next week's rankings is that disparities between high rankings and low rankings will increase. The rich get richer.
Preferential attachment algorithm
Chapter 4 of Six Degrees pp 108109 has a good introduction to cumulative advantage (known also as "rich get richer" and the "Matthew effect" among other names) and its effect on network dynamics. Read that before you continue.
Preferential attachment is the mathematical model used to represent the force of cumulative advantage in network dynamics. The algorithm below describes the process of repeated preferential attachment. This is equivalent to the algorithm informally described on pp 108109 of Six Degrees:
Preferential Attachment Algorithm:
Input:
 An undirected graph G = (V, E) such that V = {1, 2, 3, …, k}, k > 1, and E > 0
 Integer n, with n > k. Variable n represents the number of nodes in the output graph
Output:
See example below…
Steps:
 i = k + 1
 Choose a node x based on degree: if deg(x) = 2 * deg(y), then x is twice as likely to be picked as y.
 Add element i to set V
 Add element {i,x} to set E
 Add 1 to the value of i

if i > n then we are done

Go to step 2
Example: Suppose the preferential attachment algorithm starts with V = {1, 2}, E = {{1, 2}}, and n = 20. Therefore k = 2 and the input graph can be drawn as:
The sequence below illustrates the first few iterations of the algorithm.
Iteration 1
Create node 3 and join it to a preexisting node. Node 1 and node 2 are equally likely.
Suppose the algorithm joins node 3 to node 1. Then…
Iteration 2
Create node 4 and join it to a preexisting node. Node 1 is twice as likely as node 2 or node 3.
Suppose the algorithm joins node 4 to node 1. Then…
Iteration 3
Create node 5 and join it to a preexisting node. Node 1 is 3 times as likely as node 2, 3 or 4.
Iteration 4
Create node 6 and join it to a preexisting node.
Iterations continue...
In the very last iteration, node 20 is joined to node 2
Output graph:
Notice in the above example that nodes 1 and 2 start out as equals. However, the random choice of node 1 in the first iteration gives node 1 a nudge up in popularity that feeds on itself. In the end, node 1 is the single dominant hub and node 2 is a secondtier
minihub.
A seemingly insignificant event in iteration 1 has caused a very substantial effect because of cumulative advantage.
Source: Flavio Esposito, https://cs.slu.edu/~esposito/teaching/1080/webscience/graphs.html,
https://cs.slu.edu/~esposito/teaching/1080/webscience/dynamics.html,
https://cs.slu.edu/~esposito/teaching/1080/webscience/structure.html
This work is licensed under a Creative Commons AttributionShareAlike 4.0 License.