Friday, November 7, 2014

Kosaraju's algorithm explained in simple terms

Why does it work?

Understanding why an algorithm works technically comes under the argument of correctness. It is a very useful exercise if you want to become a great programmer. This post is not a description of the algorithm. Go to wikipedia or something for that. You must be familiar with the algorithm before reading this. This post is just an assistance in convincing you of its correctness.

The algorithm that we are going to understand in this post, was invented by this guy. Alright, lets cut to the chase.



Prerequisites and definitions

Suppose you have a directed graph G, you know that if we do a DFS search on G, we end up with a forest of trees. You must also know the following kind of edges that are in such DFS trees (the numbers in the nodes are not the order we are talking about here):




In a DFS forest of G, cross edges can be between trees too.


The core logic

This algorithm creates and uses an artificial order on the vertices of the graph G:
This order comes from DFS exploration of the graph, resulting in a set of trees (called a forest). This is the order in which a DFS call on a node returns.

In the code of Kosaraju's algorithm, this is implemented by maintaining a global int counter c, and modifying the DFS on the graph G as follows: In the DFS code on a vertex V, just after the DFS on all the unvisited neighbors of V have returned, increment c and assign c to the order of the current node.

This ordering has a very useful property:

A cross edge is never from a lower ordered node to a higher ordered node. 
 However, there can be a back edge from a lower ordered node to a higher ordered node, or a forward edge from a higher ordered node to a lower ordered node.


Why?: This should be apparent from property of DFS. It is easy to see. Suppose that there was a cross edge from a lower ordered node V1 to a higher ordered node V2. This means, during the DFS on V1, V2 must have been visited (unless V2 is an ancestor of V1, in which case it would be a back edge), and DFS on V2 must have been called, and that call must have returned before V1's did. This means, V2's order must be smaller than V1's. A contradiction.

Note that we are asserting this only for cross edges. However, there can still be forward and backward edges from lower ordered nodes to higher ordered nodes and vice-avers.

One more thing: given a vertex V in graph G, suppose we want to list all the vertices in G, from which we can reach V. Simple. Just reverse all the edges in G, and do a DFS starting from V.


The Algorithm

We start by marking all the vertices in the graph as "not marked".

Then we take the highest ordered "non marked" node V among all the nodes, and find those "non marked" vertices in G, from which we can reach V (using DFS on the reversed graph of G). Let the set of these vertices be SV. We mark all the vertices in SV as "marked". SV is one of strongly connected components of G.

We repeat the above paragraph until all the vertices are "marked". That's it. We have listed ALL the strongly connected components of G.


Why did that work just now?

All the magic is in the ordering. When a vertex W's order is smaller than a vertex V, there are 2 cases:
  1. V can't be reached from W. Not interested in W in this case, as it won't be in the strongly connected component containing V.
  2. V can be reached from W:
    In this case, W has to be in the subtree rooted at V. If not, then V can be reached from W, and W is not in the subtree rooted at V. This means W's order can not be smaller than V's (Why? because firstly, W can't be ancestor of V as W's order is lower than V's, and secondly, A cross edge is always from a higher ordered node to a lower ordered node). Contradiction.


    In simpler terms
    , if W wasn't in the subtree rooted at V, a call to DFS at W would definitely have reached V, and the DFS at V would have returned before DFS at W would. This means, W's order would have become higher than V's
Since we always pick the current highest ordered vertex from the graph, all the "not marked" vertices other than V in the graph are W. Suppose that currently, V has the highest order among all the "not marked" vertices, and SV be the set of all the "not marked" vertices from which we can reach V. Then, in the DFS tree containing V, W has to be in the subtree rooted at V (as argued earlier). Thus W can be reached from V (through the tree edges ofcourse).

Hence we conclude, for every vertex W in SV, V can reach W and W can reach V. Sound familiar? Thats right. This means that SV is the set of vertices in the strongly connected component containing V.


I hope you have a clearer understanding of Kosaraju's algorithm now. Please like/share and all that if you do. If you have any questions, please comment below. I will resolve them.