Couple of quick points:

1. Bayesian networks don't necessarily encode causality: they encode conditional dependencies, but you can have things the other way.

2. It's possible to come up with networks which aren't acyclic: the graphs in, eg, [turbo coding]( There's extensive theory on loopy [belief propagation](, which attempts to formulate a good set of criteria when belief propagation with (a) converge and (b) converge to the right solution in a graph with cycles.

3. It's probably already known, but the [Chow-Liu tree]( is essentially the best (to second order error terms) tree appropximation to a given probability distribution. It turns out greedily adding the pair with greatest mutual information the tree is optimal for tree approximations. Since mutual information is related to entropy this may be of interest in that connection.