It looks like you're new here. If you want to get involved, click one of these buttons!

- All Categories 2.3K
- Chat 502
- Study Groups 21
- Petri Nets 9
- Epidemiology 4
- Leaf Modeling 2
- Review Sections 9
- MIT 2020: Programming with Categories 51
- MIT 2020: Lectures 20
- MIT 2020: Exercises 25
- Baez ACT 2019: Online Course 339
- Baez ACT 2019: Lectures 79
- Baez ACT 2019: Exercises 149
- Baez ACT 2019: Chat 50
- UCR ACT Seminar 4
- General 72
- Azimuth Code Project 110
- Statistical methods 4
- Drafts 10
- Math Syntax Demos 15
- Wiki - Latest Changes 3
- Strategy 113
- Azimuth Project 1.1K
- - Spam 1
- News and Information 148
- Azimuth Blog 149
- - Conventions and Policies 21
- - Questions 43
- Azimuth Wiki 715

Options

Andrew Eckford is starting up the Banff meeting on information theory and biology. An overview:

Information theory is everywhere in biology.

Evolution is a process of optimization.

Therefore:

- Evolution should have discovered the Shannon limits.

What does this mean, if true?

Information theory should help us biological processes. E.g., any transducer should be operating near its signal capacity.

Biological communication should have good information-theoretic properties that human technology can exploit.

But is it true? There's evidence on both sides.

Yes: neuroscience on the information processing properties of neurons, "infotaxis": how mobile organisms operate inside a chemical gradient in an information-theoretically efficient manner.

No: information-optimal strategies are not always utility-optimal. Strategies for eating are different from strategies about learning!

So, information theory is only as good as the model to which it's applied. Organisms learn *and* act.

## Comments

There's been a long history of work in this field: Yockey (1950s), Barlow (60s), Berger (70s). However, the work has long been outside the mainstream of information theory (IT). A recent surge of interest in "systems biology" is changing this.

The problem may go back to Claude Shannon's editorial The Bandwagon, in which he tried to pour cold water on the fad for information theory. However, he merely said he wanted information theory to be mathematically rigorous. Now biology is in a position be mathematically rigorous.

`There's been a long history of work in this field: Yockey (1950s), Barlow (60s), Berger (70s). However, the work has long been outside the mainstream of information theory (IT). A recent surge of interest in "systems biology" is changing this. The problem may go back to Claude Shannon's editorial [The Bandwagon](http://dsp.rice.edu/sites/dsp.rice.edu/files/shannon-bandwagon.pdf), in which he tried to pour cold water on the fad for information theory. However, he merely said he wanted information theory to be mathematically rigorous. Now biology is in a position be mathematically rigorous.`

We'll have talks on:

modelling at the level of individual neurons

modelling at the level of systems: sensory systems

calculating channel capacity for some specific systems, e.g. cell-cell communications

computing Shannon capacity can be hard - for example, it may require computing the permanent of a matrix, which is a lot harder than the determinant

information theory in evolution (me), information theory and the origin of life, information theory in biodiversity

`We'll have talks on: * modelling at the level of individual neurons * modelling at the level of systems: sensory systems * calculating channel capacity for some specific systems, e.g. cell-cell communications * computing Shannon capacity can be hard - for example, it may require computing the permanent of a matrix, which is a lot harder than the determinant * information theory in evolution (me), information theory and the origin of life, information theory in biodiversity`

Peter Thomas' talk.

What does channel capacity have to do with biology? Bell and Sejnoski 1997 looked at ensembles of visual scenes in a paper "The independent components of natural scenes are edge filters", and used information theory to find optimal visual receptors.

Smith and Lewicki (2006

Nature) did a similar analysis for auditory receptors.Schneider 2010 used a similar analysis to understand nucleic acids (gene encodings).

Vergassola

et al(2007Nature) did a similar analysis for chemotaxis. Do small organisms move along the gradient of a chemical signal? No, often they move in such a way as to maximize information gained!But to do these analyses rigorously, we need to be able to measure channel capacities of biological systems,

cell by cell. Levchenko et all (2011Science) started doing this:`Peter Thomas' talk. What does channel capacity have to do with biology? Bell and Sejnoski 1997 looked at ensembles of visual scenes in a paper "The independent components of natural scenes are edge filters", and used information theory to find optimal visual receptors. Smith and Lewicki (2006 _Nature_) did a similar analysis for auditory receptors. Schneider 2010 used a similar analysis to understand nucleic acids (gene encodings). Vergassola _et al_ (2007 _Nature_) did a similar analysis for chemotaxis. Do small organisms move along the gradient of a chemical signal? No, often they move in such a way as to maximize information gained! But to do these analyses rigorously, we need to be able to measure channel capacities of biological systems, _cell by cell_. Levchenko et all (2011 _Science_) started doing this: * Raymond Cheong, Alex Rhee, Chiaochun Joanne Wang, Ilya Nemenman and Andre Levchenko, [Information transduction capacity of noisy biochemical signaling networks](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3895446/).`

Three stages of signalling:

Secretion of a signalling molecule.

Diffusion from sender to receiver.

Ligand binding of signalling molecule to a receptor protein.

All three stages involve noise which must be taken into account when computing channel capacities.

Binding: see the finite sets of states and random transitions between these in Thomas' paper on the "calcum-calmodulin binding graph". The chemical master equation is involved here!

They applied a theorem due to Chen and Berger to compute the channel capacity of a two-state discrete time Markov model - a simplified model of this sort:

`Three stages of signalling: 1. Secretion of a signalling molecule. 2. Diffusion from sender to receiver. 3. Ligand binding of signalling molecule to a receptor protein. All three stages involve noise which must be taken into account when computing channel capacities. Binding: see the finite sets of states and random transitions between these in Thomas' paper on the "calcum-calmodulin binding graph". The chemical master equation is involved here! They applied a theorem due to Chen and Berger to compute the channel capacity of a two-state discrete time Markov model - a simplified model of this sort: * Andrew W. Eckford and Peter J. Thomas, [Capacity of a simple intercellular signal transduction channel](http://arxiv.org/abs/1305.2245).`

Information processing needs to be thought of as part of the bigger "perception-action loop". The statistical structure of the world determines the percept; the creature then determines an action using some strategy, and this changes the statistical structure of the world.

So, we've left the world of "directed acyclic graphs", because our causal Bayesian network contains

loops!!!Journal of Theoretical Biology.`Information processing needs to be thought of as part of the bigger "perception-action loop". The statistical structure of the world determines the percept; the creature then determines an action using some strategy, and this changes the statistical structure of the world. So, we've left the world of "directed acyclic graphs", because our causal Bayesian network contains _loops!!!_ * Edward K. Agarwala, Hillel J. Chiel and Peter J. Thomas, Pursuit of food versus pursuit of information in Markov chain models of a perception-action loop, _Journal of Theoretical Biology_. > **Abstract.** Efficient coding, redundancy reduction, and other information theoretic optimization principles have successfully explained the organization of many biological phenomena, from the physiology of sensory receptive fields to the variability of certain DNA sequence ensembles. Here we examine the hypothesis that behavioral strategies that are optimal for survival must necessarily involve efficient information processing, and ask whether there can be circumstances in which deliberately sacrificing some information can lead to higher utility? To this end, we present an analytically tractable model for a particular instance of a perception–action loop: a creature searching for a randomly moving food source confined to a 1D ring world. The model incorporates the statistical structure of the creature's world, the effects of the creature's actions on that structure, and the creature's strategic decision process. The underlying model takes the form of a Markov process on an infinite dimensional state space. To analyze it we construct an exact coarse graining that reduces the model to a Markov process on a finite number of “information states”. This mathematical technique allows us to make quantitative comparisons between the performance of an information-theoretically optimal strategy with other candidate search strategies on a food gathering task. We find that1.Information optimal search does not necessarily optimize utility (expected food gain).2.The rank ordering of search strategies by information performance does not predict their ordering by expected food obtained.3.The relative advantage of different strategies depends on the statistical structure of the environment, in particular the variability of motion of the source.We conclude that there is no simple relationship between information and utility. Even in the absence of information processing costs or bandwidth constraints, behavioral optimality does not imply information efficiency, nor is there a simple tradeoff between the two objectives of gaining information about a food source versus obtaining the food itself. For a wide range of values of the food source's movement parameter, the strategy of collecting the most information possible about the unknown source location carries an ineliminable structural cost, leading to a situation in which a foraging creature could actually choose to be less well-informed while simultaneously being, on average, better fed.`

William Bialek argues that many biological processes operate at the theoretical limit of information utilization already. He also argues that this drives them to operate at phase transitions or critical points of their respective parameter spaces. He gives several concrete examples with calculations to back up the claim.

Good introductions to this work are:

`William Bialek argues that many biological processes operate at the theoretical limit of information utilization already. He also argues that this drives them to operate at phase transitions or critical points of their respective parameter spaces. He gives several concrete examples with calculations to back up the claim. Good introductions to this work are: + very nice [talk](http://t.co/HcabJvC1tV) + [a lecture series](http://www.youtube.com/playlist?list=PLoxv42WBtfCAY8icy7uChz_kpBXpWoMwk) + a survey paper [Are biological systems poised at criticality? Thierry Mora, William Bialek](http://arxiv.org/abs/1012.2242)`

There's an forum thread referencing Bialek here. John offered links to copies of a couple of interesting sounding pdfs you might like.

`There's an forum thread referencing Bialek [here](http://forum.azimuthproject.org/discussion/1164/national-institute-for-mathematical-and-biological-synthesis/?Focus=8563#Comment_8563). John offered links to copies of a couple of interesting sounding pdfs you might like.`

Hello John

In all what was mentioned above the most important concept is missing, that of

compressionwhich is found in living organisms and their genome, please review this for the ants (P12):Ants and Kolmogorov Complexity

Compression of information allows for large network processing I am sure the concept requires no further suggestion to you.

Kolmogorov Complexity is in full mathematical gear as invented by Kolmogorov (may he rest in peace) and LA Levin. Therefore for your purposes the mathematical and computational framework is there already 100% done!

Same concept is found in genome compression:

Evolutionary Principles of Genomic Compression

In particular it is this genomic compression that allows for the economic energetics for bird flight:

The Smallest Avian genomes are found in hummingbirds

John compression has no mathematical closed form and difficult to deal with mathematically the traditional way, but easy in code to compress any data. In fact any

serializedobject could be compressed, then replace the compression with Kolmogorov Complexity counterpart, and you get yourself a mathematical computational framework to explain some nifty awesome stuff in the universe.Dara

`Hello John In all what was mentioned above the most important concept is missing, that of **compression** which is found in living organisms and their genome, please review this for the ants (P12): [Ants and Kolmogorov Complexity](http://www.reznikova.net/AntsBits.pdf) Compression of information allows for large network processing I am sure the concept requires no further suggestion to you. Kolmogorov Complexity is in full mathematical gear as invented by Kolmogorov (may he rest in peace) and LA Levin. Therefore for your purposes the mathematical and computational framework is there already 100% done! Same concept is found in genome compression: [Evolutionary Principles of Genomic Compression](http://www.santafe.edu/media/workingpapers/02-05-021.pdf) In particular it is this genomic compression that allows for the economic energetics for bird flight: [The Smallest Avian genomes are found in hummingbirds](http://msb.unm.edu/birds/publications_files/Gregory_et_al_PRSB_Hummingbird_Genome_Size.pdf) John compression has no mathematical closed form and difficult to deal with mathematically the traditional way, but easy in code to compress any data. In fact any **serialized** object could be compressed, then replace the compression with Kolmogorov Complexity counterpart, and you get yourself a mathematical computational framework to explain some nifty awesome stuff in the universe. Dara`

Thanks for your comments, everyone! I'll hve to look at these references when I get time. I'd invited William Bialek to my own workshop on information and entropy in biological systems. Unfortunately, when we had to move the date due to a football game in the town where the workshop is being held, he said he could no longer attend.

`Thanks for your comments, everyone! I'll hve to look at these references when I get time. I'd invited William Bialek to my own workshop on [information and entropy in biological systems](http://johncarlosbaez.wordpress.com/2014/07/04/entropy-and-information-in-biological-systems-part-2/). Unfortunately, when we had to move the date due to a football game in the town where the workshop is being held, he said he could no longer attend.`

FWIW Possible typo at #4 if it's the NN guy it's s/sejnoski/sejnowski/.

`FWIW Possible typo at #4 if it's the NN guy it's s/sejnoski/sejnowski/.`

Toby Berger, inventor of "rate distortion theory", is talking now on "Neuroscience applications of generalized inverse Gaussian distributions".

While he's warming up: Generalized inverse Gaussian distributions or

GIG distributionsfor short, are a 3-parameter family of probability distributions on the real line. "The GIG distribution is conjugate to the normal distribution when serving as the mixing distribution in a normal variance-mean mixture" - whatever that means.`Toby Berger, inventor of "rate distortion theory", is talking now on "Neuroscience applications of generalized inverse Gaussian distributions". While he's warming up: [Generalized inverse Gaussian distributions](https://en.wikipedia.org/wiki/Generalized_inverse_Gaussian_distribution) or **GIG distributions** for short, are a 3-parameter family of probability distributions on the real line. "The GIG distribution is conjugate to the normal distribution when serving as the mixing distribution in a normal variance-mean mixture" - whatever that means.`

The generalized Shannon-Blackwell billiard ball channel.Each step you put in either a white or black billiard ball in a huge box and your friend randomly picks one and takes one out. What is the channel capacity?Nobody knows!You can get at least $1/2$ a bit per use, but the capacity is at least $0.52$.This is similar to one cell communicating to another by putting out different kind of molecules into a liquid, with the other cell randomly picking out these molecules. Of course there can be more than 2 kinds of molecules: there's a set of species $\{0, \dots, M-1\}$.

Suppose the number of molecules in the solution is huge, like $b = 10^9$. Suppose $M = 2$. Consider testing the hypotheses: $H_0$: the next $N$ inputs are 0's. $H_1$: the next $n$ input are all $1$'s.

Find the smallest $N$ such that the next $N$ outputs discriminate reliably between $H_0$ and $H_1$. Answer: $const p^{1/3} b^{2/3}$. (What the heck is $p$?)

This is much smaller than you might think! Since $b^{2/3}$ = 10^6$, you only need to change the concentration of 0's by one in ten thousand to reliably distinguish between these hypotheses.

`**The generalized Shannon-Blackwell billiard ball channel.** Each step you put in either a white or black billiard ball in a huge box and your friend randomly picks one and takes one out. What is the channel capacity? **Nobody knows!** You can get at least $1/2$ a bit per use, but the capacity is at least $0.52$. This is similar to one cell communicating to another by putting out different kind of molecules into a liquid, with the other cell randomly picking out these molecules. Of course there can be more than 2 kinds of molecules: there's a set of species $\{0, \dots, M-1\}$. Suppose the number of molecules in the solution is huge, like $b = 10^9$. Suppose $M = 2$. Consider testing the hypotheses: $H_0$: the next $N$ inputs are 0's. $H_1$: the next $n$ input are all $1$'s. Find the smallest $N$ such that the next $N$ outputs discriminate reliably between $H_0$ and $H_1$. Answer: $const p^{1/3} b^{2/3}$. (What the heck is $p$?) This is much smaller than you might think! Since $b^{2/3}$ = 10^6$, you only need to change the concentration of 0's by one in ten thousand to reliably distinguish between these hypotheses.`

The "classical" view of cortical processing was:

Information gets compressed as it travels up the hierarchy.

They ignored the flow of information back

down, e.g. from the cortex toward the retina.Both these are wrong. Indeed, even the briefest of stimuli set off big chains of neural firings going up and down the pathways.

Information is not being destroyed as it goes up the hierarchy. It gets rearranged, and about halfway up information from different sensory systems (visual, auditory) get mixed. Top-down signals largely (not exclusively)

suppresscertain cells in the levels below.The idea: sensation is focused on trying to confirm or disconfirm hypotheses about reality. Signals irrelevant to this decision process get suppressed.

`The "classical" view of cortical processing was: 1. Information gets compressed as it travels up the hierarchy. 1. They ignored the flow of information back _down_, e.g. from the cortex toward the retina. Both these are wrong. Indeed, even the briefest of stimuli set off big chains of neural firings going up and down the pathways. Information is not being destroyed as it goes up the hierarchy. It gets rearranged, and about halfway up information from different sensory systems (visual, auditory) get mixed. Top-down signals largely (not exclusively) _suppress_ certain cells in the levels below. The idea: sensation is focused on trying to confirm or disconfirm hypotheses about reality. Signals irrelevant to this decision process get suppressed.`

The goal of sensation is

notto accurately measure some quantity in the outside world. The error criterion you're trying to minimize isnotthe error between reality and perception. The error that matters is the difference between what youthinkthe sensation your action will produce, and the sensation isdoesproduce.`The goal of sensation is _not_ to accurately measure some quantity in the outside world. The error criterion you're trying to minimize is _not_ the error between reality and perception. The error that matters is the difference between what you _think_ the sensation your action will produce, and the sensation is _does_ produce.`

A simple model:

Let $a,b$ be times of occurrence of two successive spikes generated by a neuron.

Goal: maximize the number of bits of information that knowledge of $t = b = a$ conveys to the neuron's targets per joule of energy that the neuron expends to process its input during the interspike interval (ISI) to generate the spike at $b$, and propagate that spike to all the targets.

`A simple model: Let $a,b$ be times of occurrence of two successive spikes generated by a neuron. Goal: maximize the number of bits of information that knowledge of $t = b = a$ conveys to the neuron's targets per joule of energy that the neuron expends to process its input during the interspike interval (ISI) to generate the spike at $b$, and propagate that spike to all the targets.`

Actually, what matter is the

bit rateat which the targets receive information, per watt expended by the neuron. This is less than the single-ISI bit per watt, because while energy is additive, the information received issubadditivedue to correlations.`Actually, what matter is the _bit rate_ at which the targets receive information, per watt expended by the neuron. This is less than the single-ISI bit per watt, because while energy is additive, the information received is _subadditive_ due to correlations.`

`* Toby Berger, [Information and decision theory explain why there are neural pulse trains](http://www.clsp.jhu.edu/seminars/1236/).`

In 2010-2011 a team of eminent scientists led by Dharmendra Modhra generated a real-time simulation of 10 million neurons in a cat's visual cortex. They managed to get it stable: previous simulations would destabilize so after a while all neurons were firing as fast as posssible or not at all!

However, the simulation used $10^9$ times as much power per neuron than a real cat!

The human brain is wonderfully efficient, running on 40 watts.

`In 2010-2011 a team of eminent scientists led by Dharmendra Modhra generated a real-time simulation of 10 million neurons in a cat's visual cortex. They managed to get it stable: previous simulations would destabilize so after a while all neurons were firing as fast as posssible or not at all! However, the simulation used $10^9$ times as much power per neuron than a real cat! The human brain is wonderfully efficient, running on 40 watts.`

Claim: the three main energy costs of an interspike interval of duration $t$ are proportional to $\log(1/t)$, $1/t$, and $t$.

`Claim: the three main energy costs of an interspike interval of duration $t$ are proportional to $\log(1/t)$, $1/t$, and $t$.`

The point (which the speaker just barely got around to at the very end of his talk):

The GIG (generalized inverse Gaussian) distribution manages to maximize bit rate per watt under certain assumptions.

`The point (which the speaker just barely got around to at the very end of his talk): The GIG (generalized inverse Gaussian) distribution manages to maximize bit rate per watt under certain assumptions.`

Next talk: Naftali Tishby on "Sensing and acting under information constraints: a principled approach to biology and intelilgence".

Life: systems that exploit the predictability of their environment for their survivability.

There's one function: how much value a certain amount of information about the future can provide.

And there's another: how little having a certain amount of information about the past cost.

And another: how many bits about the future will some bits about the past provide. This depends on the "window size" for the past and for the future.

If you can compute these 3 functions, you can do something.

`Next talk: [Naftali Tishby](http://www.cs.huji.ac.il/~tishby/) on "Sensing and acting under information constraints: a principled approach to biology and intelilgence". Life: systems that exploit the predictability of their environment for their survivability. There's one function: how much value a certain amount of information about the future can provide. And there's another: how little having a certain amount of information about the past cost. And another: how many bits about the future will some bits about the past provide. This depends on the "window size" for the past and for the future. If you can compute these 3 functions, you can do something.`

In "metabolic information processing" sensory is information is received and used for acting almost immediately. There's hardly any long term planning.

In something like playing chess, we have a different problem: "long term planning". Here we model the future and use information from that model to make our present decision.

`In "metabolic information processing" sensory is information is received and used for acting almost immediately. There's hardly any long term planning. In something like playing chess, we have a different problem: "long term planning". Here we model the future and use information from that model to make our present decision.`

JM Fuster's "perception-action cycle" is the circular flow of information between the organism and its environment that takes place in a sensory-guided sequence of behaviors toward a goal.

`JM Fuster's "perception-action cycle" is the circular flow of information between the organism and its environment that takes place in a sensory-guided sequence of behaviors toward a goal.`

The brain compresses information about the past, selecting just "useful" information, and uses this to make

predictions about the futureandactions that affect the future.We may try to formalize this using a

partially observed Markov decision process. This have states, actions, transition probabilities (probability to go from one state to anothergiven an action), and rewards (reward for an action going from one state to another).This allows a nice graphical model for the perception-action cycle.

`The brain compresses information about the past, selecting just "useful" information, and uses this to make _predictions about the future_ and _actions that affect the future_. We may try to formalize this using a **[partially observed Markov decision process](https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process)**. This have states, actions, transition probabilities (probability to go from one state to another _given an action_), and rewards (reward for an action going from one state to another). This allows a nice graphical model for the perception-action cycle.`

Naftali Tishby is saying very clear and interesting things about deep issues... unfortunately there are a lot of pictures I can't type in here!

`[Naftali Tishby](http://www.cs.huji.ac.il/~tishby/) is saying very clear and interesting things about deep issues... unfortunately there are a lot of pictures I can't type in here!`

He's generalizing something I don't know: the Bellman equation. Unlike the traditional Bellman equation, his takes into account the gain of information produced by actions.

`He's generalizing something I don't know: the [Bellman equation](https://en.wikipedia.org/wiki/Bellman_equation). Unlike the traditional Bellman equation, his takes into account the gain of information produced by actions.`

Claim: we enjoy music because we are constantly trying to predict what'll happen next, and we get a dopamine reward when we succeed. If we fail we try to build a model that does better.

His collaborators have played simple sequences of tones to anesthesized cats, who still in fact can hear and try to predict the next tone. He compares their neural activity to what you'd expect from optimal prediction models.

The number of neural spikes after a tone is proportional to the "surprise" of that tone, that is, the logarithm of its probability in the Markov process whereby the tones were generated!

`Claim: we enjoy music because we are constantly trying to predict what'll happen next, and we get a dopamine reward when we succeed. If we fail we try to build a model that does better. His collaborators have played simple sequences of tones to anesthesized cats, who still in fact can hear and try to predict the next tone. He compares their neural activity to what you'd expect from optimal prediction models. The number of neural spikes after a tone is proportional to the "surprise" of that tone, that is, the logarithm of its probability in the Markov process whereby the tones were generated!`

A paper:

`A paper: * Naftali Tishby and Daniel Polani, [Information theory of decisions and actions](http://www.cs.huji.ac.il/labs/learning/Papers/IT-PAC.pdf)`

Too predictable music is usually boring, i.e. feels like no reward.

`>Claim: we enjoy music because we are constantly trying to predict what’ll happen next, and we get a dopamine reward when we succeed. If we fail we try to build a model that does better. Too predictable music is usually boring, i.e. feels like no reward.`

Check out Mirror Neurons, the more predictable the tunes the more we tend to repeat after. This is true I believe even in animals who dance to periodic drum beats or bass of human music (lots of them on Youtube).

If you make the sequence of notes random, then the mirror neurons would not fire! as John mentioned. In that case pleasure glands would not release.

This is same exact predator prey chase, roll a ball in front a child or a cat, uncontrollably they chase the ball.

Same as the children and songbirds repeat the sounds of the parents.

Same as in Mathematics' definitions and proofs, they are made to be repeatable, this process of repeating is not possible unless by Mirror Neurons. That is how children learn to arithmetic and algebra.

Dara

`>Too predictable music is usually boring Check out Mirror Neurons, the more predictable the tunes the more we tend to repeat after. This is true I believe even in animals who dance to periodic drum beats or bass of human music (lots of them on Youtube). If you make the sequence of notes random, then the mirror neurons would not fire! as John mentioned. In that case pleasure glands would not release. This is same exact predator prey chase, roll a ball in front a child or a cat, uncontrollably they chase the ball. Same as the children and songbirds repeat the sounds of the parents. Same as in Mathematics' definitions and proofs, they are made to be repeatable, this process of repeating is not possible unless by Mirror Neurons. That is how children learn to arithmetic and algebra. Dara`

Now Ilya Nemenman is talking about "Predictive information".

(He's done a lot of interesting work on quantum techniques for chemical reaction networks, and knows a whole community of people working on this who I hadn't met! They have an annual conference on it!!! More on that later.)

His webpage says:

`Now [Ilya Nemenman](http://nemenmanlab.org/~ilya/index.php/Ilya_Nemenman) is talking about "Predictive information". (He's done a lot of interesting work on quantum techniques for chemical reaction networks, and knows a whole community of people working on this who I hadn't met! They have an annual conference on it!!! More on that later.) His webpage says: > Are there phenomenological, coarse-grained, and yet functionally accurate representations of biological processes, or are we forever doomed to every detail mattering? > In my mind, the question is not if some details don't matter, but which ones. A lot of smart people have thought about this question before. The dream is that, by stripping unnecessary details, we will eventually understand the basics of how we can function reliably in an ever changing world. I hope to achieve some quantitative understanding of such complex phenomena as evolution, sensory processes, animal behavior, human cognition, and, who knows, maybe one day even human consiousness. > What can be a more noble science goal? As I argued a while back: > > _Studying string theory cannot be more exciting than studying the brain that can study string theory._`

He's done work with Andre Levchenko (who couldn't attend) on information processing by molecular pathways.

`He's done work with Andre Levchenko (who couldn't attend) on information processing by molecular pathways.`

Hello John

I love to write simulators and modellers for this

Dara

`Hello John > information processing by molecular pathways I love to write simulators and modellers for this Dara`

Most accurate algorithm for estimating information from a discrete signal: the

NSB algorithm.`Most accurate algorithm for estimating information from a discrete signal: the **NSB algorithm**.`

In biology most communications reach their target only

after things have changed significantly.Predictability of the future is a deviation from extensivity of entropy: the information needed to describe past, present, and future together is less than the sum of the information needed to describe past, present and future separately. The difference is the

predictive information$I_{pred}(T)$, where $T$ is the relevant time scale.We have

$$ I_{pred}(T) \ge 0$$ and subextensivity:

$$ \lim_{T \to \infty} I_{pred}(T) / T = 0$$ If

$$ \lim_{T \to \infty} I_{pred}(T) = const $$ we have no long range structure - e.g. a Markov process spits out new entropy at a constant rate.

More interesting case:

$$ I_{pred}(T) \sim \frac{K}{2} \log T $$ We get this when we're learning $K$ parameters from $T$ samples of a Markov process - learning these parameters with better and better precision as $T$ grows. This may be related to long-range correlations as seen at critical points - but details are unknown.

Mysterious case:

$$ I_{pred}(T) \sim C T^\beta $$ with $\beta \lt 1$. For example: reading English prose seems to give $\beta \approx 1/2$. As we look at longer and longer samples of English prose, there's more and more information; it seems to grow forever (from experiments done so far), but it grows sublinearly.

`In biology most communications reach their target only _after things have changed significantly_. Predictability of the future is a deviation from extensivity of entropy: the information needed to describe past, present, and future together is less than the sum of the information needed to describe past, present and future separately. The difference is the **predictive information** $I_{pred}(T)$, where $T$ is the relevant time scale. We have $$ I_{pred}(T) \ge 0$$ and subextensivity: $$ \lim_{T \to \infty} I_{pred}(T) / T = 0$$ If $$ \lim_{T \to \infty} I_{pred}(T) = const $$ we have no long range structure - e.g. a Markov process spits out new entropy at a constant rate. More interesting case: $$ I_{pred}(T) \sim \frac{K}{2} \log T $$ We get this when we're learning $K$ parameters from $T$ samples of a Markov process - learning these parameters with better and better precision as $T$ grows. This may be related to long-range correlations as seen at critical points - but details are unknown. Mysterious case: $$ I_{pred}(T) \sim C T^\beta $$ with $\beta \lt 1$. For example: reading English prose seems to give $\beta \approx 1/2$. As we look at longer and longer samples of English prose, there's more and more information; it seems to grow forever (from experiments done so far), but it grows sublinearly.`

Now Daniel Polani is talking about "Informational principles in the perception-action loop".

Sensors in biology are often highly optimized:

detection of just a few molecules (moths).

humans can detect a few photons, certain toads can detect single photons.

human children can hear at close to the limit imposed by thermal noise; then they go to discos and lose their hearing.

Cognitive processing is very expensive: it can often use 40% or 50% of our total power!

So, there must be huge evolutionary pressure for optimized sensors and high cognitive skills.

Was man nicht im Kopf hat, muss man in den Beinen haben.You don't need to run as fast if you can sense danger earlier.

`Now Daniel Polani is talking about "Informational principles in the perception-action loop". Sensors in biology are often highly optimized: * detection of just a few molecules (moths). * humans can detect a few photons, certain toads can detect single photons. * human children can hear at close to the limit imposed by thermal noise; then they go to discos and lose their hearing. Cognitive processing is very expensive: it can often use 40% or 50% of our total power! So, there must be huge evolutionary pressure for optimized sensors and high cognitive skills. *Was man nicht im Kopf hat, muss man in den Beinen haben.* You don't need to run as fast if you can sense danger earlier.`

Landauer's principle: on the lowest level, erasure of information from memory must create heat. This gives the connection between energy and information that defeats Maxwell's demon.

However, when discussing human cognition we have vast amounts of waste heat so Landauer's principle is not directly relevant.Here Ashby (1956) said "only variety can destroy entropy". Touchet and Lloyd in 2000, 2004 studied open-loop and closed-loop controllers and saw there was a limit on how well we could reduce the entropy of the system being controlled:

Supposedly they showed:

$$ \Delta H_{closed} \le \Delta H_{open} + I(W_t; A_t) $$ where $I(W_t; A_t)$ is the mutual information of the world at time $t$ and the action at time $t$.

Learn about this stuff!This is a cool relationship between information theory and control theory!`Landauer's principle: on the lowest level, erasure of information from memory must create heat. This gives the connection between energy and information that defeats Maxwell's demon. _However_, when discussing human cognition we have vast amounts of waste heat so Landauer's principle is not directly relevant. Here Ashby (1956) said "only variety can destroy entropy". Touchet and Lloyd in 2000, 2004 studied open-loop and closed-loop controllers and saw there was a limit on how well we could reduce the entropy of the system being controlled: * [Information-theoretic approach to the study of control systems](http://arxiv.org/abs/physics/0104007). Supposedly they showed: $$ \Delta H_{closed} \le \Delta H_{open} + I(W_t; A_t) $$ where $I(W_t; A_t)$ is the mutual information of the world at time $t$ and the action at time $t$. _Learn about this stuff!_ This is a cool relationship between information theory and control theory!`

Too many pictures to take good notes...

Consider a Bayesian network where the world at time $t$ gives a sensation at time $t$ which gives an action at time $t$ which (together with the world) affects the world at time $t+1$. If the vector of sensations is $\mathbf{S}_t = (S_0, \dots, S_t)$, then the mutual information between $\mathbf{S}_t$ and $W_0$ (the state of the world at time zero) approaches a limit as $t \to \infty$.

`Too many pictures to take good notes... Consider a Bayesian network where the world at time $t$ gives a sensation at time $t$ which gives an action at time $t$ which (together with the world) affects the world at time $t+1$. If the vector of sensations is $\mathbf{S}_t = (S_0, \dots, S_t)$, then the mutual information between $\mathbf{S}_t$ and $W_0$ (the state of the world at time zero) approaches a limit as $t \to \infty$.`

The traditional work on Markov decision processes don't take into account the decision costs. So, let's add a cost for the amount of information that needs to be used to make the decision.

Minimize (mutual information between state of world and action) minus $\beta$ times (expected value of reward). Here $\beta$ is a Lagrange multiplier, that says how much we care about a high expected reward versus how much we try to avoid using a lot of information.

There's a nice "grid world" example of a robot trying to find a specific point while walking around on a grid, in the paper by Tishby and Polani which founded this line of work.

`The traditional work on Markov decision processes don't take into account the decision costs. So, let's add a cost for the amount of information that needs to be used to make the decision. Minimize (mutual information between state of world and action) minus $\beta$ times (expected value of reward). Here $\beta$ is a Lagrange multiplier, that says how much we care about a high expected reward versus how much we try to avoid using a lot of information. There's a nice "grid world" example of a robot trying to find a specific point while walking around on a grid, in [the paper by Tishby and Polani](http://www.cs.huji.ac.il/labs/learning/Papers/IT-PAC.pdf) which founded this line of work.`

In biology, the success criterion is survival (really reproduction). The problem is then that feedback is too sparse: you only know you've done something wrong when you die (before having children). So, to optimize behavior it helps to have smaller rewards and punishments: pain, pleasure, unhappiness, happiness.

Adaptational feedback should be dense and rich.`In biology, the success criterion is survival (really reproduction). The problem is then that feedback is too sparse: you only know you've done something wrong when you die (before having children). So, to optimize behavior it helps to have smaller rewards and punishments: pain, pleasure, unhappiness, happiness. _Adaptational feedback should be dense and rich._`

"Being in control of your destiny - and knowing it - is good". This is connected to

controllabilityandobservability, two key concepts in control theory.See Klyubin

et al, 2005 and 2008 on empowerment:Abstract of the last paper:

`"Being in control of your destiny - and knowing it - is good". This is connected to _controllability_ and _observability_, two key concepts in control theory. See Klyubin _et al_, 2005 and 2008 on [empowerment](http://www.prokopenko.net/empowerment.html): > The information transfer can also be interpreted as the acquisition of information from the environment by a single adapting individual: there is evidence that pushing the information flow to the information-theoretic limit (i.e. maximization of information transfer) can give rise to intricate behaviour, induce a necessary structure in the system, and ultimately adaptively reshape the system [1-3]. The central hypothesis of Klyubin et al. is that there exists "a local and universal utility function which may help individuals survive and hence speed up evolution by making the fitness landscape smoother", while adapting to morphology and ecological niche. The proposed general utility function, **empowerment**, couples the agent’s sensors and actuators via the environment. _Empowerment is the perceived amount of influence or control the agent has over the world, and can be seen as the agent’s potential to change the world. It can be measured via the amount of Shannon information that the agent can "inject into" its sensor through the environment, a effecting future actions and future perceptions._ Such a perception-action loop defines the agent’s actuation channel, and technically empowerment is defined as the capacity of this actuation channel: the maximum mutual information for the channel over all possible distributions of the transmitted signal. "The more of the information can be made to appear in the sensor, the more control or influence the agent has over its sensor" – this is the main motivation for this local and universal utility function [2]. > 1. Klyubin, A.S., Polani, D. and Nehaniv, C.L. Organization of the information flow in the perception-action loop of evolved agents. In Proceedings of 2004 NASA/DoD Conference on Evolvable Hardware, page 177-180. IEEE Computer Society, 2004. > 2. Klyubin, A.S., Polani, D. and Nehaniv, C.L. All else being equal be empowered. In M. S. Capcarr‘ere, A. A. Freitas, P. J. Bentley, C. G. Johnson, and J. Timmis, editors, Advances in Artificial Life, 8th European Conference, ECAL 2005, volume 3630 of LNCS, page 744-753. Springer, 2005. > 3. Klyubin, A.S., Polani, D. and Nehaniv, C.L. [Empowerment: A Universal Agent-Centric Measure of Control](http://homepages.herts.ac.uk/~comqdp1/publications/files/cec2005_klyubin_polani_nehaniv.pdf). In Proc. CEC 2005. IEEE. Abstract of the last paper: > **Abstract.** The classical approach to using utility functions suffers from the drawback of having to design and tweak the functions on a case by case basis. Inspired by examples from the animal kingdom, social sciences and games we propose empowerment, a rather universal function, defined as the information-theoretic capacity of an agent's actuation channel. The concept applies to any sensorimotor apparatus. Empowerment as a measure reflects the properties of the apparatus as long as they are observable due to the coupling of sensors and actuators via the environment. Using two simple experiments we also demonstrate how empowerment influences sensor-actuator evolution.`

I think Nemenman was Bialek's student. They have co-authored a number of papers together, some with also Tishby. Predictivity turns up in Bialek's talks a fair bit. Nemenman has also done work on applying field theory in machine learning. His thesis looks very interesting. That and a number of his papers are on my reading list. If only I could tick papers off the list at the same rate as I add to it ... :)

`I think Nemenman was Bialek's student. They have co-authored a number of papers together, some with also Tishby. Predictivity turns up in Bialek's talks a fair bit. Nemenman has also done work on applying field theory in machine learning. His thesis looks very interesting. That and a number of his papers are on my reading list. If only I could tick papers off the list at the same rate as I add to it ... :)`

Ooh videos! :)

`Ooh [videos](https://www.birs.ca/events/2014/5-day-workshops/14w5170/videos)! :)`

Hi Daniel,

I guess this is characteristic of the Azimuth project network. Now of the ~1300 papers in my Azimuth downloads folder, never mind the bookmarks, which one is the optimal "next"? :)

I went through Nemenmann's publications list and copied the maths and machine learning sounding links. I'll add titles and post links to their pdfs tomorrow.

`Hi Daniel, > a number of his papers are on my reading list. If only I could tick papers off the list at the same rate as I add to it … :) I guess this is characteristic of the Azimuth project network. Now of the ~1300 papers in my Azimuth downloads folder, never mind the bookmarks, which one is the optimal "next"? :) I went through Nemenmann's publications list and copied the maths and machine learning sounding links. I'll add titles and post links to their pdfs tomorrow.`

They really should upload them to youtube though. I am completely hooked on the youtube's speed-up feature.

`> Ooh videos! :) They really should upload them to youtube though. I am completely hooked on the youtube's speed-up feature.`

Susanne Still is giving a talk.

Say we're using Bayesian updating and replacing the prior $p(y)$ by the posterior $p(y|x) = p(x|y)p(y)/p(x)$.

How much work does it take, on average, to do this?

$$ \langle \Delta F(p(y) \to p(y|x)) \rangle_x = \langle F(p(y|x)) \rangle_x - F(p(y)) = \Delta E + k_B T I[x,y] $$ where $I[x,y]$ is the mutual information, i.e. the entropy of $y$ minus the entropy of $y$ given $x$.

`Susanne Still is giving a talk. Say we're using Bayesian updating and replacing the prior $p(y)$ by the posterior $p(y|x) = p(x|y)p(y)/p(x)$. How much work does it take, on average, to do this? $$ \langle \Delta F(p(y) \to p(y|x)) \rangle_x = \langle F(p(y|x)) \rangle_x - F(p(y)) = \Delta E + k_B T I[x,y] $$ where $I[x,y]$ is the mutual information, i.e. the entropy of $y$ minus the entropy of $y$ given $x$.`

In many cases $\Delta E$ is zero.

Then define

$$ L(U) := min_{p(y|x) \;such \;that\; \langle u(x,y) \rangle = U} \langle \Delta F \rangle $$ where $u(x,y)$ is some utility function.

`In many cases $\Delta E$ is zero. Then define $$ L(U) := min_{p(y|x) \;such \;that\; \langle u(x,y) \rangle = U} \langle \Delta F \rangle $$ where $u(x,y)$ is some utility function.`

The following titles in Ilya Nemenman's publications.looked interesting and possibly even relevant to other non-biological Azimuth work.

`The following titles in Ilya Nemenman's [publications](http://nemenmanlab.org/~ilya/index.php/Publications).looked interesting and possibly even relevant to other non-biological Azimuth work. * Bryan C. Daniels and Ilya Nemenman, [Automated adaptive inference of coarse-grained dynamical models in systems biology (2014)](http://arxiv.org/abs/1404.6283) * David J. Schwab, Ilya Nemenman and Pankaj Mehta[Zipf's law and criticality in multivariate data without fine-tuning (2013)](http://arxiv.org/abs/1310.0448) * Martin Tchernookov and Ilya Nemenman, [Predictive information in a nonequilibrium critical model (2012)](http://arxiv.org/abs/1212.3896) * Ilya Nemenman, [Inference of entropies of discrete random variables with unknown cardinalities (2002)](http://arxiv.org/abs/physics/0207009) * N. A. Sinitsyn and I. Nemenman, [Time-dependent corrections to effective rate and event statistics in Michaelis-Menten kinetics (2010)](http://arxiv.org/abs/1001.4212) * Golan Bel and Ilya Nemenman, [Ergodic and Nonergodic Anomalous Diffusion in Coupled Stochastic Processes (2009)](http://arxiv.org/abs/0901.4785) * Ilya Nemenman, Fariel Shalee, William Bialek, [Entropy and inference, revisited (2001)](http://arxiv.org/abs/physics/0108025)`

This one caught my eye

Abstract:

`This one caught my eye + William Bialek, Ilya Nemenman, Naftali Tishby [Complexity Through Nonextensivity](http://arxiv.org/abs/physics/0103076) Abstract: > The problem of defining and studying complexity of a time series has interested people for years. In the context of dynamical systems, Grassberger has suggested that a slow approach of the entropy to its extensive asymptotic limit is a sign of complexity. We investigate this idea further by information theoretic and statistical mechanics techniques and show that these arguments can be made precise, and that they generalize many previous approaches to complexity, in particular unifying ideas from the physics literature with ideas from learning and coding theory; there are even connections of this statistical approach to algorithmic or Kolmogorov complexity. Moreover, a set of simple axioms similar to those used by Shannon in his development of information theory allows us to prove that the divergent part of the subextensive component of the entropy is a unique complexity measure. We classify time series by their complexities and demonstrate that beyond the 'logarithmic' complexity classes widely anticipated in the literature there are qualitatively more complex, 'power--law' classes which deserve more attention.`

Kolmogorov Complexity ROCKS! Best tool at the hands of our theoretical master here, its mathematical machinery fully developed based on measure theory.

`>there are even connections of this statistical approach to algorithmic or Kolmogorov complexity Kolmogorov Complexity ROCKS! Best tool at the hands of our theoretical master here, its mathematical machinery fully developed based on measure theory.`