#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Options

# Bio-inspired information theory

Andrew Eckford is starting up the Banff meeting on information theory and biology. An overview:

• Information theory is everywhere in biology.

• Evolution is a process of optimization.

Therefore:

What does this mean, if true?

• Information theory should help us biological processes. E.g., any transducer should be operating near its signal capacity.

• Biological communication should have good information-theoretic properties that human technology can exploit.

But is it true? There's evidence on both sides.

• Yes: neuroscience on the information processing properties of neurons, "infotaxis": how mobile organisms operate inside a chemical gradient in an information-theoretically efficient manner.

• No: information-optimal strategies are not always utility-optimal. Strategies for eating are different from strategies about learning!

So, information theory is only as good as the model to which it's applied. Organisms learn and act.

«1

• Options
1.
edited October 2014

There's been a long history of work in this field: Yockey (1950s), Barlow (60s), Berger (70s). However, the work has long been outside the mainstream of information theory (IT). A recent surge of interest in "systems biology" is changing this.

The problem may go back to Claude Shannon's editorial The Bandwagon, in which he tried to pour cold water on the fad for information theory. However, he merely said he wanted information theory to be mathematically rigorous. Now biology is in a position be mathematically rigorous.

Comment Source:There's been a long history of work in this field: Yockey (1950s), Barlow (60s), Berger (70s). However, the work has long been outside the mainstream of information theory (IT). A recent surge of interest in "systems biology" is changing this. The problem may go back to Claude Shannon's editorial [The Bandwagon](http://dsp.rice.edu/sites/dsp.rice.edu/files/shannon-bandwagon.pdf), in which he tried to pour cold water on the fad for information theory. However, he merely said he wanted information theory to be mathematically rigorous. Now biology is in a position be mathematically rigorous.
• Options
2.

We'll have talks on:

• modelling at the level of individual neurons

• modelling at the level of systems: sensory systems

• calculating channel capacity for some specific systems, e.g. cell-cell communications

• computing Shannon capacity can be hard - for example, it may require computing the permanent of a matrix, which is a lot harder than the determinant

• information theory in evolution (me), information theory and the origin of life, information theory in biodiversity

Comment Source:We'll have talks on: * modelling at the level of individual neurons * modelling at the level of systems: sensory systems * calculating channel capacity for some specific systems, e.g. cell-cell communications * computing Shannon capacity can be hard - for example, it may require computing the permanent of a matrix, which is a lot harder than the determinant * information theory in evolution (me), information theory and the origin of life, information theory in biodiversity
• Options
3.
edited October 2014

Peter Thomas' talk.

What does channel capacity have to do with biology? Bell and Sejnoski 1997 looked at ensembles of visual scenes in a paper "The independent components of natural scenes are edge filters", and used information theory to find optimal visual receptors.

Smith and Lewicki (2006 Nature) did a similar analysis for auditory receptors.

Schneider 2010 used a similar analysis to understand nucleic acids (gene encodings).

Vergassola et al (2007 Nature) did a similar analysis for chemotaxis. Do small organisms move along the gradient of a chemical signal? No, often they move in such a way as to maximize information gained!

But to do these analyses rigorously, we need to be able to measure channel capacities of biological systems, cell by cell. Levchenko et all (2011 Science) started doing this:

Comment Source:Peter Thomas' talk. What does channel capacity have to do with biology? Bell and Sejnoski 1997 looked at ensembles of visual scenes in a paper "The independent components of natural scenes are edge filters", and used information theory to find optimal visual receptors. Smith and Lewicki (2006 _Nature_) did a similar analysis for auditory receptors. Schneider 2010 used a similar analysis to understand nucleic acids (gene encodings). Vergassola _et al_ (2007 _Nature_) did a similar analysis for chemotaxis. Do small organisms move along the gradient of a chemical signal? No, often they move in such a way as to maximize information gained! But to do these analyses rigorously, we need to be able to measure channel capacities of biological systems, _cell by cell_. Levchenko et all (2011 _Science_) started doing this: * Raymond Cheong, Alex Rhee, Chiaochun Joanne Wang, Ilya Nemenman and Andre Levchenko, [Information transduction capacity of noisy biochemical signaling networks](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3895446/).
• Options
4.

Three stages of signalling:

1. Secretion of a signalling molecule.

2. Diffusion from sender to receiver.

3. Ligand binding of signalling molecule to a receptor protein.

All three stages involve noise which must be taken into account when computing channel capacities.

Binding: see the finite sets of states and random transitions between these in Thomas' paper on the "calcum-calmodulin binding graph". The chemical master equation is involved here!

They applied a theorem due to Chen and Berger to compute the channel capacity of a two-state discrete time Markov model - a simplified model of this sort:

Comment Source:Three stages of signalling: 1. Secretion of a signalling molecule. 2. Diffusion from sender to receiver. 3. Ligand binding of signalling molecule to a receptor protein. All three stages involve noise which must be taken into account when computing channel capacities. Binding: see the finite sets of states and random transitions between these in Thomas' paper on the "calcum-calmodulin binding graph". The chemical master equation is involved here! They applied a theorem due to Chen and Berger to compute the channel capacity of a two-state discrete time Markov model - a simplified model of this sort: * Andrew W. Eckford and Peter J. Thomas, [Capacity of a simple intercellular signal transduction channel](http://arxiv.org/abs/1305.2245).
• Options
5.
edited October 2014

Information processing needs to be thought of as part of the bigger "perception-action loop". The statistical structure of the world determines the percept; the creature then determines an action using some strategy, and this changes the statistical structure of the world.

So, we've left the world of "directed acyclic graphs", because our causal Bayesian network contains loops!!!

• Edward K. Agarwala, Hillel J. Chiel and Peter J. Thomas, Pursuit of food versus pursuit of information in Markov chain models of a perception-action loop, Journal of Theoretical Biology.

Abstract. Efficient coding, redundancy reduction, and other information theoretic optimization principles have successfully explained the organization of many biological phenomena, from the physiology of sensory receptive fields to the variability of certain DNA sequence ensembles. Here we examine the hypothesis that behavioral strategies that are optimal for survival must necessarily involve efficient information processing, and ask whether there can be circumstances in which deliberately sacrificing some information can lead to higher utility? To this end, we present an analytically tractable model for a particular instance of a perception–action loop: a creature searching for a randomly moving food source confined to a 1D ring world. The model incorporates the statistical structure of the creature's world, the effects of the creature's actions on that structure, and the creature's strategic decision process. The underlying model takes the form of a Markov process on an infinite dimensional state space. To analyze it we construct an exact coarse graining that reduces the model to a Markov process on a finite number of “information states”. This mathematical technique allows us to make quantitative comparisons between the performance of an information-theoretically optimal strategy with other candidate search strategies on a food gathering task. We find that1.Information optimal search does not necessarily optimize utility (expected food gain).2.The rank ordering of search strategies by information performance does not predict their ordering by expected food obtained.3.The relative advantage of different strategies depends on the statistical structure of the environment, in particular the variability of motion of the source.We conclude that there is no simple relationship between information and utility. Even in the absence of information processing costs or bandwidth constraints, behavioral optimality does not imply information efficiency, nor is there a simple tradeoff between the two objectives of gaining information about a food source versus obtaining the food itself. For a wide range of values of the food source's movement parameter, the strategy of collecting the most information possible about the unknown source location carries an ineliminable structural cost, leading to a situation in which a foraging creature could actually choose to be less well-informed while simultaneously being, on average, better fed.

Comment Source:Information processing needs to be thought of as part of the bigger "perception-action loop". The statistical structure of the world determines the percept; the creature then determines an action using some strategy, and this changes the statistical structure of the world. So, we've left the world of "directed acyclic graphs", because our causal Bayesian network contains _loops!!!_ * Edward K. Agarwala, Hillel J. Chiel and Peter J. Thomas, Pursuit of food versus pursuit of information in Markov chain models of a perception-action loop, _Journal of Theoretical Biology_. > **Abstract.** Efficient coding, redundancy reduction, and other information theoretic optimization principles have successfully explained the organization of many biological phenomena, from the physiology of sensory receptive fields to the variability of certain DNA sequence ensembles. Here we examine the hypothesis that behavioral strategies that are optimal for survival must necessarily involve efficient information processing, and ask whether there can be circumstances in which deliberately sacrificing some information can lead to higher utility? To this end, we present an analytically tractable model for a particular instance of a perception–action loop: a creature searching for a randomly moving food source confined to a 1D ring world. The model incorporates the statistical structure of the creature's world, the effects of the creature's actions on that structure, and the creature's strategic decision process. The underlying model takes the form of a Markov process on an infinite dimensional state space. To analyze it we construct an exact coarse graining that reduces the model to a Markov process on a finite number of “information states”. This mathematical technique allows us to make quantitative comparisons between the performance of an information-theoretically optimal strategy with other candidate search strategies on a food gathering task. We find that1.Information optimal search does not necessarily optimize utility (expected food gain).2.The rank ordering of search strategies by information performance does not predict their ordering by expected food obtained.3.The relative advantage of different strategies depends on the statistical structure of the environment, in particular the variability of motion of the source.We conclude that there is no simple relationship between information and utility. Even in the absence of information processing costs or bandwidth constraints, behavioral optimality does not imply information efficiency, nor is there a simple tradeoff between the two objectives of gaining information about a food source versus obtaining the food itself. For a wide range of values of the food source's movement parameter, the strategy of collecting the most information possible about the unknown source location carries an ineliminable structural cost, leading to a situation in which a foraging creature could actually choose to be less well-informed while simultaneously being, on average, better fed.
• Options
6.
edited October 2014

William Bialek argues that many biological processes operate at the theoretical limit of information utilization already. He also argues that this drives them to operate at phase transitions or critical points of their respective parameter spaces. He gives several concrete examples with calculations to back up the claim.

Good introductions to this work are:

Comment Source:William Bialek argues that many biological processes operate at the theoretical limit of information utilization already. He also argues that this drives them to operate at phase transitions or critical points of their respective parameter spaces. He gives several concrete examples with calculations to back up the claim. Good introductions to this work are: + very nice [talk](http://t.co/HcabJvC1tV) + [a lecture series](http://www.youtube.com/playlist?list=PLoxv42WBtfCAY8icy7uChz_kpBXpWoMwk) + a survey paper [Are biological systems poised at criticality? Thierry Mora, William Bialek](http://arxiv.org/abs/1012.2242)
• Options
7.
edited October 2014

There's an forum thread referencing Bialek here. John offered links to copies of a couple of interesting sounding pdfs you might like.

Comment Source:There's an forum thread referencing Bialek [here](http://forum.azimuthproject.org/discussion/1164/national-institute-for-mathematical-and-biological-synthesis/?Focus=8563#Comment_8563). John offered links to copies of a couple of interesting sounding pdfs you might like.
• Options
8.

Hello John

In all what was mentioned above the most important concept is missing, that of compression which is found in living organisms and their genome, please review this for the ants (P12):

Ants and Kolmogorov Complexity

Compression of information allows for large network processing I am sure the concept requires no further suggestion to you.

Kolmogorov Complexity is in full mathematical gear as invented by Kolmogorov (may he rest in peace) and LA Levin. Therefore for your purposes the mathematical and computational framework is there already 100% done!

Same concept is found in genome compression:

Evolutionary Principles of Genomic Compression

In particular it is this genomic compression that allows for the economic energetics for bird flight:

The Smallest Avian genomes are found in hummingbirds

John compression has no mathematical closed form and difficult to deal with mathematically the traditional way, but easy in code to compress any data. In fact any serialized object could be compressed, then replace the compression with Kolmogorov Complexity counterpart, and you get yourself a mathematical computational framework to explain some nifty awesome stuff in the universe.

Dara

Comment Source:Hello John In all what was mentioned above the most important concept is missing, that of **compression** which is found in living organisms and their genome, please review this for the ants (P12): [Ants and Kolmogorov Complexity](http://www.reznikova.net/AntsBits.pdf) Compression of information allows for large network processing I am sure the concept requires no further suggestion to you. Kolmogorov Complexity is in full mathematical gear as invented by Kolmogorov (may he rest in peace) and LA Levin. Therefore for your purposes the mathematical and computational framework is there already 100% done! Same concept is found in genome compression: [Evolutionary Principles of Genomic Compression](http://www.santafe.edu/media/workingpapers/02-05-021.pdf) In particular it is this genomic compression that allows for the economic energetics for bird flight: [The Smallest Avian genomes are found in hummingbirds](http://msb.unm.edu/birds/publications_files/Gregory_et_al_PRSB_Hummingbird_Genome_Size.pdf) John compression has no mathematical closed form and difficult to deal with mathematically the traditional way, but easy in code to compress any data. In fact any **serialized** object could be compressed, then replace the compression with Kolmogorov Complexity counterpart, and you get yourself a mathematical computational framework to explain some nifty awesome stuff in the universe. Dara
• Options
9.
edited October 2014

Thanks for your comments, everyone! I'll hve to look at these references when I get time. I'd invited William Bialek to my own workshop on information and entropy in biological systems. Unfortunately, when we had to move the date due to a football game in the town where the workshop is being held, he said he could no longer attend.

Comment Source:Thanks for your comments, everyone! I'll hve to look at these references when I get time. I'd invited William Bialek to my own workshop on [information and entropy in biological systems](http://johncarlosbaez.wordpress.com/2014/07/04/entropy-and-information-in-biological-systems-part-2/). Unfortunately, when we had to move the date due to a football game in the town where the workshop is being held, he said he could no longer attend.
• Options
10.

FWIW Possible typo at #4 if it's the NN guy it's s/sejnoski/sejnowski/.

Comment Source:FWIW Possible typo at #4 if it's the NN guy it's s/sejnoski/sejnowski/.
• Options
11.
edited October 2014

Toby Berger, inventor of "rate distortion theory", is talking now on "Neuroscience applications of generalized inverse Gaussian distributions".

While he's warming up: Generalized inverse Gaussian distributions or GIG distributions for short, are a 3-parameter family of probability distributions on the real line. "The GIG distribution is conjugate to the normal distribution when serving as the mixing distribution in a normal variance-mean mixture" - whatever that means.

Comment Source:Toby Berger, inventor of "rate distortion theory", is talking now on "Neuroscience applications of generalized inverse Gaussian distributions". While he's warming up: [Generalized inverse Gaussian distributions](https://en.wikipedia.org/wiki/Generalized_inverse_Gaussian_distribution) or **GIG distributions** for short, are a 3-parameter family of probability distributions on the real line. "The GIG distribution is conjugate to the normal distribution when serving as the mixing distribution in a normal variance-mean mixture" - whatever that means.
• Options
12.

The generalized Shannon-Blackwell billiard ball channel. Each step you put in either a white or black billiard ball in a huge box and your friend randomly picks one and takes one out. What is the channel capacity? Nobody knows! You can get at least $1/2$ a bit per use, but the capacity is at least $0.52$.

This is similar to one cell communicating to another by putting out different kind of molecules into a liquid, with the other cell randomly picking out these molecules. Of course there can be more than 2 kinds of molecules: there's a set of species $\{0, \dots, M-1\}$.

Suppose the number of molecules in the solution is huge, like $b = 10^9$. Suppose $M = 2$. Consider testing the hypotheses: $H_0$: the next $N$ inputs are 0's. $H_1$: the next $n$ input are all $1$'s.

Find the smallest $N$ such that the next $N$ outputs discriminate reliably between $H_0$ and $H_1$. Answer: $const p^{1/3} b^{2/3}$. (What the heck is $p$?)

This is much smaller than you might think! Since $b^{2/3}$ = 10^6$, you only need to change the concentration of 0's by one in ten thousand to reliably distinguish between these hypotheses. Comment Source:**The generalized Shannon-Blackwell billiard ball channel.** Each step you put in either a white or black billiard ball in a huge box and your friend randomly picks one and takes one out. What is the channel capacity? **Nobody knows!** You can get at least$1/2$a bit per use, but the capacity is at least$0.52$. This is similar to one cell communicating to another by putting out different kind of molecules into a liquid, with the other cell randomly picking out these molecules. Of course there can be more than 2 kinds of molecules: there's a set of species$\{0, \dots, M-1\}$. Suppose the number of molecules in the solution is huge, like$b = 10^9$. Suppose$M = 2$. Consider testing the hypotheses:$H_0$: the next$N$inputs are 0's.$H_1$: the next$n$input are all$1$'s. Find the smallest$N$such that the next$N$outputs discriminate reliably between$H_0$and$H_1$. Answer:$const p^{1/3} b^{2/3}$. (What the heck is$p$?) This is much smaller than you might think! Since$b^{2/3}$= 10^6$, you only need to change the concentration of 0's by one in ten thousand to reliably distinguish between these hypotheses.
• Options
13.

The "classical" view of cortical processing was:

1. Information gets compressed as it travels up the hierarchy.

2. They ignored the flow of information back down, e.g. from the cortex toward the retina.

Both these are wrong. Indeed, even the briefest of stimuli set off big chains of neural firings going up and down the pathways.

Information is not being destroyed as it goes up the hierarchy. It gets rearranged, and about halfway up information from different sensory systems (visual, auditory) get mixed. Top-down signals largely (not exclusively) suppress certain cells in the levels below.

The idea: sensation is focused on trying to confirm or disconfirm hypotheses about reality. Signals irrelevant to this decision process get suppressed.

Comment Source:The "classical" view of cortical processing was: 1. Information gets compressed as it travels up the hierarchy. 1. They ignored the flow of information back _down_, e.g. from the cortex toward the retina. Both these are wrong. Indeed, even the briefest of stimuli set off big chains of neural firings going up and down the pathways. Information is not being destroyed as it goes up the hierarchy. It gets rearranged, and about halfway up information from different sensory systems (visual, auditory) get mixed. Top-down signals largely (not exclusively) _suppress_ certain cells in the levels below. The idea: sensation is focused on trying to confirm or disconfirm hypotheses about reality. Signals irrelevant to this decision process get suppressed.
• Options
14.

The goal of sensation is not to accurately measure some quantity in the outside world. The error criterion you're trying to minimize is not the error between reality and perception. The error that matters is the difference between what you think the sensation your action will produce, and the sensation is does produce.

Comment Source:The goal of sensation is _not_ to accurately measure some quantity in the outside world. The error criterion you're trying to minimize is _not_ the error between reality and perception. The error that matters is the difference between what you _think_ the sensation your action will produce, and the sensation is _does_ produce.
• Options
15.

A simple model:

Let $a,b$ be times of occurrence of two successive spikes generated by a neuron.

Goal: maximize the number of bits of information that knowledge of $t = b = a$ conveys to the neuron's targets per joule of energy that the neuron expends to process its input during the interspike interval (ISI) to generate the spike at $b$, and propagate that spike to all the targets.

Comment Source:A simple model: Let $a,b$ be times of occurrence of two successive spikes generated by a neuron. Goal: maximize the number of bits of information that knowledge of $t = b = a$ conveys to the neuron's targets per joule of energy that the neuron expends to process its input during the interspike interval (ISI) to generate the spike at $b$, and propagate that spike to all the targets.
• Options
16.

Actually, what matter is the bit rate at which the targets receive information, per watt expended by the neuron. This is less than the single-ISI bit per watt, because while energy is additive, the information received is subadditive due to correlations.

Comment Source:Actually, what matter is the _bit rate_ at which the targets receive information, per watt expended by the neuron. This is less than the single-ISI bit per watt, because while energy is additive, the information received is _subadditive_ due to correlations.
• Options
17.
Comment Source:* Toby Berger, [Information and decision theory explain why there are neural pulse trains](http://www.clsp.jhu.edu/seminars/1236/).
• Options
18.

In 2010-2011 a team of eminent scientists led by Dharmendra Modhra generated a real-time simulation of 10 million neurons in a cat's visual cortex. They managed to get it stable: previous simulations would destabilize so after a while all neurons were firing as fast as posssible or not at all!

However, the simulation used $10^9$ times as much power per neuron than a real cat!

The human brain is wonderfully efficient, running on 40 watts.

Comment Source:In 2010-2011 a team of eminent scientists led by Dharmendra Modhra generated a real-time simulation of 10 million neurons in a cat's visual cortex. They managed to get it stable: previous simulations would destabilize so after a while all neurons were firing as fast as posssible or not at all! However, the simulation used $10^9$ times as much power per neuron than a real cat! The human brain is wonderfully efficient, running on 40 watts.
• Options
19.

Claim: the three main energy costs of an interspike interval of duration $t$ are proportional to $\log(1/t)$, $1/t$, and $t$.

Comment Source:Claim: the three main energy costs of an interspike interval of duration $t$ are proportional to $\log(1/t)$, $1/t$, and $t$.
• Options
20.

The point (which the speaker just barely got around to at the very end of his talk):

The GIG (generalized inverse Gaussian) distribution manages to maximize bit rate per watt under certain assumptions.

Comment Source:The point (which the speaker just barely got around to at the very end of his talk): The GIG (generalized inverse Gaussian) distribution manages to maximize bit rate per watt under certain assumptions.
• Options
21.
edited October 2014

Next talk: Naftali Tishby on "Sensing and acting under information constraints: a principled approach to biology and intelilgence".

Life: systems that exploit the predictability of their environment for their survivability.

There's one function: how much value a certain amount of information about the future can provide.

And there's another: how little having a certain amount of information about the past cost.

And another: how many bits about the future will some bits about the past provide. This depends on the "window size" for the past and for the future.

If you can compute these 3 functions, you can do something.

Comment Source:Next talk: [Naftali Tishby](http://www.cs.huji.ac.il/~tishby/) on "Sensing and acting under information constraints: a principled approach to biology and intelilgence". Life: systems that exploit the predictability of their environment for their survivability. There's one function: how much value a certain amount of information about the future can provide. And there's another: how little having a certain amount of information about the past cost. And another: how many bits about the future will some bits about the past provide. This depends on the "window size" for the past and for the future. If you can compute these 3 functions, you can do something.
• Options
22.

In "metabolic information processing" sensory is information is received and used for acting almost immediately. There's hardly any long term planning.

In something like playing chess, we have a different problem: "long term planning". Here we model the future and use information from that model to make our present decision.

Comment Source:In "metabolic information processing" sensory is information is received and used for acting almost immediately. There's hardly any long term planning. In something like playing chess, we have a different problem: "long term planning". Here we model the future and use information from that model to make our present decision.
• Options
23.

JM Fuster's "perception-action cycle" is the circular flow of information between the organism and its environment that takes place in a sensory-guided sequence of behaviors toward a goal.

Comment Source:JM Fuster's "perception-action cycle" is the circular flow of information between the organism and its environment that takes place in a sensory-guided sequence of behaviors toward a goal.
• Options
24.
edited October 2014

The brain compresses information about the past, selecting just "useful" information, and uses this to make predictions about the future and actions that affect the future.

We may try to formalize this using a partially observed Markov decision process. This have states, actions, transition probabilities (probability to go from one state to another given an action), and rewards (reward for an action going from one state to another).

This allows a nice graphical model for the perception-action cycle.

Comment Source:The brain compresses information about the past, selecting just "useful" information, and uses this to make _predictions about the future_ and _actions that affect the future_. We may try to formalize this using a **[partially observed Markov decision process](https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process)**. This have states, actions, transition probabilities (probability to go from one state to another _given an action_), and rewards (reward for an action going from one state to another). This allows a nice graphical model for the perception-action cycle.
• Options
25.
edited October 2014

Naftali Tishby is saying very clear and interesting things about deep issues... unfortunately there are a lot of pictures I can't type in here!

Comment Source:[Naftali Tishby](http://www.cs.huji.ac.il/~tishby/) is saying very clear and interesting things about deep issues... unfortunately there are a lot of pictures I can't type in here!
• Options
26.
edited October 2014

He's generalizing something I don't know: the Bellman equation. Unlike the traditional Bellman equation, his takes into account the gain of information produced by actions.

Comment Source:He's generalizing something I don't know: the [Bellman equation](https://en.wikipedia.org/wiki/Bellman_equation). Unlike the traditional Bellman equation, his takes into account the gain of information produced by actions.
• Options
27.
edited October 2014

Claim: we enjoy music because we are constantly trying to predict what'll happen next, and we get a dopamine reward when we succeed. If we fail we try to build a model that does better.

His collaborators have played simple sequences of tones to anesthesized cats, who still in fact can hear and try to predict the next tone. He compares their neural activity to what you'd expect from optimal prediction models.

The number of neural spikes after a tone is proportional to the "surprise" of that tone, that is, the logarithm of its probability in the Markov process whereby the tones were generated!

Comment Source:Claim: we enjoy music because we are constantly trying to predict what'll happen next, and we get a dopamine reward when we succeed. If we fail we try to build a model that does better. His collaborators have played simple sequences of tones to anesthesized cats, who still in fact can hear and try to predict the next tone. He compares their neural activity to what you'd expect from optimal prediction models. The number of neural spikes after a tone is proportional to the "surprise" of that tone, that is, the logarithm of its probability in the Markov process whereby the tones were generated!
• Options
28.

A paper:

Comment Source:A paper: * Naftali Tishby and Daniel Polani, [Information theory of decisions and actions](http://www.cs.huji.ac.il/labs/learning/Papers/IT-PAC.pdf)
• Options
29.

Claim: we enjoy music because we are constantly trying to predict what’ll happen next, and we get a dopamine reward when we succeed. If we fail we try to build a model that does better.

Too predictable music is usually boring, i.e. feels like no reward.

Comment Source:>Claim: we enjoy music because we are constantly trying to predict what’ll happen next, and we get a dopamine reward when we succeed. If we fail we try to build a model that does better. Too predictable music is usually boring, i.e. feels like no reward.
• Options
30.

Too predictable music is usually boring

Check out Mirror Neurons, the more predictable the tunes the more we tend to repeat after. This is true I believe even in animals who dance to periodic drum beats or bass of human music (lots of them on Youtube).

If you make the sequence of notes random, then the mirror neurons would not fire! as John mentioned. In that case pleasure glands would not release.

This is same exact predator prey chase, roll a ball in front a child or a cat, uncontrollably they chase the ball.

Same as the children and songbirds repeat the sounds of the parents.

Same as in Mathematics' definitions and proofs, they are made to be repeatable, this process of repeating is not possible unless by Mirror Neurons. That is how children learn to arithmetic and algebra.

Dara

Comment Source:>Too predictable music is usually boring Check out Mirror Neurons, the more predictable the tunes the more we tend to repeat after. This is true I believe even in animals who dance to periodic drum beats or bass of human music (lots of them on Youtube). If you make the sequence of notes random, then the mirror neurons would not fire! as John mentioned. In that case pleasure glands would not release. This is same exact predator prey chase, roll a ball in front a child or a cat, uncontrollably they chase the ball. Same as the children and songbirds repeat the sounds of the parents. Same as in Mathematics' definitions and proofs, they are made to be repeatable, this process of repeating is not possible unless by Mirror Neurons. That is how children learn to arithmetic and algebra. Dara
• Options
31.
edited October 2014

Now Ilya Nemenman is talking about "Predictive information".

(He's done a lot of interesting work on quantum techniques for chemical reaction networks, and knows a whole community of people working on this who I hadn't met! They have an annual conference on it!!! More on that later.)

His webpage says:

Are there phenomenological, coarse-grained, and yet functionally accurate representations of biological processes, or are we forever doomed to every detail mattering?

In my mind, the question is not if some details don't matter, but which ones. A lot of smart people have thought about this question before. The dream is that, by stripping unnecessary details, we will eventually understand the basics of how we can function reliably in an ever changing world. I hope to achieve some quantitative understanding of such complex phenomena as evolution, sensory processes, animal behavior, human cognition, and, who knows, maybe one day even human consiousness.

What can be a more noble science goal? As I argued a while back:

Studying string theory cannot be more exciting than studying the brain that can study string theory.

Comment Source:Now [Ilya Nemenman](http://nemenmanlab.org/~ilya/index.php/Ilya_Nemenman) is talking about "Predictive information". (He's done a lot of interesting work on quantum techniques for chemical reaction networks, and knows a whole community of people working on this who I hadn't met! They have an annual conference on it!!! More on that later.) His webpage says: > Are there phenomenological, coarse-grained, and yet functionally accurate representations of biological processes, or are we forever doomed to every detail mattering? > In my mind, the question is not if some details don't matter, but which ones. A lot of smart people have thought about this question before. The dream is that, by stripping unnecessary details, we will eventually understand the basics of how we can function reliably in an ever changing world. I hope to achieve some quantitative understanding of such complex phenomena as evolution, sensory processes, animal behavior, human cognition, and, who knows, maybe one day even human consiousness. > What can be a more noble science goal? As I argued a while back: > > _Studying string theory cannot be more exciting than studying the brain that can study string theory._
• Options
32.

He's done work with Andre Levchenko (who couldn't attend) on information processing by molecular pathways.

Comment Source:He's done work with Andre Levchenko (who couldn't attend) on information processing by molecular pathways.
• Options
33.

Hello John

information processing by molecular pathways

I love to write simulators and modellers for this

Dara

Comment Source:Hello John > information processing by molecular pathways I love to write simulators and modellers for this Dara
• Options
34.

Most accurate algorithm for estimating information from a discrete signal: the NSB algorithm.

Comment Source:Most accurate algorithm for estimating information from a discrete signal: the **NSB algorithm**.
• Options
35.
edited October 2014

In biology most communications reach their target only after things have changed significantly.

Predictability of the future is a deviation from extensivity of entropy: the information needed to describe past, present, and future together is less than the sum of the information needed to describe past, present and future separately. The difference is the predictive information $I_{pred}(T)$, where $T$ is the relevant time scale.

We have

$$I_{pred}(T) \ge 0$$ and subextensivity:

$$\lim_{T \to \infty} I_{pred}(T) / T = 0$$ If

$$\lim_{T \to \infty} I_{pred}(T) = const$$ we have no long range structure - e.g. a Markov process spits out new entropy at a constant rate.

More interesting case:

$$I_{pred}(T) \sim \frac{K}{2} \log T$$ We get this when we're learning $K$ parameters from $T$ samples of a Markov process - learning these parameters with better and better precision as $T$ grows. This may be related to long-range correlations as seen at critical points - but details are unknown.

Mysterious case:

$$I_{pred}(T) \sim C T^\beta$$ with $\beta \lt 1$. For example: reading English prose seems to give $\beta \approx 1/2$. As we look at longer and longer samples of English prose, there's more and more information; it seems to grow forever (from experiments done so far), but it grows sublinearly.

Comment Source:In biology most communications reach their target only _after things have changed significantly_. Predictability of the future is a deviation from extensivity of entropy: the information needed to describe past, present, and future together is less than the sum of the information needed to describe past, present and future separately. The difference is the **predictive information** $I_{pred}(T)$, where $T$ is the relevant time scale. We have $$I_{pred}(T) \ge 0$$ and subextensivity: $$\lim_{T \to \infty} I_{pred}(T) / T = 0$$ If $$\lim_{T \to \infty} I_{pred}(T) = const$$ we have no long range structure - e.g. a Markov process spits out new entropy at a constant rate. More interesting case: $$I_{pred}(T) \sim \frac{K}{2} \log T$$ We get this when we're learning $K$ parameters from $T$ samples of a Markov process - learning these parameters with better and better precision as $T$ grows. This may be related to long-range correlations as seen at critical points - but details are unknown. Mysterious case: $$I_{pred}(T) \sim C T^\beta$$ with $\beta \lt 1$. For example: reading English prose seems to give $\beta \approx 1/2$. As we look at longer and longer samples of English prose, there's more and more information; it seems to grow forever (from experiments done so far), but it grows sublinearly.
• Options
36.
edited October 2014

Now Daniel Polani is talking about "Informational principles in the perception-action loop".

Sensors in biology are often highly optimized:

• detection of just a few molecules (moths).

• humans can detect a few photons, certain toads can detect single photons.

• human children can hear at close to the limit imposed by thermal noise; then they go to discos and lose their hearing.

Cognitive processing is very expensive: it can often use 40% or 50% of our total power!

So, there must be huge evolutionary pressure for optimized sensors and high cognitive skills.

Was man nicht im Kopf hat, muss man in den Beinen haben.

You don't need to run as fast if you can sense danger earlier.

Comment Source:Now Daniel Polani is talking about &quot;Informational principles in the perception-action loop&quot;. Sensors in biology are often highly optimized: * detection of just a few molecules (moths). * humans can detect a few photons, certain toads can detect single photons. * human children can hear at close to the limit imposed by thermal noise; then they go to discos and lose their hearing. Cognitive processing is very expensive: it can often use 40% or 50% of our total power! So, there must be huge evolutionary pressure for optimized sensors and high cognitive skills. *Was man nicht im Kopf hat, muss man in den Beinen haben.* You don't need to run as fast if you can sense danger earlier.
• Options
37.
edited October 2014

Landauer's principle: on the lowest level, erasure of information from memory must create heat. This gives the connection between energy and information that defeats Maxwell's demon.

However, when discussing human cognition we have vast amounts of waste heat so Landauer's principle is not directly relevant.

Here Ashby (1956) said "only variety can destroy entropy". Touchet and Lloyd in 2000, 2004 studied open-loop and closed-loop controllers and saw there was a limit on how well we could reduce the entropy of the system being controlled:

Supposedly they showed:

$$\Delta H_{closed} \le \Delta H_{open} + I(W_t; A_t)$$ where $I(W_t; A_t)$ is the mutual information of the world at time $t$ and the action at time $t$. Learn about this stuff! This is a cool relationship between information theory and control theory!

Comment Source:Landauer's principle: on the lowest level, erasure of information from memory must create heat. This gives the connection between energy and information that defeats Maxwell's demon. _However_, when discussing human cognition we have vast amounts of waste heat so Landauer's principle is not directly relevant. Here Ashby (1956) said "only variety can destroy entropy". Touchet and Lloyd in 2000, 2004 studied open-loop and closed-loop controllers and saw there was a limit on how well we could reduce the entropy of the system being controlled: * [Information-theoretic approach to the study of control systems](http://arxiv.org/abs/physics/0104007). Supposedly they showed: $$\Delta H_{closed} \le \Delta H_{open} + I(W_t; A_t)$$ where $I(W_t; A_t)$ is the mutual information of the world at time $t$ and the action at time $t$. _Learn about this stuff!_ This is a cool relationship between information theory and control theory!
• Options
38.
edited October 2014

Too many pictures to take good notes...

Consider a Bayesian network where the world at time $t$ gives a sensation at time $t$ which gives an action at time $t$ which (together with the world) affects the world at time $t+1$. If the vector of sensations is $\mathbf{S}_t = (S_0, \dots, S_t)$, then the mutual information between $\mathbf{S}_t$ and $W_0$ (the state of the world at time zero) approaches a limit as $t \to \infty$.

Comment Source:Too many pictures to take good notes... Consider a Bayesian network where the world at time $t$ gives a sensation at time $t$ which gives an action at time $t$ which (together with the world) affects the world at time $t+1$. If the vector of sensations is $\mathbf{S}_t = (S_0, \dots, S_t)$, then the mutual information between $\mathbf{S}_t$ and $W_0$ (the state of the world at time zero) approaches a limit as $t \to \infty$.
• Options
39.
edited October 2014

The traditional work on Markov decision processes don't take into account the decision costs. So, let's add a cost for the amount of information that needs to be used to make the decision.

Minimize (mutual information between state of world and action) minus $\beta$ times (expected value of reward). Here $\beta$ is a Lagrange multiplier, that says how much we care about a high expected reward versus how much we try to avoid using a lot of information.

There's a nice "grid world" example of a robot trying to find a specific point while walking around on a grid, in the paper by Tishby and Polani which founded this line of work.

Comment Source:The traditional work on Markov decision processes don't take into account the decision costs. So, let's add a cost for the amount of information that needs to be used to make the decision. Minimize (mutual information between state of world and action) minus $\beta$ times (expected value of reward). Here $\beta$ is a Lagrange multiplier, that says how much we care about a high expected reward versus how much we try to avoid using a lot of information. There's a nice "grid world" example of a robot trying to find a specific point while walking around on a grid, in [the paper by Tishby and Polani](http://www.cs.huji.ac.il/labs/learning/Papers/IT-PAC.pdf) which founded this line of work.
• Options
40.

In biology, the success criterion is survival (really reproduction). The problem is then that feedback is too sparse: you only know you've done something wrong when you die (before having children). So, to optimize behavior it helps to have smaller rewards and punishments: pain, pleasure, unhappiness, happiness. Adaptational feedback should be dense and rich.

Comment Source:In biology, the success criterion is survival (really reproduction). The problem is then that feedback is too sparse: you only know you've done something wrong when you die (before having children). So, to optimize behavior it helps to have smaller rewards and punishments: pain, pleasure, unhappiness, happiness. _Adaptational feedback should be dense and rich._
• Options
41.
edited October 2014

"Being in control of your destiny - and knowing it - is good". This is connected to controllability and observability, two key concepts in control theory.

See Klyubin et al, 2005 and 2008 on empowerment:

The information transfer can also be interpreted as the acquisition of information from the environment by a single adapting individual: there is evidence that pushing the information flow to the information-theoretic limit (i.e. maximization of information transfer) can give rise to intricate behaviour, induce a necessary structure in the system, and ultimately adaptively reshape the system [1-3]. The central hypothesis of Klyubin et al. is that there exists "a local and universal utility function which may help individuals survive and hence speed up evolution by making the fitness landscape smoother", while adapting to morphology and ecological niche. The proposed general utility function, empowerment, couples the agent’s sensors and actuators via the environment. Empowerment is the perceived amount of influence or control the agent has over the world, and can be seen as the agent’s potential to change the world. It can be measured via the amount of Shannon information that the agent can "inject into" its sensor through the environment, a effecting future actions and future perceptions. Such a perception-action loop defines the agent’s actuation channel, and technically empowerment is defined as the capacity of this actuation channel: the maximum mutual information for the channel over all possible distributions of the transmitted signal. "The more of the information can be made to appear in the sensor, the more control or influence the agent has over its sensor" – this is the main motivation for this local and universal utility function [2].

1. Klyubin, A.S., Polani, D. and Nehaniv, C.L. Organization of the information flow in the perception-action loop of evolved agents. In Proceedings of 2004 NASA/DoD Conference on Evolvable Hardware, page 177-180. IEEE Computer Society, 2004.

2. Klyubin, A.S., Polani, D. and Nehaniv, C.L. All else being equal be empowered. In M. S. Capcarr‘ere, A. A. Freitas, P. J. Bentley, C. G. Johnson, and J. Timmis, editors, Advances in Artificial Life, 8th European Conference, ECAL 2005, volume 3630 of LNCS, page 744-753. Springer, 2005.

3. Klyubin, A.S., Polani, D. and Nehaniv, C.L. Empowerment: A Universal Agent-Centric Measure of Control. In Proc. CEC 2005. IEEE.

Abstract of the last paper:

Abstract. The classical approach to using utility functions suffers from the drawback of having to design and tweak the functions on a case by case basis. Inspired by examples from the animal kingdom, social sciences and games we propose empowerment, a rather universal function, defined as the information-theoretic capacity of an agent's actuation channel. The concept applies to any sensorimotor apparatus. Empowerment as a measure reflects the properties of the apparatus as long as they are observable due to the coupling of sensors and actuators via the environment. Using two simple experiments we also demonstrate how empowerment influences sensor-actuator evolution.

Comment Source:"Being in control of your destiny - and knowing it - is good". This is connected to _controllability_ and _observability_, two key concepts in control theory. See Klyubin _et al_, 2005 and 2008 on [empowerment](http://www.prokopenko.net/empowerment.html): > The information transfer can also be interpreted as the acquisition of information from the environment by a single adapting individual: there is evidence that pushing the information flow to the information-theoretic limit (i.e. maximization of information transfer) can give rise to intricate behaviour, induce a necessary structure in the system, and ultimately adaptively reshape the system [1-3]. The central hypothesis of Klyubin et al. is that there exists "a local and universal utility function which may help individuals survive and hence speed up evolution by making the fitness landscape smoother", while adapting to morphology and ecological niche. The proposed general utility function, **empowerment**, couples the agent’s sensors and actuators via the environment. _Empowerment is the perceived amount of influence or control the agent has over the world, and can be seen as the agent’s potential to change the world. It can be measured via the amount of Shannon information that the agent can "inject into" its sensor through the environment, a effecting future actions and future perceptions._ Such a perception-action loop defines the agent’s actuation channel, and technically empowerment is defined as the capacity of this actuation channel: the maximum mutual information for the channel over all possible distributions of the transmitted signal. "The more of the information can be made to appear in the sensor, the more control or influence the agent has over its sensor" – this is the main motivation for this local and universal utility function [2]. > 1. Klyubin, A.S., Polani, D. and Nehaniv, C.L. Organization of the information flow in the perception-action loop of evolved agents. In Proceedings of 2004 NASA/DoD Conference on Evolvable Hardware, page 177-180. IEEE Computer Society, 2004. > 2. Klyubin, A.S., Polani, D. and Nehaniv, C.L. All else being equal be empowered. In M. S. Capcarr‘ere, A. A. Freitas, P. J. Bentley, C. G. Johnson, and J. Timmis, editors, Advances in Artificial Life, 8th European Conference, ECAL 2005, volume 3630 of LNCS, page 744-753. Springer, 2005. > 3. Klyubin, A.S., Polani, D. and Nehaniv, C.L. [Empowerment: A Universal Agent-Centric Measure of Control](http://homepages.herts.ac.uk/~comqdp1/publications/files/cec2005_klyubin_polani_nehaniv.pdf). In Proc. CEC 2005. IEEE. Abstract of the last paper: > **Abstract.** The classical approach to using utility functions suffers from the drawback of having to design and tweak the functions on a case by case basis. Inspired by examples from the animal kingdom, social sciences and games we propose empowerment, a rather universal function, defined as the information-theoretic capacity of an agent's actuation channel. The concept applies to any sensorimotor apparatus. Empowerment as a measure reflects the properties of the apparatus as long as they are observable due to the coupling of sensors and actuators via the environment. Using two simple experiments we also demonstrate how empowerment influences sensor-actuator evolution.
• Options
42.

I think Nemenman was Bialek's student. They have co-authored a number of papers together, some with also Tishby. Predictivity turns up in Bialek's talks a fair bit. Nemenman has also done work on applying field theory in machine learning. His thesis looks very interesting. That and a number of his papers are on my reading list. If only I could tick papers off the list at the same rate as I add to it ... :)

Comment Source:I think Nemenman was Bialek's student. They have co-authored a number of papers together, some with also Tishby. Predictivity turns up in Bialek's talks a fair bit. Nemenman has also done work on applying field theory in machine learning. His thesis looks very interesting. That and a number of his papers are on my reading list. If only I could tick papers off the list at the same rate as I add to it ... :)
• Options
43.

Ooh videos! :)

Comment Source:Ooh [videos](https://www.birs.ca/events/2014/5-day-workshops/14w5170/videos)! :)
• Options
44.
edited October 2014

Hi Daniel,

a number of his papers are on my reading list. If only I could tick papers off the list at the same rate as I add to it … :)

I guess this is characteristic of the Azimuth project network. Now of the ~1300 papers in my Azimuth downloads folder, never mind the bookmarks, which one is the optimal "next"? :)

I went through Nemenmann's publications list and copied the maths and machine learning sounding links. I'll add titles and post links to their pdfs tomorrow.

Comment Source:Hi Daniel, > a number of his papers are on my reading list. If only I could tick papers off the list at the same rate as I add to it … :) I guess this is characteristic of the Azimuth project network. Now of the ~1300 papers in my Azimuth downloads folder, never mind the bookmarks, which one is the optimal "next"? :) I went through Nemenmann's publications list and copied the maths and machine learning sounding links. I'll add titles and post links to their pdfs tomorrow.
• Options
45.

Ooh videos! :)

They really should upload them to youtube though. I am completely hooked on the youtube's speed-up feature.

Comment Source:> Ooh videos! :) They really should upload them to youtube though. I am completely hooked on the youtube's speed-up feature.
• Options
46.
edited October 2014

Susanne Still is giving a talk.

Say we're using Bayesian updating and replacing the prior $p(y)$ by the posterior $p(y|x) = p(x|y)p(y)/p(x)$.

How much work does it take, on average, to do this?

$$\langle \Delta F(p(y) \to p(y|x)) \rangle_x = \langle F(p(y|x)) \rangle_x - F(p(y)) = \Delta E + k_B T I[x,y]$$ where $I[x,y]$ is the mutual information, i.e. the entropy of $y$ minus the entropy of $y$ given $x$.

Comment Source:Susanne Still is giving a talk. Say we're using Bayesian updating and replacing the prior $p(y)$ by the posterior $p(y|x) = p(x|y)p(y)/p(x)$. How much work does it take, on average, to do this? $$\langle \Delta F(p(y) \to p(y|x)) \rangle_x = \langle F(p(y|x)) \rangle_x - F(p(y)) = \Delta E + k_B T I[x,y]$$ where $I[x,y]$ is the mutual information, i.e. the entropy of $y$ minus the entropy of $y$ given $x$.
• Options
47.
edited October 2014

In many cases $\Delta E$ is zero.

Then define

$$L(U) := min_{p(y|x) \;such \;that\; \langle u(x,y) \rangle = U} \langle \Delta F \rangle$$ where $u(x,y)$ is some utility function.

Comment Source:In many cases $\Delta E$ is zero. Then define $$L(U) := min_{p(y|x) \;such \;that\; \langle u(x,y) \rangle = U} \langle \Delta F \rangle$$ where $u(x,y)$ is some utility function.
• Options
48.

The following titles in Ilya Nemenman's publications.looked interesting and possibly even relevant to other non-biological Azimuth work.

Comment Source:The following titles in Ilya Nemenman's [publications](http://nemenmanlab.org/~ilya/index.php/Publications).looked interesting and possibly even relevant to other non-biological Azimuth work. * Bryan C. Daniels and Ilya Nemenman, [Automated adaptive inference of coarse-grained dynamical models in systems biology (2014)](http://arxiv.org/abs/1404.6283) * David J. Schwab, Ilya Nemenman and Pankaj Mehta[Zipf's law and criticality in multivariate data without fine-tuning (2013)](http://arxiv.org/abs/1310.0448) * Martin Tchernookov and Ilya Nemenman, [Predictive information in a nonequilibrium critical model (2012)](http://arxiv.org/abs/1212.3896) * Ilya Nemenman, [Inference of entropies of discrete random variables with unknown cardinalities (2002)](http://arxiv.org/abs/physics/0207009) * N. A. Sinitsyn and I. Nemenman, [Time-dependent corrections to effective rate and event statistics in Michaelis-Menten kinetics (2010)](http://arxiv.org/abs/1001.4212) * Golan Bel and Ilya Nemenman, [Ergodic and Nonergodic Anomalous Diffusion in Coupled Stochastic Processes (2009)](http://arxiv.org/abs/0901.4785) * Ilya Nemenman, Fariel Shalee, William Bialek, [Entropy and inference, revisited (2001)](http://arxiv.org/abs/physics/0108025)
• Options
49.

This one caught my eye

Abstract:

The problem of defining and studying complexity of a time series has interested people for years. In the context of dynamical systems, Grassberger has suggested that a slow approach of the entropy to its extensive asymptotic limit is a sign of complexity. We investigate this idea further by information theoretic and statistical mechanics techniques and show that these arguments can be made precise, and that they generalize many previous approaches to complexity, in particular unifying ideas from the physics literature with ideas from learning and coding theory; there are even connections of this statistical approach to algorithmic or Kolmogorov complexity. Moreover, a set of simple axioms similar to those used by Shannon in his development of information theory allows us to prove that the divergent part of the subextensive component of the entropy is a unique complexity measure. We classify time series by their complexities and demonstrate that beyond the 'logarithmic' complexity classes widely anticipated in the literature there are qualitatively more complex, 'power--law' classes which deserve more attention.

Comment Source:This one caught my eye + William Bialek, Ilya Nemenman, Naftali Tishby [Complexity Through Nonextensivity](http://arxiv.org/abs/physics/0103076) Abstract: > The problem of defining and studying complexity of a time series has interested people for years. In the context of dynamical systems, Grassberger has suggested that a slow approach of the entropy to its extensive asymptotic limit is a sign of complexity. We investigate this idea further by information theoretic and statistical mechanics techniques and show that these arguments can be made precise, and that they generalize many previous approaches to complexity, in particular unifying ideas from the physics literature with ideas from learning and coding theory; there are even connections of this statistical approach to algorithmic or Kolmogorov complexity. Moreover, a set of simple axioms similar to those used by Shannon in his development of information theory allows us to prove that the divergent part of the subextensive component of the entropy is a unique complexity measure. We classify time series by their complexities and demonstrate that beyond the 'logarithmic' complexity classes widely anticipated in the literature there are qualitatively more complex, 'power--law' classes which deserve more attention.
• Options
50.

there are even connections of this statistical approach to algorithmic or Kolmogorov complexity

Kolmogorov Complexity ROCKS! Best tool at the hands of our theoretical master here, its mathematical machinery fully developed based on measure theory.

Comment Source:>there are even connections of this statistical approach to algorithmic or Kolmogorov complexity Kolmogorov Complexity ROCKS! Best tool at the hands of our theoretical master here, its mathematical machinery fully developed based on measure theory.