"Reanalysis of business visits from deployments of a mobile phone app"

I have pushed out today a project I've been working on for a little over a month, a semi-mathematical tutorial on mark-capture methods, sometimes called "multi-list methods". The principal means of announcement is at my blog, but I have put up the code both in my own non-Github repository and at the Azimuth-accessible repo.

I don't dwell on this at the blog, but I do a bit in the accompanying tutorial paper: This was only possible because Yauck, Rivest, and Rothman, in the related paper which was published in the Journal of the American Statistical Association, made all their data available to the publish. Thus I was able to use it for a reanalysis, a practice which has its scientific and statistical uses, but also lets people use actual data for teaching and other purposes. Such publication of data sets is now the norm in statistical journals, and in many scientific ones, but in some fields, such as Internet measurement, which I criticize, it is typically not done. I cite papers there which used mark-recapture for Internet measurement purposes, but you can't say much more about them since their datasets are not available.

Yauck, Rivest, and Rothman, thankfully, chose to publish in JASM, which is not the typical outlet for an Internet measurement paper. Moreover, it's a tribute to their work, principally, I believe, that by Rivest, that the method of estimating equations they published was worthy of JASM rather than, say, being sent off to Journal of Computational and Graphical Statistics.

The technique I pursued there, for open populations, and ones where probability of capture is not uniform, that developed by R. Tanaka, is not well known in the the mark-recapture literature. This is striking, since it is just regression, and the means of generalizing off of Tanaka's insights are straightforward. People in the business go off and create specialized likelihood and probability models and, then, when a dataset is in hand, they try all of them, and the ones which seem to work well are embraced as an accurate depiction. As a Bayesian I have some issues with this kind of multi-model shotgun approach, but, that said, my own work the the Tanaka technique is not Bayesian, nor is the R package, segmented, by Professor V. M. R. Muggeo, a Bayesian approach.

Tanaka's technique is worthwhile because it is so transparent compared with some of the others.

These methods have importants uses in public and health policy, for instance, counting numbers of intravenous drug addicts: See the references in the tutorial, including papers by Lavallée and Rivest (yes, the same), and by Bird and King.

This is also pretty much the kind of work I did while at Akamai, although not only using mark-recapture methods. There are ways of using series to estimate sizes of subpopulations, and classifying them, too. I'm working several projects to get some of the techniques I development out in public, since the ones I used to work are unfortunately now lost to the dark caves of proprietary information.

Sign In or Register to comment.