Options

Any interesting medium-sized datasets (1-4GB)?

I've been away from computers a lot recently, and what little time I have had has mostly been spent working on an improved mechanism for easy-to-write data crunching (so in particular model-based exploratory data analysis) for the language Julia. (This has partly been motivated using the well-known observation that programming languages acquire "re-implementation costs for infrastructure" which means new languages only take-off when they are compelling new elements enough to overcome this cost. Otherwise they just lie on the shelf unused.) Julia seems to have both the compelling new elements and enough general momentum that it might take-off. So getting some data crunching (the way I, in my wisdom, think it's best done) in there might lead to it being easier/more productive working with data, including ecological data. (This is a roundabout way of saying this is a bit related to Azimuth.)

Anyway, I'm reaching the point of being able to run stuff, and I'm looking for any interesting primarily numerical (although some categorical elements would be fine) data sets to crunch on. I'm looking for stuff in the 1-4GB range (enough to fit in main memory of a recent workstation, not so big that splitting it over multiple machines is strongly advised). I've found there's a reasonable selection of finanical or social network-type datasets available, but it would be nice to feel that at the very least my testing was working on environmental data, if only to show that it's something to do.

Obviously this is an odd request: caring about data size rather than content means it's difficult to look up on the main wiki, so I thought i'd briefly ask here.

Comments

  • 1.

    Hi David,

    Serendipitously I just got an email notifying me of the release of the NOAA climate and weather recast data set from which you can download exactly the amount of data you need. hth.

    Comment Source:Hi David, Serendipitously I just got an email notifying me of the release of the [NOAA climate and weather recast data set](http://esrl.noaa.gov/psd/forecasts/reforecast2/README.GEFS_Reforecast2.pdf) from which you can download exactly the amount of data you need. hth.
  • 2.
    edited September 2012

    I'm not succeeding in finding databases listed by their size, but NOAA is good for data:

    has a total of 500 gigabytes, but you don't need to download all of it! Their

    may also be a good point of access. But Jim's suggestion sounds good.

    Comment Source:I'm not succeeding in finding databases listed by their _size_, but NOAA is good for data: * [Integrated Surface Database](http://www.ncdc.noaa.gov/oa/climate/isd/index.php) has a total of 500 gigabytes, but you don't need to download all of it! Their * [Online Climate Data Directory](http://www.ncdc.noaa.gov/oa/climate/climatedata.html#search) may also be a good point of access. But Jim's suggestion sounds good.
  • 3.

    Many thanks for the suggestions. I'll take a look.

    Comment Source:Many thanks for the suggestions. I'll take a look.
  • 4.

    You could try looking through the PCMDI CMIP5 archive of climate model simulation output, if you sign up for an account and learn their web interface. The KNMI Climate Explorer has a mirror of some of this with an easier interface. Many of the files are in NetCDF format.

    Comment Source:You could try looking through the [PCMDI CMIP5 archive](http://cmip-pcmdi.llnl.gov/cmip5/) of climate model simulation output, if you sign up for an account and learn their web interface. The [KNMI Climate Explorer](http://climexp.knmi.nl/) has a mirror of some of this with an easier interface. Many of the files are in [NetCDF](http://www.unidata.ucar.edu/software/netcdf/) format.
Sign In or Register to comment.