Dpdata

From Cmcl

Jump to: navigation, search

Contents

Datapository Datasets

All raw data is stored in /data/raw and then organized by datatype. Some of that data is also stored in a Postgres database (dp). Individual datatypes are described below.

BGP

/data/raw/bgp/

Routeviews

This dataset is public.

The "all updates" are stored as a mirror of the raw routeviews FTP site, but changed to use lzma compression instead of bzip2. We haven't tried putting them all in the database yet, but only analyze them with Hadoop or DpMPP -- they represent hundreds of billions of BGP updates.

All updates /mnt/sn002_01/raw/rv_tmp/ no database entries
Select updates /data/raw/bgp/updates/ bgpup_routeviews_oix
Select tables /data/raw/bgp/tables bgptables_routeviews_oix

Spam

This dataset can be used by permission of Nick Feamster.

/data/raw/spam/

The database version is currently being updated.

DNS root zone information

This dataset can be used by permission of David Andersen.

It is a full copy of the DNS root, .com, and .net zones.

/data/raw/dnsroot

Netflow datasets

  • Abilene data, last significant bits zeroed. Used by permission of dga/feamster. May not be exported from this machine.
  • Geant data. Used by permission of feamster. May not be exported.

In addition, we have CMU internal netflow traces stored elsewhere. Contact dga, george nychis, and vyas sekar for information. We'll be getting these online soon, but they'll have serious access restrictions.

Nick Feamster has georgia tech netflow available at the Georgia Tech Datapository. The data is restricted, but it may be possible to run analyses against it without exporting the actual data. Contact Nick for more information

Packet Traces

  • PREDICT /data/raw/predict - anonymized traces from LBL/ICSI. Public.

Email datasets

  • The enron email corpus: /data/raw/email/enron (public)
  • Archives of the nanog email list from 1994-2006: /data/raw/nanog (public)

Search datasets

  • The AOL search queries samples database (not yet moved into an official location)

end-to-end probing datasets

  • RON monitoring data from 2000-2004: /data/raw/ronmon (public)

Other datasets

  • Roofnet sigcomm 2004 traces: /data/raw/datasets/roofnet-sigcomm04.tar.bz2
  • DOT mail traces from a research group (stores mail hashes). Not public - will provide samples and run queries against for you. Records content hashes of various sorts, hash of to/from addresses, size, etc.
Personal tools