Datapository
From Cmcl
Wiki space for use in the datapository project. Note that all information on these pages is public...
Please put Datapository documentation stuff at dpdoc. Data is documented for now at Dpdata
Contents |
The Switchover to CMU datapository
Things that are broken
- More than half of the example queries refer to data that we don't have loaded
- The proxy for AS whois lookups is broken
- DpMPP clients stick around after computation is done
Current Efforts
GT upgrade (avr)
- UID remapping: UIDs must be consistent with Emulab-assigned UIDs.
- Ubuntu upgrade: Need to be running the same version as CMU (Gibbon).
- Get Emulab-ified: Install and configure the Emulab account management scripts. Add users to the right groups on Emulab.
Data Collection
- Check on RON testbed BGP data collection
- Check on root nameserver monitoring
- Check on all-pairs probing
- Fix all broken feeds
- Get pulling working again
- Get rotating and archiving and auto-cleaning on remote nodes working
- Deal with backlog of old collected and archived and probably broken data
Documentation
- Cleanup and clarification pass on documentation
- Inventory of available datasets
- Update existing libraries and modules to have RDoc info
- Automate documentation installation as part of make / make install
Data Insertion
- Create BGP table insertion class (BGP up is working) -- Done
- Create spam insertion class (feamster + GT people doing now)
- Create netflow insertion class
- After RON testbed cleanup: Create all-pairs and nameserver insertion mechanisms (later!)
Setup and Installation
- Install the table definition files in share/table_defs/
- Single script to set up database and get things running
- Not db creation (must be done as priv. user) (put in docs)
- Autoconfify or paths.confify the database to use
- Add to docs: Centralized logging setup (machine-setup.txt i assume) (jpmoss)
- FIX the authentication stuff
Federation and mirroring
- jpmoss is getting rsync between datapositories working (autogenerated from feeds table) [done]
- Convenience script to say "mirror everything public on this DP" (needed) [done]
- use xmlrpc feeds query
- A DP must not be behind a NAT if it's going to get mirrored.
- Convenience script to say "mirror everything public on this DP" (needed) [done]
- Anirudh: Redoing pull scripts in some way
Other
- Add pull scripts to install targets automake/autoconf/etc. (jpmoss)
- Start fetching all of routeviews (nick can you look at this?)
- Depends on having pull scripts installed!
- update dp.feeds set pull='y' where name='routeviews_oix' (done)
- Sanity check dp permissions, etc. (dga not working)
Todo later
- Automate permissions management
- Emulab hooks for postgres account creation
- Emulab hooks for setting permissions on postgres
- Need to get Emulab group information propagated
- Automatic linking of papers to datasets
- Workflow!
- Postgres 8.3 when 8.3.1 comes out (to be done with next Ubuntu release - probably summer 2008)
- Enumerated types
- More efficient data storage
- Better ORDER BY ... LIMIT handling, very useful for doing small bits of a large table
- Get our patches more integrated into bgpdump. sigh.
- Fix the AS grapher some day
Completed Efforts
- Initial population of dp schemas (jpmoss) (collect/insert/schema_create)
The Fricking Datapository Paper
- Must decide what we're doing here (everyone but mostly nick & dave)
- Topics and contribution nuggets:
- Data analysis patterns for network data anlysis:
- Window-based joins
- Change detection (e.g., fast flux)
- What can we get out of Nychis's IMC paper? paper
- Management - things like recompressing data
- Error reporting and other provenance issues
- Data analysis patterns for network data anlysis:
- Topics and contribution nuggets:
HowTos
Keeping datapositories in sync
Packages
- Getting the list of packages from sn001@cmu:
ssh sn001.datapository.net dpkg --get -selections > file dpkg --set -selections < file apt-get upgrade
(Beware: This runs a significant risk of incrementally gaining more and more packages that never seem to disappear. We'll have to do manual pruning from time to time.)
- When you add packages locally that you want to persist, notify James so he can also add them on sn001 and note that we really do want those packages in the long term.
- Update doc/building to reflect the packages that we really want installed
- This should probably be programmatic!
Database Schema
- For now: Email to everyone the commands they need to run to update.
- Later: We'll figure something out.
Feeds?
Tools to eventually look at
- Nutch (open source mapreduce)
- Hadoop (In progress - Hadoop is installed, upgrading to Hadoop on Demand as we speak - 2/19/2008)
- Ideas from MapReduce, Sawzall, other google stuff (see above)
- BGP analysis BOF at NANOG
- Parallel BZIP2
- Parallel LZMA?
