Dpdoc
From Cmcl
Contents |
[edit] Datapository Documentation
[edit] Getting a Datapository Account
Datapository accounts are managed through Emulab. You must use an SSH public key to login to the Datapository. To get your datapository account, go to the Emulab Join Project Page and request to join the project datapository.
For help setting up SSH public key authentication, see:
[edit] Using the Datapository
There are two major aspects of using the datapository: The code/utilities/etc., and the machines where you can run analyses.
[edit] Datapository Code
The datapository libraries are installed in /usr/local/datapository. You can check out your own copy at
svn co svn+ssh://YOUR_USERNAME@ops.emulab.net/proj/datapository/svn/datapository/
In doc/STYLE, you will find the general guidelines for keeping the code base consistent.
In src/trunk, you'll see a few sets of useful things:
- lib - Datapository convenience functions, mostly for accessing the database.
- See /usr/local/datapository/Dpdb.rb for a good starting point.
- util - Some utility programs. Notable in here are some programs like recompress.rb that use the DpMPP.rb framework to schedule massively parallel computations on every node in the cluster. You can start these using the start_clients.rb script.
In the event you've made changes and want to test (do this always), you will find convenience scripts in src/etc. After generating yourself an apache2.conf file that uses a non-privileged port, run apache2 -X -f <apache2 conf that was just generated>. Please remember to quit your temporary web server.
[edit] Datapository nodes
There are 14 datapository nodes, but only 2 are visible to the world:
- sn001.datapository.net (2x Xeon 5130, dual core, @2.0Ghz. 8GB RAM, 6TB RAID)
- sn002.datapository.net (2x Xeon 5130, dual core, @2.0Ghz. 8GB RAM, 6TB RAID)
The remaining 4 sn* nodes and the cn* nodes can only be reached through sn001 or sn002 (described below)
- sn003 - sn006: (2x Xeon E5440, quad core, @ 2.83Ghz
sn* nodes are storage nodes. Each of these has 6 terabytes of disk space attached. sn003, sn004, sn005, and sn006 are behind the firewall
In addition, there are 8 "computation" nodes:
- cn001.datapository.net
- cn002.datapository.net
- ...
These nodes can be accessed from either sn001 or sn002, or you can ssh to them directly through a port relay on sn001 at ports 2201, 2202, 2203, ..., 2208. We recommend creating .ssh/config entries like:
Host cn001
Hostname sn001.datapository.net
Port 2201
...
Host sn003
Hostname sn001.datapository.net
Port 3203
(Note that the sn003-6 nodes have a different port range)
You can download the entire config.txt file and append it to your .ssh/config file.
[edit] Running parallel computations
We have two frameworks for running parallel computations on the datapository: DpMPP and Hadoop.
[edit] DpMPP
DpMPP is a home-brewed parallel program runner. It's probably similar to every other home-brewed program runner in existence, but it gets the job done for simple things like recompressing terabytes of packet traces from gzip to lzma.
Because DpMPP reads from the NFS filesystem, it only works particularly well for computationally bound things. A common way we use it is to read highly compressed data from the NFS filesystem and then process it on the remote nodes, which cuts down the fileserver load and allows very effective parallelization.
[edit] Hadoop
Hadoop 0.16 is installed on the cluster in /data/users/hadoop. To run hadoop, type:
/data/users/hadoop/hadoop/bin/hadoop
In general, Hadoop is higher overhead, but can take advantage of the storage on the nodes if you're doing something that requires fast storage access.
[edit] Datapository Data
See Dpdata
[edit] Datapository Database
Much datapository data is also stored in a Postgres database on sn001. To access this database, run
psql dp set search_path='dp';
You can then see the available tables using \d
And can run queries, e.g.
select * from bgpup_abilene_atla limit 3;
Or include the ASpath:
select * from bgpup_abilene_atla natural join bgpup_abilene_atla_aspath limit 3
(Note that the latter query may not actually give you any aspaths. Some entries have an empty aspath and they sometimes come up first in that query.)
