FAWN
From Cmcl
Contents |
The FAWN Page
- fakebook [1]
- showgraph [2]
- Storage Statistics for stuff we (or Milo/Jiri) have benchmarked Storage Stats
Related Work / Links
- Google's "The Case for Energy-Proportional Computing" [3]
- IBM Blue Gene [4]
- Mehul's SIGMOD07 paper on JouleSort - Energy-efficient computing [5].
- Flash writes get slower as the flash gets more full Tevero filesystem FAQ
- Flash Data Structures/Algos Survey [6]. Good background reading.
- Power consumption numbers of existing systems - Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments [7]
- More numbers - Energy cost, the key challenge of today's data centers: a power consumption analysis of TPC-C results [8]
- Cute USB RAID [11]
- Intel's X25-M SSD [12]
- Facebook and memcache consistency [13]
- Flashing Up The Storage Layer [16]
- SSD Talk:
- http://perspectives.mvdirona.com/2008/11/13/IntelsSolidStateDrives.aspx
- http://perspectives.mvdirona.com/2008/10/15/WhenSSDsMakeSenseInServerApplications.aspx
- http://perspectives.mvdirona.com/2008/10/19/WhenSSDsMakeSenseInClientApplications.aspx
- http://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.html
Alix3c2
- To access BIOS + Activate Netboot
- set /dev/ttyS0 to 38400 baud - Press 's' during memtest - set baud rate to 19200 (Press 2) - set repeatable PXE netboot (E in BIOS) - save and exit - Netboot and use 19200 baud for /dev/ttyS0
- Alix3c2 has a AMD Geode LX processor
Reinstalling Clusterhead and Setting-up Netboot (for the Alix3c2 wimpies)
- install ubuntu (~500G drive)
- /dev/sda1 ==> / ==> ext3 (10GB)
- /dev/sda5 ==> /nfsroot ==> ext3 (10GB)
- /dev/sda7 ==> /home ==> ext3 (460GB)
- /dev/sda2 ==> extended (with /dev/sda5 as swap - 4GB)
- create all users you want to support on clusterhead/cluster
- create a default user. sudo su to get root. copy over the passwd, shadow, group files from backup (/media/disk/clusterhead_2.0/). If all users are in group admin then they get sudo rights automatically
- Let people copy over whatever files they need from the backup into their home directories. Permissions should be set properly on the home directories to allow them to log in properly.
- /etc/network/interfaces should look like this
auto lo iface lo inet loopback auto eth0 iface eth0 inet static address 128.2.218.116 netmask 255.255.0.0 gateway 128.2.223.254 auto eth1 iface eth1 inet static address 10.79.0.1 netmask 255.255.255.0 auto eth2 iface eth2 inet static address 10.78.0.1 netmask 255.255.255.0 auto ath0 #iface ath0 inet dhcp auto wlan0 #iface wlan0 inet dhcp
- copy the following lines into /etc/resolv.conf
search cmcl.cs.cmu.edu nameserver 10.79.0.1 nameserver 208.67.222.222
- /etc/init.d/networking restart
- you should should be able to access the wide are now.
- sudo su; passwd (set a password for the root user on the machine)
- sudo apt-get update
- sudo apt-get upgrade
- sudo apt-get install openssh-server dhcp3-server build-essential tftpd-hpa xinetd libncurses5-dev
- cat /etc/dhcp3/dhcpd.conf
ddns-update-style none;
option domain-name "cmcl.cs.cmu.edu";
default-lease-time 600;
max-lease-time 7200;
authoritative;
log-facility local7;
subnet 10.79.0.0 netmask 255.255.255.0 {
range 10.79.0.100 10.79.0.240;
option routers 10.79.0.1;
option broadcast-address 10.79.0.255;
option domain-name-servers 10.79.0.1;
filename "pxelinux.0";
}
subnet 10.78.0.0 netmask 255.255.255.0 {
range 10.78.0.100 10.78.0.200;
option routers 10.78.0.1;
option broadcast-address 10.78.0.255;
option domain-name-servers 10.78.0.1;
}
- restart the dhcp server
- install nfs server (https://help.ubuntu.com/community/SettingUpNFSHowTo, http://ubuntuguide.org/wiki/Ubuntu:Feisty#NFS_Server)
apt-get install nfs-kernel-server portmap sudo dpkg-reconfigure portmap "No" to loopback add line to /etc/exports /nfsroot 10.79.0.0/24(rw,no_root_squash,async) create /nfsroot sudo /etc/init.d/nfs-kernel-server restart sudo exportfs -a (optional step)
- Add the following to /etc/xinet.d/tftp
service tftp
{
disable = no
socket_type = dgram
wait = yes
user = nobody
server = /usr/sbin/in.tftpd
server_args = -v -s /var/lib/tftpboot
only_from = 10.79.0.0/24
interface = 10.79.0.1
}
- killall -HUP xinetd
- /etc/default/tftpd-hpa should look like this
#Defaults for tftpd-hpa RUN_DAEMON="yes" OPTIONS="-l -s /var/lib/tftpboot"
- now create the tftpboot directory and start the tftp server
mkdir -p /var/lib/tftpboot /etc/init.d/tftpd-hpa start
- pxe configuration
mkdir /var/lib/tftpboot/pxelinux.cfg
# /var/lib/tftpboot/pxelinux.cfg/default should look as follows (replace kernel with whatever the latest kernel is) ----- SERIAL 0 19200 0 LABEL linux #KERNEL vmlinuz-2.6.24-19-generic KERNEL vmlinuz-2.6.24.3 #APPEND root=/dev/nfs initrd=initrd.img-2.6.24-19-generic nfsroot=10.79.0.1:/nfsroot ip=dhcp console=ttyS0,19200n8 rw APPEND root=/dev/nfs initrd=initrd.img-2.6.24.3 nfsroot=10.79.0.1:/nfsroot ip=dhcp console=ttyS0,19200n8 rw ----- * cp /lib/modules/2.6.24.3 directory from backup to /nfsroot/lib/modules/2.6.24.3 chmod 644 /var/lib/tftpboot/pxelinux.cfg/default
- if kernel is already built and you are restoring from backup, do the following. replace "uname -r" with the kernel version that is built for the wimpies (e.g. 2.6.24.3) if you don't want to use the latest kernel running on clusterhead.
cp /boot/initrd.img-`uname -r` /var/lib/tftpboot/ cp /boot/vmlinuz-`uname -r` /var/lib/tftpboot/
- If the kernel is not already built, then..compile 2.6.24.3 using FAWN_Kernel_Config and then rebuild the kernel using the following script
#!/bin/bash -x # download the kernel source (2.6.24.3) from kernel.org and unzip/untar it. say you put this in /home/amar/src/ cd /home/amar/src/linux-2.6.24.3 #make mrproper make clean make dep modules modules_install bzImage mkinitramfs -d /home/amar/src/linux-2.6.24.3/initramfs_conf/ -o ./install/initrd.img-2.6.24.3 2.6.24.3 #cat initrd.img-2.6.24.3 | gunzip | cpio -ivdm cp ./arch/x86/boot/bzImage /var/lib/tftpboot/vmlinuz-2.6.24.3 cp ./install/initrd.img-2.6.24.3 /var/lib/tftpboot/
- Change the permissions for the kernel image files
chmod 755 /var/lib/tftpboot/initrd.img-2.6.24.3 chmod 755 /var/lib/tftpboot/vmlinuz-2.6.24.3
- You should have a backup of /var/lib/tftpboot/pxelinux.0 on the backup disk (external hdd) of clusterhead. Alternatetively you can create your own pxelinux.0 (painful)
cp /clusterhead_backup/ /var/lib/tftpboot/pxelinux.0
- This is how /var/lib/tftpboot/ should look now
/var/lib/tftpboot/
|-- initrd.img-2.6.24-19-generic
|-- vmlinuz-2.6.24-19-generic
|-- pxelinux.0
`-- pxelinux.cfg
`-- default
- setup /nfsroot correctly
cp -av /bin /dev/ /etc/ /home /lib /mnt /root /sbin /tmp /usr/ /var /nfsroot mkdir /nfsroot/proc mkdir /nfsroot/sys
/nfsroot/etc/network/interfaces should look as follows auto lo iface lo inet loopback iface eth0 inet manual
sudo /usr/sbin/chroot /nfsroot update-rc.d -f SERVICE_NAME remove where SERVICE_NAME is in the following list - bluetooth - powernowd - dhcp3-server - anacron - cron (?) - usplash - laptop-mode - atd - acpid - acpi-* - pulseaudio - nfs-server - cupsys - gdm - and whatever else you thing is not needed from the list in /etc/rc2.d/
copy the following lines to /nfsroot/etc/event.d/ttyS0 start on runlevel 2 start on runlevel 3 start on runlevel 4 start on runlevel 5 stop on runlevel 0 stop on runlevel 1 stop on runlevel 6 respawn exec /sbin/getty 19200 ttyS0
- cp /lib/modules/2.6.24.3 directory from backup to /nfsroot/lib/modules/2.6.24.3
- restart the nfs server
sudo /etc/init.d/nfs-kernel-server restart sudo exportfs -a
- /nfsroot/etc/fstab should look as follows
# /etc/fstab: static file system information. # # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0 /dev/nfs / nfs defaults 0 0 none /tmp tmpfs defaults 0 0 none /var/run tmpfs defaults 0 0 none /var/lock tmpfs defaults 0 0 none /var/tmp tmpfs defaults 0 0 /dev/sda1 /localfs ext2 rw,dev,exec,suid,auto,user,noatime,nodiratime 0 0
- restart a node that will be netbooted. View boot process using the serial connector and GtkTerm (/dev/ttyS0, 19200).
if you aren't watching the GtkTerm output, watch /var/log/messages to for the node's ip address and login using ssh
- to format the flash drive on the alix nodes with a vfat filesystem (backend)
sudo fdisk /dev/sda1 - select the right options and create a partition (primary, fstype ext2) sudo mkfs.ext2 /dev/sda1
# vrv comments: # wanted to use vfat so that we could have it be mounted on reboot with the right permissions # ext2 does not support it: http://www.usenet-forums.com/linux-general/79037-mount-fstab-question.html # but ext2 is so much faster than vfat (dd on CF = 26.6MB/s on ext2, 17.4MB/s on vfat).
- proxy for the back-end nodes
apt-get install tinyproxy (on clusterhead) /etc/tinyproxy/tinyproxy.conf should have Listen 10.79.0.1 Allow 10.79.0.0/24
back-end nodes export http_proxy=http://10.79.0.1:8888/ export ftp_proxy=http://10.79.0.1:8888/ try sudo apt-get update to check if the thing works! :-)
- ntp to set date on wimpies
# on clusterhead sudo apt-get install ntp
# cat /etc/ntp.conf should have the following lines restrict 10.79.0.0 mask 255.255.255.0 nomodify notrap broadcast 10.79.0.255
# on each wimpy, do the tedious job of: sudo ntpdate 10.79.0.1
- Setting up other things (for things like watts graph, fakebook)
sudo apt-get install gnuplot ruby
sudo apt-get install apache2 php5 php5-memcache php5 libapache2-mod-php5
sudo cp svn/fawn/src/wattsup/{showGraph,parseWatts,scripthandler}.php /var/www/.
sudo cp -R svn/fawn/src/wattsup/wattsuplogs /var/www/.
sudo chown www-data {/var/www/*.php, /var/www/wattsuplogs/*} /var/www/wattsuplogs
# to deal with dns requests (and subsequent delay) from wattsup device for ntp, need to forward queries:
sudo apt-get install dnsmasq
sudo apt-get install php5-mysql php5-cli memcached imagemagick # add the following to /etc/php5/apache2/php.ini extension=memcache.so # restart apache2 sudo /etc/init.d/apache2 restart
- MySQL
mysql -u root -p GRANT ALL PRIVILEGES ON *.* TO 'sn'@'localhost' IDENTIFIED BY 'password'; FLUSH PRIVILEGES; \q mysql -u sn -p CREATE DATABASE social_network; \q mysql -u sn -p social_network < ~/fawn/src/fakebook/dbcreate/dbbackup.sql
References
http://www.debian-administration.org/articles/478: debian-administration
https://help.ubuntu.com/community/Installation/Netboot
http://wiki.antlinux.com/pmwiki.php?n=HowTos.SetupPxeBoot
http://www.digitalpeer.com/id/linuxnfs
http://www.intra2net.com/de/produkte/opensource/diskless-howto/howto.html
http://wiki.openwrt.org/SoekrisPort#head-cb32eb65b03b258a7bed12deb2d6d4821de819bb
optimizations
random notes so that we don't forget.
- minimize services started on the back-end nodes
- fawndb
- the reason we use a log layout is because small random writes are expensive. so when you you merge or split files, not having a log like structure results in a lot of random writes. In the case of BDB (hash or tree structure) - for small file sizes that fit into memory, there was no memory pressure, writes were not flushed immediately, and hence split and merge times for small files was not as bad as that for large files that did not fit in memory. For large files, even if a page had just one entry modified and had to be evicted because of memory pressure (due to write all over the place in virtual memory) this would result in an expensive Flash write.
- idea for addition and compaction in fawndb/hashdb
Index has fixed number of entries of the form (key, <value_ptr, size> array[10]) At the end of these entries there is also a Deleted_Entries_Index which has a fixed number of entries of the form (<value_ptr, size> array[10]). Deleted_Entries_Index also has a size param that is updated every time entries are added or removed.
- on an insert write value in log fashion and insert <value_ptr, size> entry in Index entry corresponding to key
- if on insert there is no place in the Index for the new entry, (note: all this can be avoided using an indirect block)
-- create a new "value" block with data from all existing entries and write it out to the log
-- move current Index entries for the key into the Deleted_Entries_Index (and update size entry there)
-- add new <value_ptr, size> entry to the Index entry for this key
-- flush changes to index entry
-- if Deleted_Entries_Index is full, or size > SIZE_LIMIT_TO_START_COMPACTION (some config value), begin file compaction
-- else update Deleted_Entries_Index
- run a low priority thread that wakes-up infrequently and compacts the file.
- an interesting point would be to compare the performance of BDB over an underlying log-structured FS - does this make hashdb unnecessary?
- power
- alix for back-end nodes
also consumes less power based on workload (cpu, ping, dd tests showed that as each test started the back-end node consumed 1w additional power)
- use alix box as front-end node
- 3 prog power connector
- switch that consumes low power
? http://www.dlink.com/products/?pid=71
- shorter ethernet cable save power (http://www.dlink.com/corporate/environment/dlink-green-ethernet/)
Hardware: Atom
The Architecture
The chips
- Atom N270 - 1.6 Ghz (533Mhz bus), 2W. $44 in bulk.
- Atom 230 - 1.6 Ghz - 4W. $29 in bulk.
- Atom Z500 - 800Mhz - 0.85W
Chipsets
- http://www.hardforum.com/showthread.php?t=1324788 GC, GSE
- http://www.tomshardware.com/reviews/intel-atom-cpu,1947-3.html Low power SCH chipset (Poulsbo)
Boards using Atoms
- http://www.ibase.com.tw/ib882.htm - Z510, SCH US15W chipset, SODIMM socket up to 1GB, 2.7W, IDE, Serial, 2xgig LAN, 1xSATA, 8xUSB, SD/SDIO/MMC
- http://www.ibase.com.tw/ib883.htm - N270, 945GSE chipset, 1.6Ghz, 1GB max, 2xgig-e, 1xsata, 4xUSB, 2 serial, CF socket, VGA
- http://www.globalamericaninc.com/p3308370/3308370_-_3.5%22_Embedded_Controller_with_the_Intel_Atom_Processor/product_info.html - 2x gigabit, 7-8W by spec (requested quote, expect expensive)
Workloads
Below are some workload ideas for FAWN
RUBiS
Open source implementation of an ebay-style workload from Rice University back in 2001. http://rubis.objectweb.org/
Has lots of fancy aspects to it, but all we want is a simple application that has a web-based interface to a SQL backend using PHP calls in apache, which this does! The webpage is currently running at http://clusterhead.pc.cs.cmu.edu (but the database isn't running at the moment). The distribution also comes with a driver program that allows us to modify how clients interface with the website (for example, to change read/write ratios).
RUBBoS
Like RUBiS but meant for Slashdot-type bulletin boards that exhibit lots of content caching capabilities due to popular items being very popular. Read write ratios not too high.
TPC-W
Old TPC benchmark for bookstore-like transactions, e.g. Amazon.com. Both RUBiS and TPC-W are/have 15% write ratios.
This page mentions that RUBBoS benefits from caching whereas TPC-W does not, because TPC-W does not exhibit a lot of locality in accesses.
TPC-App
Updated version of TPC-W to reflect modern web-transaction websites more accurately. There is no open-source implementation that I know of. One person at the OSDL started working on a version in Java but it does not look sufficiently developed yet.
Power Meters
searching efforts over summer 2008 - after wading thru all the crap the 2 good ones to consider are:
Also
- Greenbox [17]
TODOs
- try a log structured file system on the back-end nodes ( [log based FS + bdb] == hashdb?)
- Understand why BDB does not do well with Hash index structure
- FAWNDB insert times do not scale linearly with DB size on the wimpies. More importantly, it does not nearly achieve the potential dd performance of the Alix/ExtremeIV combo, which it should since it is writing in a log-structured manner. We need to find out why this is.
- Is FAWN going to be a good architecture in 10 years? Things are changing so quickly in Flash and I/O in general, so we need a good explanation for how improvements in I/O will not render FAWN obsolete.
- Need to counter the argument that you can take an array of super fast Flash devices and pair it to one beefy node. The way we can do this is by highlighting the QPS/W of nodes (which grows sublinearly), and the QPS/W of I/O devices (which, at best, grows linearly) means that you always want to choose the most power-efficient node and find the storage that best uses the I/O capability of that device.
