Developer Documentation

	a.	Adding Records
	b.	Merging Rdb Files
	c.	Deleting Records
	d.	Reading Records
	e.	Error Correction - How Rdb deals with corrupted data on disk.
	f.	Adding Records over the Net
	g.	Reading Records over the Net
	i.	Variable Key Sizes
	j.	Caching Records

3. a.

Posdb - The new word-position storing search index Rdb.

3. b.

Indexdb - The retired/unused tf/idf search index Rdb.

Datedb - For returning search results constrained or sorted by date.

Titledb - Holds the cached web pages.

Spiderdb - Holds the urls to be spidered, sorted by spider date.

Checksumdb - Each record is a hash of the document content. Used for deduping.

Sitedb - Used for mapping a url or site to a ruleset.

10.

Clusterdb - Used for doing site clustering and dupplicate search result removal.

11.

Catdb - Used to hold DMOZ.

12.

Robotdb - Just a cache for robots.txt files.

The Network Layer

Associated Classes

Udp server - Used for internal communication.

	a.	Multicasting
	b.	Message Classes

TCP server - Used by the HTTP server.

HTTP server - The web server.

DNS Resolver

The Build Layer

1.	Associated Files
2.	Spiderdb.cpp - Most of the spider code.
3.	Link Analysis
4.	The Sample Vector - Used for deduping at spider time.
5.	The Summary Vector - Used for removing similar search results.
6.	The Gigabit Vector - Used for clustering results by topic.
7.	Deduping - Preventing the index of duplicate pages.
8.	Indexing Error Codes - Reasons why a document failed to be indexed.
9.	DocIds - How DocIds are assigned.
10.	Scaling To Many DocIds - How we scale to many DocIds.

The Search Results Layer

1.	Associated Classes
1.	The Query Parser
2.	Getting the Termlists
3.	Intersecting the Termlists
4.	Raiding Termlist Intersections
5.	The Search Results Cache
6.	Site Clustering
7.	Deduping
8.	Family Filter
9.	Gigabits
10.	Topic Clustering
11.	Related and Reference Pages
12.	The Spell Checker

The Administration Layer

1.	Associated Classes
2.	Collections
3.	Parameters and Controls
4.	The Log System
5.	Changing Live Clusters

The Core Functions Layer

1.	Associated Classes
2.	Files
3.	Threads

File List - List of all source code files with brief descriptions.

XIV.

Fighting Spam - Search engine spam identification.

Getting Started

Welcome to Gigablast. Thanks for supporting this open source software.

SSH - A Brief Introduction

Gigablast uses SSH extensively to administer and develop its software. SSH is a secure shell connection that helps protect sensitive information over an insecure network. Internally, we access our development machines with SSH, but being prompted for a password when logging in between machines can become tiresome as well as development gb processes will not execute unless you can ssh without a password. This is where our authorized keys file comes in. In your user directory you have a .ssh directory and inside of that directory there lies an authorized_keys and known_hosts file. Now what we first have to do is generate a key for your user on the current machine you are on. Remember to press <Enter> when asked for a passphrase, otherwise you will still have to login.
ssh-keygen -t dsa
Now that should have generated an id_dsa.pub and an id_dsa file. Since you now have a key, we need to let all of our trusted hosts know that key. Luckily ssh provides a simple way to copy your key into the appropriate files. When you perform this command the machine will ask you to login.
ssh-copy-id -i ~/.ssh/id_dsa.pub destHost
Perform this operation to all of our development machines by replacing destHost with the appropriate machine name. Once you have done this, try sshing into each machine. You should no longer be prompted for a password. If you were requested to enter a password/phrase, go through this section again, but DO NOT ENTER A PASSPHRASE when ssh-keygen requests one. NOTE: Make sure to copy the id to the same machine you generated the key on. This is important for running gigablast since it requires to ssh freely.

Host Key Has Changed and Now I Can Not Access the Machine
This is easily remedied. If a machine happens to crash and the OS needs to be replaced for whatever reason, the SSH host key for that machine will change, but your known_hosts file will still have the old host key. This is to prevent man-in-the-middle attacks, but when we know this is a rebuilt machine we can simply correct the problem. Just open the ~/.ssh/known_hosts file with your favorite editor. Find the line with the name or ip address of the offending machine and delete the entire line (the known_hosts file separates host keys be newlines).

Using the Shell

Everyone uses the bash shell. The navigation keystrokes work in emacs, too. Here are the common commands:

Command	Description
export LD_PATH=/home/mwells/dynamicslibs/	Export the LD_PATH variable, used to tell the OS where to look for dynamic libraries.
Ctrl+p	Show the previously executed command.
Ctrl+n	Show the next executed command.
Ctrl+f	Move cursor forward.
Ctrl+b	Move cursor backward.
Ctrl+a	Move cursor to start of line.
Ctrl+e	Move cursor to end of line.
Ctrl+k	Cut buffer from cursor forward.
Ctrl+y	Yank (paste) buffer at cursor location.
Ctrl+Shift+-	Undo last keystrokes.
history	Show list of last commands executed. Edit /home/username/.bashrc to change the number of commands stored in the history. All are stored in /home/username/.history file.
!xxx	Execute command #xxx, where xxx is a number shown from the 'history' command.
ps auxww	Show all processes.
ls -latr	Show all files reverse sorted by time.
ls -larS	Show all files reverse sorted by size.
ln -s <x> <y>	Make directory or file y a symbolic link to x.
cat xxx \| awk -F":" '{print $1}'	Show contents of file xxx, but for each line, use : as a delimeter and print out the first token.
dsh -c -f hosts 'cat /proc/scsi/scsi'	Show all hard drives on all machines listed in the file hosts. -c means to execute this command concurrently on all those machines. dsh must be installed with apt-get install dsh for this to work. You can use double quotes in a single quoted dsh command without problems, so you can grep for a phrase, for instance.
apt-cache search xxx	Search for a package to install. xxx is a space separated list of keywords. Debian only.
apt-cache show xxx	Show details of the package named xxx. Debian only.
apt-get install xxx	Installs a package named xxx. Must be root to do this. Debian only.
adduser xxx	Add a new user to the system with username xxx.

Using Git

Git is what we use to do source code control. Git allows us to have many engineers making changes to a common set of files. The basic commands are the following:

git clone <srcdir> <destDit>

This will copy the git repository to the destination directory.

More information available at github.com

Setting up GitHub on a Linux Box

To make things easier, you can set up github access via ssh. Here's a quick list of commands to run: (assuming Ubuntu/Debian)

sudo apt-get install git-core git-doc
git config --global user.name "Your Name"
git config --global user.email "your@email.com"
git config --global color.ui true
ssh-keygen -t rsa -C "your@email.com" -f ~/.ssh/git_rsa
cat ~/.ssh/git_rsa.pub

Copy and paste the ssh-rsa output from the above command into your Github profile's list of SSH Keys.

ssh-add ~/.ssh/git_rsa

If that gives you an error about inability to connect to ssh agent, run:

eval `ssh-agent -a`

Then test and clone!

ssh -T git@github.com
git clone git@github.com:gigablast/open-source-search-engine

Hardware Administration

Hardware Failures

If a machine has a bunch of I/O errors in the log or the gb process is at a standstill, login and type "dmesg" to see the kernel's ring buffer. The error that means the hard drive is fried is something like this:

mwells@gf36:/a$ dmesg | tail -3
scsi5: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 13 89 68 82 00 00 18 00 
Info fld=0x1389688b, Current sd08:34: sense key Medium Error
 I/O error: dev 08:34, sector 323627528

If you do a cat /proc/scsi/scsi you can see what type of hard drives are in the server:

mwells@gf36:/a$ cat /proc/scsi/scsi 
Attached devices: 
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi4 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi5 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
  Type:   Direct-Access                    ANSI SCSI revision: 03

So for this error we should replace the rightmost hard drive with a spare hard drive. Usually we have spare hard drives floating around. You will know by looking at colo.html where all the equipment is stored.

If a drive is not showing up in /proc/scsi/scsi when it should be, it may be a bad sata channel on the motherboard, but sometimes you can see it again if you reboot a few times.

Often, if a motherboard failures, or a machine just stops responding to pings and cannot be revived, we designate it as a junk machine and replace its good hard drives with bad ones from other machines. Then we just need to fix the junk machine.

At some point we email John Wang from TwinMicro, or Nick from ACME Micro to fix/RMA the gb/g/gf machines and gk machines respectively. We have to make sure there are plenty of spares to deal with spikes in the hardware failure rate.

Sometimes you can not ssh to a machine but you can rsh to it. This is usually because the a drive is bad and ssh can not access some files it needs to log you in. And sometimes you can ping a machine but cannot rsh or ssh to it, for the same reason. Recently we had some problems with some machines not coming back up after reboot, that was due to a flaky 48-port NetGear switch.

Replacing a Hard Drive

When a hard drive goes bad, make sure it is labelled as such so we don't end up reusing it. If it is the leftmost drive slot (drive #0) then the OS will need to be reinstalled using a boot cd. Otherwise, we can just umount /dev/md0 as root and rebuild ext2 using the command in /home/mwells/gbsetup2, mke2fs -b4096 -m0 -N20000 -R stride=32 /dev/md0. And to test the hard drives thoroughly, after the mke2fs, nohup badblocks -b4096 -p0 -c10000 -s -w -o/home/mwells/badblocks /dev/md0 >& /home/mwells/bbout . Sometimes you might even need to rebuild the software raid with mkraid -c /etc/raidtab --really-force /dev/md0 before you do that. Once the raid is repaired, mount /dev/md0 and copy over the data from its twin to it. On the newly repaired host use nohup rcp -r gbXX:/a/* /a/ & where gbXX is its twin that has all the data still. DO NOT ERASE THE TWIN'S DATA. You will also have to do rcp -r gbXX:/a/.antiword /a/ because rcp does not pick that directory up.

Measuring Drive Performance

Slow drives are sometimes worse than broken drives. Run ./gb thrutest <directoryToWriteTo> 10000000000 to make gigablast write out 10GB of files and then read from them. This is also a good test for checking a drive for errors. Each drive should have a write speed of about 39MB/s for a HITACHI, so the whole raid should be doing 150MB/s write, and close to 200MB/s for reading. Fuller drives will not do so well because of fragmentation. If the performance is less than what it should, then try mounting each drive individually and seeing what speed you get to isolate the culprit. Sometimes you can have the drive re-seated (just pulling it out and plugging it back in) and that does the trick. A soft reboot does not seem to work. I don't know if a power cycle works yet or not.

Measuring Memory Performance

Run membustest 50000000 50 to read 50MB from main memory 50 times to see how fast the memory bus is. It should be at least 2000MB/sec, and 2500MB/sec for the faster servers. Partap also wrote gb memtest to see the maximum amount of memory that can be allocated, in addition to testing memory bus speeds. The max memory that can be allocated seems to be about 3GB on our 2.4.31 kernel.

Measuring Network Performance

Use thunder 50000000 on the server with ip address a.b.c.d and thunder 50000000 a.b.c.d on the client to see how fast the client can read UDP packets from the server. Gigablast needs about a Gigabit between all machines, and they must all be on a Gigabit switch. Sometimes the 5013CM-T resets the Gigabit to 100Mbps and causes Gigablast to basically come to a hault. You can detect that by doing a dmesg | grep e1000 | tail -1 on each machine to see what speed its network interface is running at.

Kernels

Gigablast runs well on the Linux 2.4.31 kernel which resides in /gb/kernels/2.4.31/. The .config file is in that directory and named dotconfig. It does have the marvel patch, mvsata340_2.4.patch.bz2 applied so Linux can see the Marvel chip that drives the 4 sata devices on our SuperMicro 5013CM-T 1U servers. In addition we comment out the 3 'offline' statements in drives/scsi/scsi_error.c in the kernel source, and raised the 3 TIMEOUT to 30. These changes allow the kernel to deal with the 'drive offlined' messages that the Hitachi drives SCSI interface give every now and then. We still set the 'ghost lun' think in the kernel source to 10, but I am not sure if this actually helps, though.

Older kernels had problems with unncessary swapping and also with crashing, most likely due to the e1000 driver, which drives the 5013CM-T's Gigabit ethernet functionality. All issues seem to be worked out at this point, so any kernel crash is likely due to a hardware failure, like bad memory, or disk.

I would like to have larger block sizes on the ext2fs filesystem, because 4KB is just not good for fragmentation purposes. We use mostly huge files on our partitions, raid-0'ed across 4 400GB Sata drives, so it would help if we could use a block size of a few Megabytes even. I've contacted Theodore T'so, the guy he wrote e2fs, and he may do it for a decent chunk of cash.

Coding Conventions

When editing the source please do not wrap lines passed 80 columns. Also, please make an effort to avoid excessive nesting, it makes for hard-to-read code and can almost always be avoided.

Classes and files are almost exclusively named using Capitalized letters for the beginning of each word, like MyClass.cpp. Global variables start with g_, static variables start with s_ and member variables start with m_. In order to save vertical space and make things more readable, left curly brackets ({'s) are always placed on the same line as the class, function or loop declaration. In pointer declations, the * sticks to the variable name, not the type as in void *ptr. All variables are to be named in camel case as in lowerCamelCase. Underscores in variable names are forbidden, with the exception of the g_, s_ and m_ prefixes, of course.

And try to avoid lots of nested curly brackets. Usually you do not have to nest more than 3 levels deep if you write the code correctly.

Using the *p, p++ and p < pend mechanisms is generally preferred to using the p[i], i++ and i < n mechanisms. So stick to that whenever possible.

Using ?'s in the code, like a = b ? d : e; is also forbidden because it is too cryptic and easy to mess up.

Upon error, g_errno is set and the conditions causing the error are logged. Whenever possible, Gigablast should recover after error and continue running gracefully.

Gigablast uses many non-blocking functions that return false if they block, or true if they do not block. These type of functions will often set the g_errno global error variable on error. These type of functions almost always take a pointer to the callback function and the callback state as arguments. The callback function will be called upon completion of the function.

Gigablast no longer uses Xavier Leroy's pthread library (LinuxThreads) because it was buggy. Instead, the Linux clone() system call is used. That function is pretty much exclusively Linux, and was the basis for pthreads anyway. It uses an older version of libc which does not contain thread support for the errno variable, so, like the old pthreads library, Gigablast's Threads.cpp class has implemented the _errno_location() function to work around that. This function provides a pointer to a separate thread's (process's) own errno location so we do not contend for the same errno.

We also make an effort to avoid deep subroutine calls. Having to trace the IP stack down several functions is time-consuming and should be avoided.

Gigablast also tries to make a consistent distinction between size and length. Length usually refers to a string length (like strlen()) and does not include the NULL terminator, if any. Size is the overall size of a data blob, and includes everything.

When making comments please do not insert the "//" at the very first character of the line in the first column of the screen. It should be indented just like any other line.

Whenver possible, please mimice the currently used styles and conventions and please avoid using third party libraries. Even the standard template library has serious performance/scalability issues and should be avoided.

Debugging Gigablast

General Procedure

When Gigablast cores you generally want to use gdb to look at the core and see where it happened. You can use BitKeeper's bk revtool <filename> to see if the class or file that cored was recently edited. You may be looking at a bug that was already fixed and only need to update to the latest version.

Using GDB

We generally use gdb to debug Gigablast. If you are running gb, the gigablast process, under gdb, and it receives a signal, gdb will break and you have to tell it to ignore the signal from now now by typing handle SIG39 nostop noprint for signal 39, at least, and then continue execution by typing 'c' and enter. When debugging a core on a customer's machine you might have to copy your versino of gdb over to it, if they don't have one installed.

There is also a /gb/bin/gdbserver that you can use to debug a remote gb process, although no one really uses this except Partap used it a few times.

The most common way to use gdb is:

gdb ./gb to start up gdb.
run -c ./hosts.conf 0 to run gb under gdb.
Ctrl-C to break gb
where to make gdb print out the ip stack
break MyClass.cpp:123 to break gb at line 123 in MyClass.cpp.
print <variableName> to print out the value of a variable.
watch <variableName> to set a watchpoint so gdb breaks if the variable changes value.
set <variableName> <value> to set the value of a variable.
set print elements 100000 to tell gdb to print out the first 100000 bytes when printing strings, as opposed to limiting it to the default 200 or so bytes.

You can also use gdb to do poor man's profiling by repeatedly attaching gdb to the gb pid like gdb ./gb <pid> and seeing where it is spending its time. This is a fairly effective random sampling technique.

If a gb process goes into an infinite loop you can get it to save its in-memory data by attaching gdb to its pid and typing print mainShutdown(1) which will tell gdb to run that function which will save all gb's data to disk so you don't end up losing data.

To debug a core type gdb ./gb <coreFilename> Then you can examine why gb core dumped. Please copy the gb binary and move the core to another filename if you want to preserve the core for another engineer to look at. You need to use the exact gb that produced the core in order to analyze it properly.

It is useful to have the following .gdbinit file in your home directory:

set print elements 100000
handle SIG32 nostop noprint
handle SIG35 nostop noprint
handle SIG39 nostop noprint
set overload-resolution off

The overload-resolution gets in the way when tring to print the return value of some functions, like uccDebug() for instance.

Using Git to Help Debug

Git can be a useful debugging aid. By doing git log You can see what the latest changes made to gb were. This may allow you to isolate the bug. bk diff HEAD~1 HEAD~0 will actually list the files that were changed from the last commit, instead of just the changesets. Substituting '1' with '2' will show the changes from the last two commits, etc.

Memory Leaks

All memory allocations, malloc, calloc, realloc and new, go through Gigablast's Mem.cpp class. That class has a hash table of pointers to any outstanding allocated memory, including the size of the allocation. When Gigablast allocates memory it will allocate a little more than requested so it can pad the memory with special bytes, called magic characters, that are verified when the memory is freed to see if the buffer was overwritten or underwritten. In addition to doing these boundary checks, Gigablast will also print out all the unfreed memory at exit. This allows you to detect memory leaks. The only way we can leak memory without detection now is if it is leaked in a third party library which Gigablast links to.

Random Memory Writes

This is the hardest problem to track down. Where did that random memory write come from? Usually, RdbTree.cpp, uses the most memory, so if the write happens it will corrupt the tree. To fight this you can turn protect the RdbTree by forcing RdbTree::m_useProtection to true in RdbTree::set(). This will make Gigablast protect the memory when not being accessed explicitly by RdbTree.cpp code. It does this by calling mprotect(), which can slow things down quite a bit.

Corrupted Data In Memory

Try to reproduce it under gdb immediately before the corruption occurs, then set a watchpoint on some memory that was getting corrupted. Try to determine the start of the corruption in memory. This may lead you to a buffer that was overflowing. Often times, a gb process will core in RdbTree.cpp which never happens unless the machine has bad RAM. Bad RAM also causes gb to go into an infinite loop in Msg3.cpp's RdbMap logic. In these cases try replacing the memory or using another machine.

Common Core Dumps

Problem	Cause
Core in RdbTree	Probably bad ram

Common Coding Mistakes

1.	Calling atoip() twice or more in the same printf() statement. atoip() outputs into a single static buffer and can not be shared this way.
2.	Not calling class constructors and destructors when mallocing/freeing class objects. Need to allow class to initialize properly after allocation and free any allocated memory before it is freed.

System Administration

Code Overview

Gigablast consists of about 500,000 lines of C++ source. The latest gb version being developed as of this writing is /gb/datil/. Gigablast is a single executable, gb, that contains all the functionality needed to run a search engine. Gigablast attempts to uniformly distribute every function amongst all hosts in the network. Please see overview.html to get a general overview of the search engine from a USER perspective.

Matt started writing Gigablast in 2000, with the intention of making it the most efficient search engine in the world. A lot of attention was paid to making every piece of the code fast and scalable, while still readable. In July, 2013, the majority of the search engine source code was open sourced. Gigablast continues to develop and expand the software, as well as offers consulting services for companies wishing to run their own search engine.

The Control Loop Layer

Gigablast uses a signal-based architecture, as opposed to a thread-based architecture. At the heart of the Gigablast process is the I/O control loop, contained in the Loop.cpp file. Gigablast sits in an idle state until it receives a signal on a file or socket descriptor, or a "thread" exits and sends a signal to be cleaned up. Threads are only used when necessary, which is currently for performing disk I/O, intersecting long lists of docIds to generate search results (which can block the CPU for a while) and merging database data from multiple files into a single list.

Threads are not only hard to debug, but can make everything more latent. If you have 10 requests coming in, each one requiring 1 second of CPU time, if you thread them all it will take 10 seconds until each query can be satisfied and the results returned, but if you handle them FIFO (one at a time) then the first request is satisfied in 1 second, the second in two seconds, etc. The Threads.cpp class handles the use of threads and uses the Linux clone() call. Xavier Leroy's pthreads were found to be too buggy at the time.

Loop.cpp also allows other classes to call its registerCallback() function which takes a file or socket descriptor and when call a given callback when data is ready for reading or writing on that file descriptor. Gigablast accomplishes this with Linux's fcntl() call which you can use to tell the kernel to connect a particular file descriptor to a signal. The heart of gb is really sigtimedwait() which sits there idle waiting for signals. When one arrives it breaks out and the callback responsible for the associated file descriptor is called.

In the event that the queue of ready file descriptors is full, the kernel will deliver a SIGIO to the gb process, which performs a poll over all the registered file descriptors.

Gigablast will also catch other signals, like segmentation faults and HUP signals and save the data to disk before exiting. That way we do not lose data because of a core dump or an accidental HUP signal.

Using a signal based architecture means that we have to work with callbacks a lot. Everything is triggered by an event similar to GUIs, but our events are not mouse related, but disk and network. This naturally leads to the concept of a callback, which is a function to be called when a task is completed. So you might call a function like readData ( myCallback , myState , readbuf , bytesToRead ) and gigablast will perform the requested read operation and call myCallback ( myState ) when it is done. This means you have to worry about state, whereas, in a thread-based architecture you do not.

Furthermore, all callbacks must be expressed as pointers to C functions, not C++ functions. So we call these C functions wrapper functions and have them immediately call their C++ counterparts. You'll see these wrapper functions clearly illustrated throughout the code.

Gigablast also has support for asynchronous signals. You can tell the kernel to deliver you an asynchronous signal in an I/O event, and that signal, also known as an interrupt, will interrupt whatever the CPU is doing and call gb's signal handler, sigHandlerRT(). Be careful, you cannot allocate memory when in a real-time/asynchronous signal handler. You should only call function names ending in _ass, which stands for asychronous-signal safe. The global instance of the UdpServer called g_udpServer2 is based on these asynchronous signals, mostly set up for doing fast pings without having to worry about what the CPU is doing at the time, but it really does not interrupt like it should most likely due to kernel issues. It doesn't seem any faster than the non-asynchronouys UDP server, g_udpServer.

This signal-based architecture also makes a lot of functions return false if they block, meaning they never completed because they have to wait on I/O, or return true and set g_errno on error. So if a function returns false on you, generally, you should return false, too.

The Loop class is primarily responsible for calling callbacks when an event occurs on a file/socket descriptor. Loop.cpp also calls callbacks that registered with it from calling registerSleepCallback(). Those sleep callbacks are called every X milliseconds, or more than X milliseconds if load is extreme. Loop also responds to SIGCHLD signals which are called when a thread exits. The only socket descriptor related callbacks are for TcpServer and for UdpServer. UdpServer and TcpServer each have their own routine for calling callbacks for an associated socket descriptor. But when the system is under heavy load we must be careful with what callbacks we call. We should call the highest priority (lowest niceness) callbacks before calling any of the lowest priority (highest niceness) callbacks. To solve this problem we require that every callback registered with Loop.cpp return after 2 ms or more have elapsed if possible. If they still need to be called by Loop.cpp to do more work they should return true at that time, otherwise they should return false. We do this in order to give higher priority callbacks a chance to be called. This is similar to the low latency patch in the Linux kernel. Actually, low priority callbacks will try to break out after 2 ms or more, but high priority callbacks will not. Loop::doPoll() just calls low priority

The Database Layer

Gigablast has the following databases which each contain an instance of the Rdb class:

Indexdb     - Used to hold the index.
Datedb      - Like indexdb, but its scores are dates.
Titledb     - Used to hold cached web pages.
Spiderdb    - Used to hold urls sorted by their scheduled time to be spidered.
Checksumdb  - Used for preventing spidering of duplicate pages.
Sitedb      - Used for classifying webpages. Maps webpages to rulesets.
Clusterdb   - Used to hold the site hash, family filter bit, language id of a document.
Catdb       - Used to classify a document using DMOZ.

Rdb

Rdb.cpp is a record-based database. Each record has a key, and optional blob of data and an optional long (4 bytes) that holds the size of the blob of data. If present, the size of the blob is specified right after the key. The key is 12 bytes and of type key_t, as defined in the types.h file.

Associated Classes (.cpp and .h files)

Checksumdb	DB	Rdb that maps a docId to a checksum for an indexed document. Used to dedup same content from the same hostname at build time.
Clusterdb	DB	Rdb that maps a docId to the hash of a site and its family filter bit and, optionally, a sample vector used for deduping search results. Used for site clustering, family filtering and deduping at query time.
Datedb	DB	Like indexdb, but its scores are 4-byte dates.
Indexdb	DB	Rdb that maps a termId to a score and docId pair. The search index is stored in Indexdb.
MemPool	DB	Used by RdbTree to add new records to tree without having to do an individual malloc.
MemPoolTree	DB	Unused. Was our own malloc routine.
Msg0	DB	Fetches an RdbList from across the network.
Msg1	DB	Adds all the records in an RdbList to various hosts in the network.
Msg3	DB	Reads an RdbList from several consecutive files in a particular Rdb.
Msg5	DB	Uses Msg3 to read RdbLists from multiple files and then merges those lists into a single RdbList. Does corruption detection and repiar. Intergrates list from RdbTree into the single RdbList.
MsgB	DB	Unused. A distributed cache for caching anything.
Rdb	DB	The core database class from which all are derived.
RdbBase	DB	Each Rdb has an array of RdbBases, one for each collection. Each RdbBase has an array of BigFiles for that collection.
RdbCache	DB	Can cache RdbLists or individual Rdb records.
RdbDump	DB	Dumps the RdbTree to an Rdb file. Also is used by RdbMerge to dump the merged RdbList to a file.
RdbList	DB	A list of Rdb records.
RdbMap	DB	Maps an Rdb key to an offset into an RdbFile.
RdbMem	DB	Memory manager for RdbTree so it does not have to allocate space for every record in the three.
RdbMerge	DB	Merges multiple Rdb files into one Rdb file. Uses Msg5 and RdbDump to do reading and writing respectively.
RdbScan	DB	Reads an RdbList from an RdbFile, used by Msg3.
RdbTree	DB	A binary tree of Rdb records. All collections share a single RdbTree, so the collection number is specified for each node in the tree.
SiteRec	DB	A record in Sitedb.
Sitedb	DB	An Rdb that maps a url to a Sitedb record which contains a ruleset to be used to parse and index that url.
SpiderRec	DB	A record in spiderdb.
Spiderdb	DB	An Rdb whose records are urls sorted by times they should be spidered. The key contains other information like if the url is old or new to the index, and the priority of the url, currently from 0 to 7.
TitleRec	DB	A record in Titledb.
Titledb	DB	An Rdb where the records are basically compressed web pages, along with other info like the quality of the page. Contains an instance of the LinkInfo class.

Adding a Record to Rdb

When a record is added to an Rdb it is housed in a binary tree, RdbTree.cpp, whose size is configurable via gb.conf. For example, <indexdbMaxTreeSize> specified how much memory in bytes that the tree for Indexdb can use. When the tree is 90% full it is dumped to a file on disk. The records are dumped in order of their keys. When enough files have accrued on disk, a merge is performed to keep the number of files down and responsiveness up.

Each file, called a BigFile and defined by BigFile.h, can actually consist of multiple physical files, each limited in size to 512MB. In this manner Gigablast overcomes the 2GB file limit imposed by some kernels or disk systems. Each physical file in a BigFile, after the first file, has a ".partX" extension added to its filename, where X ranges from 1 to infinity. Throughout this document, "BigFile" is used interchangably with "file".

If the tree can not accomodate a record add, it will return an ETRYAGAIN error. Typically, most Gigablast routines will wait one second before retrying the add.

Dumping an Rdb Tree

The RdbDump class is responsible for dumping the RdbTree class to disk. It dumps a little bit of the tree at a time because it can take a few hundred milliseconds seconds to gather a large list of records from the tree, especially when the records are dataless (just keys). Since RdbTree::getList() is not threaded it is important that RdbDump gets a little bit at a time so it does not block other operations.

Merging Rdb Files

Rdb merges just enough files as to keep the number of files at or below the threshold specified in gb.conf, which is <indexdbMaxFilesToMerge> for Indexdb, for example. RdbMerge.cpp is used to control the merge logic. The merge chooses which files to merge so as to minimize the amount of disk writing for the long term. It is important to keep the number of files low, because any time a key range of records is requested, the number of disk seeks will be low and the database will be responsive.

When too many files have accumulated on disk, the database will enter "urgent merge mode". It will show this in the log when it happens. When that happens, Gigablast will not dump the corresponding RdbTree to disk because it will create too many files. If the RdbTree is full then any attempts to add data to it when Gigablast is in "urgent merge mode" will fail with an ETRYAGAIN error. These error replies are counted on a per host basis and displayed in the Hosts table. At this point the host is considered to be a spider bottleneck, and most spiders will be stuck waiting to add data to the host.

If Gigablast is configured to "use merge tokens", (which no longer works for some reason and has since been disabled) then any file merge operation may be postponed if another instance of Gigablast is performing a merge on the same computer's IDE bus, or if the twin of the host is merging. This is done mostly for performance reasons. File merge operations tend to decrease disk access times, so having a handy twin and not bogging down an IDE bus allows Gigablast's load balancing algorithms to redirect requests to hosts that are not involved with a big file merge.

Deleting Rdb Records

The last bit of the 12 bytes key, key_t, is called the delbit. It is 0 if the key is a "negative key", it is 1 if the key is a "positive key". When performing a merge, a negative key may collide with its positive counterpart thus annihilating one another. Therefore, deletes are performed by taking the key of the record you want to delete, changing the low bit from a 1 to a 0, and then adding that key to the database. Negative keys are always dataless.

Reading a List of Rdb Records

When a user requests a list of records, Gigablast will read the records in the key range from each file. If no more than X bytes of records are requested, then Gigablast will read no more than X bytes from each file. After the reading, it merges the lists from each file into a final list. During this phrase it will also annihilate positive keys with their negative counterparts.

In order to determine where to start reading in each file for the database, Gigablast uses a "map file". The map file, encompassed by RdbMap.cpp, records the key of the first record that occurs on each disk page. Each disk page is PAGE_SIZE bytes, currently, 16k. In addition to recording the key of the first record that start on each page, it records a 2 byte offset of the key into that page. The map file is represented by the RdbMap class.

The Msg3::readList() routine is used retrieve a list of Rdb records from a local Rdb. Like most Msg classes, it allows you to specify a callback function and callback data in case it blocks. The Msg5 class contains the Msg3 class and extends it with the error correction (discussed below) and caching capabilities. Calling Msg5::getList() also allows you to incorporate the lists from the RdbTree in order to get realtime data. Msg3 is used mostly by the file merge operation.

Rdb Error Correction

Every time a list of records is read, either for answering a query or for doing a file merge operation, it is checked for corruption. Right now just the order of the keys of the records, and if those keys are out of the requested key range, are checked. Later, checksums may be implemented as well. If some keys are found to be out of order or out of the requested key range, the request for the list is forwared to that host's twin. The twin is a mirror image. If the twin has the data intact, it returns its data across the network, but the twin will return all keys it has in that range, not just necessarily from a single file, thus we can end up patching the corrupted data with a list that is hundreds of times bigger. Since this procedure also happens during a file merge operation, merging is a way of correcting the data.

If there is no twin available, or the twin's data is corrupt as well, then Gigablast attempts to excise the smallest amount of data so as to regain integrity. Any time corruption is encountered it is noted in the log with a message like:

1145413139583 81 WARN  db     [31956]  Key out of order in list of records.
1145413139583 81 WARN  db     [31956]  Corrupt filename is indexdb1215.dat.
1145413139583 81 WARN  db     [31956]  startKey.n1=af6da55 n0=14a9fe4f69cd6d46 endKey.n1=b14e5d0 n0=8d4cfb0deeb52cc3
1145413139729 81 WARN  db     [31956]  Removed 0 bytes of data from list to make it sane.
1145413139729 81 WARN  db     [31956]  Removed 6 recs to fix out of order problem.
1145413139729 81 WARN  db     [31956]  Removed 12153 recs to fix out of range problem.
1145413139975 81 WARN  net     Encountered a corrupt list.
1145413139975 81 WARN  net     Getting remote list from twin instead.
1145413139471 81 WARN  net     Received good list from twin. Requested 5000000 bytes and got 5000010. 
	startKey.n1=af6cc0e n0=ae7dfec68a44a788 endKey.n1=ffffffffffffffff n0=ffffffffffffffff

Getting a List of Records from Rdb in a Network

Msg0.cpp is used to retrieve a list of Rdb records from an abritrary host. Indexdb is currently the only database where lists of records are needed. All the other databases pretty much just do single key lookups. Each database chooses how to partition its records based on the key across the many hosts in a network. For the most part, the most significant bits of the key are used to determine which host is responsible for storing that record. This is why Gigablast is currently limited to running on 2, 4, 8, 16 ... or 2^n machines. Msg0 should normally be used to get records, not Msg5 or Msg3.

Adding a Record to Rdb in a Network

You can add record to a local instance of Rdb by using Rdb::addRecord() and you can use Rdb::deleteRecord() to delete records. But in order to add a record to a remote host you must use Msg1.cpp. This class will use the key of the record to determine which host or hosts in the network are responsible for storing the record. It will continue to send the add request to each host until it receives a positive reply. If the host receiving the record is down, the add operation will not complete, and will keep retrying forever.

Variable Key Size

As of June 2005, Rdb supports a variable key size. Before, the key was always of class key_t as defined in types.h as being 12 bytes. Now the key can be 16 bytes as well. The key size, either 12 or 16 bytes, is passed into the Rdb::init() function. MAX_KEY_BYTES is currently #define'd as being 16, so if you ever use a key bigger than 16 bytes that must be updated. Furthermore, the functions for operating on key_t's, while still available, have been generalized to accept a key size parameter. These functions are in types.h as well, and the more popular ones are KEYSET() and KEYCMP() for setting and comparing variable sized keys.

To convert the Rdb files from a fixed-size key to a variable-sized one required modifying the files of the Associated Classes (listed above). Essentially, all variables of type key_t were changed to character pointers (char *), and all key assignment and comparison operators (and others) were changed to use the more general KEYSET() and KEYCMP() functions in types.h. To maintain backwards compatibility and ease migration, all Rdb associated classes still accept key_t parameters, but now they also accept character pointer parameters for passing in keys.

Some of the more common bugs from this change: since the keys are now character pointers, data owned by one class often got overwritten by another, therefore you have to remember to copy the key using KEYSET() rather than just operator on the key that the pointer points to. Another common mistake is using the KEYCMP() function without having a comparison operator immediately following in, such as < > = or !=. Also tossing the variable key size, m_ks, around and keeping it in agreement is another weak point. There were some cases of leaving a statement like (char *)&key alone when it should have been changed to just key since it was made into a (char *) from a key_t. And a case of not dealing with the 6-byte compression correctly, like replacing 6 with m_ks-6 when we should not have, like in RdbList::constrain_r(). Whenever possible, the original code was left intact and simply commented out to aid in future debugging.

Posdb

Indexdb was replaced with Posdb in 2012 in order to store word position information as well as what field the word was contained in. Word position information is basically the position of the word in the document and starts at position 0. A sequence of whitespace is counted as one, and a sequence of punctuation containing a comma or something else is counted as 2. An alphanumeric word is counted as one. So in the sentence "The quick, brown" the word brown would have a word position of 5. The field of the word in the document could be the title, a heading, a meta tag, the text of an inlink or just the plain document body.

The 18-byte key for an Posdb record has the following bitmap:

tttttttt tttttttt tttttttt tttttttt  t = termId (48bits)
tttttttt tttttttt dddddddd dddddddd  d = docId (38 bits)
dddddddd dddddddd dddddd0r rrrggggg  r = siterank, g = langid
wwwwwwww wwwwwwww wwGGGGss ssvvvvFF  w = word postion , s = wordspamrank
pppppb1M MMMMLZZD                    v = diversityrank, p = densityrank
                                     M = unused, b = in outlink text
                                     L = langIdShiftBit (upper bit for langid)
                                     Z = compression bits. can compress to
                                         12 or 6 bytes keys.
G: 0 = body 
   1 = intitletag 
   2 = inheading 
   3 = inlist 
   4 = inmetatag
   5 = inlinktext
   6 = tag
   7 = inneighborhood
   8 = internalinlinktext
   9 = inurl

F: 0 = original term
   1 = conjugate/sing/plural
   2 = synonym
   3 = hyponym

Posdb.cpp tries to rank documents highest that have the query terms closest together. If most terms are close together in the body, but one term is in the title, then there is a slight penalty. This penalty as well as the weights applied to the different density ranks, siteranks, etc are in the Posdb.h and Posdb.cpp files.

Indexdb

Indexdb has been replaced by Posdb, but the key for an Indexdb record has the following bitmap:

tttttttt tttttttt tttttttt tttttttt  t = termid (48bits)
tttttttt tttttttt ssssssss dddddddd  s = ~score
dddddddd dddddddd dddddddd dddddd0Z  d = docId (38 bits)  Z = delbit

When Rdb::m_useHalfKeys is on and the preceeding key as the same 6 bytes as the following key, then the following key, called a half key, only requires 6 bytes, therefore, has the following bitmap:

ssssssss dddddddd dddddddd dddddddd  d = docId, s = ~score
dddddddd dddddd1Z                    Z = delbit

Every term that Gigablast indexes, be it a word or phrase, is hashed using the hash64() routine in the hash.h. This is a very fast and effective hashing function. The resulting hash of the term is called the termid. It is constrained to 48 bits.

Next, the score of that term is computed. Generally, the more times the term occurs in the document the higher the score. Other factors, like incoming links and link text and quality of the document, may contribute to the score of the term. The score is 4 bytes but is logarithmically mapped into 8 bits, and complemented, so documents with higher scores appear above those with lower scores, since the Rdb is sorted in ascending order.

The <indexdbTruncationLimit> parameter in gb.conf, reflected as Conf::m_indexdbTruncationLimit, allows you to specify the maximum number of docids allowed per termid. This allows you to save disk space. Right now it is set to 5 million, a fairly hefty number. The merge routine, RdbList::indexMerge_r() will ensure that no returned Indexdb list violates this truncation limit. This routine is used for reading data for resolving queries as well as doing file merge operations. A list of Indexdb records all with the same termid, is called a termlist or indexlist.

Datedb

The key for an Datedb record has the following 16-byte bitmap:

tttttttt tttttttt tttttttt tttttttt  t = termId (48bits)
tttttttt tttttttt DDDDDDDD DDDDDDDD  D = ~date
DDDDDDDD DDDDDDDD ssssssss dddddddd  s = ~score
dddddddd dddddddd dddddddd dddddd0Z  d = docId (38 bits)

And, similar to indexdb, datedb also has a half bit for compression down to 10 bytes:

                  DDDDDDDD DDDDDDDD  D = ~date
DDDDDDDD DDDDDDDD ssssssss dddddddd  s = ~score
dddddddd dddddddd dddddddd dddddd1Z  d = docId (38 bits)

Datedb was added along with variable-sized keys (mentioned above). It is basically the same as indexdb, but has a 4-byte date field inserted. IndexTable.cpp was modified slightly to treat dates as scores in order to provide sort by date functionality. By setting the sdate=1 cgi parameter, Gigablast should limit the termlist lookups to Datedb. Using date1=X and date2=Y cgi parameters will tell Gigablast to constrain the termlist by those dates. date1 and date2 are currently seconds since the epoch. Gigablast will search for the date of a document in this order, stopping at the first non-zero date value:
1. <meta name=datenum content=123456789>
2. <meta name=date content="Dec 1 2006">
3. The last modified date in the HTTP mime that the web server returns.

Titledb

Titledb holds a cached copy of every web page in the index. The TitleRec class is used to serialize and deserialize the information from an Rdb record into something more useful. It also uses the freely available zlib library to compress web pages to about 1/5th of their size. Depending on the word density on a page and the truncation limit, Titledb is usually slightly bigger than Indexdb. See TitleRec.h to see what kind of information is contained in a TitleRec.

The key of a Titledb record, a TitleRec, has the following bitmap:

dddddddd dddddddd dddddddd dddddddd  d = docId
dddddddd hhhhhhhh hhhhhhhh hhhhhhhh  h = hash of site name
hhcccccc cccccccc cccccccc cccccccD  c = content hash, D = delbit

The low bits of the top 31 bits of the docId are used to determine which host in the network stores the Titledb record. See Titledb::getGroupId().

The hash of the sitename is used for doing site clustering. The hash of the content is used for deduping identical pages from the search results.

IMPORTANT: Any time you change TITLEREC_VERSION in Titledb.h or make any parsing change to TitleRec.cpp, you must also change it in all parent bk respositories to keep things backwards compatible. We need to ensure that newer versions of gb can parse the Titledb records of older versions. Because if you increment TITLEREC_VERSION in the newer repository and then someone else increments it for a different reason in the parent repository, then you end up having a conflict as to what the latest version number actually stands for. So versions need to be immediately propagated both ways to avoid this.

Spiderdb

NOTE: this description of spiderdb is very outdated! - matt Aug 2013.

Spiderdb is responsible for the spidering schedule. Every url in spiderdb is either old or new. If it is already in the index, it is old, otherwise, it is new. The urls in Spiderdb are further categorized by a priority from 0 to 7. If there are urls ready to be spidered in the priority 7 queue, then they take precedence over urls in the priority 6 queue.

All the urls in Spiderdb priority queue are sorted by the date they are scheduled to be spidered. Typically, a url will not be spidered before its time, but you can tweak the window in the Spider Controls page. Furthermore, you can turn the spidering of new or old urls on and off, in addition to disabling spidering based on the priority level. Link harvesting can also be toggled based on the spider queue. This way you can harvest links only from higher priority spider queues.

Some spiderdb records are docid based only and will not have an associated url. These records were usually added from the PageReindex.cpp tool which allows the user to store the docids of a bunch of search results directly into spiderdb for respidering.

The bitmap of a key in Spiderdb:

00000000 00000000 00000000 pppNtttt  t = time to spider, p = ~ of priority
tttttttt tttttttt tttttttt ttttRRf0  R = retry #, f = forced?
dddddddd dddddddd dddddddd dddddddD  d = top 32 bits of docId, D = delbit
                                     N = 1 iff url not in titledb (isNew)

Each Spiderdb record also records the number of times the url was tried and failed. In the Spider Controls you can specify how many until Gigablast gives up and deletes the url from Spiderdb, and possibly from the other databases if it was indexed.

If a url was "forced", then it was added to Spiderdb even though it was detected as already being in there. Urldb lets us know if a url is in Spiderdb or not. So if the url is added to Spiderdb even though it is in there already it is called "forced" url. Once a forced url is spidered then that Spiderdb record is discarded. Otherwise, if the url was spidered and was not forced, it is rescheduled for a future spidering, and the old Spiderdb record is deleted and a new one is added.

Gigablast attempts to determine if the url changed since its last spidering. If it has, then it will try to decrease the time between spiderings, otherwise it will try to increase that time. The min and max respider times can be specified in the Spider Controls page, so you can ensure Gigablast does not wait long enough or ensure it does not wait too long between respiderings.

For more about the spider process see The Build Layer.

Checksumdb

cccccccc hhhhhhhh hhhhhhhh cccccccc  h = host name hash
cccccccc cccccccc cccccccc cddddddd  c = content, collection and host hash
dddddddd dddddddd dddddddd dddddddD  d = docId , D = delbit

Sitedb

dddddddd dddddddd dddddddd dddddddd  d = domain hash (w/ collection)
uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu  u = special url hash
uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu

Sitedb maps a url to a site file number (sfn). The site file is now called a ruleset file and has the name sitedbN.xml in the working directory. All rulesets must be archived in the Bitkeeper repository at /gb/conf/. Therefore, all gb clusters share the same ruleset name space. Msg8.cpp and Msg9.cpp are used to respectively get and set sitedb records.

Catdb

dddddddd dddddddd dddddddd dddddddd  d = domain hash
uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu  u = special url hash
uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu

Data Block:
 . number of catides (1 byte)
 . list of catids    (4 bytes each)
 . sitedb file #     (3 bytes)
 . sitedb version #  (1 byte)
 . siteUrl (remaining bytes)

Catdb is a special implementation of Sitedb. While it is similar in that a single record is stored per url and the keys are created using the same hashes, the record stores additional category information about the url. This includes how many categories the url is in and which categories it is in (their ids). Like sitedb, Msg8 and Msg9 are used to get and set catdb. Msg2a is used to generate a full catdb using directory information and calling Msg9.

Catdb is only used at spider time for the spider to lookup urls and see if they are in the directory and which categories they are under. A url's category information is stored in its TitleRec and is also indexed using special terms (see below).

dmozparse

The program "dmozparse" is implemented individually and allows for the creation of proprietary data files using the RDF data available from DMOZ. Dmozparse will create two main data files, gbdmoz.content.dat and gbdmoz.structure.dat. gbdmoz.content.dat stores a list of url strings and their associated category IDs. This is used to populate catdb. gbdmoz.structure.dat stores and list of category names, their IDs, their parent IDs, their offsets into the original RDF files, and the number of urls present in each category. The offsets are used to lookup more complex information about the categories, such as sub-category lists and titles and summaries for urls. See the Overview for proper use of dmozparse.

Msg2a

Catdb is generated using Msg2a, which reads the gbdmoz.content.dat file generated by dmozparse and sets the Catdb records accordingly with Msg9. Msg2a is also used to update Catdb when new data is available. See the Overview for proper building and updating of Catdb.

Msg2b

Msg2b is used to generate and sort a category's directory listing. This includes the sub-categories, related categories, and other languages of the given category. Input is given as the category ID to generate the directory listing for. Msg2b will store the listing internally. This is primarily used by PageResults to make a DMOZ style directory which can be browsed by users.

Indexed Catids, Direct and Indirect

When a url that is in the directory gets spidered, the IDs of the categories the url is in are indexed. For the exact categories the url appears in, the prefix "gbdcat" is hashed with the category id. For all parent categories of these categories, the prefix "gbpdcat" is used. The base categores the url is in will also be indexed with "gbpdcat". These are considered "direct" catids.

All urls which are under a sub-url that is in the directory will have category IDs indexed as "indirect" catids. All of the direct catids associated with the sub-url will be indexed using the prefixes "gbicat" and "gbpicat".

Indexing the direct catids allows for a fast lookup of all the urls under a category. Searches can also be restricted to single categories, either the base category itself or the inclusion of all sub-categories. Indirect catids allows pages under those listed in a category to be searched as well.

The Network Layer

The network code consists of 5 parts:
1. The UDP Server
2. The Multicast class
3. The Message Classes
4. The TCP Server
5. The HTTP Server

Associated Classes (.cpp and .h files)

Dns	Net	A DNS client built on top of the UdpServer class.
DnsProtocol	Net	Uses UdpServer to make a protocol for talking to DNS servers. Used by Dns class.
HttpMime	Net	Creates and parses an HTTP MIME header.
HttpRequest	Net	Creates and parses an HTTP request.
HttpServer	Net	Gigablast's highly efficient web server, contains a TcpServer class.
Multicast	Net	Used to reroute a request if it fails to be answered in time. Also used to send a request to multiple hosts in the cluster, usually to a group (shard) for data storage purposes.
TcpServer	Net	A TCP server which contains an array of TcpSockets.
TcpSockets	Net	A C++ wrapper for a TCP socket.
UdpServer	Net	A reliable UDP server that uses non-blocking sockets and calls handlers recieving a message. The handle called depends on that message's type. The handler is UdpServer::m_handlers[msgType].
UdpSlot	Net	Basically a "socket" for the UdpServer. The UdpServer contains an array of a few thousand of these. When none are available to conduct receive a request, the dgram is dropped and will later be resent by the requester in a back-off fashion.

The UDP Server

The UdpServer.cpp, UdpSlot.cpp and UdpProtocol.h classes constitute the framework for Gigablast's UDP server, a fast and reliable method for transmitting data between hosts in a Gigablast network.

Gigablast uses two instances of the UdpServer class. One of them, g_udpServer2, uses asynchronous signals to reply to received dgrams by interrupting the current process. g_udpServer is not asynchronous. The asynchronous instance is not really used any more since the 2.4.31 kernel does not seem to really support it because ping replies were not instant under heavy load.

The way it works is that the caller calls g_udpServer.sendRequest ( ... ) to send a message. The receiving host will get a signal that the udp socket descriptor is ready for reading. It will then peek at the datagram using the read ( ... , MSG_PEEK , ... ) call to get the transid (transaction id) from the datagram. The transid is either associated with an existing UdpSlot or a new UdpSlot is created to handle the transaction. The UdpSlots are stored in a fixed-size array, so if you run out of them, the incoming datagram will be silently dropped and will be re-sent later.

When a UdpSlot is selected to receive a datagram, it then calls UdpSlot::sendAck() to send back an ACK datagram packet. When the sender receives the ACK it sets a bit in UdpSlot::m_readAckBits[] to indicate it has done so. Likewise, UdpSlot::m_sendAckBits[] are updated on the machine receving the message, and m_readBits[] and m_sendBits[] are used to keep track of what non-ACK datagrams have been read or sent respectively.

Gigablast will send up to ACK_WINDOW_SIZE (defined in UdpSlot.cpp as 12) non-ACK datagrams before it requires an ACK to send more. If doing a local send then Gigablast will send up to ACK_WINDOW_SIZE_LB (loopback) datagrams.

If the sending host has not received an ACK datagram within RESEND_0 (defined to be 33 in UdpSlot.cpp) milliseconds, it will resend the datagrams that have not been ACKed. If the host is sending on g_udpServer, then RESEND_1 (defined to be 100 in UdpSlot.cpp) milliseconds is used. If sending a short message (under one datagram in size) then RESEND_0_SHORT (33 ms) is used. The resends are taken care of in in UdpServer::timePoll() which is called every 20ms for g_udpServer2 (defined in main.cpp) and every 60ms for g_udpServer.

Each datagram can be up to DGRAM_SIZE bytes. You can make this up to 64k and let the kernel deal with chopping the datagrams up into IP packets that fit under the MTU, but unfortunately, the new 2.6 kernel does not allow you to send datagrams larger than the MTU over the loopback device. Datagrams sent over the loopback device can be up to DGRAM_SIZE_LB bytes, and datagrams used by the Dns class can be up to DGRAM_SIZE_DNS bytes.

The UdpProtocol class is used to define and parse the header used by each datagram. The Dns class derives its DnsProtocl class from UdpProtocol in order to use the proper DNS headers.

The bitmap of the header used by the UdpProtocol class is the following:

REtttttt ACNnnnnn nnnnnnnn nnnnnnnn  R = is Reply?, E = hadError? N=nice
iiiiiiii iiiiiiii iiiiiiii iiiiiiii  t = msgType, A = isAck?, n = dgram #
ssssssss ssssssss ssssssss ssssssss  i = transId C = cancelTransAck
dddddddd dddddddd dddddddd dddddddd  s = msgSize (iff !ack) (w/o hdrs!)
dddddddd ........ ........ ........  d = msg content ...

The niceness (N bit) of a datagram can be either 0 or 1. If it is 0 (not nice) then it will take priority over a datagram with a niceness of 1. This just means that we will call the handlers for it first. It's not a very big deal.

The datagram number (n bits) are used to number the datagrams so we can re-assemble them in the correct order.

The message type corresponds to all of the Msg*.cpp classes you see in the source tree. The message type also maps to a pointer to a function, called the handler function, which generates and sends back a reply. This map (array) of pointers to handler functions is UdpServer::m_handlers[]. Each Msg*.cpp class must call g_udpServer2.registerHandler() to register its handler function for its message type. Its handler function will be called when a complete request of its message type (msgType) is received.

The handler function must set the UdpSlot's (UDP socket) m_requestBuf to NULL if it does not want UdpServer to free it. More likely, however, is that when receiving a reply from the UdpServer you will want to set UdpSlot::m_readBuf to NULL so the UdpServer doesn't free it, but then, of course, you are responsible for freeing it.

Multicasting

The Multicast.cpp class is also considered part of the network layer. It takes a request and broadcasts it to a set of twin hosts. It can do this in one of two ways. One way is to send the request to a set of twin hosts and not complete until all hosts have sent back a reply. This method is used when adding data. Gigablast needs to add data to a host and all of its mirrors for the add to be considered a success. So if one host crashes and is not available to receive the data, the requester will keep re-sending the request to that host until it comes back online.

The second broadcast method employed by Multicast.cpp is used mainly for reading data. If m_loadBalancing is true, it will send an asynchronous request to each host (using Msg34.cpp) in a set of twin hosts to determine the load on each host. Once it knows the load of each host, the multicast will forward the request to the least loaded host. If, for some reason, the selected host does not reply within a set amount of time (depends on the message type -- see Multicast.cpp) then the request will be re-routed to another host in the set of twins.

The Msg classes

All of the Msg*.cpp files correspond to network messages. Most of them are requests. See the classes that start with Msg* for explanations in the File List.

The TCP Server

The HTTP Server

The DNS Resolver

Originally Dns.cpp was built to interface with servers running the bind9 DNS program. A query containing the hostname would be sent to a bind9 process (which was selected based on the hash of the hostname) and the bind9 process would perform the recursive DNS lookups and return the final answer, the IP address of that hostname.

However, since the advent of the 256 node cluster, we've had to run multiple instances of bind9, like about 10 of them, just to support the spider load. Rather than have to worry about if these processes or servers go down, or if the processes get terminated, or contend too much for memory or cpu with the Gigablast processes they share the server with, we decided to just replace bind9's recursive lookup functionality with about 400 lines of additional code the Dns.cpp. This also serves to make the code base more independent.

Dns::getIp() is the entry point to Dns.cpp's functionality. It will create a UDP datagram query and send it to a nameserver (DNS server). It will expect to get back the IP address of the hostname in the query or receive an error packet, or timeout. Now, since bind9 is out of the loop, we have to direct the request to a root DNS server. The root servers will refer us to TLD root servers which will then refer us to others, etc. Sometimes the nameservers we are referred to do not have their IPs listed in the reply and so we have to look those up on the side. Also, we often are referred to a list of servers to ask, and so if any of those timeout we have to move to the next in the list. This explains the nature of the DnsState class which is used to keep our information persistent as we wait for replies.

DnsState has an m_depth member which is 0 when we ask the root servers, 1 when we ask the TLD root servers, etc. m_depth is used to offset us into an array of DNS IPs, m_dnsIps[], and an array of DNS names, m_dnsNames[], which unfortunately do not have IPs included and most be looked up should we need to ask them. Furthermore, at each depth we are limited to MAX_DNS_IPS IP addresses and MAX_DNS_IPS DNS names. So if we get referred to more than that many IP addesses we will truncate the list. This value is a hefty 32 right now, and that is per depth level. MAX_DEPTH is about 7. That is how many times we can be referred to other nameservers before giving up and returning ETIMEDOUT. We also avoid asking the same IP address twice by keeping a list of m_triedIps in the DnsState. And DnsState has an m_buf buffer that is used for allocating extra DnsStates for doing IP lookups on nameservers we are referred to but have no IPs given in the response. We can use m_buf to recursively create up to 3 DnsStates before giving up. For instance, if we are trying to get the IP of a nameserver, and we are referred to 2 more nameservers, without IPs, so we must use the DnsState, etc.

To avoid duplicate parallel IP lookups we use s_table, which is a hashtable using the HashTableT.cpp class in which each bucket is a CallbackEntry, which consists of a callback function and a state. So Dns::getIp() will use the hostname you want to get an IP for as the key into this hash table. If there is a CallbackEntry for that key then you must wait in line because someone has already sent out a request for that hostname. The line is kept in the form of a linked list. Each CallbackEntry has an m_nextKey member, too, which refers to the key of the next CallbackEntry in s_table waiting on that same hostname. However, the people that are not first in line use a s_bogus for their key, since all keys in the hash table must be unique. And s_bogus key is a long long that is incremented each time so it is used, so it should never ever wrap, since a long long is absolutely huge, like beyond astronomical.

We also cache timeouts into the RdbCache for about a day. That way we do not bottleneck on them forever. Other errors were always cached, but not timeouts until now. We now also support a TTL in the cache. Every DNS reply has a TTL which specifies for how long the IP is good for and we put that into the cache. We actually limit it to MAX_DNS_CACHE_AGE which is currently set to 2 days (it is in seconds). We may want to cache some resource records, so, for instance, we do not have to keep asking the root servers for the TLD servers if we know what they are. And along the lines of future performance enhancements, we may want to consider asking multiple nameservers, launching the requests within about a second of each other. And, if a reply comes back for a slot that was timedout, we should probably still entertain it as a late reply.

DNS debug messages can be turned on at any time from the log tab. It is often helpful to use dig to assist you. like dig @1.2.3.4 xyz.com +norecurse will show you DNS 1.2.3.4 server's response to a request for the IP of xyz.com. And dig xyz.com +trace will show you all the servers and replies dig communes with to resolve a hostname to its IP address. dig also seems to get IPv6 IP addresses whereas the Gigablast resolver does not work on. This may be something to support later.

The Build Layer

This section describes how Gigablast downloads, parses, scores and indexes documents.

Associated Files

AdultBit.cpp	Build	Used to detect if document content is naughty.
Bits.cpp	Build	Sets descriptor bits for each word in a Words class.
Categories.cpp	Build	Stores DMOZ categories in a hiearchy.
Lang.cpp	Build	Unused.
Language.cpp	Build	Enumerates the various languages supported by Gigablast's language detector.
LangList.cpp	Build	Interface to the language-specific dictionaries used for language identification by XmlDoc::getLanguage().
Linkdb.cpp	Build	Functions to perform link analysis on a docid/url. Computes a LinkInfo class for the docId. LinkInfo class contains Inlink classes serialized into it for each non-spammy inlink detected. Also contains a Links class that parses out all the outlinks in a document.
PageAddUrl.cpp	Build	HTML page to add a url or file of urls to spiderdb.
PageInject.cpp	Build	HTML page to inject a page directly into the index.
Phrases.cpp	Build	Generates phrases for every word in a Words class. Uses the Bits class.
Pops.cpp	Build	Computes popularity for each word in a Words class. Uses the dictionary files in the dict subdirectory.
Pos.cpp	Build	Computes the character position of each word in a Words class. HTML entities count as a single character. So do back-to-back spaces.
Spam	Build	Computes the probability a word is spam for every word in a Words class.
Spider.cpp	Build	Has most of the code used by the spidering process. SpiderLoop is a class in there that is the heart of the spider. It is the control loop that launches spiders.
StopWords.cpp	Build	A table of stop words, used by Bits to see if a word is a stop word.
Words.cpp	Build	Breaks a document up into "words", where each word is a sequence of alphanumeric characters, a sequence of non-alphanumeric characters, or a single HTML/XML tag. A heart of the build process.
Xml.cpp	Build	Breaks a document up into XmlNodes where each XmlNode is a tag or a sequence of characters which are not a tag.
XmlDoc.cpp	Build	The main document parsing class. A huge file, pretty much does all the parsing.
XmlNode	Build	Xml classes has an array of these. Each is either a tag or a sequence of characters that are between tags (or beginning/end of the document).
dmozparse	Build	Creates the necessary dmoz files Gigablast needs from those files downloadable from DMOZ.

The Spider Loop

SpiderLoop.cpp finds what URLs in spiderdb to spider next. It is based on the URL filters table that you can specify in the admin interface. It calls XmlDoc::index() to download and index the URL.

Link Analysis

XmlDoc calls Msg25 to set the LinkInfo class for a document. The LinkInfo class is stored in the titledb record (the TitleRec) and can be recycled if recycle link info is enabled in the Spider Controls. LinkInfo contains an array of LinkText classes. Each LinkText class corresponds to a document that links to the url being spidered, and has an IP address, a link-adjusted quality and docid of the linker, optional hyperlink text, a log-spam bit, and the total number of outgoing links the linker has. Exactly how the incoming link text is indexed and how the linkers affect the link-adjusted quality of the url being indexed can be easily seen on the Page Parser tool available through the administrative section or by clicking on the [analyze] link next to a search result.

Msg25 uses Msg20 to get information about an inlinker. It sends a Msg20 request to the machine that has the linking document in its titledb. The handler loads the titledb record using Msg22, and then extracts the relevant link text using the Links.cpp class, and packages that up with the quality and docid of the linker in the reply.

The Sample Vector

After downloading the page, XmlDoc::getPageSampleVector() generates a sample vector for the document. This vector is formed by getting the top 32 or so terms from the new TermTable sorted by their termId, a hash of each term. This vector is stored in the TitleRec and used for deduping.

The Summary Vector

Deduping search results at query time is based partly on how similar the summaries are. We use XmlDoc::getSummaryVector() which takes two sample vectors as input. The similarity threshold can be controlled from the Search Controls' percent similar dedup default and it can also be explicitly specified using the psc cgi parameter.

The Gigabit Vector

Like the Sample Vector, the Gigabit Vector is a sample of the terms in a document, but the terms are sorted by scores which are based on the popularity of the term and on the frequency of the term in the document. XmlDoc::getGigabitVector() computes that vector for a TitleRec and makes use of the Msg24 class which gets the gigabits for a document. The resulting vector is stored in the TitleRec and used to do topic clustering. cluster by topic can be enabled in the Search Controls, and percent similar topic default can be used to tune the clustering sensitivity so that documents in the search results are only clustered together if their Gigabit Vectors are X% similar or more. Msg38 calls Clusterdb::getGigabitSimilarity() to compute the similarity of two Gigabit Vectors.

Deduping at Spider Time

Last Updated 1/29/2014 by MDW

Many URLs contain essentially the same content as other URLs. Therefore we need to dedup URLs and prevent them from entering the index. Deduping in this fashion is disabled by default so you will need to enabled it in the Spider Controls for your collection. You will note such duplicate urls in the logs with the "Doc is a dup" (EDOCDUP) message. Those urls are not indexed, and if they were indexed, they will be removed.

When a page is indexed we store a hash of its content into a single "posdb" key. We set the "sharded by termid" bit for that posdb key, so it is not sharded by docid which is the norm. Hostdb::getShard() checks for that bit and shards by termid when it is present. In that way all the pages with the same content hash will be together on disk in posdb.

We perform a lookup of all other docids that share the same content hash as the document we are indexing. If another docid has the same content hash and its site rank is the same as ours or higher, then we are a duplicate docid and we do not index our docid and getIndexCode() returns EDOCDUP. Our document will be deleted if it was indexed. If another docid with the same content hash as us has a siterank lower than ours, then he is the dup and he will be removed when he is reindexed the next time with the EDOCDUP error. If another docid with the same content hash as us as the SAME site rank as us, then we assume we are the dup. We go first come first server approach in that scenario, site rank trumps.

We use the getContentHashExact64() function to generate the 64-bit dup checksum. It hashes the document pretty much verbatim for safety. It does treat sequences of white space as a single space character however.

Other hash functions exist that hash all the month names and day of week names and ALL DIGITS to a specific number in order to ignore the clocks one finds on web pages. Additionally tags can be ignored unless they are a frame or iframe tag. Pure text is hashed as lower case alnum. Punctuatuation and spaces can be skipped over. This algoritm is currently unused but is available in the getLooseContentHash() function.

If we see a <link href=xxx rel=canonical> tag in the page and our url is different than xxx, then we do not index the url and use the ENOTCANONICAL error, but we do insrt a spider request for the referenced canonical url so it can get picked up. The error when downloading such urlsshows up as "Url was dup of canonical page" in the logs. The canonical URL we insert should inherit the same SpiderRequest::m_isInjecting or SpiderRequest::m_isAddUrl bits of the "dup" url it came from, similar to how simplified meta redirects work.

Indexing Error Codes

When trying to index a document we call XmlDoc::getIndexCode() to see if there was a problem. 0 means everything is good. Typically an error will be indicative of not indexing the document, or sometimes even removing it if it was indexed. Sometimes an error code will just do knowthing except add a SpiderReply to Spiderdb indicating the error so that the document might be tried again in the future. Sometimes, like in the case of simplified meta rediriects we will add the new url's SpiderRequest to spiderdb as well. So we handle all errors on an individual basis.

hese error codes are all in Errno.h.

EBADTITLEREC	Document's TitleRec is corrupted and can not be read.
EURLHASNOIP	Url has no ip.
EDOCCGI	Document has CGI parms in url and allowCgiUrls is specified as false in the ruleset.
EDOCURLIP	Document is an IP-based url and allowIpUrls is specified as false in the ruleset.
EDOCBANNED	Document's ruleset has banned set to true.
EDOCDISALLOWED	Robots.txt forbids this document to be indexed. But, if it has incoming link text, Gigablast will index it anyway, but just index the link text.
EDOCURLSPAM	The url itself contains naughty words and the do url sporn checking is enabled in the Spider Controls.
EDOCQUOTABREECH	The quota for this site has been exceeded. Quotas is based on quality of the url. See the quota section in the Overview file.
EDOCBADCONTENTTYPE	Content type, as returned in the mime reply and parsed out by HttpMime.cpp, is not supported for indexing
EDOCBADHTTPSTATUS	Http status was 404 or some other bad stats.
EDOCNOTMODIFIED	Spider Controls have use IfModifiedSince enabled and document was not modified since the last time we indexed it.
EDOCREDIRECTSTOSELF	The mime redirects to itself.
EDOCTOOMANYREDIRECTS	Url had more than 6 redirects.
EDOCBADREDIRECTURL	The redirect url was empty.
EDOCSIMPLIFIEDREDIR	The document redirected to a simpler url, which had less path components, did not have cgi, or for whatever reason was prettier to look at. The current url will be discarded and the redirect url will be added to spiderdb.
EDOCNONCANONICAL	Doc has a canonical link reference a url that was not itself. Used for deduping at spider time.
EDOCNODOLLAR	Document did not contain a dollar sign followed by a price. Used for building shopping indexes.
EDOCHASBADRSS	This should not happen.
EDOCISANCHORRSS	This should not happen.
EDOCHASRSSFEED	only index documents from rss feeds is true in the Spider Controls, and the document indicates it is part of an RSS feed, and does not currently have an RSS feed linking to it in the index. Gigablast will discard the document, and add the url of the RSS feed to spiderdb. When that is spidered the url should be picked up again.
EDOCNOTRSS	If the Spider Controls specify only index articles from rss feeds as true and the document is not part of an RSS feed.
EDOCDUP	According to checksumdb, a document already exists from this hostname with the same checksumdb hash. See Deduping section.
EDOCTOOOLD	Document's last modified date is before the maxLastModifiedDate specified in the ruleset.
EDOCLANG	The document does not match the language given in the Spider Controls.
EDOCADULT	Document was detected as adult and adult documents are forbidden in the Spider Controls.
EDOCNOINDEX	Document has a noindex meta tag.
EDOCNOINDEX2	Document's ruleset (SiteRec) says not to index it using the <indexDoc> tag. Probably used to just harvest links then.
EDOCBINARY	Document is detected as a binary file.
EDOCTOONEW	Document is after the minLastModifiedDate specified in the ruleset.
EDOCTOOBIG	Document size is bigger than maxDocSize specified in the ruleset.
EDOCTOOSMALL	Document size is smaller than minDocSize specified in the ruleset.

The following error codes just set g_errno. Like if the error is an internet error, or something on our side. These errors are also in Errno.h.

ETTRYAGAIN
ENOMEM	We ran out of memory.
ENOSLOTS	We ran out of UDP sockets.
ECANCELLED	An administrator disabled spidering in the Master Controls thereby cancelling all outstanding spiders.
EBADIP	Unable to get IP address of url.
EBADENGINEER
EIPHAMMER	We would hit the IP address too hard, violating sameIpWait in the Spider Controls if we were to download this document.
ETIMEDOUT	If we timed out downloading the document.
EDNSTIMEDOUT	If we timed out looking up the IP of the url.
EBADREPLY	DNS server sent us a bad reply.
EDNSDEAD	DNS server was dead

XmlDoc: The Heart of the Parser

Spider.cpp ultimately calls XmlDoc::index() to download and index the specified URL.

Xml

The Xml class is Gigablast's XML parsing class. You can set it by passing it a pointer to HTML or XML content along with a content length. If you pass it some strange character set, it will convert it to UTF-16 and use that to set its nodes. Each node is a tag or a non-tag. It uses the XmlNode class to tokenize the content into these nodes.

Links

The Links class is set by XmlDoc and gets all the links in a document. Links::hash(), called by XmlDoc::hash(), will index all the link:, ilink: or links: terms. It is also used to get link text by the LinkText class.

Words

All text documents can be broken up into "words". Gigablast's definition of a word is slight different than normal. A word is defined as a sequence of alphanumeric characters, a sequence of non-alphanumeric characters. An HTML or XML tag is also considered to be an individual word. Words for Japanese or other similar languages are determined by a tokenizer Partap integrated.

The Phrases, Bits, Spam and Scores classes all contain arrays which are 1-1 with the Words class they were set from.

The Words class is primarily used by TermTable::set() which takes a string, breaks it down into words and phrases and hash them word and phrase ids into the hash table.

Phrases

Gigablast implements phrase searching by stringing words together and hashing them as a single unit. It uses the Phrases and Bits class to generate phrases from a Words class. Basically, every pair of words is considered a phrase. So if it sees "cd rom" in the document it will index the bigram "cdrom". It can easily be modified to index trigrams and quadgrams as well, but that will bloat the index somewhat.

Bits

Before phrases can be generated from a Words class, Gigblast must set descriptor bits. Each word has a corresponding character in the Bits class that sets flags depending on different properties of the word. These bits are used for generating phrases.

Content Filtering

Gigablast supports multiple document types by saving the downloaded mime and content to a file, then calling the gbfilter program (gbfilter.cpp) which based on the content type in the mime will call pdftohtml, antiword, pstotext, etc. and send the html/text content back to gb via stdout.

DocIds

Every document indexed by Gigablast has a unique document id, called a docid. The docid is usually just the hash of the url it represents. Titledb::getProbableDocId() takes the url as a parameter and returns the probable docid for that url. The reason the docid is probable is that it may collide with the docid of an existing url and therefore will have to be changed. When we index a document, Msg14, the Msg class which is primarily responsible for indexing a document, calls Msg15::getOldDoc() to retrieve the TitleRec (record in Titledb) for the supplied url (the url being indexed). Msg15 will then call Msg22::getTitleRec() to actually get the TitleRec for that url. If a TitleRec is not found for the given url then Msg22.cpp will sets its m_availDocId so Msg14 will know which available docid it can use for that url.

Msg22 will actually load a list of TitleRecs from titledb corresponding to all the possible actual docids for the url. When the probable docid for a new url collides with the docid of an existing url, as evidenced by the TitleRec list, then Gigablast increments the new url's docid by one. Gigablast will only change the last six bits (bits 0-5) of a docid for purposes of collision resolution and it will wrap the docid if it tops out (see Msg22::gotUrlListWrapper). Titledb::getFirstProbableDocId() and Titledb::getLastProbableDocId() define the range of actual docids available given a probable docid by masking out or adding bits 0-5. Bits 7 and up of the docid are used to determine which host in the network is responsible for storing the TitleRec for that docid, as illustrated in Titledb::getGroupId(). That is why collision resolution is limited to the lower six bits, because we don't want to, through collision resolution, resolve a TitleRec to another host, we want to keep it local. We also do not want the high bits of the docid being used for determining the group (shard) which stores that docId because when reading the termlist of linkers to a particular url, they are sorted by docId, which means we end up hitting one particular host in the network much harder than the rest when we attempt to extract the link text. If we have to increase these 6 bits in the future, then we may consider using some higher bits, like maybe bits 12 and up or something to avoid messing with the bits that control what hosts stores the TitleRec.

When Msg15 asks Msg22 for the TitleRec of a url being indexed, Msg22 will either return the TitleRec or, if the TitleRec does not exist for that url, it will set Msg22::m_availDocId to an available docid for that url to use. When looking up a TitleRec from a url, the first thing Msg22 does is define a startKey and endKey for a fetch of records from Urldb. Urldb consists of a bunch of dataless keys. Each of these keys consist of a docid, a hash extension and a file number.

If the file number in a Urldb record is 255 then the corresponding docid is a probable docid (just a hash of the url as computed in Titledb::getProbableDocId()) and that the TitleRec is either in Titledb's RdbTree or that the TitleRec does not exist, but that url is slated for spidering and is contained in Spiderdb. This makes it easy to avoid adding duplicate urls to Spiderdb. Because different urls in Spiderdb may have the same probable docid, each Urldb record contains an 7-bit hash extension of the url used to reduce the chance of collision. The 38-bit docid hash plus this 7-bit extension gives an effective docid of 45 bits. Unfortunately, if a url has the same 45-bit hash as another different url it will not be permitted into Spiderdb and therefore will not be indexed. We have to rely on the extended hash thing because we cannot store the actual url in every Urldb record because keeping Urldb mostly in memory is very important for performance reasons.

If the file number in a Urldb record is not 255, then it refers to the particular Titledb file that contains the TitleRec for the corresponding docid. In this case, the docid in the Urldb record is the actual docid of the url, not the probable docid. In most cases the actual docid is the same as the probable docid, but when two different urls hash to the same probable docid, Msg22 will increment the actual docid of the url being added until it finds an unused actual docid.

The startKey and endKey for this particular Urldb request are constructed using the lowest and highest actual docids possible for the url. Because we only know the probable docid for that url, we have to assume that its lower six bits were changed in order to form its actual docid. Once we get back the list of records from Urldb we compute the extended hash of the url we are looking up and we scan through the list of records to see if any have the same extended hash. If a Urldb record in the list has the same extended hash, then we can extract the file number and consult the corresponding Titledb file to get the TitleRec. If the file number is 255 then we check for the TitleRec in Titledb's tree. If not in the tree, then url is probably slated for spidering but not yet in the index, so we conclude the TitleRec does not exist.

If we do not find a TitleRec for the url then we must return an available actual docid for the url, so the caller can index the url. In this case, in Msg22::gotUrlListWrapper() [line 292 in Msg22.cpp], we assume the probable docid is the actual docid. This assumption is a bug, because the probable docid may have been present in one of the Urldb records in the list, but the extended hash may not have matched the extended hash of the url. So this needs to be fixed.

SIDE NOTE: To avoid having to lookup urls associated with each docid, a Query Reindex request will populate Spiderdb with docid-based records, rather than the typical url-based records. In this case, Msg15 can pass the exact docid to Msg22, so Msg22 does not have to worry about collisions and having to scan through a list of TitleRecs.

Scaling to Many Docids

Currently, we are using 38 bit docids which allows for over 270 billion docids. Using the lower 6 bits of each docid for the chaning purposes as described above, means that there will be a problem if a probable docid is unable to be turned into an available actual docid by only changing the lower 6 bits of the probable docid because all the possible actual docids are occupied. These 6 bits correspond to 32 different values, therefore if we think of the range of possible actual docids being divided up into buckets, where each bucket is 32 consecutive docids, we might ask: what number of full buckets can we expect? Once a bucket is full we will have to turn away any url that hashes into that bucket. It turns out that when we have 128 billion documents the expected number of full buckets is only 29.80.

If we have 128 billion pages out of a possible 256 billion, each docid has a 50% probability of being used. Therefore, the probability of 32 consecutive used docids is (.5)^32 and the expected number of full buckets is (.5)^32 * 128 billion which is 29.80. If we have 16 billion pages out of the 256 billion the number of expected full buckets is a very tiny fraction. This is all assuming we have a uniformly distributed hash function.

The next scaling problem is that of urldb. Urldb enhances the probable docid with a 7-bit extension. The probability that a url's probable docid collides with another url's actual docid is fairly good in a 16 billion page index, it is 1/16. But then we multiply by 128 to get 1 in 2048, which is lousy. Therefore we should grow the extended bits by 16 so it is at least 1 in 134 million for a 16 billion page index which is ok because that means only about 100 urls will not be able to be indexed because of a collision in urldb. And with this 16-bit growth, we should still get decent compression for large indexes. That is, if we have 130 million pages per server, we'd have 1<<24 bits of the docid in the upper 6 bytes of the urldb key. That means we'd expect about 7 keys on average to share the same 6 bytes, (130,000,000/(1<<24)) = 7.74. That would still constitute about a 12% growth over current urldb sizes, though.

The Search Results Layer

This layer is responsible for taking a query url as input and returning search results. Gigablast is capable of doing a few queries for every server in the cluster. So if you double the number of servers you double the query throughput.

Gigablast uses the Query class to break a query up into termIds. Each termId is a hash of a word or phrase in the query. Then the IndexList for each termId is loaded and they are all intersected (by docId) in a hash table, and their scores are mapped and accumulated as well. Msg39 computes the top X docIds and then uses Msg38 to get the site cluster hash and possibly the sample vector of each top docId. Clustered and duplicate results (docIds) are removed and if the total remaining is less than what was requested, Msg39 redoes the intersection to try to get more docIds to compensate. The final docIds are returned to Msg40 which uses Msg20 to fetch a summary for each docId. Then PageResults display the final search results.

Associated Classes (.cpp and .h files)

Ads	Search	Interface to third party ad server.
Highlight	Search	Highlights query terms in a document or summary.
IndexList	Search	Derived from RdbList. Used specifically for processing Indexdb RdbLists.
IndexReadInfo	Search	Tells Gigablast how much of what IndexLists to read from Indexdb to satisfy a query.
IndexTable	Search	Intersects IndexLists to get the final docIds to satisfy a query.
Matches	Search	Identifies words in a document or string that match supplied query terms. Used by Highlight.
Msg17	Search	Used by Msg40 for distributed caching of search result pages.
Msg1a	Search	Get the reference pages from a set of search results.
Msg1b	Search	Get the related pages from a set of search results and reference pages.
Msg2	Search	Given a list of termIds, download their respective IndexLists.
Msg20	Search	Given a docId and query, return a summary or or document excerpt. Used by Msg40.
Msg33	Search	Unused. Did raid stuff.
Msg36	Search	Gets the length of an IndexList for determining query term weights.
Msg37	Search	Calls a Msg36 for each term in the query.
Msg38	Search	Returns the Clusterdb record for a docId. May also get for Titledb record if its key is in the RdbMap.
Msg39	Search	Intersects IndexLists to get list of docIds satisfying query. Uses Msg38 to cluster away dups and same-site results. Re-intersects lists to get more docIds if too many were removed. Uses Msg2, Msg38, IndexReadInfo, IndexTable. This and IndexTable are the heart of the query resolution process.
Msg3a	Search	Calls multiple Msg39s to distribute the query based on docId parity. One host computes the even docId search results, the other the odd. And so on for different parity levels. Merges the docIds into a final list.
Msg40	Search	Uses Msg20 to get the summaries for the final list of docIds returned from Msg3a.
Msg40Cache	Search	Used by Msg17 to cache search results pages. Basically, caching serialized Msg40s.
Msg41	Search	Queries multiple clusters and merges the results.
PageDirectory	Search	HTML page to display a DMOZ directory page.
PageGet	Search	HTML page to display a cached web page from titledb with optional query term highlighting.
PageResults	Search	HTML/XML page to display the search results.
PageRoot	Search	HTML page to display the root page.
Query	Search	Parses a query up into QueryWords which are then parsed into QueryTerms. Makes a boolean truth table for boolean queries.
SearchInput	Search	Used to parse, contain and manage all parameters passed in for doing a query.
Speller	Search	Performs spell checking on a query. Returns a single recommended spelling of the query.
Summary	Search	Generates a summary given a document and a query.
Title	Search	Generates a title for a document. Usually just the <title> tag.
TopTree	Search	A balanced binary tree used for getting the top-scoring X search results from intersecting IndexLists in IndexTable, where X is a large number. Normally we just do a linear scan to find the minimum scoring docId and replace him with a higher scoring docid, but when X is large this linear scan process is too slow.

The Query Parser

The query syntax is described on gigablast.com. The Query class is responsible for converting a query into a list of termIds which are used to look up

Raiding TermList Intersections

An option for improving efficiency and parallelization is to "raid" term list disk reads and intersections. There are multiple layers of "raiding" to distribute both disk reads and CPU use. One level of disk raiding, called Indexdb Splitting, is implemented durring indexing. Indexdb term lists are split between multiple hosts so that smaller parts of the full list may be read in parallel. Another level of raiding splits the term list intersection between multiple "mercenary" hosts. This can be done using Msg33 which replaces the functionality typically performed by Msg2 and some of that performed by Msg39. An additional level of raiding may be performed by Msg33 to further split term list reads between twins.

Indexdb Splitting is enabled in the code by defining the SPLIT_INDEXDB #define. The number splits to make is set with the INDEXDB_SPLIT #define. For a single given term list, this will split which group (shard) the each term/docid entry is stored on based on the docid. The seperate pieces of one term list will be retrieved and combined in one of two ways. For general purpose access, a single Msg0 call may be made the same way it would normally be done without splitting. Msg0 will then forward the request to all hosts storing a split piece of the desired list. When each piece is collected, Msg0 will merge them together into one list. Alternately, for queries, Msg33 may be used to take further advantage of the splitting (see below).

The primary function of Msg33 is to split term lists apart so that the intersection can be done in parallel. Msg33 actually consists of multiple stages. In the initial stage, "stage 0", the host processing Msg39 will create a Msg33 and allow it to retrieve the docids required. This Msg33 will choose a number of hosts to act as "mercenary" machines, hosts which will recieve a piece of each term list and intersect those pieces together. The number of mercenaries is either set as a parm or forced to the number of Indexdb Splits when using Indexdb Splitting. Each mercenary is essentially responsible for a set of docids.
When not using Indexdb Splitting, the stage0 Msg33 will then call another Msg33 to each term list host, one for every term, executing "stage 1". When using Indexdb Splitting, a Msg33 will be sent to each host containing a split piece for every term list.
The stage1 Msg33 on each term list host will read its specific term list. When not using Indexdb Splitting, that term list will then be broken apart and sent with another Msg33 call to every mercenary machine which will perform intersection, executing "stage 2". When using Indexdb Splitting, the term lists are effectively already aplit up and the piece read by stage1 will be sent as a whole to the one mercenary machine responisble for that set of docids.
The stage2 Msg33 on each mercenary machine will wait for all the requests from each stage1 host to be recieved, storing each in a global queue. When all requests have been recieved, stage2 will intersect the term lists together, generating a partial set of final docids. These final docids will then be sent with another Msg33 back to the original host (the one Msg39 is being processed on), executing "stage 3".
Like stage2, the stage3 Msg33 (also the stage0 Msg33) waits for all sets of docids to be recieved from each mercenary. Once all have been recieved, stage3 will merge them together and create a true final set of docids to be used by Msg39. Msg39 is allowed to continue before replies are sent back down the chain Msg33s to complete the entire process.
The entire Msg33 chain may be called multiple times (preserving the same stage0/3 and stage2 Msg33s) for each tier that is required by the Query. Replies are not sent and the entire process in not cleaned up until the last tier has finished.

An additional ability of Msg33 is to further split the term list reads up amongst twins durring stage1. Each twin will read a seperate half of the desired term list and set that half to the mercenaries, where all pieces will eventually be collected.

The Spell Checker

Spell checking is done by the Speller class. At startup the Speller class, the global g_speller, loads in a set of files, called dictionary files. These files are contained in the 'dict' subdirectory in the working directory of the Gigablast process. Each letter of the alphabet, and each digit, has its own dictionary file which exclusively contains the most popular words and phrases that beging with that particular letter (or digit). Each of these files contain one word or phrase per line, in addition to a score, which preceeds each word or phrase. This score is representative of how popular that word or phrase is.

Using 'gb gendicts' on the command line, you can tell Gigablast to generate these dictionary files from the titledb files. The entry routine for this is Speller::generateDicts(numWordsToDump). It will scan through titledb getting each titleRec and storing the words (parsed out from Xml.cpp and Words.cpp classes) and phrases it find from each titleRec into several temporary dictionary files. A word that starts with the letter 'a', for instance, will be stored in a different temporary dictionary file than a word that starts with the letter 'b'. Gigablast will stop scanning titleRecs once it has dumped out numWordsToDump words.

After storing numWordsToDump words into the temporary dictionary files, Gigablast sorts each file by making a system call to 'sort'. Then it converts each temporary dictionary file into another temporary file by scoring each word. Since each temporary dictionary file is sorted, scoring each word is simple. The score is just the number of times a particular word or phrase occurred throughout the titledb scan. Lastly, that second temporary file is sorted by score to get the actual dictionary file. Gigablast will throw away words or phrases with a score less than MIN_DOCS, currently #define'd to 3.

At startup, Gigablast loads all of the words and phrases from all of the dictionary files. It finds the highest score and normalizes all words and phrases by that score so that the normalized scores range from 1 to 10000, where 10000 is the most popular word or phrase. Furthermore, Gigablast loads a file called 'words', also contained in the 'dict' subdirectory, which is more or less a copy of /usr/share/dict/words. A lot of times during a titledb scan Gigablast will not pick up all the words that a dictionary has, so Gigablast will augment each dict file with the words from this 'words' files. It is a convenient way of adding any words, that begin with any letter, to Gigablast's spell checker.

Next, for each letter x, Gigablast computes a linked list for every pair of letters that start with the letter x. So, when x is c the linked list for the letters 'a' and 't', for example, would contain pointers to the words "cat", "carat" and "crate". So when it comes to spell checking you can fetch such a linked list for each pair of adjacent letters for the word in question, and intersect those lists to get spelling recommendations. But before doing that Gigablast also hashes each word from the dictionary files into a hashtable (Speller::makeHashTable()), so if the word in questions matches a word in the hashtable then no recommendation is computed.

As a memory saving procedure, which may now be somewhat unnecessary, Gigablast converts each letter in every word to a "dictionary character" using the to_dict_char() function. There are currently 40 (see #define for NUM_CHARS) dictionary characters, essentially just a-z and 0-9 plus a few punctuation marks (see Speller.h). This normalization procedure means that Gigablast can deal with less possible letter pairs and save some memory. This will present problems when trying to spell check Unicode characters, of course, and really, the whole letter pair thing may not be that applicable to Unicode, but it should still word in utf-8 at least, although spelling recommendations might be a little strange sometimes.

Besides computing spelling recommendations, dictionary files files are used to quickly measure a word's popularity. Before you can get the popularity of a word or phrase, unfortunately, it is currently converted into dictionary characters before it is hashed. This process is used extensively by the Gigabits generation code. It is essential that we know the popularity/score of a word or phrase. When getting the popularity of a Unicode word or phrase, converting it to dictionary characters may not be the best thing to do. So, all-in-all, if we can get rid of the dictionary character conversion requirements, i think getting the score of a Unicode word will work much better.

Increasing NUM_CHARS to 256 may be a good idea, but there is long Speller::m_table[NUM_CHARS][NUM_CHARS][NUM_CHARS] which will jump from 256k to 67MB if we raise NUM_CHARS from 40 to 256. The first component in m_table is the letter that the possible misspelled word starts with, and the second and third indices are letter pairs in the word. (m_table is how we map a particular letter pair to a linked list of words that have that adjacent letter pair.) This means that all spelling recommendations for a word will start with the same letter as that word. It was done to speed things up and is definitely a bad short cut to take. But, since Gigablast computes the "edit distance" between the word in question and every word in the linked list, making the linked list about 1/40th the size really speeds up the computation.

I'm not too concerned with spell checking Unicode words right now, but I would like to have Unicode Gigabits. So if we could hash the words from the dictionary files without converting them to "dictionary characters" but just converting them to "lower case" I think we could handle Unicode words as far as getting their popularities is concerned. We could reserve dictionary characters for computing spelling recommendations only. Any ideas?

The Administrative Layer

Associated Classes (.cpp and .h files)

AutoBan	Admin	Automatically bans IP addresses that exceed daily and minute query quotas.
CollectionRec	Admin	Holds all of the parameters for a particular search collection.
Collectiondb	Admin	Manages all the CollectionRecs.
Conf	Admin	Holds all of the parameters not collection specific (Collectiondb does that). Like maximum memory for the gb process to use, for instance. Corresponds to gb.conf file.
Hostdb	Admin	Contains the array of Hosts in the network. Each Host has various stats, like ping time and IP addresses.
Msg1c	Admin	Perform spam analysis on an IP.
Msg1d	Admin	Ban documents identified as spam.
Msg30	Admin	Unused.
PageAddColl	Admin	HTML page to add a new collection.
PageHosts	Admin	HTML page to display all the hosts in the cluster. Shows ping times for each host.
PageLogin	Admin	HTML page to login as master admin or as a collection's admin.
PageOverview	Admin	HTML page to present the help section.
PagePerf	Admin	HTML page to show the performance graph.
PageSockets	Admin	HTML page for showing existing network connections for both TCP and UDP servers.
PageStats	Admin	HTML page for showing various server statistics.
Pages	Admin	Framework for displaying generic HTML pages as described by Parms.cpp.
Parms	Admin	All of the control parameters for the gb process or for a particular collection are stored in this file. Some controls are assigned to a specific page id so Pages.cpp can generate the HTML page automatically for controlling those parameters.
PingServer	Admin	Does round-robin pinging of every host in the cluster. Ping times are displayed on PageHosts.
Stats	Admin	Holds various statistics that PagePerf displays.
Sync	Admin	Unused. Syncs to twins Rdbs together.

Collections

Gigablast can support multiple collections. Each collection is essentially a sub-index. You can create a collection, index urls into it, and just search those urls. Each collection also has its own set of configurable parameters specified in the CollectionRec class. Collectiondb contains the array of CollectionRecs and is used to map a collection name to a CollectionRec.

Currently, Gigablast uses just one Indexdb (Titledb, etc.) to hold data for all of the searchable collections. When constructing the key of a record to be added to the Rdb of a particular collection, the collection name itself will be used, often, by just hashing it into the current key. In this manner, the probability of intercollection key collisions are made highly unlikely.

A defect of this model is that deleting a collection in its entirety is not easy since its records are intermixed with records from other collections in Indexdb, Titledb, Spiderdb and Checksumdb. Therefore, we are moving away from this model and creating a different directory for each collection and storing that collection's records just in its directory. This new model requires that the RdbTree class be modified to hold the collection number of each record. Also, RdbDump must be modified to dump each record into the file in the proper directory, as determined by the collection. We can still just use one instance of Indexdb, Titledb, etc., but we should add an extra dimension to the Rdb class's "BigFile *m_files[];" array for the extra collections we have. When performing an kind of operation, the Rdb class should select the appopriate files based on the supplied collection number. When a collection is deleted we must also remember to remove its records from the RdbTree, in addition to removing the directory for that collection. Once implemented, this new model will vastly increase the value of the Gigablast search product and make people much happier.

Parameters and Controls

The CollectionRec class houses all of the configuration parameters used to customize how a particular collection is managed, how its search results are displayed and how it indexes documents. There is one instance of the CollectionRec class per collection.

The Conf class houses all of the configuration parameters for things that are collection-independent, like the maximum amount of memory the process can use, for instance.

Each CollectionRec has a xml-based configuration file called "coll.conf" stored in the collections' subdirectory. The Conf class is instantiated globally as g_conf and is initialized from the xml-based file named "gb.conf", as described in the user-based overview.

All of the configuration parameters in these two classes are describe in an array call m_parms which is contained in the Parms class. Each element in this array is a Parm class. Each element has the following members:

a cgi variable name for setting the parameter from a browser HTTP request,
a title and description, for displaying the parameter on a web-based admin page
an xml name, for loading/saving the parameter from/to an xml config file,
a "cast" setting, which is 1 if the parameter's value change should be broadcast to all hosts in the network,
the type of parameter, like boolean, float or string,
the size of the parameter, like 4 bytes for a float, X for a string,
the offset of the parameter in Conf or CollectionRec,
the maximum number of elements, if an array, of that parameter,
and many more. See Pages.h and Pages.cpp.

The Parms class automates the display and control of all configuration parameters through unified mechanisms. This allows the easy addition and removal of parameters, makes it easy to read and save parameters to an xml-based config file, and makes it easy to set parameters from a browser HTTP request.

The Log System

Logging is done with a command like:

log(LOG_DEBUG,"query: a query debug message #%li.",n);

The first parameter to the log() subroutine is the type of log message. These types are defined in Log.h.

The second parameter is a format string. The first word in the format string is the subtype of the log message. The subtype must be preceeded by a colon and be a particular word as already used in the source code. Please do not make up your own. The subtypes (take from Log.h) are as follows:

Subtype	Description
addurls	related to adding urls
admin	related to administrative things, sync file, collections
build	related to indexing (high level)
conf	configuration issues
disk	disk reads and writes
dns	dns networking
http	http networking
loop
net	network later: multicast pingserver. sits atop udpserver.
query	related to querying (high level)
rdb	generic rdb things
spcache	related to determining what urls to spider next
speller	query spell checking
thread	calling threads
topics	related topics
udp	udp networking
uni	unicode parsing

These various types and subtypes can be toggled on and off in the Log control page via the administrative web GUI. To force a message to be logged despite the control setting, use logf().

All log messages are currently logged to stderr but may use a rotating file scheme in the future.

Changing Live Clusters

This is a brief overview of how to switch the live www.gigablast.com site from one cluster to another, what changes are required, and what things to keep in mind. The cluster which is to become live will be refered to as the "new cluster" and the one which is to be replaced will be refered to as the "old cluster".

Diff the gb.conf and coll.conf(s) between the new cluster and old cluster to check for undesired differences. If no real-time spidering will be done, the readOnlyMode option should be set to 1 in the new cluster's gb.conf. Additionally, blaster the new cluster to check for potential bugs with a new version of gb. Now leave the new cluster on and ready to serve queries.

On both the primary and secondary dns servers, edit /etc/bind/db.gigablast.com. Comment out the "www","dir","gov","@", and any other hostnames currently pointing to the old cluster. Replicate these lines (or uncomment old existing lines) and point them to the ip of the new cluster. Kill and restart named using: "killall -9 named ; /etc/rc3.d/S15bind9 start". If everything is working correctly, traffic will begin to shift over to the new cluster. Both should be left running until the old cluster is no longer seeing any traffic.

Something to watch for is that port 80 on the new cluster is correctly mapped to port 8000. Check in /etc/rcS.d/S99local near the bottom for the line:

/sbin/iptables -t nat -A PREROUTING -p tcp -m tcp --dport 80 -j DNAT --to-destination 64.62.168.XX:8000

This command should be executed at startup to perform the mapping.

After waiting a couple of days, force the traffic that hard-coded the IP for gigablast.com to start using the new cluster by running the following commands as root on host #0 on the old cluster, for instance, this is for gf0 (64.62.168.3), if gf0 were host #0 of the old cluster, and gb1 (64.62.168.52) is host #0 of the new cluster:

gf0:/a# echo 1 > /proc/sys/net/ipv4/ip_forward 
gf0:/a# /sbin/iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j DNAT --to 64.62.168.52:8000
gf0:/a# /sbin/iptables -t nat -A POSTROUTING -d 64.62.168.52 -p tcp --dport 8000 -j SNAT --to 64.62.168.3

The Core Layer

Associated Classes (.cpp and .h files)

BigFile	Core	A virtual file class that allows virtual files bigger than 2GB by using smaller 512MB files.
Dir	Core	Used to read the files in a directory.
DiskPageCache	Core	Used by BigFile to read and write from/to a page cache.
Domains	Core	Used to extract the Top Level Domain (TLD) from a url.
Entities	Core	List of all the various HTML entities.
Errno	Core	List of all the error codes and their associated error messages. Used by mstrerror().
File	Core	A basic file class that recycles file descriptors to get around the 1024 limit.
HashTable	Core	A basic hashtable that grows automatically.
HashTableT	Core	A templatized version of HashTable.cpp.
Log	Core	Used to log messages.
Loop	Core	Used to control the flow of execution. Reacts to signals.
Mem	Core	A malloc and new wrapper that helps isolate memory leaks, prevent double frees and identify overflows and underflows as well as track and limit memory consumption.
Mime	Core	Unused.
Strings	Core	Unused.
Threads	Core	Gigablast's own threads class which uses the Linux clone() call to do its own LWP threads.
Unicode	Core	Unicode support.
UnicodeProperties	Core	Unicode support.
Url	Core	For breaking a url up into its various components.
Vector	Core	Unused.
blaster	Core	Download the urls listed in a file in parallel. Very useful for testing query performance.
create_ucd_tables	Core	Unicode support.
dnstest	Core	Test dns server by sending it a bunch of lookup requests.
fctypes	Core	Various little functions.
gbfilter	Core	Called via a system call by Gigabot to convert pdfs, Microsoft Word documents, PowerPoint documents, etc. to HTML.
hash	Core	Contains a bunch of fast and convenient hashing functions.
iana_charset	Core	Unicode support. Autogenerated.
ip	Core	Routines for manipulating IP addresses.
main	Core	The main.cpp file.
monitor	Core	Monitors an external gb process and sends an email alert on 3 failed queries in a row.
thunder	Core	Tests UDP throughput between two machines.
types	Core	Defines the well used key_t type, a 12-byte key, and some functions for 16-byte keys.
uniq2	Core	Like the 'uniq' command but also counts the occurences and prints those out.
urlinfo	Core	Displays information for a url given through stdin.

File List

Here is a list and description of all the source code files.

File (.cpp or .h)	Layer	Description
Ads	Search	Interface to third party ad server.
AdultBit	Build	Used to detect if document content is naughty.
AutoBan	Admin	Automatically bans IP addresses that exceed daily and minute query quotas.
BigFile	Core	A virtual file class that allows virtual files bigger than 2GB by using smaller 512MB files.
Bits	Build	Sets descriptor bits for each word in a Words class.
Categories	Build	Stores DMOZ categories in a hiearchy.
Checksumdb	DB	Rdb that maps a docId to a checksum for an indexed document. Used to dedup same content from the same hostname at build time.
Clusterdb	DB	Rdb that maps a docId to the hash of a site and its family filter bit and, optionally, a sample vector used for deduping search results. Used for site clustering, family filtering and deduping at query time.
CollectionRec	Admin	Holds all of the parameters for a particular search collection.
Collectiondb	Admin	Manages all the CollectionRecs.
Conf	Admin	Holds all of the parameters not collection specific (Collectiondb does that). Like maximum memory for the gb process to use, for instance. Corresponds to gb.conf file.
DateParse	Build	Extracts the publish date from a document.
Datedb	DB	Like indexdb, but its scores are 4-byte dates.
Dir	Core	Used to read the files in a directory.
DiskPageCache	Core	Used by BigFile to read and write from/to a page cache.
Dns	Net	A DNS client built on top of the UdpServer class.
DnsProtocol	Net	Uses UdpServer to make a protocol for talking to DNS servers. Used by Dns class.
Doc	Build	A container class to store the titleRec and siteRec and other classes associated with a document.
Domains	Core	Used to extract the Top Level Domain (TLD) from a url.
Entities	Core	List of all the various HTML entities.
Errno	Core	List of all the error codes and their associated error messages. Used by mstrerror().
File	Core	A basic file class that recycles file descriptors to get around the 1024 limit.
HashTable	Core	A basic hashtable that grows automatically.
HashTableT	Core	A templatized version of HashTable.cpp.
Highlight	Search	Highlights query terms in a document or summary.
Hostdb	Admin	Contains the array of Hosts in the network. Each Host has various stats, like ping time and IP addresses.
HttpMime	Net	Creates and parses an HTTP MIME header.
HttpRequest	Net	Creates and parses an HTTP request.
HttpServer	Net	Gigablast's highly efficient web server, contains a TcpServer class.
IndexList	Search	Derived from RdbList. Used specifically for processing Indexdb RdbLists.
IndexReadInfo	Search	Tells Gigablast how much of what IndexLists to read from Indexdb to satisfy a query.
IndexTable	Search	Intersects IndexLists to get the final docIds to satisfy a query.
Indexdb	DB	Rdb that maps a termId to a score and docId pair. The search index is stored in Indexdb.
Lang	Build	Unused.
Language.h	Build	Enumerates the various languages supported by Gigablast's language detector.
LangList	Build	Interface to the language-specific dictionaries used for language identification by XmlDoc::getLanguage().
LinkInfo	Build	Used by the link analysis routine. Contains an array of LinkTexts.
LinkText	Build	Contains the link information for a specific document that links to a specific url, such as quality, IP address, number of total outgoing links, and the link text for that url.
Links	Build	Parses out all the outgoing links in a document.
Log	Core	Used to log messages.
Loop	Core	Used to control the flow of execution. Reacts to signals.
Matches	Search	Identifies words in a document or string that match supplied query terms. Used by Highlight.
Mem	Core	A malloc and new wrapper that helps isolate memory leaks, prevent double frees and identify overflows and underflows as well as track and limit memory consumption.
MemPool	DB	Used by RdbTree to add new records to tree without having to do an individual malloc.
MemPoolTree	DB	Unused. Was our own malloc routine.
Mime	Core	Unused.
Msg0	DB	Fetches an RdbList from across the network.
Msg1	DB	Adds all the records in an RdbList to various hosts in the network.
Msg10	Build	Adds a list of urls to spiderdb for spidering.
Msg13	Build	Tells a server to download robots.txt (or get from cache) and report if Gigabot has permission to download it.
Msg14	Build	The core class for indexing a document.
Msg15	Build	Called by Msg14 to set the Doc class from the previously indexed TitleRec.
Msg16	Build	Called by Msg14 to download the document and create a new titleRec to set the Doc class with.
Msg17	Search	Used by Msg40 for distributed caching of search result pages.
Msg18	Build	Unused. Was used for supporting soft banning.
Msg19	Build	Determine if a document is a duplicate of a document already indexed from that same hostname.
Msg1a	Search	Get the reference pages from a set of search results.
Msg1b	Search	Get the related pages from a set of search results and reference pages.
Msg1c	Admin	Perform spam analysis on an IP.
Msg1d	Admin	Ban documents identified as spam.
Msg2	Search	Given a list of termIds, download their respective IndexLists.
Msg20	Search	Given a docId and query, return a summary or or document excerpt. Used by Msg40.
Msg22	Search/Build	Return the TitleRec for a docId or url.
Msg23	Build	Get the link text in a document that links to a specified url. Also returns other info besides that link text.
Msg24	search	Get the gigabits (aka related topics) for a query.
Msg25	build	Set the LinkInfo class for a document.
Msg28	admin	Set a particular parm or set of parms on all hosts in the cluster.
Msg2a	admin	Makes catdb, for assigning documents to a category in DMOZ.
Msg2b	admin	Makes catdb, for assigning documents to a category in DMOZ.
Msg3	DB	Reads an RdbList from several consecutive files in a particular Rdb.
Msg37	Search	Calls a Msg36 for each term in the query.
Msg38	Search	Returns the Clusterdb record for a docId. May also get for Titledb record if its key is in the RdbMap.
Msg39	Search	Intersects IndexLists to get list of docIds satisfying query. Uses Msg38 to cluster away dups and same-site results. Re-intersects lists to get more docIds if too many were removed. Uses Msg2, Msg38, IndexReadInfo, IndexTable. This and IndexTable are the heart of the query resolution process.
Msg3a	Search	Calls multiple Msg39s to distribute the query based on docId parity. One host computes the even docId search results, the other the odd. And so on for different parity levels. Merges the docIds into a final list.
Msg40	Search	Uses Msg3a to get final docIds in search results. Uses Msg20 to get the summaries for these docIds.
Msg40Cache	Search	Used by Msg17 to cache search results pages. Basically, caching serialized Msg40s.
Msg41	Search	Sends Msg40s to multiple clusters and merges the results.
Msg5	DB	Uses Msg3 to read RdbLists from multiple files and then merges those lists into a single RdbList. Does corruption detection and repiar. Intergrates list from RdbTree into the single RdbList.
Msg7	build	Injects a url into the index using Msg14.
Msg8	Build	Gets the Sitedb record given a url.
Msg9	Build	Adds a Sitedb record to Sitedb for a given site/url.
MsgB	DB	Unused. A distributed cache for caching anything.
Multicast	Net	Used to reroute a request if it fails to be answered in time. Also used to send a request to multiple hosts in the cluster, usually to a group (shard) for data storage purposes.
PageAddColl	Admin	HTML page to add a new collection.
PageAddUrl	Build	HTML page to add a url or file of urls to spiderdb.
PageCatdb	Admin/Build	HTML page to lookup the categories of a url in catdb.
PageDirectory	Search	HTML page to display a DMOZ directory page.
PageGet	Search	HTML page to display a cached web page from titledb with optional query term highlighting.
PageHosts	Admin	HTML page to display all the hosts in the cluster. Shows ping times for each host.
PageIndexdb	Admin/Search	HTML page to display an IndexList for a given query term or termId. Can also add or delete individual Indexdb records.
PageInject	Build	HTML page to inject a page directly into the index.
PageLogin	Admin	HTML page to login as master admin or as a collection's admin.
PageOverview	Admin	HTML page to present the help section.
PageParser	Admin/Build	HTML page to show how a document is analyzed, parsed and its terms are scored and indexed.
PagePerf	Admin	HTML page to show the performance graph.
PageReindex	Admin/Build	HTML page to reindex or delete the search results for a single term query.
PageResults	Search	HTML/XML page to display the search results.
PageRoot	Search	HTML page to display the root page.
PageSitedb	Admin/Build	HTML page to allow urls or sites to be entered into Sitedb, for assigning spidered urls to a ruleset.
PageSockets	Admin	HTML page for showing existing network connections for both TCP and UDP servers.
PageSpamr	Admin/Build	HTML page for removing spam from the index.
PageSpiderdb	Admin/Build	HTML page for showing status of spiders and what is in spiderdb.
PageStats	Admin	HTML page for showing various server statistics.
PageTitledb	Admin/Build	HTML page for show a Titledb record for a given docId.
Pages	Admin	Framework for displaying generic HTML pages as described by Parms.cpp.
Parms	Admin	All of the control parameters for the gb process or for a particular collection are stored in this file. Some controls are assigned to a specific page id so Pages.cpp can generate the HTML page automatically for controlling those parameters.
Phrases	Build	Generates phrases for every word in a Words class. Uses the Bits class.
PingServer	Admin	Does round-robin pinging of every host in the cluster. Ping times are displayed on PageHosts.
Pops	Build	Computes popularity for each word in a Words class. Uses the dictionary files in the dict subdirectory.
Pos	Build	Computes the character position of each word in a Words class. HTML entities count as a single character. So do back-to-back spaces.
Query	Search	Parses a query up into QueryWords which are then parsed into QueryTerms. Makes a boolean truth table for boolean queries.
Rdb	DB	The core database class from which all are derived.
RdbBase	DB	Each Rdb has an array of RdbBases, one for each collection. Each RdbBase has an array of BigFiles for that collection.
RdbCache	DB	Can cache RdbLists or individual Rdb records.
RdbDump	DB	Dumps the RdbTree to an Rdb file. Also is used by RdbMerge to dump the merged RdbList to a file.
RdbList	DB	A list of Rdb records.
RdbMap	DB	Maps an Rdb key to an offset into an RdbFile.
RdbMem	DB	Memory manager for RdbTree so it does not have to allocate space for every record in the three.
RdbMerge	DB	Merges multiple Rdb files into one Rdb file. Uses Msg5 and RdbDump to do reading and writing respectively.
RdbScan	DB	Reads an RdbList from an RdbFile, used by Msg3.
RdbTree	DB	A binary tree of Rdb records. All collections share a single RdbTree, so the collection number is specified for each node in the tree.
Robotdb	Build	Caches and parses robots.txt files. Used by Msg13.
SafeBuf	General	Used to print messages safely into a buffer without worrying about overflow. Will automatically reallocate the buffer if necessary.
Scores	Build	Computes the score of each word in a Words class. Used to weight the final score of a term being indexed in TermTable::hash().
SearchInput	Search	Used to parse, contain and manage all parameters passed in for doing a query.
SiteRec	DB	A record in Sitedb.
Sitedb	DB	An Rdb that maps a url to a Sitedb record which contains a ruleset to be used to parse and index that url.
Spam	Build	Computes the probability a word is spam for every word in a Words class.
SpamContainer	Build	Used to remove spam from the index using Msg1c and Msg1d.
Speller	Search	Performs spell checking on a query. Returns a single recommended spelling of the query.
SpiderCache	Build	Spiderdb records are preloaded in this cache so SpiderLoop::spiderUrl() can get urls to spider as fast as possible.
SpiderLoop	Build	The heart of the spider process. Continually gets urls from the SpiderCache and calls Msg14::spiderUrl() on them.
SpiderRec	DB	A record in spiderdb.
Spiderdb	DB	An Rdb whose records are urls sorted by times they should be spidered. The key contains other information like if the url is old or new to the index, and the priority of the url, currently from 0 to 7.
Stats	Admin	Holds various statistics that PagePerf displays.
Stemmer	Build	Unused. Given a word, computes its stem.
StopWords	Build	A table of stop words, used by Bits to see if a word is a stop word.
Strings	Core	Unused.
Summary	Search	Generates a summary given a document and a query.
Sync	Admin	Unused. Syncs to twins Rdbs together.
TcpServer	Net	A TCP server which contains an array of TcpSockets.
TcpSockets	Net	A C++ wrapper for a TCP socket.
TermTable	Build	A hash table of terms from a document. Consists of termIds and scores. Used to accumulate scores. TermTable::hash() is arguably a heart of the build process.
Threads	Core	Gigablast's own threads class which uses the Linux clone() call to do its own LWP threads.
Title	Search	Generates a title for a document. Usually just the <title> tag.
TitleRec	DB	A record in Titledb.
Titledb	DB	An Rdb where the records are basically compressed web pages, along with other info like the quality of the page. Contains an instance of the LinkInfo class.
TopTree	Search	A balanced binary tree used for getting the top-scoring X search results from intersecting IndexLists in IndexTable, where X is a large number. Normally we just do a linear scan to find the minimum scoring docId and replace him with a higher scoring docid, but when X is large this linear scan process is too slow.
UCNormalizer	General	Unicode support.
UCPropTable	General	Unicode support.
UCWordIterator	General	For iterating over Unicode characters.
UdpServer	Net	A reliable UDP server that uses non-blocking sockets and calls handlers recieving a message. The handle called depends on that message's type. The handler is UdpServer::m_handlers[msgType].
UdpSlot	Net	Basically a "socket" for the UdpServer. The UdpServer contains an array of a few thousand of these. When none are available to conduct receive a request, the dgram is dropped and will later be resent by the requester in a back-off fashion.
Unicode	Core	Unicode support.
UnicodeProperties	Core	Unicode support.
Url	Core	For breaking a url up into its various components.
Url2	Build	For hashing/indexing a url.
Vector	Core	Unused.
Words	Build	Breaks a document up into "words", where each word is a sequence of alphanumeric characters, a sequence of non-alphanumeric characters, or a single HTML/XML tag. A heart of the build process.
Xml	Build	Breaks a document up into XmlNodes where each XmlNode is a tag or a sequence of characters which are not a tag.
XmlDoc	Build	XmlDoc::hash() hashes a TitleRec (and SiteRec which indicates the ruleset to use) into a TermTable. Another heart of the build process.
XmlNode	Build	Xml classes has an array of these. Each is either a tag or a sequence of characters that are between tags (or beginning/end of the document).
blaster	Core	Download the urls listed in a file in parallel. Very useful for testing query performance.
create_ucd_tables	Core	Auto generated. Used by Unicode stuff.
dmozparse	Build	Creates the necessary dmoz files Gigablast needs from those files downloadable from DMOZ.
dnstest	Core	Test dns server by sending it a bunch of lookup requests.
fctypes	Core	Various little functions.
gbfilter	Core	Called via a system call by Gigabot to convert pdfs, Microsoft Word documents, PowerPoint documents, etc. to HTML.
hash	Core	Contains a bunch of fast and convenient hashing functions.
iana_charset	Core	Unicode support.
ip	Core	Routines for manipulating IP addresses.
main	Core	The main.cpp file.
monitor	Core	Monitors an external gb process and sends an email alert on 3 failed queries in a row.
thunder	Core	Tests UDP throughput between two machines.
types	Core	Defines the well used key_t type, a 12-byte key, and some functions for 16-byte keys.
uniq2	Core	Like the 'uniq' command but also counts the occurences and prints those out.
urlinfo	Core	Displays information for a url given through stdin.

Fighting Spam

Spam is perhaps the leading issue in quality control. How do we remove undesirable and useless pages from the search results? Fighting spammers is like an endless cat and mouse game. Before we discuss any more, we must ask ourselves, What is spam?

A spam page consists of a bunch of links and generated content. The generated content uses various sources as the basis for the content generation. These sources include other static web pages, search results pages, resources like dictionaries, dmoz, query logs, and databases from shopping sites like amazon, ebay, allposters or zappos. The exact sources of input used by the generating function is not always clear.

To fight spam using a different approach, we have devised a quality-determination algorithm which uses a point-based system. Our base measure of quality is based upon offsite factors, such as number of incoming links weighted by the quality of those incoming links. The spam detector then looks at onsite factors such as the fingerprints of content entirely generated automatically, links to affiliate moneymaking sites, or irregular language model, and adds a negative quality proportional to the number of questionable practices the page engages in. The total quality of a page is the sum of the link-adjusted quality and the negative spam quality.

Our attributes for spam detection have been gleaned from examining thousands of spam and non-spam documents. The weights are adjustable, and have been manually tuned to provide good spam detection, while minimizing false positives. The points for each attribute is multiplied by the number of times the attribute occured in the document. The total points for each attribute are then added for a raw total of points. The raw total is then multiplied by a force multiplier. The force multiplier is calculated by .01 * the number of attributes had a non-zero number of points. This provides a much heavier weight for pages which have many 'spammy' qualities, while reducing false positives for pages which have lots of points in a single category. This produces a small integer which is capped at MaxQuality points and comprises the negative quality weight.

Spam Scores	Points	Occurences	Total
url has - or _ or a digit in the domain	20	0	0
tld is info or biz	20	0	0
tld is gov,edu, or mil	-20	0	0
title has spammy words	20	0	0
page has img src to other domains	5	1	5
page contains spammy words	5	3	15
consecutive link text has the same word	10	0	0
links to amazon, allposters, or zappos	10	0	0
has 'affiliate' in the links	40	0	0
has an iframe to amazon	30	0	0
links to urls > 128 chars long	5	0	0
links have ?q= or &q=	5	0	0
page has google ads	15	0	0
Raw Total			20
Force Multiplier			0.020000
Final Score			0

Search engine results pages (SERPS)

Many pages indexed by search engines are themselves search engine pages. This could be because of a misconfigured robots.txt which fails to tell search engines not to index their results pages, or they could be included by a php script by a search engine spammer because they will include keywords which are close in relevency to a query term. In any case, they are useless content. In order to detect these pages, we use a points system which looks for the structure of title - summary - url several times in a row, as well as things like next and prev links, and queries in the url. These points are added together and normalized between 0 and 100 to give a likelihood that the page is a serp.

products help add a url Press contact

Developer Documentation

Table of Contents

Getting Started

SSH - A Brief Introduction

Using the Shell

Using Git

Setting up GitHub on a Linux Box

Hardware Administration

Hardware Failures

Replacing a Hard Drive

Measuring Drive Performance

Measuring Memory Performance

Measuring Network Performance

Kernels

Coding Conventions

Debugging Gigablast

General Procedure

Using GDB

Using Git to Help Debug

Memory Leaks

Random Memory Writes

Corrupted Data In Memory

Common Core Dumps

Common Coding Mistakes

System Administration

Code Overview

The Control Loop Layer

The Database Layer

Rdb

Associated Classes (.cpp and .h files)

Adding a Record to Rdb

Dumping an Rdb Tree

Merging Rdb Files

Deleting Rdb Records

Reading a List of Rdb Records

Rdb Error Correction

Getting a List of Records from Rdb in a Network

Adding a Record to Rdb in a Network

Variable Key Size

Posdb

Indexdb

Datedb

Titledb

Spiderdb

Checksumdb

Sitedb

Catdb

dmozparse

Msg2a

Categories

Msg2b

Indexed Catids, Direct and Indirect

The Network Layer

Associated Classes (.cpp and .h files)

The UDP Server

Multicasting

The Msg classes

The TCP Server

The HTTP Server

The DNS Resolver

The Build Layer

Associated Files

The Spider Loop

Link Analysis

The Sample Vector

The Summary Vector

The Gigabit Vector

Deduping at Spider Time

Indexing Error Codes

XmlDoc: The Heart of the Parser

Xml

Links

Words

Phrases

Bits

Content Filtering

DocIds

Scaling to Many Docids

The Search Results Layer

Associated Classes (.cpp and .h files)