Gigablast API

Gigablast - api

WEB

DIRECTORY

ADMIN

NOTE: All APIs support both GET and POST method. If the size of your request is more than 2K you should use POST.

NOTE: All APIs support both http and https protocols.

API by pages

/search - search results page
/get - gets cached web page
/admin/status - basic status
/admin/collectionpasswords - passwords
/admin/hosts - hosts status
/admin/master - master controls
/admin/search - search controls
/admin/spider - spider controls
/admin/proxies - proxies
/admin/log - log controls
/admin/masterpasswords - master passwords
/admin/addcoll - add a new collection
/admin/delcoll - delete a collection
/admin/clonecoll - clone one collection's settings to another
/admin/rebuild - rebuild data
/admin/inject - inject url in the index here
/admin/addurl - add url page for admin
/admin/reindex - query delete/reindex
/admin/stats - general statistics

/search - search results page [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	q	STRING	query		The query to perform. See help. See the query operators below for more info. REQUIRED
4	c	STRING	collection		Search this collection. Use multiple collection names separated by a whitespace to search multiple collections at once. REQUIRED
5	n	INT32	number of results per query	10	The number of results returned. If you want more than 1000 results you must use &stream=1 so Gigablast does not run out of memory.
6	s	INT32	first result num	0	Start displaying at search result #X. Starts at 0. If you want more than 1000 results in total, you must use &stream=1 so Gigablast does not run out of memory.
7	showerrors	BOOL (0 or 1)	show errors	0	Show errors from generating search result summaries rather than just hide the docid. Useful for debugging.
8	showanomalies	BOOL (0 or 1)	show anomalies	0	Show search results that only contain the query terms in some anomalous link texts.
9	sc	BOOL (0 or 1)	site cluster	0	Should search results be site clustered? This limits each site to appearing at most twice in the search results. Sites are subdomains for the most part, like abc.xyz.com.
10	hacr	BOOL (0 or 1)	hide all clustered results	0	Only display at most one result per site.
11	dr	BOOL (0 or 1)	dedup results	0	Should duplicate search results be removed? This is based on a content hash of the entire document. So documents must be exactly the same for the most part.
12	pss	INT32	percent similar dedup summary		If document summary (and title) are this percent similar to a document summary above it, then remove it from the search results. 100 means only to remove if exactly the same. 0 means no summary deduping. You must also supply dr=1 for this to work.
13	ddu	BOOL (0 or 1)	dedup URLs	0	Should we dedup URLs with case insensitivity? This is mainly to correct duplicate wiki pages.
14	spell	BOOL (0 or 1)	do spell checking	1	If enabled while using the XML feed, when Gigablast finds a spelling recommendation it will be included in the XML tag. Default is 0 if using an XML feed, 1 otherwise. Will be availble again soon.
15	stream	CHAR	stream search results	0	Stream search results back on socket as they arrive. Useful when thousands/millions of search results are requested. Required when doing such things otherwise Gigablast could run out of memory. Only supported for JSON and XML formats, not HTML. You must use this if you want more than 1000 results.
16	secsback	INT32	seconds back	0	Limit results to pages spidered this many seconds ago. Use 0 to disable.
17	sortby	CHAR	sort by	0	Use 0 to sort results by relevance, 1 to sort by most recent spider date down, and 2 to sort by oldest spidered results first.
18	filetype	STRING	filetype		Restrict results to this filetype. Supported filetypes are pdf, doc, html xml, json, xls.
19	scores	BOOL (0 or 1)	get scoring info		Get scoring information for each result so you can see how each result is scored. You must explicitly request this using &scores=1 for the XML feed because it is not included by default.
20	fast	BOOL (0 or 1)	fast results	0	Sacrifice some quality and result filtering for the sake of speed.
21	qe	BOOL (0 or 1)	do query expansion		If enabled, query expansion will expand your query to include the various forms and synonyms of the query terms.
22	uip	STRING	user ip		The ip address of the searcher. We can pass back for use in the autoban technology which bans abusive IPs.
23	nf	INT32	max number of facets to return	50	Max number of facets to return
24	qlang	STRING	sort language preference		Default language to use for ranking results. Value should be any language abbreviation, for example "en" for English. Use xx to give ranking boosts to no language in particular. See the language abbreviations at the bottom of the url filters page.
25	langw	FLOAT32	language weight		Use this to override the default language weight for this collection. The default language weight can be set in the search controls and is usually something like 20.0. Which means that we multiply a result's score by 20 if from the same language as the query or the language is unknown.
26	tml	INT32	max title len		What is the maximum number of characters allowed in titles displayed in the search results?
27	ns	INT32	number of summary excerpts		How many summary excerpts to display per search result?
28	sw	INT32	max summary line width		<br> tags are inserted to keep the number of chars in the summary per line at or below this width. Also affects title. Strings without spaces that exceed this width are not split. Has no affect on xml or json feed, only works on html.
29	smxcpl	INT32	max summary excerpt length		What is the maximum number of characters allowed per summary excerpt?
30	dsrt	INT32	results to scan for gigabits generation		How many search results should we scan for gigabit (related topics) generation. Set this to zero to disable gigabits!
31	ipr	BOOL (0 or 1)	ip restriction for gigabits		Should Gigablast only get one document per IP domain and per domain for gigabits (related topics) generation?
32	nrt	INT32	number of gigabits to show	11	What is the number of gigabits (related topics) displayed per query? Set to 0 to save a little CPU time.
33	mts	INT32	min topics score		Gigabits (related topics) with scores below this will be excluded. Scores range from 0% to over 100%.
34	mdc	INT32	min gigabit doc count by default	2	How many documents must contain the gigabit (related topic) in order for it to be displayed.
35	dsp	INT32	dedup doc percent for gigabits (related topics)	80	If a document is this percent similar to another document with a higher score, then it will not contribute to the gigabit generation.
36	mwpt	INT32	max words per gigabit (related topic) by default	6	Maximum number of words a gigabit (related topic) can have. Affects xml feeds, too.
37	showimages	BOOL (0 or 1)	show images	1	Should we return or show the thumbnail images in the search results?
38	usecache	CHAR	use cache	1	Use 0 if Gigablast should not read or write from any caches at any level.
39	rcache	BOOL (0 or 1)	read from cache	1	Should we read search results from the cache?
40	wcache	CHAR	write to cache	1	Use 0 if Gigablast should not write to any caches at any level.
41	minserpdocid	INT64	max serp docid	0	Start displaying results after this score/docid pair. Used by widget to append results to end when index is volatile.
42	maxserpscore	FLOAT64	max serp score	0	Start displaying results after this score/docid pair. Used by widget to append results to end when index is volatile.
43	link	STRING	restrict search to pages that link to this url		The url which the pages must link to.
44	sites	STRING	restrict results to these sites		Returned results will have URLs from these space-separated list of sites. Can have up to 200 sites. A site can include sub folders. This is allows you to build a Custom Topic Search Engine.
45	ff	BOOL (0 or 1)	family filter	0	Remove objectionable results if this is enabled.
46	qh	BOOL (0 or 1)	highlight query terms in summaries	1	Use to disable or enable highlighting of the query terms in the summaries.
47	hq	STRING	cached page highlight query		Highlight the terms in this query instead.
48	bq	INT32	boolean status	2	Can be 0 or 1 or 2. 0 means the query is NOT boolean, 1 means the query is boolean and 2 means to auto-detect.
49	dt	STRING	meta tags to display		A space-separated string of meta tag names. Do not forget to url-encode the spaces to +'s or %%20's. Gigablast will extract the contents of these specified meta tags out of the pages listed in the search results and display that content after each summary. i.e. &dt=description will display the meta description of each search result. &dt=description:32+keywords:64 will display the meta description and meta keywords of each search result and limit the fields to 32 and 64 characters respectively. When used in an XML feed the <display name="meta_tag_name">meta_tag_content</> XML tag will be used to convey each requested meta tag's content.
50	niceness	INT32	niceness	0	Can be 0 or 1. 0 is usually a faster, high-priority query, 1 is a slower, lower-priority query.
51	debug	CHAR	debug flag	0	Is 1 to log debug information, 0 otherwise.
52	rdc	BOOL (0 or 1)	return number of docs per topic	1	Use 1 if you want Gigablast to return the number of documents in the search results that contained each topic (gigabit).
53	rd	BOOL (0 or 1)	return docids per topic	0	Use 1 if you want Gigablast to return the list of docIds from the search results that contained each topic (gigabit).
54	debuggigabits	BOOL (0 or 1)	debug gigabits flag	0	Is 1 to log gigabits debug information, 0 otherwise.
55	dio	BOOL (0 or 1)	return docids only	0	Is 1 to return only docids as query results.
56	admin	BOOL (0 or 1)	admin override	1	admin override
57	prepend	STRING	prepend		prepend this to the supplied query followed by a \|.
58	sb	BOOL (0 or 1)	show banned pages	0	show banned pages
59	icc	INT32	include cached copy of page	0	Will cause a cached copy of content to be returned instead of summary.

Example XML Output (&format=xml)


<response>
	<statusCode>0</statusCode>
	<statusMsg>Success</statusMsg>
	<currentTimeUTC>1404513734</currentTimeUTC>
	<responseTimeMS>284</responseTimeMS>
	<docsInCollection>226</docsInCollection>
	<hits>193</hits>
	<moreResultsFollow>1</moreResultsFollow>
	<result>
		<imageBase64>/9j/4AAQSkZJRgABAQAAAQABA...</imageBase64>
		<imageHeight>350</imageHeight>
		<imageWidth>223</imageWidth>
		<origImageHeight>470</origImageHeight>
		<origImageWidth>300</origImageWidth>
		<title><![CDATA[U.S....]]></title>
		<sum>Department of the Interior protects America's natural resources and</sum>
		<url><![CDATA[www.doi.gov]]></url>
		<size>  64k</size>
		<docId>34111603247</docId>
		<site>www.doi.gov</site>
		<spidered>1404512549</spidered>
		<firstIndexedDateUTC>1404512549</firstIndexedDateUTC>
		<contentHash32>2680492249</contentHash32>
		<language>English</language>
	</result>
</response>

Example JSON Output (&format=json)


{ "response":{

	# This is zero on a successful query. 
	# Otherwise it will be a non-zero number 
	# indicating the error code.
	"statusCode":0,

	# Similar to above, this is "Success" 
	# on a successful query. Otherwise it 
	# will indicate an error message 
	# corresponding to the statusCode above.
	"statusMsg":"Success",

	# This is the current time in UTC in 
	# unix timestamp format (seconds since 
	# the epoch) that the server has when 
	# generating this JSON response.
	"currentTimeUTC":1404588231,

	# This is how long it took in 
	# milliseconds to generate the JSON 
	# response from reception of the request.
	"responseTimeMS":312,

	# This is how many matches were 
	# excluded from the search results 
	# because they were considered 
	# duplicates, banned, had errors 
	# generating the summary, or were from an 
	# over-represented site. To show them use 
	# the &sc &dr &pss &sb and &showerrors 
	# input parameters described above.
	"numResultsOmitted":3,

	# This is how many shards failed to 
	# return results. Gigablast gets results 
	# from multiple shards (computers) and 
	# merges them to get the final result 
	# set. Some times a shard is down or 
	# malfunctioning so it will not 
	# contribute to the results. So If this 
	# number is non-zero then you had such a 
	# shard.
	"numShardsSkipped":0,

	# This is how many shards are ideally 
	# in use by Gigablast to generate search 
	# results.
	"totalShards":159,

	# This is how many total documents are 
	# in the collection being searched.
	"docsInCollection":226,

	# This is how many of those documents 
	# matched the query.
	"hits":193,

	# This is 1 if more search results are 
	# available, otherwise it is 0.
	"moreResultsFollow":1,

	# Start of query-based information.
	"queryInfo":{

		# The entire query that was received, 
		# represented as a single string.
		"fullQuery":"test",

		# The language of the query. This is 
		# the 'preferred' language of the search 
		# results. It is reflecting the &qlang 
		# input parameter described above. Search 
		# results in this language (or an unknown 
		# language) will receive a large boost. 
		# The boost is multiplicative. The 
		# default boost size can be overridden 
		# using the &langw input parameter 
		# described above. This language 
		# abbreviation here is usually 2 letter, 
		# but can be more, like in the case of 
		# zh-cn, for example.
		"queryLanguageAbbr":"en",

		# The language of the query. Just 
		# like above but the language is spelled 
		# out. It may be multiple words.
		"queryLanguage":"English",

		# List of space separated words in 
		# the query that were ignored for the 
		# most part. Because they were common 
		# words for the query language they are 
		# in.
		"ignoredWords":"to the",

		# There is a maximum limit placed on 
		# the number of query terms we search on 
		# to keep things fast. This can be 
		# changed in the search controls.
		"queryNumTermsTotal":52,
		"queryNumTermsUsed":20,
		"queryWasTruncated":1,

		# The start of the terms array. Each 
		# query is broken down into a list of 
		# terms. Each term is described here.
		"terms":[

			# The first query term in the JSON 
			# terms array.
			{

			# The term number, starting at 0.
			"termNum":0,

			# The term as a string.
			"termStr":"test",

			# The term frequency. An estimate of 
			# how many pages in the collection 
			# contain the term. Helps us weight terms 
			# by popularity when scoring the results.
			"termFreq":425239458,

			# A 48-bit hash of the term. Used to 
			# represent the term in the index.
			"termHash48":67259736306430,

			# A 64-bit hash of the term.
			"termHash64":9448336835959712000,

			# If the term has a field, like the 
			# term title:cat, then what is the hash 
			# of the field. In this example it would 
			# be the hash of 'title'. But for the 
			# query 'test' there is no field so it is 
			# 0.
			"prefixHash64":0

			},

			# The second query term in the JSON 
			# terms array.
			{

			"termNum":1,
			"termStr":"tested",

			# The language the term is from, in 
			# the case of query expansion on the 
			# original query term. Gigablast tries to 
			# find multiple forms of the word that 
			# have the same essential meaning. It 
			# uses the specified query language 
			# (&qlang), however, if a query term is 
			# from a different language, then that 
			# language will be implied for query 
			# expansion.
			"termLang":"en",

			# The query term that this term is a 
			# form of.
			"synonymOf":"test",

			"termFreq":73338909,
			"termHash48":66292713121321,
			"termHash64":9448336835959712000,
			"prefixHash64":0
			},

			...

		# End of the JSON terms array.
		]

	# End of the queryInfo JSON structure.
	},

	# The start of the gigabits array. 
	# Each gigabit is mined from the content 
	# of the search results. The top N 
	# results are mined, and you can control 
	# N with the &dsrt input parameter 
	# described above.
	"gigabits":[

		# The first gigabit in the array.
		{

		# The gigabit as a string in utf8.
		"term":"Membership",

		# The numeric score of the gigabit.
		"score":240,

		# The popularity ranking of the 
		# gigabit. Out of 10000 random documents, 
		# how many documents contain it?
		"minPop":480,

		# The gigabit in the context of a 
		# document.
		"instance":{

			# A sentence, if it exists, from one 
			# of the search results which also 
			# contains the gigabit and as many 
			# significant query terms as possible. In 
			# UTF-8.
			"sentence":"Get a free Tested Premium Membership here!",

			# The url that contained that 
			# sentence. Always starts with http.
			"url":"http://www.tested.com/",

			# The domain of that url.
			"domain":"tested.com"
		}

		# End of the first gigabit
		},

		...

	# End of the JSON gigabits array.
	],

	# Start of the facets array, if any.
	"facets":[

		# The first facet in the array.
		{
			# The field you are faceting over
			"field":"Company",

			# How many documents in the 
			# collection had this particular field? 
			# 64-bit integer.
			"totalDocsWithField":148553,

			# How many documents in the 
			# collection had this particular field 
			# with the same value as the value line 
			# directly below? This should always be 
			# less than or equal to the 
			# totalDocsWithField count. 64-bit 
			# integer.
			"totalDocsWithFieldAndValue":44184,

			# The value of the field in the case 
			# of this facet. Can be a string or an 
			# integer or a float, depending on the 
			# type described in the gbfacet query 
			# term. i.e. gbfacetstr, gbfacetint or 
			# gbfacetfloat.
			"value":"Widgets, Inc.",

			# Should be the same as 
			# totalDocsWithFieldAndValue, above. 
			# 64-bit integer.
			"docCount":44184

		# End of the first facet in the array.
		}

	# End of the facets array.
	],

	# Start of the JSON array of 
	# individual search results.
	"results":[

		# The first result in the array.
		{

		# The title of the result. In UTF-8.
		"title":"This is the title.",

		# A DMOZ entry. One result can have 
		# multiple DMOZ entries.
		"dmozEntry":{

			# The DMOZ category ID.
			"dmozCatId":374449,

			# The DMOZ direct category ID.
			"directCatId":1,

			# The DMOZ category as a UTF-8 
			# string.
			"dmozCatStr":"Top: Computers: Security: Malicious 
			 Software: Viruses: Detection and Removal Tools: 
			 Reviews",

			# What title some DMOZ editor gave 
			# to this url.
			"dmozTitle":"The DMOZ Title",

			# What summary some DMOZ editor gave 
			# to this url.
			"dmozSum":"A great web page.",

			# The DMOZ anchor text, if any.
			"dmozAnchor":"",

		# End DMOZ entry.
		},

		# The content type of the url. Can be 
		# html, pdf, text, xml, json, doc, xls or 
		# ps.
		"contentType":"html",

		# The summary excerpt of the result. 
		# In UTF-8.
		"sum":"Department of the Interior protects America's natural resources.",

		# The url of the result. If it starts 
		# with http:// then that is omitted. Also 
		# omits the trailing / if the urls is 
		# just a domain or subdomain on the root 
		# path.
		"url":"www.doi.gov",

		# The hopcount of the url. The 
		# minimum number of links we would have 
		# to click to get to it from a root url. 
		# If this is 0 that means the url is a 
		# root url, like http://www.root.com/.
		"hopCount":0,

		# The size of the result's content. 
		# Always in kilobytes. k stands for 
		# kilobytes. Could be a floating point 
		# number or and integer.
		"size":"  64k",

		# The exact size of the result's 
		# content in bytes.
		"sizeInBytes":64560,

		# The unique document identifier of 
		# the result. Used for getting the cached 
		# content of the url.
		"docId":34111603247,

		# The site the result comes from. 
		# Usually a subdomain, but can also 
		# include part of the URL path, like, 
		# abc.com/users/brad/. A site is a set of 
		# web pages controlled by the same 
		# entity.
		"site":"www.doi.gov",

		# The time the url was last INDEXED. 
		# If there was an error or the url's 
		# content was unchanged since last 
		# download, then this time will remain 
		# unchanged because the document is not 
		# reindexed in those cases. Time is in 
		# unix timestamp format and is in UTC.
		"spidered":1404512549,

		# The first time the url was 
		# successfully INDEXED. Time is in unix 
		# timestamp format and is in UTC.
		"firstIndexedDateUTC":1404512549,

		# A 32-bit hash of the url's content. 
		# It is used to determine if the content 
		# changes the next time we download it.
		"contentHash32":2680492249,

		# The dominant language that the 
		# url's content is in. The language name 
		# is spelled out in its entirety.
		"language":"English"

		# A convenient abbreviation of the 
		# above language. Most are two 
		# characters, but some, like zh-cn, are 
		# more.
		"langAbbr":"en"

		# If the result has an associated 
		# image then the image thumbnail is 
		# encoded in base64 format here. It is a 
		# jpg image.
		"imageBase64":"/9j/4AAQSkZJR...",

		# If the result has an associated 
		# image then what is its height and width 
		# of the above jpg thumbnail image in 
		# pixels?
		"imageHeight":223,
		"imageWidth":350,

		# If the result has an associated 
		# image then what are the dimensions of 
		# the original image in pixels?
		"origImageHeight":300,
		"origImageWidth":470

		# End of the first result.
		},

		...

	# End of the JSON results array.
	]

# End of the response.
}

}

/get - gets cached web page [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	d	INT64	docId	0	The docid of the cached page to view. REQUIRED
4	url	STRING	url		Instead of specifying a docid, you can get the cached webpage by url as well. REQUIRED
5	c	STRING	collection		Get the cached page from this collection. REQUIRED
6	strip	INT32	strip	0	Is 1 or 2 two strip various tags from the cached content.
7	ih	BOOL (0 or 1)	include header	1	Is 1 to include the Gigablast header at the top of the cached page, 0 to exclude the header.
8	q	STRING	query		Highlight this query in the page.

Example XML Output (&format=xml)

<response>
	<statusCode>0</statusCode>
	<statusMsg>Success</statusMsg>
	<url><![CDATA[http://www.doi.gov/]]></url>
	<docId>34111603247</docId>
	<cachedTimeUTC>1404512549</cachedTimeUTC>
	<cachedTimeStr>Jul 04, 2014 UTC</cachedTimeStr>
	<content><![CDATA[<html><title>Some web page title</title><head>My first web page</head></html>]]></content>
</response>

Example JSON Output (&format=json)

{ "response":{
	"statusCode":0,
	"statusMsg":"Success",
	"url":"http://www.doi.gov/",
	"docId":34111603247,
	"cachedTimeUTC":1404512549,
	"cachedTimeStr":"Jul 04, 2014 UTC",
	"content":"<html><title>Some web page title</title><head>My first web page</head></html>"
}
}

/admin/status - basic status [ show parms in xml or json ] [ show status in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	c	STRING	collection		Use this collection. REQUIRED

/admin/collectionpasswords - passwords [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.

/admin/hosts - hosts status [ show parms in xml or json ] [ show status in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.

/admin/master - master controls [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	qrts	STRING	query routes	default->127.0.0.1:7000	Match a subtring in the HTTP request to the first substring before the -> below and route it, round robin style, to one host in the following comma-separated list of IP:UDP_PORT. Put an * immediately after the -> to mark the host as unpingable; just a hack for older clusters that use a different ping packet format.
4	se	BOOL (0 or 1)	spidering enabled	1	Controls all spidering for all collections
5	dsdc	BOOL (0 or 1)	do spiderdb decimation	0	If enabled limits to MAXURLSPERFIRSTIP urls in spiderdb per firstip. Currently 100M.
6	injen	BOOL (0 or 1)	injections enabled	1	Controls injecting for all collections
7	qryen	BOOL (0 or 1)	querying enabled	1	Controls querying for all collections
8	dospellchecking	BOOL (0 or 1)	do spell checking	0	Spell check all queries?
9	usesortdb	BOOL (0 or 1)	use sortdb	0	A temp parm for debugging.
10	rra	BOOL (0 or 1)	return results even if a shard is down	1	If you turn this off then Gigablast will return an error message if a shard was down and did not return results for a query. The XML and JSON feed let's you know when a shard is down and will give you the results back any way, but if you would rather have just and error message and no results, then set then set this to 'NO'.
11	maxmem	INT64	max mem	8000000000	Mem available to this process. May be exceeded due to fragmentation.
12	mtsp	INT32	max total spiders	100	What is the maximum number of web pages the spider is allowed to download simultaneously for ALL collections PER HOST? Caution: raising this too high could result in some Out of Memory (OOM) errors. The hard limit is currently 300. Each collection has its own limit in the spider controls that you may have to increase as well.
13	mspt	INT32	max spiders per token	30	The maximum number of simultaneous spider requests outstanding per host.
14	ae	BOOL (0 or 1)	add url enabled	1	Can people use the add url interface to add urls to the index?
15	ucp	BOOL (0 or 1)	use collection passwords	0	Should collections have individual password settings so different users can administrer different collections? If not the only the master passwords and IPs will be able to administer any collection.
16	acu	BOOL (0 or 1)	allow cloud users	0	Can guest users create and administer a collection? Limit: 1 collection per IP address. This is mainly for doing demos on the gigablast.com domain.
17	asf	INT32	auto save frequency	5	Save data in memory to disk after this many minutes have passed without the data having been dumped or saved to disk. I've seen this block over 50 seconds on a busy server before that used SSDs, so be aware. We now force this to a minimum of half a day, 60*12 to avoid collection configuration data loss.
18	mhs	INT32	max http sockets	300	Maximum sockets available to serve incoming HTTP requests. Too many outstanding requests will increase query latency. Excess requests will simply have their sockets closed.
19	mss	INT32	max https sockets	100	Maximum sockets available to serve incoming HTTPS requests. Like max http sockets, but for secure sockets.
20	ttosecs	INT32	tcp time out	60	How many seconds to wait before timing out a TCP download?
21	ttosecsproxy	INT32	tcp time out (proxy)	80	How many seconds to wait before timing out a TCP download over a proxy?
22	sua	STRING	spider user agent	GigablastOpenSource/1.0	Identification seen by web servers when the Gigablast spider downloads their web pages. It is polite to insert a contact email address here so webmasters that experience problems from the Gigablast spider have somewhere to vent.
23	js	UNARY CMD (set to 1)	save		Saves in-memory data for ALL hosts. Does Not exit. If running from a proxy, then just saves the gb.conf data for that proxy.
24	save	UNARY CMD (set to 1)	save & exit		Saves the data and exits for ALL hosts. If running on a proxy, just saves and exits on that proxy.
25	rebalance	UNARY CMD (set to 1)	rebalance shards		Tell all hosts to scan all records in all databases, and move records to the shard they belong to. You only need to run this if Gigablast tells you to, when you are changing hosts.conf to add or remove more nodes/hosts.
26	dump	UNARY CMD (set to 1)	dump to disk		Flushes all records in memory to the disk on all hosts.
27	clrkrnerr	UNARY CMD (set to 1)	clear errors		Clears the kernel error messages, the out of memory and the core-dump signals on the hosts page. (The x and O letter signals.)
28	fingerprint	UNARY CMD (set to 1)	record user agent fingerprint		Record the user agent from your browser along with the standard headers that this user agent sends with its requests
29	meen	BOOL (0 or 1)	merges enabled	1	Disable at startup to make startup faster.
30	tma	INT32	tight merge after this many days	0	If oldest file is this many days old then do a tight merge. Used by diffbot to realize spider status doc deletes faster. Use 0 to not do this. Will only do the tight merge if a regular merge is triggered from the number of files being too large.
31	afgdwd	BOOL (0 or 1)	ask for gzipped docs when downloading	0	If this is true, gb will send Accept-Encoding: gzip to web servers when doing http downloads. It does have a tendency to cause out-of-memory errors when you enable this, so until that is fixed better, it's probably a good idea to leave this disabled.
32	ab	BOOL (0 or 1)	autoban IPs which violate the queries per day quotas	0	Keep track of ips which do queries, disallow non-customers from hitting us too hard.
33	nfqpd	INT32	free queries per day	1024	Non-customers get this many queries per day beforebeing autobanned
34	nfqpm	INT32	free queries per minute	30	Non-customers get this many queries per minute beforebeing autobanned
35	ubf	BOOL (0 or 1)	use bot fence	0	Use ajax to load the search results to keep bots from slamming gb with queries.
36	bfsw	STRING	secret word for bot fence	rxiwd	Use ajax to set the value of this word as a cgi parm in the url. TODO: make this automatic again.
37	mhdms	INT32	max heartbeat delay in milliseconds	0	If a heartbeat is delayed this many milliseconds dump a core so we can see where the CPU was. Logs 'db: missed heartbeat by %ld ms'. Use 0 or less to disable.
38	mdch	INT32	max delay before logging a callback or handler	-1	If a call to a message callback or message handler in the udp server takes more than this many milliseconds, then log it. Logs 'udp: Took %ld ms to call callback for msgType=0x%hhx niceness=%d'. Use -1 or less to disable the logging.
39	dht	INT32	dead host timeout	6000	Consider a host in the Gigablast network to be dead if it does not respond to successive pings for this number of seconds. Gigablast does not send requests to dead hosts. Outstanding requests may be re-routed to a twin.
40	psms	INT32	ping spacer	100	Wait this many milliseconds before pinging the next host. Each host pings all other hosts in the network.
41	errstrone	STRING	error string 1	I/O error	Look for this string in the kernel buffer for sending email alert. Useful for detecting some strange hard drive failures that really slow performance.
42	errstrtwo	STRING	error string 2		Look for this string in the kernel buffer for sending email alert. Useful for detecting some strange hard drive failures that really slow performance.
43	errstrthree	STRING	error string 3		Look for this string in the kernel buffer for sending email alert. Useful for detecting some strange hard drive failures that really slow performance.
44	dpcsp	INT64	posdb disk cache size for small termlists	30000000	How much file cache size to use in bytes for termlists smaller than 1MB? Posdb is the index.
45	balloon	INT64	posdb disk cache size for large termlists	100000000	How much file cache size to use in bytes for termlists greater than or equal to 1MB? Posdb is the index.
46	dpcst	INT64	tagdb disk cache size	30000000	How much file cache size to use in bytes? Tagdb is consulted at spider time and query time to determine if a url or outlink is banned or what its siterank is, etc.
47	dpcsc	INT64	clusterdb disk cache size	30000000	How much file cache size to use in bytes? Gigablast does a lookup in clusterdb for each search result at query time to get its site information for site clustering. If you disable site clustering in the search controls then clusterdb will not be consulted.
48	dpcsx	INT64	titledb disk cache size	30000000	How much file cache size to use in bytes? Titledb holds the cached web pages, compressed. Gigablast consults it to generate a summary for a search result, or to see if a url Gigablast is spidering is already in the index.
49	dpcsy	INT64	spiderdb disk cache size	30000000	How much file cache size to use in bytes? Titledb holds the cached web pages, compressed. Gigablast consults it to generate a summary for a search result, or to see if a url Gigablast is spidering is already in the index.
50	dnsmcm	INT32	dns cache size	30000000	Bytes to use for caching dns replies per host.
51	wlmcm	INT32	winner list cache size	30000000	Bytes to use for caching urls to spider for an IP. Saves CPU to getUrlFilterNum().
52	wlcma	INT32	winner list cache max age	7200	Age in seconds that entries will expire. Keep low if you want freshness as a url might not respider when it is scheduled.
53	rtmcm	INT32	robots.txt cache size	50000000	Bytes to use for caching robots.txt files for spidering per host.
54	hpmcm	INT32	html pages cache size	10000000	Bytes to use for caching html pages for spidering per host.
55	serpmcm	INT32	search results cache size	10000000	Bytes to use for caching search result pages per host.
56	srcma	INT32	search results cache max age	10800	How many seconds should we cache a search results page for?
57	srtcp	BOOL (0 or 1)	send requests to compression proxy	0	If this is true, gb will route download requests for web pages to proxies in hosts.conf. Proxies will download and compress docs before sending back.
58	pdns	IP	dns 0	8.8.8.8	IP address of the primary DNS server. Assumes UDP port 53. REQUIRED FOR SPIDERING! Use Google's public DNS 8.8.8.8 as default.
59	sdns	IP	dns 1	8.8.4.4	IP address of the secondary DNS server. Assumes UDP port 53. Will be accessed in conjunction with the primary dns, so make sure this is always up. An ip of 0 means disabled. Google's secondary public DNS is 8.8.4.4.
60	sdnsa	IP	dns 2	0.0.0.0	All hosts send to these DNSes based on hash of the subdomain to try to split DNS load evenly.
61	sdnsb	IP	dns 3	0.0.0.0
62	sdnsc	IP	dns 4	0.0.0.0
63	sdnsd	IP	dns 5	0.0.0.0
64	sdnse	IP	dns 6	0.0.0.0
65	sdnsf	IP	dns 7	0.0.0.0
66	sdnsg	IP	dns 8	0.0.0.0
67	sdnsh	IP	dns 9	0.0.0.0
68	sdnsi	IP	dns 10	0.0.0.0
69	sdnsj	IP	dns 11	0.0.0.0
70	sdnsk	IP	dns 12	0.0.0.0
71	sdnsl	IP	dns 13	0.0.0.0
72	sdnsm	IP	dns 14	0.0.0.0
73	sdnsn	IP	dns 15	0.0.0.0
74	ut	BOOL (0 or 1)	use threads	1	If enabled, Gigablast will use threads.
75	utfd	BOOL (0 or 1)	use threads for disk	1	If enabled, Gigablast will use threads for disk ops. Now that Gigablast uses pthreads more effectively, leave this enabled for optimal performance in all cases.
76	utfio	BOOL (0 or 1)	use threads for intersects and merges	1	If enabled, Gigablast will use threads for these ops. Default is now on in the event you have simultaneous queries so one query does not hold back the other. There seems to be a bug so leave this ON for now.
77	utfsc	BOOL (0 or 1)	use threads for system calls	1	Gigablast does not make too many system calls so leave this on in case the system call is slow.
78	mct	INT32	max cpu threads	12	Maximum number of threads to use per Gigablast process for intersecting docid lists.
79	fw	BOOL (0 or 1)	flush disk writes	1	If enabled then all writes will be flushed to disk. If not enabled, then gb uses the Linux disk write cache and any hardware write cache. We seem to get better query performance when you enable this, but your disk writing operations will take longer. If this is disabled then the Linux page cache and disk write scheduler can fuck up your performance quite severely. The system can also run out of memory very easily and kill gb processes because it uses all the memory to cache writes.
80	mbtf	INT32	min bytes to flush	2000000	Only call flush after we've written this many bytes on a particular file descriptor. A zero here means to flush after every disk write. Reducing flushing improves write performance. But flushing too little impacts disk read performance for queries as the kernel or hardware may only flush when its buffer is full, thereby causing spikey contention. Not flushing at all can cause the system to run out of memory and the OOM (out of memory) killer to kill gb processes.
81	vwl	BOOL (0 or 1)	verify written lists	0	Ensure lists being written to disk are not corrupt. That title recs appear valid, etc. Helps isolate sources of corruption. Used for debugging.
82	cmbd	BOOL (0 or 1)	check memory before dumping	0	Check the in-memory structures before dumping to disk when they are full. This will SLOW DOWN your query response time during dumps, maybe even freezing things up for 10 seconds while checking. But this will detect BAD RAM DIMMS quite well.
83	vdw	BOOL (0 or 1)	verify disk writes	0	Read what was written in a verification step. Decreases performance, but may help fight disk corruption mostly on Maxtors and Western Digitals.
84	smdt	INT32	max spider read threads	20	Maximum number of threads to use per Gigablast process for accessing the disk for index-building purposes. Keep low to reduce impact on query response time. Increase for fast disks or when preferring build speed over lower query latencies
85	sdt	BOOL (0 or 1)	separate disk reads	1	If enabled then we will not launch a low priority disk read or write while a high priority is outstanding. Help improve query response time at the expense of spider performance.
86	mbs	INT32	merge buf size	500000	Read and write this many bytes at a time when merging files. Smaller values are kinder to query performance, but the merge takes longer. Use at least 1000000 for fast merging.
87	usdb	BOOL (0 or 1)	use statsdb	1	Archive system statistics information in Statsdb.
88	auhs	BOOL (0 or 1)	always use https	1	Redirect all http connections to https
89	asi	BOOL (0 or 1)	allowing spidering of local IPs	0	For security we default to no here, assuming you want to index the web and not your intranet. Indexing documents with local IPs allows one to accidentally index private documents from an internal web server.

/admin/search - search controls [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	c	STRING	collection		Use this collection. REQUIRED
4	mtbtr	INT64	max bytes to read	90000000	Maximum number of bytes to read for a termlist per shard from a single file.
5	msrpq	INT32	max search results per query	100	What is the limit to the total number of returned search results per query?
6	msr	INT32	max search results in total	200	What is the maximum total number of returned search results.
7	msrfpc	INT32	max search results in total for paying clients	1000	What is the limit to the total number of returned search results for clients.
8	langweight	FLOAT32	language weight	240.000000	Defalt language weight if document matches query language, which can be provided using &qlang= or it the default can be set in the search controls. Use this to give results that match the specified the speicified &qlang higher ranking, or docs whose language is unknown. Can be override with &langw in the query url. Set to -1 to only return results in the query language, given by &qlang= or the default query language specified in the search controls. Setting to -1 is prefered over using the gblang: query operator because it is faster.
9	mqt	INT32	max query terms	999999	Do not allow more than this many query terms. Helps prevent big queries from resource hogging.
10	spell	BOOL (0 or 1)	do spell checking by default	1	If enabled while using the XML feed, when Gigablast finds a spelling recommendation it will be included in the XML tag. Default is 0 if using an XML feed, 1 otherwise.
11	scores	BOOL (0 or 1)	get scoring info by default	1	Get scoring information for each result so you can see how each result is scored. You must explicitly request this using &scores=1 for the XML feed because it is not included by default.
12	qe	BOOL (0 or 1)	do query expansion by default	1	If enabled, query expansion will expand your query to include the various forms and synonyms of the query terms.
13	qh	BOOL (0 or 1)	highlight query terms in summaries by default	1	Use to disable or enable highlighting of the query terms in the summaries.
14	tml	INT32	max title len	80	What is the maximum number of characters allowed in titles displayed in the search results?
15	scd	BOOL (0 or 1)	site cluster by default	0	Should search results be site clustered? This limits each site to appearing at most twice in the search results. Sites are subdomains for the most part, like abc.xyz.com.
16	hacr	BOOL (0 or 1)	hide all clustered results	0	Only display at most one result per site.
17	drd	BOOL (0 or 1)	dedup results by default	1	Should duplicate search results be removed? This is based on a content hash of the entire document. So documents must be exactly the same for the most part.
18	stgdbl	BOOL (0 or 1)	do tagdb lookups for queries	1	For each search result a tagdb lookup is made, usually across the network on distributed clusters, to see if the URL's site has been manually banned in tagdb. If you don't manually ban sites then turn this off for extra speed.
19	psds	INT32	percent similar dedup summary default value	90	If document summary (and title) are this percent similar to a document summary above it, then remove it from the search results. 100 means only to remove if exactly the same. 0 means no summary deduping.
20	msld	INT32	number of lines to use in summary to dedup	4	Sets the number of lines to generate for summary deduping. This is to help the deduping process not throw out valid summaries when normally displayed summaries are smaller values. Requires percent similar dedup summary to be non-zero.
21	ddu	BOOL (0 or 1)	dedup URLs by default	0	Should we dedup URLs with case insensitivity? This is mainly to correct duplicate wiki pages.
22	defqlang	STRING	sort language preference default	en	Default language to use for ranking results. Value should be any language abbreviation, for example "en" for English. Use xx to give ranking boosts to no language in particular. See the language abbreviations at the bottom of the url filters page.
23	qcountry	STRING	sort country preference default	us	Default country to use for ranking results. Value should be any country code abbreviation, for example "us" for United States. This is currently not working.
24	sml	INT32	max summary len	512	What is the maximum number of characters displayed in a summary for a search result?
25	smnl	INT32	max summary excerpts	4	What is the maximum number of excerpts displayed in the summary of a search result?
26	smxcpl	INT32	max summary excerpt length	90	What is the maximum number of characters allowed per summary excerpt?
27	smw	INT32	max summary line width by default	80	<br> tags are inserted to keep the number of chars in the summary per line at or below this width. Also affects title. Strings without spaces that exceed this width are not split. Has no affect on xml or json feed, only works on html.
28	clmfs	INT32	bytes of doc to scan for summary generation	70000	Truncating this will miss out on good summaries, but performance will increase.
29	sfht	STRING	front highlight tag		Front html tag used for highlightig query terms in the summaries displated in the search results.
30	sbht	STRING	back highlight tag		Front html tag used for highlightig query terms in the summaries displated in the search results.
31	dsrt	INT32	results to scan for gigabits generation by default	30	How many search results should we scan for gigabit (related topics) generation. Set this to zero to disable gigabits generation by default.
32	ipr	BOOL (0 or 1)	ip restriction for gigabits by default	0	Should Gigablast only get one document per IP domain and per domain for gigabits (related topics) generation?
33	rot	BOOL (0 or 1)	remove overlapping topics	1	Should Gigablast remove overlapping topics (gigabits)?
34	nrt	INT32	number of gigabits to show by default	11	What is the number of related topics (gigabits) displayed per query? Set to 0 to save CPU time.
35	mts	INT32	min gigabit score by default	5	Gigabits (related topics) with scores below this will be excluded. Scores range from 0% to over 100%.
36	mdc	INT32	min gigabit doc count by default	2	How many documents must contain the gigabit (related topic) in order for it to be displayed.
37	dsp	INT32	dedup doc percent for gigabits (related topics)	80	If a document is this percent similar to another document with a higher score, then it will not contribute to the gigabit generation.
38	mwpt	INT32	max words per gigabit (related topic) by default	6	Maximum number of words a gigabit (related topic) can have. Affects xml feeds, too.
39	tmss	INT32	gigabit max sample size	4096	Max chars to sample from each doc for gigabits (related topics).
40	ddc	BOOL (0 or 1)	display dmoz categories in results	1	If enabled, results in dmoz will display their categories on the results page.
41	didc	BOOL (0 or 1)	display indirect dmoz categories in results	0	If enabled, results in dmoz will display their indirect categories on the results page.
42	dscl	BOOL (0 or 1)	display Search Category link to query category of result	0	If enabled, a link will appear next to each category on each result allowing the user to perform their query on that entire category.
43	udfu	BOOL (0 or 1)	use dmoz for untitled	1	Yes to use DMOZ given title when a page is untitled but is in DMOZ.
44	udsm	BOOL (0 or 1)	show dmoz summaries	1	Yes to always show DMOZ summaries with search results that are in DMOZ.
45	sacot	BOOL (0 or 1)	show adult category on top	0	Yes to display the Adult category in the Top category
46	hp	STRING	home page		Html to display for the home page. Leave empty for default home page. Use %N for total number of pages indexed. Use %n for number of pages indexed for the current collection. Use %c to insert the current collection name. Use %q to display the query in a text box. Use %t to display the directory TOP. Example to paste into textbox: <html><title>My Gigablast Search Engine</title><script> function x(){document.f.q.focus();} </script><body onload="x()"><br><br><center><a href=/><img border=0 width=500 height=122 src=/logo-med.jpg></a><br><br><b>My Search Engine</b><br><br><form method=get action=/search name=f><input type=hidden name=c value="%c"><input name=q type=text size=60 value=""> <input type="submit" value="Search"></form><br><center>Searching the <b>%c</b> collection of %n documents.</center><br></body></html>
47	hh	STRING	html head		Html to display before the search results. Leave empty for default. Convenient for changing colors and displaying logos. Use the variable, %q, to represent the query to display in a text box. Use %e to print the url encoded query. Use %S to print sort by date or relevance link. Use %L to display the logo. Use %R to display radio buttons for site search. Use %F to begin the form. and use %H to insert hidden text boxes of parameters like the current search result page number. BOTH %F and %H are necessary for the html head, but do not duplicate them in the html tail. Use %f to display the family filter radio buttons. Example to paste into textbox: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>My Gigablast Search Results</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body%l> %F<table cellpadding="2" cellspacing="0" border="0"> <tr> <td valign=top>%L</td> <td valign=top> <nobr> <input type="text" name="q" size="60" value="%q"> %D<input type="submit" value="Blast It!" border="0"> </nobr> <br>%f %R </tr> </table> %H
48	ht	STRING	html tail		Html to display after the search results. Leave empty for default. Convenient for changing colors and displaying logos. Use the variable, %q, to represent the query to display in a text box. Use %e to print the url encoded query. Use %S to print sort by date or relevance link. Use %L to display the logo. Use %R to display radio buttons for site search. Use %F to begin the form. and use %H to insert hidden text boxes of parameters like the current search result page number. BOTH %F and %H are necessary for the html head, but do not duplicate them in the html tail. Use %f to display the family filter radio buttons. Example to paste into textbox: <br> <table cellpadding=2 cellspacing=0 border=0> <tr><td></td> <td>%s</td> </tr> </table> Try your search on <a href=http://www.google.com/search?q=%e>google</a>   <a href=http://search.yahoo.com/bin/search?p=%e>yahoo</a>   <a href=http://search.dmoz.org/cgi-bin/search?search=%e>dmoz</a>   </font></body>
49	diffbotOutputFormat	BOOL (0 or 1)	imitate the output format of crawlbot	0
50	outputJsonResultsDirectly	BOOL (0 or 1)	output json results inline	0

/admin/spider - spider controls [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	c	STRING	collection		Use this collection. REQUIRED
4	cse	BOOL (0 or 1)	spidering enabled	1	Controls just the spiders for this collection.
5	sitelisttags	BOOL (0 or 1)	index site tags	0	Index the tags from the sitelist as metadata
6	sitelist	STRING	site list		List of sites to spider, one per line. See example site list below. Gigablast uses the insitelist directive on the url filters page to make sure that the spider only indexes urls that match the site patterns you specify here, other than urls you add individually via the add urls or inject url tools. Limit list to 300MB. If you have a lot of INDIVIDUAL urls to add then consider using the addurl interface.
7	restart	UNARY CMD (set to 1)	restart collection		Remove all documents from the collection and re-add seed urls from site list.
8	pmerge	UNARY CMD (set to 1)	tight merge posdb		Merges all outstanding posdb (index) files.
9	tmerge	UNARY CMD (set to 1)	tight merge titledb		Merges all outstanding titledb (web page cache) files.
10	spmerge	UNARY CMD (set to 1)	tight merge spiderdb		Merges all outstanding spiderdb files.
11	mns	INT32	max spiders	300	What is the maximum number of web pages the spider is allowed to download simultaneously PER HOST for THIS collection? The maximum number of spiders over all collections is controlled in the master controls.
12	ssf	INT32	spiderdb scan frequency	60	Rescan spiderdb every X minutes. This will ensure that urls are spidered somewhat promptly. We may not spider some high priority urls promptly because lower priority urls from the same IP address are occupying the schedule. This will help fix such situations by rebuilding the schedule from scratch every X minutes. The minimum value is 5 minutes, but the recommended minimum is probably 15 minutes.
13	sdms	INT32	spider delay in milliseconds	0	make each spider wait this many milliseconds before getting the ip and downloading the page.
14	obeyRobots	BOOL (0 or 1)	obey robots.txt	1	If this is true Gigablast will respect the robots.txt convention and rel no follow meta tags.
15	usecloaking	BOOL (0 or 1)	use cloaking	1	Pretend our user-agent is googlebot for the purposes of parsing robots.txt. Does not actually set our user agent to googlebot.
16	obeyRelNoFollow	BOOL (0 or 1)	obey rel no follow links	1	If this is true Gigablast will respect the rel no follow link attribute.
17	mrca	INT32	max robots.txt cache age	86400	How many seconds to cache a robots.txt file for. 86400 is 1 day. 0 means Gigablast will not read from the cache at all and will download the robots.txt before every page if robots.txt use is enabled above. However, if this is 0 then Gigablast will still store robots.txt files in the cache.
18	cua	STRING	user agent		If empty then gigablast uses the default user agent specified in the master controls, otherwise, it uses this user agent for this collection only.
19	useProxies	BOOL (0 or 1)	always use spider proxies	0	If this is true Gigablast will ALWAYS use the proxies listed on the proxies page for spidering for this collection.
20	automaticallyuseproxies	BOOL (0 or 1)	automatically use spider proxies	0	Use the spider proxies listed on the proxies page if gb detects that a webserver is throttling the spiders. This way we can learn the webserver's spidering policy so that our spiders can be more polite. If no proxies are listed on the proxies page then this parameter will have no effect. When this is enabled gb will also use proxies for the sites listed in the sites to always use proxies text box on the proxies controls page which applies to all collections. It's basically a list of the sites we know are hard to spider without proxies.
21	useeliteproxies	BOOL (0 or 1)	use elite spider proxies	0	Use the elite spider proxies instead of the regular proxies.
22	automaticallybackoff	BOOL (0 or 1)	automatically back off	0	Set the crawl delay to 5 seconds if gb detects that an IP is throttling or banning gigabot from crawling it. The crawl delay just applies to that IP. Such throttling will be logged.
23	downloadiframes	BOOL (0 or 1)	expand iframes in spidered pages	1	Include iframes in the content of spidered pages.
24	usetimeaxis	BOOL (0 or 1)	use time axis	0	If this is true Gigablast will index the same url multiple times if its content varies over time, rather than overwriting the older version in the index. Useful for archive web pages as they change over time.
25	indexwarcs	BOOL (0 or 1)	index warc or arc files	0	If this is true Gigablast will index .warc and .arc files by injecting the pages contained in them as if they were spidered with the content in the .warc or .arc file. The spidered time will be taken from the archive file as well.
26	dmt	INT32	daily merge time	-1	Do a tight merge on posdb and titledb at this time every day. This is expressed in MINUTES past midnight UTC. UTC is 5 hours ahead of EST and 7 hours ahead of MST. Leave this as -1 to NOT perform a daily merge. To merge at midnight EST use 605=300 and midnight MST use 607=420.
27	dmdl	STRING	daily merge days	0	Comma separated list of days to merge on. Use 0 for Sunday, 1 for Monday, ... 6 for Saturday. Leaving this parmaeter empty or without any numbers will make the daily merge happen every day
28	dtt	BOOL (0 or 1)	turing test enabled	0	If this is true, users will have to pass a simple Turing test to add a url. This prevents automated url submission.
29	mau	INT32	max add urls	0	Maximum number of urls that can be submitted via the addurl interface, per IP domain, per 24 hour period. A value less than or equal to zero implies no limit.
30	de	BOOL (0 or 1)	deduping enabled	0	When enabled, the spider will discard web pages which are identical to other web pages that are already in the index. It most likely has to hit disk to do these checks so it does cause some slow down. Only use it if you need it.
31	dew	BOOL (0 or 1)	deduping enabled for www	1	When enabled, the spider will discard web pages which, when a www is prepended to the page's url, result in a url already in the index. Also, if the url exists in the index with https:// instead of http:// or with https://www.
32	dcep	BOOL (0 or 1)	detect custom error pages	1	Detect and do not index pages which have a 200 status code, but are likely to be error pages.
33	usr	BOOL (0 or 1)	use simplified redirects	1	If this is true, the spider, when a url redirects to a "simpler" url, will add that simpler url into the spider queue and abandon the spidering of the current url.
34	useCanonical	BOOL (0 or 1)	use canonical redirects	1	If page has a on it then treat it as a redirect, add it to spiderdb for spidering and abandon the indexing of the current url.
35	uims	BOOL (0 or 1)	use ifModifiedSince	0	If this is true, the spider, when updating a web page that is already in the index, will not even download the whole page if it hasn't been updated since the last time Gigablast spidered it. This is primarily a bandwidth saving feature. It relies on the remote webserver's returned Last-Modified-Since field being accurate.
36	mtoti	INT32	max term occurences to index	300	If a word is repeated too many times in the same document it can slow queries down quite a bit, so truncate each term at this many occurences. If you are doing book search then you might want to make this value higher.
37	mlkftm	INT32	linkdb min files needed to trigger to merge	6	Merge is triggered when this many linkdb data files are on disk. Raise this when initially growing an index in order to keep merging down.
38	mtftgm	INT32	tagdb min files to merge	2	Merge is triggered when this many tagdb data files are on disk.
39	mtftm	INT32	titledb min files needed to trigger merge	6	Merge is triggered when this many titledb data files are on disk.
40	mpftm	INT32	posdb min files needed to trigger to merge	6	Merge is triggered when this many posdb data files are on disk. Raise this while doing massive injections and not doing much querying. Then when done injecting keep this low to make queries fast.
41	mpsss	INT32	sortdb min files needed to trigger to merge	3	Merge is triggered when this many sortdb data files are on disk.
42	mtftsp	INT32	spiderdb min files needed to trigger to merge	4	Merge is triggered when this many spiderdb data files are on disk.
43	eldb	BOOL (0 or 1)	enable linkdb	1	If this is true Gigablast will index hyper-link text and use hyper-link structures. Link voting as mentioned below requires this data in order to work. If you will never be doing link voting then you can disable this and never index the link information to save disk space.
44	glt	BOOL (0 or 1)	enable link voting	1	If this is true Gigablast will use the indexed hyper-link structures in linkdb to boost the quality of indexed documents. You can disable this when doing a ton of injections to keep things fast. Then do a posdb (index) rebuild after re-enabling this when you are done injecting. Or if you simply do not want link voting this will speed up your injections and spidering a bit.
45	csni	BOOL (0 or 1)	compute inlinks to sites	1	If this is true Gigablast will compute the number of site inlinks for the sites it indexes. This is a measure of the sites popularity and is used for ranking and some times spidering prioritzation. It will cache the site information in tagdb. The greater the number of inlinks, the longer the cached time, because the site is considered more stable. If this is NOT true then Gigablast will use the included file, sitelinks.txt, which stores the site inlinks of millions of the most popular sites. This is the fastest way. If you notice a lot of getting link info requests in the sockets table you may want to disable this parm.
46	dlsc	BOOL (0 or 1)	do link spam checking	1	If this is true, do not allow spammy inlinks to vote. This check is too aggressive for some collections, i.e. it does not allow pages with cgi in their urls to vote.
47	ovpid	BOOL (0 or 1)	restrict link voting by ip	1	If this is true Gigablast will only allow one vote per the top 2 significant bytes of the IP address. Otherwise, multiple pages from the same top IP can contribute to the link text and link-based quality ratings of a particular URL. Furthermore, no votes will be accepted from IPs that have the same top 2 significant bytes as the IP of the page being indexed.
48	uvf	FLOAT32	update link info frequency	60.000000	How often should Gigablast recompute the link info for a url. Also applies to getting the quality of a site or root url, which is based on the link info. In days. Can use decimals. 0 means to update the link info every time the url's content is re-indexed. If the content is not reindexed because it is unchanged then the link info will not be updated. When getting the link info or quality of the root url from an external cluster, Gigablast will tell the external cluster to recompute it if its age is this or higher.
49	dsd	BOOL (0 or 1)	do serp detection	1	If this is eabled the spider will not allow any docs which are determined to be serps.
50	mtdl	INT32	max text doc length	4194304	Gigablast will not download, index or store more than this many bytes of an HTML or text document. XML is NOT considered to be HTML or text, use the rule below to control the maximum length of an XML document. Use -1 for no max.
51	modl	INT32	max other doc length	4194304	Gigablast will not download, index or store more than this many bytes of a non-html, non-text document. XML documents will be restricted to this length. Use -1 for no max.
52	aft	BOOL (0 or 1)	apply filter to text pages	0	If this is false then the filter will not be used on html or text pages.
53	confilter	STRING	filter name		Program to spawn to filter all HTTP replies the spider receives, taking input as first arg (max length 64). Leave blank for none.
54	fto	INT32	filter timeout	40	Kill filter shell after this many seconds. Assume it stalled permanently.
55	upth	BOOL (0 or 1)	use pdftohtml and other filters	1	This cores a lot, so we have an option to turn it off. Also affects antiword, excel and other filters.
56	mit	BOOL (0 or 1)	make image thumbnails	0	Try to find the best image on each page and store it as a thumbnail for presenting in the search results.
57	mtwh	INT32	max thumbnail width or height	250	This is in pixels and limits the size of the thumbnail. Gigablast tries to make at least the width or the height equal to this maximum, but, unless the thumbnail is sqaure, one side will be longer than the other.
58	isr	BOOL (0 or 1)	index spider status documents	0	Index a spider status "document" for every url the spider attempts to spider. Search for them using special query operators like type:status or gberrorstr:success or stats:gberrornum to get a histogram. See syntax page for more examples. They will not otherwise show up in the search results.
59	thebs	BOOL (0 or 1)	translate html entities before storing	1	Replace < and > chars with \| in order to not confuse the parser.
60	ahtsubh	BOOL (0 or 1)	assign spider host by url hash	0	Delegate spidering by url hash rather than by first ip. Good for collections which are only spidering one domain.
61	ib	BOOL (0 or 1)	index body	1	Index the body of the documents so you can search it. Required for searching that. You wil pretty much always want to keep this enabled. Does not apply to JSON documents.
62	usub	STRING	url find and replace		Match all spidered urls to transform them. The format is space separated tuples: match_regex replace_val, one per line. If the replace val is omitted then it will be replaced with nothing. Back references are permitted using \1 - \9. For example: \?.* would remove all cgi parms. (https?://)www\. \1mx\. Would replace all www. sites with mx. sites.
63	apiUrl	STRING	diffbot api url		Send every spidered url to this url and index the reply in addition to the normal indexing process. Example: by specifying http://api.diffbot.com/v3/analyze?mode=high-precision&token= here you can index the structured JSON replies from diffbot for every url that is spidered. Gigablast will automatically append a &url= to this url before sending it to diffbot.
64	urlProcessPatternTwo	STRING	diffbot url process pattern		Only send urls that match this simple substring pattern to Diffbot. Separate substrings with two pipe operators, \|\|. Leave empty for no restrictions.
65	urlProcessRegExTwo	STRING	diffbot url process regex		Only send urls that match this regular expression to Diffbot. Leave empty for no restrictions.
66	pageProcessPatternTwo	STRING	diffbot page process pattern		Only send urls whose content matches this simple substring pattern to Diffbot. Separate substrings with two pipe operators, \|\|. Leave empty for no restrictions.

/admin/proxies - proxies [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	eliteproxyips	STRING	elite spider proxy ips		Like spider proxies, but these are more expensive, so we don't send test queries.
4	maxproxiestotryperipwhenconfident	INT32	max proxies to try per ip (when confident of ban)	10	If we detect a proxy is banned how many more do we try before giving up. This is only when we are confident the proxy IP is banned. Like when we detect certain captchas. However, a lot of pages have captchas to prevent automated access and it does not mean your IP is banned.
5	maxproxiestotryperppwhenunconfident	INT32	max proxies to try per ip (when UNconfident of ban)	2	If we detect a proxy is banned how many more do we try before giving up. This is only when we are not so confident the proxy IP is banned, but it could be. Like, for instance, a TCP timedout or connection refused, or 403 Permission Denied response. All of which may or may not indicate a ban.
6	minipcrawldelayperproxyMS	INT32	min ip crawl delay per proxy	288000	Each proxy waits at least this many milliseconds between downloads for urls from the same IP. Does not apply to super proxies (only 1 proxy of a given type specified).
7	proxytesturl	STRING	spider proxy test url	http://www.gigablast.com/	Download this url every minute through each proxy listed above to ensure they are up. Typically you should make this a URL you own so you do not aggravate another webmaster.
8	proxytesttime	INT32	spider proxy test ping time	60	Download the proxy test url once every this many seconds. Set to 0 to disable
9	resetproxytable	UNARY CMD (set to 1)	reset proxy table		Reset the proxy statistics in the table below. Makes all your proxies treated like new again.
10	userandagents	BOOL (0 or 1)	mix up user agents	1	Use random user-agents when downloading through a spider proxy listed above to protecting gb's anonymity. The User-Agent used is a function of the proxy IP/port and IP of the url being downloaded. That way it is consistent when downloading the same website through the same proxy.
11	proxyAuth	STRING	squid proxy authorized users		Gigablast can also simulate a squid proxy, complete with caching. It will forward your request to the proxies you list above, if any. This list consists of space-separated username:password items. Leave this list empty to disable squid caching behaviour. The default cache size for this is 10MB per shard. Use item : to allow anyone access.
12	alwaysuseproxysites	STRING	sites to always use proxies	facebook.com linkedin.com amazon.com yelp.com irs.gov quora.com stackexchange.com	always use proxies when hitting these sites. The collection must have always use proxies or automatically use proxies enabled before gb will use proxies for the listed sites.
13	alwaysuseeliteproxysites	STRING	sites to always use elite proxies		always use elite proxies when hitting these sites. The collection must have always use proxies or automatically use proxies enabled before gb will use the elite proxies for the listed sites.
14	userobotstxtexceptions	STRING	sites to not use robots.txt		Sites in this list do not use robots.txt, even if that setting is enabled at the collection level.
15	userelnofollowexceptions	STRING	sites to not respect rel nofollow on links		Sites in this list still follow links, even when rel=nofollow is set.
16	usecanonicalexceptions	STRING	sites to not use canonical redirects		Sites in this dont redirect on rel=canonical, even if it is enabled
17	posttodiffbotexceptions	STRING	sites to not post their content to diffbot		Sites in this dont POST their content to diffbot, even if they used a proxy.

/admin/log - log controls [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	hr	BOOL (0 or 1)	log http requests	1	Log GET and POST requests received from the http server?
4	laq	BOOL (0 or 1)	log autobanned queries	1	Should we log queries that are autobanned? They can really fill up the log.
5	lqtt	INT32	log query time threshold	5000	If query took this many millliseconds or longer, then log the query and the time it took to process.
6	lqr	BOOL (0 or 1)	log query reply	0	Log query reply in proxy, but only for those queries above the time threshold above.
7	lsu	BOOL (0 or 1)	log spidered urls	1	Log status of spidered or injected urls?
8	lnc	BOOL (0 or 1)	log network congestion	0	Log messages if Gigablast runs out of udp sockets?
9	li	BOOL (0 or 1)	log informational messages	1	Log messages not related to an error condition, but meant more to give an idea of the state of the gigablast process. These can be useful when diagnosing problems.
10	llb	BOOL (0 or 1)	log limit breeches	0	Log it when document not added due to quota breech. Log it when url is too long and it gets truncated.
11	lda	BOOL (0 or 1)	log debug admin messages	0	Log various debug messages.
12	ldb	BOOL (0 or 1)	log debug build messages	0
13	ldci	BOOL (0 or 1)	log debug crawl info messages	1
14	ldd	BOOL (0 or 1)	log debug database messages	0
15	lddm	BOOL (0 or 1)	log debug dirty messages	0
16	lddi	BOOL (0 or 1)	log debug disk messages	0
17	ldpc	BOOL (0 or 1)	log debug disk page cache	0
18	lddns	BOOL (0 or 1)	log debug dns messages	0
19	ldh	BOOL (0 or 1)	log debug http messages	0
20	ldi	BOOL (0 or 1)	log debug image messages	0
21	ldl	BOOL (0 or 1)	log debug loop messages	0
22	ldg	BOOL (0 or 1)	log debug language detection messages	0
23	ldli	BOOL (0 or 1)	log debug link info	0
24	ldm	BOOL (0 or 1)	log debug mem messages	0
25	ldmu	BOOL (0 or 1)	log debug mem usage messages	0
26	ldn	BOOL (0 or 1)	log debug net messages	0
27	ldq	BOOL (0 or 1)	log debug query messages	0
28	ldqta	BOOL (0 or 1)	log debug quota messages	0
29	ldr	BOOL (0 or 1)	log debug robots messages	0
30	lds	BOOL (0 or 1)	log debug spider cache messages	0
31	ldsp	BOOL (0 or 1)	log debug speller messages	0
32	ldscc	BOOL (0 or 1)	log debug sections messages	0
33	ldserp	BOOL (0 or 1)	log debug serp cache	0
34	ldsi	BOOL (0 or 1)	log debug seo insert messages	0
35	ldseo	BOOL (0 or 1)	log debug seo messages	0
36	ldst	BOOL (0 or 1)	log debug stats messages	0
37	ldsu	BOOL (0 or 1)	log debug summary messages	0
38	ldspid	BOOL (0 or 1)	log debug spider messages	0
39	ldspmth	BOOL (0 or 1)	log debug msg13 messages	0
40	dmth	BOOL (0 or 1)	disable host0 for msg13 reception hack	0
41	ldspr	BOOL (0 or 1)	log debug spider proxies	0
42	ldspua	BOOL (0 or 1)	log debug url attempts	0
43	ldsd	BOOL (0 or 1)	log debug spider downloads	0
44	ldfb	BOOL (0 or 1)	log debug facebook	0
45	ldtm	BOOL (0 or 1)	log debug tagdb messages	0
46	ldt	BOOL (0 or 1)	log debug tcp messages	0
47	ldtb	BOOL (0 or 1)	log debug tcp buffer messages	0
48	ldth	BOOL (0 or 1)	log debug thread messages	0
49	ldti	BOOL (0 or 1)	log debug title messages	0
50	ldtim	BOOL (0 or 1)	log debug timedb messages	0
51	ldto	BOOL (0 or 1)	log debug topic messages	0
52	ldtopd	BOOL (0 or 1)	log debug topDoc messages	0
53	ldu	BOOL (0 or 1)	log debug udp messages	0
54	ldun	BOOL (0 or 1)	log debug unicode messages	0
55	ldwlc	BOOL (0 or 1)	log debug winner list cache	1
56	ldre	BOOL (0 or 1)	log debug rebuild messages	0
57	ldpd	BOOL (0 or 1)	log debug pub date extraction messages	0
58	ltb	BOOL (0 or 1)	log timing messages for build	0	Log various timing related messages.
59	ltadm	BOOL (0 or 1)	log timing messages for admin	0	Log various timing related messages.
60	ltd	BOOL (0 or 1)	log timing messages for database	0
61	ltn	BOOL (0 or 1)	log timing messages for network layer	0
62	ltq	BOOL (0 or 1)	log timing messages for query	0
63	ltspc	BOOL (0 or 1)	log timing messages for spcache	0	Log various timing related messages.
64	ltt	BOOL (0 or 1)	log timing messages for related topics	0
65	lr	BOOL (0 or 1)	log reminder messages	0	Log reminders to the programmer. You do not need this.

/admin/masterpasswords - master passwords [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.

/admin/addcoll - add a new collection [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	addcoll	UNARY CMD (set to 1)	add collection		Add a new collection with this name. No spaces allowed or strange characters allowed. Max of 64 characters. REQUIRED

/admin/delcoll - delete a collection [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	delcoll	UNARY CMD (set to 1)	delete collection		Delete the specified collection. You can specify multiple &delcoll= parms in a single request to delete multiple collections at once. REQUIRED

/admin/clonecoll - clone one collection's settings to another [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	c	STRING	collection		Clone settings INTO this collection. REQUIRED
4	clonecoll	UNARY CMD (set to 1)	clone collection		Clone collection settings FROM this collection. REQUIRED

/admin/rebuild - rebuild data [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	rme	BOOL (0 or 1)	rebuild mode enabled	0	If enabled, gigablast will rebuild the rdbs as specified by the parameters below. When a particular collection is in rebuild mode, it can not spider or merge titledb files.
4	rctr	STRING	collection to rebuild		Name of collection to rebuild. REQUIRED
5	rac	BOOL (0 or 1)	rebuild ALL collections	0	If enabled, gigablast will rebuild all collections.
6	rmtu	INT32	memory to use for rebuild	200000000	In bytes.
7	mrps	INT32	max rebuild injections	1	Maximum number of outstanding injections for rebuild.
8	rfr	BOOL (0 or 1)	full rebuild	0	If enabled, gigablast will reinject the content of all title recs into a secondary rdb system. That will the primary rdb system when complete.
9	rfrknsx	BOOL (0 or 1)	add spiderdb recs of non indexed urls	0	If enabled, gigablast will add the spiderdb records of unindexed urls when doing the full rebuild or the spiderdb rebuild. Otherwise, only the indexed urls will get spiderdb records in spiderdb. This can be faster because Gigablast does not have to do an IP lookup on every url if its IP address is not in tagdb already.
10	rrli	BOOL (0 or 1)	recycle link text	1	If enabled, gigablast will recycle the link text when rebuilding titledb. The siterank, which is determined by the number of inlinks to a site, is stored/cached in tagdb so that is a separate item. If you want to pick up new link text you will want to set this to NO and make sure to rebuild titledb, since that stores the link text. If you recycle link text the rebuild might be around twice as fast.
11	rrt	BOOL (0 or 1)	rebuild titledb	0	If enabled, gigablast will rebuild this rdb
12	rri	BOOL (0 or 1)	rebuild posdb	1	If enabled, gigablast will rebuild this rdb, ths index.
13	rrcl	BOOL (0 or 1)	rebuild clusterdb	0	If enabled, gigablast will rebuild this rdb
14	rrsp	BOOL (0 or 1)	rebuild spiderdb	0	If enabled, gigablast will rebuild this rdb
15	rrld	BOOL (0 or 1)	rebuild linkdb	0	If enabled, gigablast will rebuild this rdb
16	ruru	BOOL (0 or 1)	rebuild root urls	1	If disabled, gigablast will skip root urls.
17	runru	BOOL (0 or 1)	rebuild non-root urls	1	If disabled, gigablast will skip non-root urls.

/admin/inject - inject url in the index here [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	c	STRING	collection		Inject into this collection. REQUIRED
4	url	STRING	url		Specify the URL that will be immediately crawled and indexed in real time while you wait. The browser will return the final index status code. Alternatively, use the add url page to add urls individually or in bulk without having to wait for the pages to be actually indexed in realtime. By default, injected urls take precedence over the "insitelist" expression in the url filters so injected urls need not match the patterns in your site list. You can change that behavior in the url filters if you want. Injected urls will have a hopcount of 0. The injection api is described on the api page. Make up a fake url if you are injecting content that does not have one. If the url ends in .warc or .arc or .warc.gz or .arc.gz Gigablast will index the contained documents as individual documents, using the appropriate dates and other meta information contained in the containing archive file. REQUIRED
5	qts	STRING	query to scrape		Scrape popular search engines for this query and inject their links. You are not required to supply the url parm if you supply this parm.
6	injectlinks	BOOL (0 or 1)	inject links	0	Should we inject the links found in the injected content as well?
7	spiderlinks	BOOL (0 or 1)	spider links	0	Add the outlinks of the injected content into spiderdb for spidering?
8	newonly	BOOL (0 or 1)	only inject content if new	0	If the specified url is already in the index then skip the injection.
9	deleteurl	BOOL (0 or 1)	delete from index	0	Delete the specified url from the index.
10	recycle	BOOL (0 or 1)	recycle content	0	If the url is already in the index, then do not re-download the content, just use the content that was stored in the cache from last time.
11	dedup	BOOL (0 or 1)	dedup url	0	Do not index the url if there is already another url in the index with the same content.
12	urlip	IP	url IP	0.0.0.0	Use this IP when injecting the document. Do not use or set to 0.0.0.0, if unknown. If provided, it will save an IP lookup.
13	hasmime	BOOL (0 or 1)	content has mime	0	If the content of the url is provided below, does it begin with an HTTP mime header?
14	delim	STRING	content delimeter		If the content of the url is provided below, then it consist of multiple documents separated by this delimeter. Each such item will be injected as an independent document. Some possible delimeters: ======== or <doc>. If you set hasmime above to true then Gigablast will check for a url after the delimeter and use that url as the injected url. Otherwise it will append numbers to the url you provide above.
15	contenttype	STRING	content type	text/html	If you supply content in the text box below without an HTTP mime header, then you need to enter the content type. Possible values: text/html text/plain text/xml application/json
16	charset	INT32	content charset	106	A number representing the charset of the content if provided below and no HTTP mime header is given. Defaults to utf8 which is 106. See iana_charset.h for the numeric values.
17	content	STRING	content		If you want to supply the URL's content rather than have Gigablast download it, then enter the content here. Enter MIME header first if "content has mime" is set to true above. Separate MIME from actual content with two returns. At least put a single space in here if you want to inject empty content, otherwise the content will be downloaded from the url. This is because the page injection form always submits the content text area even if it is empty, which should signify that the content should be downloaded.
18	metadata	STRING	metadata		Json encoded metadata to be indexed with the document.
19	sections	BOOL (0 or 1)	get sectiondb voting info	0	Return section information of injected content for the injected subdomain.
20	diffbotreply	STRING	diffbot reply		Used exclusively by diffbot. Do not use.

/admin/addurl - add url page for admin [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	c	STRING	collection		Add urls into this collection. REQUIRED
4	urls	STRING	urls to add		List of urls to index. One per line or space separated. If your url does not index as you expect you can check it's spider history by doing a url: search on it. Added urls will match the isaddurl directive on the url filters page. The add url api is described on the api page. REQUIRED
5	strip	BOOL (0 or 1)	strip sessionids	1	Strip added urls of their session ids.
6	spiderlinks	BOOL (0 or 1)	harvest links	1	Harvest links of added urls so we can spider them?.
7	hopcount	INT32	hopcount	255	Hopcount of the url. How far is it from the seeds, measured in hops.

/admin/reindex - query delete/reindex [ show parms in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.
3	c	STRING	collection		query reindex in this collection. REQUIRED
4	q	STRING	query to reindex or delete		We either reindex or delete the search results of this query. Reindexing them will redownload them and possible update the siterank, which is based on the number of links to the site. This will add the url requests to the spider queue so ensure your spiders are enabled. REQUIRED
5	srn	INT32	start result number	0	Starting with this result #. Starts at 0.
6	ern	INT32	end result number	99999999	Ending with this result #. 0 is the first result #.
7	qlang	STRING	query language	en	The language the query is in. Used to rank results. Just use xx to indicate no language in particular. But you should use the same qlang value you used for doing the query if you want consistency.
8	qrecycle	BOOL (0 or 1)	recycle content	0	If you check this box then Gigablast will not re-download the content, but use the content that was stored in the cache from last time. Useful for rebuilding the index to pick up new inlink text or fresher sitenuminlinks counts which influence ranking.
9	forcedel	BOOL (0 or 1)	FORCE DELETE	0	Check this checkbox to delete the results, not just reindex them.
10	quickdel	BOOL (0 or 1)	QUICK DELETE	0	Check this checkbox to INSTANTLY delete the results, not just reindex them. You'll have to do a tight merge on posdb and on titledb to realize the deletes. This is currently only available to crawlbot custom crawls.
11	forceupdate	BOOL (0 or 1)	Force Update	0	Check this to ignore processing errors or duplicate detection to update the results unconditionally.

/admin/stats - general statistics [ show parms in xml or json ] [ show status in xml or json ]

Input
#	Parm	Type	Title	Default Value	Description
1	format	STRING	output format	html	Display output in this format. Can be html, json or xml.
2	showinput	BOOL (0 or 1)	show input and settings	1	Display possible input and the values of all settings on this page.