WEB    

            DIRECTORY    

            ADVANCED    

            ADD URL    

            SYNTAX    

            ABOUT    

            BLOG    

            FAQ    

            API    

ADMIN    



Comparison of Gigablast vs SOLR Open Source Search Engine

Comparing Gigablast to SOLR

Gigablast Solr
Package Installation Download packages for Ubuntu or RedHat Instructions
Source Language C/C++ Java
Runs on Linux Yes. Yes.
Runs on Windows Yes with Virtual Box. Soon natively. Yes.
License Apache Open Source License 2 Apache Open Source License 2
Release Date 2000 2007
Scalability Has scaled to over 12 billion unique web pages. Can scale to over 100 billion pages in a single collection. Good luck!
HTTP API here here
Search Results here here
Source Repository github github
Github Star Ratings 326 (8/2/2014) 767 (8/2/2014)
Source Installation Just a few simple steps Source download instructions
Complete Web GUI Yes. ???
Operating Layout A single binary containing web server, database, admin tools, spider logic, etc. Many different packages quilted together. Apache, MySQL, Lucene, Tika, Zookeeper, Solr, Nutch, ...
Indexing a Single File Containing Multiple Documents via cmdline Use curl using args (including delim) listed here
unsupported
Indexing an Individual File via cmdline Use curl to post the content of the file with args listed here You can index individual local files as such: curl "http://127.0.0.1:8080/solr/update" --data-binary @myfile.html -H 'Content-type: text/html' but it does not seem to work unless your HTML meets stringent requirements for some reason.
Indexing an Individual URL via cmdline Use curl to inject the url with args listed here ???
Indexing a File of URLs via cmdline Use one curl command for each url, using the interface described here ???
Deleting Documents via cmdline Use curl command to delete a url, using the interface described here You can delete individual documents by specifying queries that match just those documents: java -Dcommit....
Getting Results via cmdline Use curl command to do a search, using the interface described here ???
Facets Yes. Basic support. See gbfacet operators in the help file. Yes.
Search Result Limitations Based on Facet Value Counts Coming soon. Yes.
Numeric Fields You can forward/reverse sort by and constrain by numeric fields. You can forward/reverse sort by and constrain by numeric fields.
Boolean Search Fully nested boolean search with AND OR NOT. Fully nested boolean search with AND OR NOT.
Searchable Fields Yes. Any meta tag, or if indexing JSON or XML. ???
Site Restricted Searches Yes. Using the site: query operator. Or use &sites=... to constrain your search up to 500 sites. ???
Spell Checker Yes. But currently disabled until improved. Yes.
Language Identification Yes. On a per word level for searching purposes. Yes. Not on a per word level for searching purposes.
Index Multiple Languages Yes. Can expand words in many languages to all their different forms. More forms coming soon, too. Yes, but stemming/expansion may be limited.
Show Images in Search Results Yes. No.
Related Concepts Yes. Called Gigabits. No.
Query Expansion (Synonyms) Yes. And also uses mysynonyms.txt file to add your own expansion terms. ???
Cached Pages Yes. ???
RESTful/XML/JSON APIs Yes, XML. JSON coming soon. ???
Schemas You do not need to define schemas to begin indexing files and urls. You have to define annoying schemas.
Spidering Gigablast has a complete distributed web spider with powerful controls. SOLR has no spider. You can try to integrate Nutch.
Document Filters antiword (for Microsoft Word)
pdftohtml (for PDF) xlstohtml (for Excel) ppthtml (for power point) pstotext (for PostScript)
uses Apache Tika for several formats.
Scalability Highly scalable. Has scaled to over 12 billion pages while serving millions of queries per day. Can easily add new servers to the hosts.conf file and click rebalance shards to rebalance the data. Has not scaled nearly as high to our knowledge. Not originally built for more than one server.
Cluster Administration Built into the web GUI. Requires separate Zookeeper package installation.
Performance High performance. Written in C/C++. Slower. Written in Java. Has garbage collection, etc.
Ranking Alogrithm Custom query term proximity based algorithm. Superior to TF/IDF or Cosine methods. Old school TF/IDF based on simple statistics.
Scoring Explanations Complete scoring information provided. Complete scoring information provided.
Inlink Text Indexed incoming link text, compensates for link spam. None. Not geared for web search.
Page Rank Uses Site Rank based on number of incoming links to a site from other sites. Detects link spam and compensates accordingly. None. Not geared for web search.
On-Page Spam Demotes terms deemed spammy on a page. None.
Reliability Pretty good. Pretty good.
Developer Documentation Yes. Here. Yes. Lots of documentation.
Graphing Graphs performance of various subroutines and query times. Unknown.
Monitoring Monitors drive temperature, disk space, query latency and shard uptime. Sends email alerts. None known.
Geospatial Can use with numeric gbminint: gbmaxint: query operators on lat/lon fields. See help file for examples using these operators. Yes.
Dynamic Summaries Yes. Contain query terms. Yes. Contain query terms.
Site Clustering Yes. ???
More Like This Coming soon. Yes.
Sort by Date gbsortbyint:gbspiderdate
gbsortbyint:gbindexdate
gbrevsortbyint:gbspiderdate
gbrevsortbyint:gbindexdate
See help file for examples using these operators.
???
Query Completion Coming soon. Available with additional module.
Document Collections Supports tens of thousands of separate collections, and federated search across them. ???