Getting Started with Sphinx Search

As a web developer you may run into occasions where your database struggles to handle search queries. Have you ever been in a situation where you search for a query containing a few keywords and your website completely hangs? Not a very good feeling. This kind of issue happens usually on websites with at least thousands or even hundreds of thousands of rows and where search involves accessing several db tables. Today I’m going to discuss a solution to speed up search on a db. Have you ever searched on your website for something and get rather irrelevant results? The tool I’m going to discuss solves that problem as well. It’s called Sphinx. Sphinx is a powerful search engine written in C++ and allows blazing fast search ranked by relevance, much more than a traditional database can handle. Sphinx is open source and also available under commercial license. I’ll talk about how to set up Sphinx and get started with it on osX.

First, let’s prepare some data to work with. We assume our website is for car catalogs. Note: I got this sample data and PHP file courtesy of a very good Sphinx tutorial at http://www.ibm.com/developerworks/library/os-php-sphinxsearch/

Create a database testdb and add a few tables to it using the following SQL statements:

CREATE TABLE IF NOT EXISTS `Assembly` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `label` varchar(7) NOT NULL,
  `description` varchar(128) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=8 ;

--
-- Dumping data for table `Assembly`
--

INSERT INTO `Assembly` (`id`, `label`, `description`) VALUES
(1, '5-00', 'Seats'),
(2, '4-00', 'Electrical'),
(3, '3-00', 'Glasses'),
(4, '2-00', 'Frame'),
(5, '1-00', 'Engine'),
(7, '101-00', 'Accessories');

-- --------------------------------------------------------
--
-- Table structure for table `Inventory`
--

CREATE TABLE IF NOT EXISTS `Inventory` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `partno` varchar(32) NOT NULL,
  `description` varchar(256) NOT NULL,
  `price` float unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  UNIQUE KEY `partno` (`partno`) USING BTREE
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=10 ;

--
-- Dumping data for table `Inventory`
--

INSERT INTO `Inventory` (`id`, `partno`, `description`, `price`) VALUES
(1, 'WIN408', 'Portal window', 423),
(2, 'ACC711', 'Jack kit', 110),
(3, 'ACC43', 'Rear-view mirror', 55),
(4, 'ACC5409', 'Cigarette lighter', 20),
(5, 'WIN958', 'Windshield, front', 500),
(6, '765432', 'Bolt', 0.1),
(7, 'ENG001', 'Entire engine', 10000),
(8, 'ENG088', 'Cylinder head', 55),
(9, 'ENG976', 'Large cylinder head', 65);

-- --------------------------------------------------------

--
-- Table structure for table `Model`
--

CREATE TABLE IF NOT EXISTS `Model` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `label` varchar(100) NOT NULL,
  `description` varchar(256) NOT NULL,
  `begin_production` int(4) NOT NULL,
  `end_production` int(4) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=8 ;

--
-- Dumping data for table `Model`
--

INSERT INTO `Model` (`id`, `label`, `description`, `begin_production`, `end_production`) VALUES
(1, 'X Sedan', 'Four-door performance sedan', 1998, 1999),
(3, 'X Sedan', 'Four door performance sedan, 1st model year', 1995, 1997),
(4, 'J Convertible', 'Two-door roadster, metal retracting roof', 2002, 2005),
(5, 'J Convertible', 'Two-door roadster', 2000, 2001),
(7, 'W Wagon', 'Four-door, all-wheel drive sport station wagon', 2007, 0);

-- --------------------------------------------------------

--
-- Table structure for table `Schematic`
--

CREATE TABLE Schematic (
  id int(10) unsigned NOT NULL auto_increment,
  partno_id int(10) unsigned NOT NULL,
  assembly_id int(10) unsigned NOT NULL,
  model_id int(10) unsigned NOT NULL,
  PRIMARY KEY (id),
  KEY partno_index USING BTREE (partno_id),
  KEY assembly_index USING BTREE (assembly_id),
  KEY model_index USING BTREE (model_id),
  FOREIGN KEY (partno_id) REFERENCES Inventory(id),
  FOREIGN KEY (assembly_id) REFERENCES Assembly(id),
  FOREIGN KEY (model_id) REFERENCES Model(id)
) ENGINE=InnoDB;

--
-- Dumping data for table `Schematic`
--

INSERT INTO `Schematic` (`id`, `partno_id`, `assembly_id`, `model_id`) VALUES
(1, 6, 5, 1),
(2, 8, 5, 1),
(3, 1, 3, 1),
(4, 5, 3, 1),
(5, 8, 5, 7),
(6, 6, 5, 7),
(7, 4, 7, 3),
(8, 9, 5, 3);

-- --------------------------------------------------------

--
-- Structure for view `catalog`
--

CREATE OR REPLACE VIEW Catalog AS
SELECT
  Inventory.id,
  Inventory.partno,
  Inventory.description,
  Assembly.id AS assembly,
  Model.id AS model
FROM
  Assembly, Inventory, Model, Schematic
WHERE
  Schematic.partno_id=Inventory.id 
  AND Schematic.model_id=Model.id 
  AND Schematic.assembly_id=Assembly.id;




A bit of explanation of the tables. Assembly tables is for storing various modules of a car. Inventory stores all the parts we have in store. Model stores the car model. Schematic stores information required to put together a car of a given model with a specific part and for a specific assembly. Finally there’s a view called catalog which brings together the fields we need to serve as source to Sphinx, including the inventory id, part number, description, assembly id and model id.

If we inspect catalog the contents should look like:

+----+---------+---------------------+----------+-------+
| id | partno  | description         | assembly | model |
+----+---------+---------------------+----------+-------+
|  6 | 765432  | Bolt                |        5 |     1 |
|  8 | ENG088  | Cylinder head       |        5 |     1 |
|  1 | WIN408  | Portal window       |        3 |     1 |
|  5 | WIN958  | Windshield, front   |        3 |     1 |
|  4 | ACC5409 | Cigarette lighter   |        7 |     3 |
|  9 | ENG976  | Large cylinder head |        5 |     3 |
|  8 | ENG088  | Cylinder head       |        5 |     7 |
|  6 | 765432  | Bolt                |        5 |     7 |
+----+---------+---------------------+----------+-------+

Next, install Sphinx by downloading the latest release version from http://sphinxsearch.com/downloads/release/. At the moment the latest is 2.2.9. It takes the form of a tar.gz archive.

Extract the contents from the archive and This will give you a folder like sphinx-2.2.9-release-osx10.10-x86_64

Move the folder to /usr/local/sphinx

Next we need to create a Sphinx config file. Change to /usr/local/sphinx and copy the sample config fiel sphinx-min.conf.in to sphinx.conf.

Next, open sphinx.conf and edit it to contain the following contents:

source catalog 
{
	type			= mysql

	sql_host		= localhost
	sql_user		= testdb
	sql_pass		= testdb
	sql_db			= testdb
	sql_port		= 3306	# optional, default is 3306

        sql_query               = \
            SELECT \
                id, partno, description, \
                assembly, model \
            FROM \
                Catalog;
        sql_attr_uint        = assembly
        sql_attr_uint        = model
}
    
index catalog 
{
    source              = catalog
    path                = /usr/local/sphinx/data/catalog
    min_word_len        = 3
    min_prefix_len      = 0
    min_infix_len       = 3
}

indexer
{
	mem_limit		= 128M
}


searchd
{
	listen			= 9312
	listen			= 9306:mysql41
	log			= /usr/local/sphinx/log/searchd.log
	query_log		= /usr/local/sphinx/log/query.log
	read_timeout		= 5
	max_children		= 30
	pid_file		= /usr/local/sphinx/log/searchd.pid
        max_matches             = 1000
	seamless_rotate		= 1
	preopen_indexes		= 1
	unlink_old		= 1
	workers			= threads # for RT to work
	binlog_path		= /usr/local/sphinx/data
}

Save and close the file.

Some explanation. The source section gives Sphinx information about your database. I use a MySQL db in this tutorial. sql_query is the main document fetch query which Sphinx uses to grab data from our db. sql_attr_uint is used to specify unsigned integers. The index section says how to index value. Update its source parameter to your desired value. The searchd section is for the search daemon

Now configuration is complete. To make command line work easier, add /usr/local/sphinx/bin/ to your PATH.

Next, index the data into Sphinx. Run the following command:

indexer --config /usr/local/sphinx/sphinx.conf catalog

You should see some text appear that ends with something like

total 8 docs, 149 bytes
total 0.062 sec, 2397 bytes/sec, 128.71 docs/sec
total 4 reads, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
total 12 writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg

That means indexing is complete. We can now run queries against Sphinx. First, we need to start the search daemon using the command:

sudo searchd -c /usr/local/sphinx/sphinx.conf

Note: to stop the daemon run the command:

sudo searchd --config /usr/local/sphinx/sphinx.conf --stop

Now we are going to write a PHP file that gets data from Sphinx.

Create a file called query_cat.php and add the following content to it:



include('/usr/local/sphinx/api/sphinxapi.php');

$cl = new SphinxClient();
$cl->SetServer('localhost', 9312);
$cl->SetMatchMode(SPH_MATCH_ANY); // match any word of the query
$cl->SetFilter('model', array(3)); // filter results to those where model is 3

$result = $cl->Query('Windshield', 'catalog'); // query the word Windshield from the index called catalog.

if ($result === false){
    echo 'Query failed: ' . $cl->GetLastError()."\n";
} else {
    if ($cl->GetLastWarning()){
        echo "Warning: ". $cl->GetLastWarning()."\n";
    }
    if ( ! empty($result["matches"]) ) {
        foreach ( $result["matches"] as $doc => $docinfo ) {
            echo "$doc\n";
        }
      
        print_r( $result );
    }

}

You will get an output like

9
Array
(
    [error] => 
    [warning] => 
    [status] => 0
    [fields] => Array
        (
            [0] => partno
            [1] => description
        )

    [attrs] => Array
        (
            [assembly] => 1
            [model] => 1
        )

    [matches] => Array
        (
            [9] => Array
                (
                    [weight] => 1
                    [attrs] => Array
                        (
                            [assembly] => 5
                            [model] => 3
                        )

                )

        )

    [total] => 1
    [total_found] => 1
    [time] => 0.001
    [words] => Array
        (
            [cylind] => Array
                (
                    [docs] => 2
                    [hits] => 2
                )

        )

)

A warning may appear above the message saying that calling a method isn’t recommended, but you can ignore it.

As you can see from the result, 9 is the single result and is indeed the only inventory id for model id 3 containing the keyword ‘container’. Refer to the printout of catalog above.

That’s it for now. There are many other options that can be explored as you play around with Sphinx. Also, there’s a so-called SphinxQL which can be use to query Sphinx instead of using the API.