As a web developer you may run into occasions where your database struggles to handle search queries. Have you ever been in a situation where you search for a query containing a few keywords and your website completely hangs? Not a very good feeling. This kind of issue happens usually on websites with at least thousands or even hundreds of thousands of rows and where search involves accessing several db tables. Today I’m going to discuss a solution to speed up search on a db. Have you ever searched on your website for something and get rather irrelevant results? The tool I’m going to discuss solves that problem as well. It’s called Sphinx. Sphinx is a powerful search engine written in C++ and allows blazing fast search ranked by relevance, much more than a traditional database can handle. Sphinx is open source and also available under commercial license. I’ll talk about how to set up Sphinx and get started with it on osX.
First, let’s prepare some data to work with. We assume our website is for car catalogs. Note: I got this sample data and PHP file courtesy of a very good Sphinx tutorial at http://www.ibm.com/developerworks/library/os-php-sphinxsearch/
Create a database testdb and add a few tables to it using the following SQL statements:
CREATE TABLE IF NOT EXISTS `Assembly` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`label` varchar(7) NOT NULL,
`description` varchar(128) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=8 ;
--
-- Dumping data for table `Assembly`
--
INSERT INTO `Assembly` (`id`, `label`, `description`) VALUES
(1, '5-00', 'Seats'),
(2, '4-00', 'Electrical'),
(3, '3-00', 'Glasses'),
(4, '2-00', 'Frame'),
(5, '1-00', 'Engine'),
(7, '101-00', 'Accessories');
-- --------------------------------------------------------
--
-- Table structure for table `Inventory`
--
CREATE TABLE IF NOT EXISTS `Inventory` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`partno` varchar(32) NOT NULL,
`description` varchar(256) NOT NULL,
`price` float unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `partno` (`partno`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=10 ;
--
-- Dumping data for table `Inventory`
--
INSERT INTO `Inventory` (`id`, `partno`, `description`, `price`) VALUES
(1, 'WIN408', 'Portal window', 423),
(2, 'ACC711', 'Jack kit', 110),
(3, 'ACC43', 'Rear-view mirror', 55),
(4, 'ACC5409', 'Cigarette lighter', 20),
(5, 'WIN958', 'Windshield, front', 500),
(6, '765432', 'Bolt', 0.1),
(7, 'ENG001', 'Entire engine', 10000),
(8, 'ENG088', 'Cylinder head', 55),
(9, 'ENG976', 'Large cylinder head', 65);
-- --------------------------------------------------------
--
-- Table structure for table `Model`
--
CREATE TABLE IF NOT EXISTS `Model` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`label` varchar(100) NOT NULL,
`description` varchar(256) NOT NULL,
`begin_production` int(4) NOT NULL,
`end_production` int(4) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=8 ;
--
-- Dumping data for table `Model`
--
INSERT INTO `Model` (`id`, `label`, `description`, `begin_production`, `end_production`) VALUES
(1, 'X Sedan', 'Four-door performance sedan', 1998, 1999),
(3, 'X Sedan', 'Four door performance sedan, 1st model year', 1995, 1997),
(4, 'J Convertible', 'Two-door roadster, metal retracting roof', 2002, 2005),
(5, 'J Convertible', 'Two-door roadster', 2000, 2001),
(7, 'W Wagon', 'Four-door, all-wheel drive sport station wagon', 2007, 0);
-- --------------------------------------------------------
--
-- Table structure for table `Schematic`
--
CREATE TABLE Schematic (
id int(10) unsigned NOT NULL auto_increment,
partno_id int(10) unsigned NOT NULL,
assembly_id int(10) unsigned NOT NULL,
model_id int(10) unsigned NOT NULL,
PRIMARY KEY (id),
KEY partno_index USING BTREE (partno_id),
KEY assembly_index USING BTREE (assembly_id),
KEY model_index USING BTREE (model_id),
FOREIGN KEY (partno_id) REFERENCES Inventory(id),
FOREIGN KEY (assembly_id) REFERENCES Assembly(id),
FOREIGN KEY (model_id) REFERENCES Model(id)
) ENGINE=InnoDB;
--
-- Dumping data for table `Schematic`
--
INSERT INTO `Schematic` (`id`, `partno_id`, `assembly_id`, `model_id`) VALUES
(1, 6, 5, 1),
(2, 8, 5, 1),
(3, 1, 3, 1),
(4, 5, 3, 1),
(5, 8, 5, 7),
(6, 6, 5, 7),
(7, 4, 7, 3),
(8, 9, 5, 3);
-- --------------------------------------------------------
--
-- Structure for view `catalog`
--
CREATE OR REPLACE VIEW Catalog AS
SELECT
Inventory.id,
Inventory.partno,
Inventory.description,
Assembly.id AS assembly,
Model.id AS model
FROM
Assembly, Inventory, Model, Schematic
WHERE
Schematic.partno_id=Inventory.id
AND Schematic.model_id=Model.id
AND Schematic.assembly_id=Assembly.id;
A bit of explanation of the tables. Assembly tables is for storing various modules of a car. Inventory stores all the parts we have in store. Model stores the car model. Schematic stores information required to put together a car of a given model with a specific part and for a specific assembly. Finally there’s a view called catalog which brings together the fields we need to serve as source to Sphinx, including the inventory id, part number, description, assembly id and model id.
If we inspect catalog the contents should look like:
+----+---------+---------------------+----------+-------+
| id | partno | description | assembly | model |
+----+---------+---------------------+----------+-------+
| 6 | 765432 | Bolt | 5 | 1 |
| 8 | ENG088 | Cylinder head | 5 | 1 |
| 1 | WIN408 | Portal window | 3 | 1 |
| 5 | WIN958 | Windshield, front | 3 | 1 |
| 4 | ACC5409 | Cigarette lighter | 7 | 3 |
| 9 | ENG976 | Large cylinder head | 5 | 3 |
| 8 | ENG088 | Cylinder head | 5 | 7 |
| 6 | 765432 | Bolt | 5 | 7 |
+----+---------+---------------------+----------+-------+
Next, install Sphinx by downloading the latest release version from http://sphinxsearch.com/downloads/release/. At the moment the latest is 2.2.9. It takes the form of a tar.gz archive.
Extract the contents from the archive and This will give you a folder like sphinx-2.2.9-release-osx10.10-x86_64
Move the folder to /usr/local/sphinx
Next we need to create a Sphinx config file. Change to /usr/local/sphinx and copy the sample config fiel sphinx-min.conf.in to sphinx.conf.
Next, open sphinx.conf and edit it to contain the following contents:
source catalog
{
type = mysql
sql_host = localhost
sql_user = testdb
sql_pass = testdb
sql_db = testdb
sql_port = 3306 # optional, default is 3306
sql_query = \
SELECT \
id, partno, description, \
assembly, model \
FROM \
Catalog;
sql_attr_uint = assembly
sql_attr_uint = model
}
index catalog
{
source = catalog
path = /usr/local/sphinx/data/catalog
min_word_len = 3
min_prefix_len = 0
min_infix_len = 3
}
indexer
{
mem_limit = 128M
}
searchd
{
listen = 9312
listen = 9306:mysql41
log = /usr/local/sphinx/log/searchd.log
query_log = /usr/local/sphinx/log/query.log
read_timeout = 5
max_children = 30
pid_file = /usr/local/sphinx/log/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
workers = threads # for RT to work
binlog_path = /usr/local/sphinx/data
}
Save and close the file.
Some explanation. The source section gives Sphinx information about your database. I use a MySQL db in this tutorial. sql_query is the main document fetch query which Sphinx uses to grab data from our db. sql_attr_uint is used to specify unsigned integers. The index section says how to index value. Update its source parameter to your desired value. The searchd section is for the search daemon
Now configuration is complete. To make command line work easier, add /usr/local/sphinx/bin/ to your PATH.
Next, index the data into Sphinx. Run the following command:
indexer --config /usr/local/sphinx/sphinx.conf catalog
You should see some text appear that ends with something like
total 8 docs, 149 bytes
total 0.062 sec, 2397 bytes/sec, 128.71 docs/sec
total 4 reads, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
total 12 writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
That means indexing is complete. We can now run queries against Sphinx. First, we need to start the search daemon using the command:
sudo searchd -c /usr/local/sphinx/sphinx.conf
Note: to stop the daemon run the command:
sudo searchd --config /usr/local/sphinx/sphinx.conf --stop
Now we are going to write a PHP file that gets data from Sphinx.
Create a file called query_cat.php and add the following content to it:
include('/usr/local/sphinx/api/sphinxapi.php');
$cl = new SphinxClient();
$cl->SetServer('localhost', 9312);
$cl->SetMatchMode(SPH_MATCH_ANY); // match any word of the query
$cl->SetFilter('model', array(3)); // filter results to those where model is 3
$result = $cl->Query('Windshield', 'catalog'); // query the word Windshield from the index called catalog.
if ($result === false){
echo 'Query failed: ' . $cl->GetLastError()."\n";
} else {
if ($cl->GetLastWarning()){
echo "Warning: ". $cl->GetLastWarning()."\n";
}
if ( ! empty($result["matches"]) ) {
foreach ( $result["matches"] as $doc => $docinfo ) {
echo "$doc\n";
}
print_r( $result );
}
}
You will get an output like
9
Array
(
[error] =>
[warning] =>
[status] => 0
[fields] => Array
(
[0] => partno
[1] => description
)
[attrs] => Array
(
[assembly] => 1
[model] => 1
)
[matches] => Array
(
[9] => Array
(
[weight] => 1
[attrs] => Array
(
[assembly] => 5
[model] => 3
)
)
)
[total] => 1
[total_found] => 1
[time] => 0.001
[words] => Array
(
[cylind] => Array
(
[docs] => 2
[hits] => 2
)
)
)
A warning may appear above the message saying that calling a method isn’t recommended, but you can ignore it.
As you can see from the result, 9 is the single result and is indeed the only inventory id for model id 3 containing the keyword ‘container’. Refer to the printout of catalog above.
That’s it for now. There are many other options that can be explored as you play around with Sphinx. Also, there’s a so-called SphinxQL which can be use to query Sphinx instead of using the API.