\Pluf_Search

Class implementing a small search engine.

Ideal for a small website with up to 100,000 documents.

Summary

Methods
Properties
Constants
search()
stem()
searchDocuments()
getWordIds()
index()
No public properties found
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Methods

search()

search(  $query,   $stemmer = 'Pluf_Text_Stemmer_Porter') : array

Search.

Returns an array of array with model_class, model_id and score. The list is already sorted by score descending.

You can then filter the list as you wish with another set of weights.

Parameters

$query
$stemmer

Returns

array —

Results.

stem()

stem(  $words,   $stemmer) 

Stem the words with the given stemmer.

Parameters

$words
$stemmer

searchDocuments()

searchDocuments(  $wids) : array

Search documents.

Only the total of the ponderated occurences is used to sort the results.

Parameters

$wids

Returns

array —

Sorted by score, returns model_class, model_id and score.

getWordIds()

getWordIds(  $words) : array

Get the id of each word.

Parameters

$words

Returns

array —

Ids, null if no matching word.

index()

index(  $doc,   $stemmer = 'Pluf_Text_Stemmer_Porter') : array

Index a document.

The document must provide a method _toIndex() returning the document as a string for indexation. The string must be clean and will simply be tokenized by Pluf_Text::tokenize().

So a recommended way to clean it at the end is to remove all the HTML tags and then run the following on it:

return Pluf_Text::cleanString(html_entity_decode($string, ENT_QUOTES, 'UTF-8'));

Indexing is resource intensive so it is recommanded to run the indexing in an asynchronous way. When you save a resource to be indexed, just write a log "need to index resource x" and then you can every few minutes index the resources. Nobody care if your index is not perfectly fresh, but your end users care if it takes 0.6s to get back the page instead of 0.1s.

Take 500 average documents, index them while counting the total time it takes to index. Divide by 500 and if the result is more than 0.1s, use a log/queue.

FIXME: Concurrency problem if you index at the same time the same doc.

Parameters

$doc
$stemmer

Returns

array —

Statistics.