index()
index( $doc, $stemmer = 'Pluf_Text_Stemmer_Porter') : array
Index a document.
The document must provide a method _toIndex() returning the
document as a string for indexation. The string must be clean
and will simply be tokenized by Pluf_Text::tokenize().
So a recommended way to clean it at the end is to remove all
the HTML tags and then run the following on it:
return Pluf_Text::cleanString(html_entity_decode($string,
ENT_QUOTES, 'UTF-8'));
Indexing is resource intensive so it is recommanded to run the
indexing in an asynchronous way. When you save a resource to be
indexed, just write a log "need to index resource x" and then
you can every few minutes index the resources. Nobody care if
your index is not perfectly fresh, but your end users care if
it takes 0.6s to get back the page instead of 0.1s.
Take 500 average documents, index them while counting the total
time it takes to index. Divide by 500 and if the result is more
than 0.1s, use a log/queue.
FIXME: Concurrency problem if you index at the same time the same doc.
Parameters
Returns
array
— Statistics.