Chapter 3: Tips, Tricks and Caveats

This chapter points out some practical issues when using LSE.

Re-initializing the index database:
If you want to re-initialize the indexer, simply remove all entries from the MySQL index:

system> mysql lse
mysql>  delete from words;
mysql>  delete from pages;
mysql>  quit;

Breaking up too long index runs:
If your index runs take way too long, then you might be better off by breaking the runs into smaller parts. E.g., say that you have the documentroot under /usr/local/apache/htdocs/, where two big directories exist big1/ and big2/.

In that case, you can issue two indexing statemets using:

LSE-index dbname /usr/local/apache/htdocs/big1 /big1    
LSE-index dbname /usr/local/apache/htdocs/big2 /big2

Using this approach you could schedule runs with cron for odd and even days, or whichever way you like.

Recovering from a broken LSE-index run:
If LSE-index should crash for some reason, then the risk exists that the last indexed file was only partially processed. If you think that this is the case, then follow these steps:

  1. In MySQL, determine the last indexed file. This is the entry in pages with the highest timestamp:

    select max(stamp) from pages;

  2. Using the result, obtain the ID of the page:

    select id from pages where stamp = maxvalue

  3. Using the ID, reset the stamp to zero so that a next run will re-index the file:

    update pages set stamp = 0 where id = id

Handling non-words and non-pages:
You might want the search engine to avoid finding certain non-words or non-pages. This is best done by removing the words or pages from the index database.

E.g., consider the following hypothetical situation. You have a directory /usr/local/apache/htdocs/secure/ which is protected by an authentication method of the webserver. The search engine should not return hits pointing into this directory tree, because the search results of any non-authenticated visitor would be polluted by unreachable document links. In this case, you would remove the pages in MySQL:

delete from pages where uri like '/secure%';

Or consider the following. You want to take out 'non-words' from the index dictionary, because these words are meaningless in searches:

mysql> # Step 1: determine the word ID for 'the'
mysql> select id from words where word = 'the';
+-----+
| id  |
+-----+
| 284 |
+-----+
1 row in set (0.34 sec)
mysql> # Step 2: kill the entry in the words list
mysql> delete from words where id = 284;
mysql> # Step 3: kill the entries in the pages hitlist
mysql> delete from hits where wordid = 284;

Note that as of version 1.02, the script LSE-index supports a flag -n, where non-words can be specified on the commandline.

Shortcut search forms:
Imagine that you have a searching form on your site, similar to the one shown in section 4.3. To create a 'shortcut' form in e.g. the banner of your site, add a mini-form that supplies most search values. When the users try this form, they will automatically go to the larger form that shows the first results, and provides tunable searching:

<form method="post" action="search.php">
  <input type="text"   size="15"        name="words"><br>
  <input type="hidden" name="logical"   value="or">
  <input type="hidden" name="matchmode" value="matchexact">
  <input type="submit" value="search">
</form>

Please see section 4.3 to understand why the form variables have the shown names and what their meanings are.

Avoiding too high system load during indexing:
To avoid too high system load during indexing, try the flag -m of the indexer LSE-index. The working of the flag is quite rudimentary; LSE-index will simply wait until the CPU consumption of all mysqld processes drops below a given percentage. However, using this flag, you can start really long indexing jobs and let them run.