NAME

Penetrator - Text Indexing Engine


SYNOPSIS

 use Penetrator;
 # open a Penetrator object (using the DBM driver)
 $ph = new Penetrator("type" => "DBM",
                    "index_file" => "/var/index.db");
 # or
 # open a Penetrator object (using the DBI driver)
 $ph = new Penetrator("type" => "DBI",
                   "dbi_string" => $dbi_connection_string);
 # INDEXING
 # start indexing a document
 $ph->begin_doc_index($docname, $mtime);
 # feed the words
 open F, $docname;
 while(<F>)
 {
        $ph->doc_words(split(/\W+/,$_));
 }
 close F;
 # end indexing this document
 $ph->end_doc_index() or die "Doc couldn't be indexed";
 # SEARCHING
 # search all documents matching ALL The words
 $ph->search($word1, $word2);
 # fetch docnames
 while(my $d = $ph->fetch())
 {
        print "Words found in document '$d'\n";
 }
 # close connection
 $ph->close();


DESCRIPTION

Penetrator.pm is the engine of the Penetrator package. You can retrieve it or get more documentation from its home page at

 http://www.triptico.com/software/penetrator.html


FUNCTIONS AND METHODS

new

 $ph = new Penetrator("type"              => $driver,
                  [ "min_word_size"       => $size, ]
                  [ "zap"                 => $bool, ]
                  [ "normalize"           => \&function, ]
                  [ "disable_query_cache" => $bool, ]
                  [ %driver_specific_arguments ] );

Creates a new Penetrator object.

type
The object type references the driver to be used. It can be ``DBM'', to use a Berkeley DB plain index, ``GDBM'', for a GNU dbm plain index, ``SDBM'', for an SDBM plain index, ``DBI'', to use a relational database using the Perl DBI interface, or ``plain'', for a dummy driver that does not generate indexes, but searches every time. This argument is mandatory.

min_word_size
This is the minimal word size that will be taken into account when building the database. By default is 0, meaning all words will be indexed.

zap
When this argument is set to a nonzero value, the database is zapped, i.e., returned to an empty state. This can mean deleting files or creating tables in an RDBMS.

normalize
Sets the word normalization function for the database. See ADVANCED TOPICS.

disable_query_cache
By default, successful queries are stored in a query cache to speed up subsequent calls for the same information (useful, for example, when navigating the results of a query from a CGI). If this argument is set to zero, the query cache will not be used.

begin_doc_index

 $ph->begin_doc_index($docname, $mtime);

Starts indexing a document. $docname is the name of the document; though it used to be a file name, it can contain any textual and unique reference. This is the key returned it search hits. $mtime is a numeric value containing the version of the document. This usually contained the file modification time, but can contain any other significant numeric value. If the document being indexed already exists, the $mtime parameter is tested, and the indexing rejected if it's the same. Returns 1 if the document is accepted, 0 otherwise.

doc_words

 $ph->doc_words($word [, $word ...] );

Add the word(s) to the document being indexed. The words cannot contain spaces nor other separators (they are assumed to be the result of a split(/\W+/)).

end_doc_index

 $ph->end_doc_index()

Ends the indexing of the document. Depending of the driver, this can be the most time consuming function. It's return value must be tested; if it's zero, the document was not indexed.

search

 $ph->search($word [, $word... ]);

Searches the database for the documents matching ALL the words. Words can also be prepended by a - or ~ to exclude documents containing them from the final result. It does not return a value; values must be retrieved using fetch().

fetch

 $docname=$ph->fetch();

Returns a matching document, one at a time. The returned $docname is the first parameter of begin_doc_index(). When no more hits are left, undef is returned instead and the search is finished.

unindex_doc

 $ph->unindex_doc($docname);

Deletes ('unindexes') a document from the database.

get_docs

 $ph->get_docs();

Starts the retrieval of all the documents stored in the database. Each entry will bet fetched using the fetch_doc() method.

fetch_doc

 ($docname,$mtime)=$ph->fetch_doc();

Retrieves sequentially a $docname and its respective $mtime, triggered by a call to get_docs(). After the last one, undef is returned.

clear_cache

 $ph->clear_cache();

Clears the query cache, if one is available in the current driver. Each successful search is stored in the query cache to speed up repeated queries, but loses its value when the database has changed, so this method must be called after a document (or group of documents) are (re)indexed.

close

 $ph->close();

Closes the Penetrator object. No further uses of it can be done.

index_file

 $ret = $ph->index_file($filename [,$docname]);

Indexes a complete file. $filename must be accesible and readable. Returns 0 if the file was correctly indexed, -1 if file could not be read, -2 if this version of the file is already indexed (by using file's mtime; see stat()) or -3 if the indexing could not be done (probably due to another process concurrently indexed the same file). The final document name will be $docname if set, or $filename otherwise.

test_file

 $ret = $ph->test_file($filename, $mtime [,$docname]);

This function takes $docname and $mtime from the return value of fetch_doc and unindexes it if: a) filename no longer exists or b) filename is newer than the indexed one. $docname can be specified if it's different from filename. Returns 1 if the file has been unindexed, or 0 if the file is fresh.


DRIVER SPECIFIC INFORMATION

DBM Driver

The DBM driver uses Berkeley_DB style index files. The value for the 'type' argument to new is 'DBM'.

The driver specific parameters for a new Penetrator object are:

index_file
This parameter is mandatory, and must point to an readable and writable file where the index will be stored.

cache_dir
Contains the name of the directory where the query results will be stored. If none is specified, the query cache won't be used (regardless of the value of the 'disable_query_cache' argument).

GDBM Driver

The GDBM driver is very similar to the DBM driver, but uses the GNU dbm library instead of the Berkeley_DB one. The value for the 'type' argument to new is 'GDBM'.

In every other aspect is exactly the same as the DBM driver.

SDBM Driver

The SDBM driver is very similar to the DBM driver, but uses the Perl internal SDBM library instead of the Berkeley_DB one. This library is reportedly slower and less reliable than the Berkeley_DB or the GDBM ones, so use it only if those are not available (SDBM is included with Perl, so it's always there). The value for the 'type' argument to new is 'SDBM'.

Instead of creating just the specified index_file, two files will, appending .dir and .pag to their names.

In every other aspect is exactly the same as the DBM driver.

DBI Driver

The DBI driver uses the Perl DataBase Interface to relational databases. It has been tested with MySQL and PostgreSQL, but hopefully any other DBI-supported RDBMS could be used.

The driver specific parameters for a new Penetrator object are:

dbi_string
This argument is mandatory (unless you set dbh, see below) and must store the DBI connection string to your database. Consult your DBI driver for details.

dbi_user
Database user, if necessary.

dbi_passwd
Database user's password, if necessary.

dbh
If this parameter is set, it must point to an already open DBI handle. If used, dbi_string, dbi_user and dbi_passwd will be ignored. Use this if you already have an open database connection to your Penetrator database.

transaction
If set to nonzero, all session will be protected by a transaction.

insert_retries
Number of retries before giving up indexing a document (20 by default). When more than one process is indexing documents on the same database, the codification of their words can collide and the insertion fail; this insertion is retried insert_retries times before failing and returning error from end_doc_index().

delay
Time to wait before retrying the storage of the words, in seconds. 1 second by default.

key_code_type
SQL type for the codes that are primary keys. This must be the native auto incremental numeric type of the database. On PostgreSQL is 'serial' (the default) and on MySQL 'integer auto_increment'.

code_type
SQL type for the codes. This must be a numeric type compatible with key_code_type. On PostgreSQL and MySQL is 'integer' (also the default value).

doc_type
SQL type for storing document names. Must be string and indexable. It's 'text' on PostgreSQL (by default) and 'varchar(100)' (or similar) on MySQL.

word_type
SQL type for storing document words. Must be string and indexable. It's 'text' on PostgreSQL (by default) and 'varchar(100)' (or similar) on MySQL.

query_type
SQL type for storing queries in the query cache. Must be string and indexable. It's 'text' on PostgreSQL (by default) and 'varchar(100)' (or similar) on MySQL.

drop_commands
Reference to a list of additional drop commands to 'zap' a database. If you use PostgreSQL and need to re-zap an already existing Penetrator database, you must set the following values to delete the automatically created sequences that are not deleted when the corresponding tables are dropped:
 "drop_commands" => [ "drop sequence docs_coddoc_seq" ,
                      "drop sequence words_codword_seq" ],

create_commands
Reference to a list of additional drop commands to 'zap' a database. Use it if you want to create additional indexes, for example.

plain Driver

The plain driver is a dummy driver that doesn't generate any index, but does sequential searches everytime (as a recursive grep). As hoped, is slow, but no big files are created, so it can be used in disk space challenged environments, as web hosting services and such.

The driver specific parameters for a new Penetrator object are:

index_file
This parameter is mandatory and must point to a filename where the 'indexed' files are stored. This file is populated by begin_doc_index() and used by search() to know which files must be scanned.


ADVANCED TOPICS

Word normalization

Penetrator normalizes each word by converting it to lowercase, extracting non-ASCII characters, deleting HTML tags and substituting several latin language specific characters (mainly accented vocals) to its ASCII equivalents. This normalization is done when storing the words and when performing the search. The default behaviour can be changed by providing your own normalization function. You can do this by supplying it to the object initialization function with the 'normalize' option, as in

 $ph = new Penetrator("type" => $driver,
        "normalize" => \&my_normalization_function,
        [ ... other arguments ]
        );

This function must accept one scalar parameter (the word), and return the same value after normalization. This is probably the simplest example:

 sub my_normalization_function
 # only convert to lowercase
 {
        my ($word) = @_;
        return(lc($word));
 }

But you probably want a more sophisticated one, as this let pass punctuation characters that will become part of the word (that will, consequently, not be found when searched without them).


AUTHOR

Angel Ortega angel@triptico.com