Penetrator - Text Indexing Engine
use Penetrator;
# open a Penetrator object (using the DBM driver)
$ph = new Penetrator("type" => "DBM",
"index_file" => "/var/index.db");
# or
# open a Penetrator object (using the DBI driver)
$ph = new Penetrator("type" => "DBI",
"dbi_string" => $dbi_connection_string);
# INDEXING
# start indexing a document $ph->begin_doc_index($docname, $mtime);
# feed the words
open F, $docname;
while(<F>)
{
$ph->doc_words(split(/\W+/,$_));
}
close F;
# end indexing this document $ph->end_doc_index() or die "Doc couldn't be indexed";
# SEARCHING
# search all documents matching ALL The words $ph->search($word1, $word2);
# fetch docnames
while(my $d = $ph->fetch())
{
print "Words found in document '$d'\n";
}
# close connection $ph->close();
Penetrator.pm is the engine of the Penetrator package. You can retrieve it or get more documentation from its home page at
http://www.triptico.com/software/penetrator.html
$ph = new Penetrator("type" => $driver,
[ "min_word_size" => $size, ]
[ "zap" => $bool, ]
[ "normalize" => \&function, ]
[ "disable_query_cache" => $bool, ]
[ %driver_specific_arguments ] );
Creates a new Penetrator object.
$ph->begin_doc_index($docname, $mtime);
Starts indexing a document. $docname is the name of the document; though it used to be a file name, it can contain any textual and unique reference. This is the key returned it search hits. $mtime is a numeric value containing the version of the document. This usually contained the file modification time, but can contain any other significant numeric value. If the document being indexed already exists, the $mtime parameter is tested, and the indexing rejected if it's the same. Returns 1 if the document is accepted, 0 otherwise.
$ph->doc_words($word [, $word ...] );
Add the word(s) to the document being indexed. The words cannot
contain spaces nor other separators (they are assumed to be the
result of a split(/\W+/)).
$ph->end_doc_index()
Ends the indexing of the document. Depending of the driver, this can be the most time consuming function. It's return value must be tested; if it's zero, the document was not indexed.
$ph->search($word [, $word... ]);
Searches the database for the documents matching ALL the words. Words can also be prepended by a - or ~ to exclude documents containing them from the final result. It does not return a value; values must be retrieved using fetch().
$docname=$ph->fetch();
Returns a matching document, one at a time. The returned $docname is the first parameter of begin_doc_index(). When no more hits are left, undef is returned instead and the search is finished.
$ph->unindex_doc($docname);
Deletes ('unindexes') a document from the database.
$ph->get_docs();
Starts the retrieval of all the documents stored in the database. Each entry will bet fetched using the fetch_doc() method.
($docname,$mtime)=$ph->fetch_doc();
Retrieves sequentially a $docname and its respective $mtime, triggered by a call to get_docs(). After the last one, undef is returned.
$ph->clear_cache();
Clears the query cache, if one is available in the current driver. Each successful search is stored in the query cache to speed up repeated queries, but loses its value when the database has changed, so this method must be called after a document (or group of documents) are (re)indexed.
$ph->close();
Closes the Penetrator object. No further uses of it can be done.
$ret = $ph->index_file($filename [,$docname]);
Indexes a complete file. $filename must be accesible and readable. Returns 0 if the file was correctly indexed, -1 if file could not be read, -2 if this version of the file is already indexed (by using file's mtime; see stat()) or -3 if the indexing could not be done (probably due to another process concurrently indexed the same file). The final document name will be $docname if set, or $filename otherwise.
$ret = $ph->test_file($filename, $mtime [,$docname]);
This function takes $docname and $mtime from the return value of fetch_doc and unindexes it if: a) filename no longer exists or b) filename is newer than the indexed one. $docname can be specified if it's different from filename. Returns 1 if the file has been unindexed, or 0 if the file is fresh.
The DBM driver uses Berkeley_DB style index files. The value for the 'type' argument to new is 'DBM'.
The driver specific parameters for a new Penetrator object are:
The GDBM driver is very similar to the DBM driver, but uses the GNU dbm library instead of the Berkeley_DB one. The value for the 'type' argument to new is 'GDBM'.
In every other aspect is exactly the same as the DBM driver.
The SDBM driver is very similar to the DBM driver, but uses the Perl internal SDBM library instead of the Berkeley_DB one. This library is reportedly slower and less reliable than the Berkeley_DB or the GDBM ones, so use it only if those are not available (SDBM is included with Perl, so it's always there). The value for the 'type' argument to new is 'SDBM'.
Instead of creating just the specified index_file, two files will, appending .dir and .pag to their names.
In every other aspect is exactly the same as the DBM driver.
The DBI driver uses the Perl DataBase Interface to relational databases. It has been tested with MySQL and PostgreSQL, but hopefully any other DBI-supported RDBMS could be used.
The driver specific parameters for a new Penetrator object are:
"drop_commands" => [ "drop sequence docs_coddoc_seq" ,
"drop sequence words_codword_seq" ],
The plain driver is a dummy driver that doesn't generate any index, but does sequential searches everytime (as a recursive grep). As hoped, is slow, but no big files are created, so it can be used in disk space challenged environments, as web hosting services and such.
The driver specific parameters for a new Penetrator object are:
Penetrator normalizes each word by converting it to lowercase, extracting non-ASCII characters, deleting HTML tags and substituting several latin language specific characters (mainly accented vocals) to its ASCII equivalents. This normalization is done when storing the words and when performing the search. The default behaviour can be changed by providing your own normalization function. You can do this by supplying it to the object initialization function with the 'normalize' option, as in
$ph = new Penetrator("type" => $driver,
"normalize" => \&my_normalization_function,
[ ... other arguments ]
);
This function must accept one scalar parameter (the word), and return the same value after normalization. This is probably the simplest example:
sub my_normalization_function
# only convert to lowercase
{
my ($word) = @_;
return(lc($word));
}
But you probably want a more sophisticated one, as this let pass punctuation characters that will become part of the word (that will, consequently, not be found when searched without them).
Angel Ortega angel@triptico.com