Text-Median version 0.01
========================
NAME
Text::Median - Perl extension for determining the set median of a set of
strings
SYNOPSIS
use Text::Median;
my $medianobj = new Text::Median(module=>"StringDistanceModule",method=>"distancemethod");
$medianobj->add_data(\@data);
print $medianobj->find_median();
DESCRIPTION
The median of a set of strings is defined as the string that minimizes
the sum of all distances between that string and all other strings
within the set. The true median is not necessarily a member of the set
of strings. It as been shown that finding the median of a set of strings
is an NP comilete problem in: "Topology of Strings: Median String is NP
Complete", C. de la Higuera, F. Casacuberta, Theoretical Computer
Science Vol. 230 Issue 1-2, January 2000
This module is concerned with calculating the set median, which is the
member of the set which minimize the sum of distances. There are a
myriad of string distance algorithms, including the string edit
distance, the keyboard distance and the algorithm used by
String::Similarity. This module is designed to allow the programmer to
choose the algorithm for distance. It should also be noted that the
method associated with the module used to calculate distance should take
two arguments. The programmer of this module assumes that you, the user,
will give the appropriate method and does not double check that the
method exists.
The algorithm used in this module is O(N**2).
Methods
new(module=>'module name', method=>'method', max=>0);
Creates and instantiates a Text::Median object. The module and method
arguments are required. The module argument is used to pass in the
distance module that the Text::Median object will use and the method
argument is the particular method within the distance module that
calculates the distance. The module must be a valid module.
The max argument is slightly different. Most string distance modules
give a larger number for a larger distance. However, in the
String::Similarity module (and potentially other modules) the similarity
of a string is calculated and the higher the result, the more similar
the strings are. Therefore, the set median is the string with the
largest sum of similarities rather than the string with the smallest sum
of distance. If you are going to use String::Similarity (or similar
modules) you must use the max argument in order to derive the set
median.
add_data(@data)
Adds a set of data to the module. If a set of data already exists within
the module, appends the new set of data to the old set of data.
find_median()
Determines the set median of the given set of strings and returns it.
This is where the main calculation occurs, so this might take time
depending on the size of the data set. One thing to note: the distance
matrix required for the calculation is held in memory, so if additional
data is added to the set the calculation is faster. Also, since it is
held in memory, it can have a large memory footprint.
EXPORT
None by default.
SEE ALSO
Any perl modules relating to string distance, including the Levenshtein
distance, String::Similarity, and String::KeyboardDistance
AUTHOR
Leigh Metcalf,
COPYRIGHT AND LICENSE
Copyright (C) 2009 by Leigh Metcalf
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself, either Perl version 5.8.9 or, at
your option, any later version of Perl 5 you may have available.
INSTALLATION
To install this module type the following:
perl Makefile.PL
make
make test
make install
DEPENDENCIES
This module requires these other modules and libraries:
Module::Runtime
Test::Warn
COPYRIGHT AND LICENCE
Put the correct copyright and licence information here.
Copyright (C) 2009 by Leigh Metcalf
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.9 or,
at your option, any later version of Perl 5 you may have available.