stringsim {stringdist} | R Documentation |
stringsim
computes pairwise string similarities between elements of
character
vectors a
and b
, where the vector with less
elements is recycled.
stringsim(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, q = 1, ...)
a |
R object (target); will be converted by |
b |
R object (source); will be converted by |
method |
Method for distance calculation. The default is |
useBytes |
Perform byte-wise comparison, see |
q |
Size of the q-gram; must be nonnegative. Only applies to
|
... |
additional arguments are passed on to |
The similarity is calculated by first calculating the distance using
stringdist
, dividing the distance by the maximum
possible distance, and substracting the result from 1.
This results in a score between 0 and 1, with 1
corresponding to complete similarity and 0 to complete dissimilarity.
Note that complete similarity only means equality for distances satisfying
the identity property. This is not the case e.g. for q-gram based distances
(for example if q=1, anagrams are completely similar).
For distances where weights can be specified, the maximum distance
is currently computed by assuming that all weights are equal to 1.
Returns a vector with similarities, which are values between 0 and 1 where
1 corresponds to perfect similarity (distance 0) and 0 to complete
dissimilarity. NA
is returned when stringdist
returns
NA
. Distances equal to Inf
are truncated to a similarity of
0.
# Calculate the similarity using the default method of optimal string alignment stringsim("ca", "abc") # Calculate the similarity using the Jaro-Winkler method # The p argument is passed on to stringdist stringsim('MARTHA','MATHRA',method='jw', p=0.1)