Hierarchical Fuzzy matching strategy for address matching
Hierarchical Fuzzy matching strategy for address matching
I am building an address matching module in R, where I would like to find a match of a list of inAddress against a database of all addresses dbAddress using R.
inAddress
dbAddress
Let's say the address contains street number, street name, postal code, city to be matched. There are certain matching rules I would like to consider, for example :
street number, street name, postal code, city
postal code should be an exact match
street number should be an exact match, unless not found, then
consider fuzzy matching
Do you have any advise on the strategy and how to build it effectively ?
Here's several of my thoughts so far :
if
I am concerned this will be a big performance hurt. Also, is there a way to speed up multiple address match at the same time ? Perhaps join on postal code first to avoid full search each time ? Parallelism ?
Any advice would be welcome. Thank you
@Jeffrey : what would be your search order ? Number does not seem to limit our search result at first step, zipcode however does the job. Partial street will miss Ave, Street, boulevard, etc. How would you clean/match the Street name ?
– Kenny
Oct 30 '17 at 11:52
Can you contact me at support@smartystreets.com - I'd be happy to discuss this further but it would be helpful to get on a phone call, I think. I would recommend house number and zipcode first, followed by one, two, or three characters of the street name. Where are you obtaining a master address list to index?
– Jeffrey
Oct 30 '17 at 17:43
2 Answers
2
The levensthein is a must for simple spelling mistakes. Finding the right tolerance is important because less than 0.8 would return too many false positives.
I’d recommend using a dictionary of short words that you can correct too, such as road/raod or street/stret.
You may want to check for abbreviations such as Ave vs Avenue, which starts with the same characters however Road vs Rd is missing some characters so the matching rules are different. Once again, a dictionary could help.
This article contains 12 tests to find addresses using fuzzy matching that could be useful for improving your algorithm. Many of these examples Google can’t even match!
The examples include:
Incorrect Type (Street vs Road)
Bordering / Nearby Suburb
After looking at several commercial address autocomplete widgets, this one (https://www.addy.co.nz/address-finder-fuzzy-matching) is by far the smartest for New Zealand addresses. Perhaps you can get inspiration and come up with an even better algorithm!
I totally backup what has been said previously in this post. Depending on your data quality, there is a lot of data cleaning work to perform before running your geocoder. Here's my code for parallel running implementing the Levensthein fuzzy matching algorithm. There's a list of addresses called "df_ge" which is matched with the data "X" containing addresses and geocodes. The hierarchical search goes from ZIP code area (1) to Street (2) to Street number (3). The geocoding of 10'000 addresses takes about 1.71 min on 4 CPUs @ 3GHz Please note that the code is far from being perfect and can be optimized.
library(foreach)
library(doParallel)
#
#
#
start <- Sys.time()
#setup parallel backend to use 4 processors
cl<-makeCluster(4)
registerDoParallel(cl)
#
#create the dataframe to store the IDs and their respective geocodes
#
k <- data.frame(matrix(ncol = 4))
colnames(k) <- c("ID_US", "ID_Pet", "GKODE", "GKODN")
#
k <- foreach(i = 1:10000, .combine="rbind") %dopar% {
#
#(1) match the data based on ZIP code
#
temp1 <- X[which(X$PLZ4==df_ge$PostalCodeNumber[i]),] # PLZ stands for ZIP code
#
#(2) match the data based on addresses; perfect match with j=0 and then increase the tolerance as long as temp2 is empty
#
temp2 <- data.frame(); j <- 0;
while (nrow(temp2)==0){
temp2 <- temp1[agrep(df_ge$Street[i], temp1$STRNAME, max.distance = j, ignore.case = T),]
j <- j+0.01
}
#
#(3) match the data based on street number
#
temp3 <- temp2[which(tolower(temp2$DEINR)==tolower(df_ge$StreetNumber[i])),] # DEINR stands for street number
#
#make sure that only one result is returned
#
if(nrow(temp3)==1){
k[i,] <- unlist(c(df_ge[i,c("ID_US", "ID_Pet")],temp3[,c("GKODE", "GKODN")]))
}
k[i,]
}
stopCluster(cl)
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Think about which values are more likely to have data entry errors and which are not. Primary number is on the far left and is the first thing that is entered and almost never changes. Count on that one highly. Street names are frequently misspelled. Counting on the full street name is less reliable - first couple of letters, sure. City + State and zipcode are synonyms. A zipcode represents a various city+state combinations (and the zipcode is subject to change by the USPS). I would recommend a simple search of primary number, partial street, city+state.
– Jeffrey
Oct 27 '17 at 20:04