recentpopularlog in

program247365 : architecture   101

« earlier  
single-spa
a javascript framework for front-end microservices
architecture  framework  microservices  programming 
4 weeks ago by program247365
Apex – Serverless Infrastructure
Apex serverless infrastructure built on AWS Lambda
serverless  cli  aws  architecture  deployment 
6 weeks ago by program247365
Eight Habits of Expert Software Designers: An Illustrated Guide | The MIT Press Reader
The best designers employ specific habits, learned practices, and observed principles when they work. Here are a few of them.
architecture  programming  software  technology 
october 2019 by program247365
Strangler pattern - Cloud Design Patterns | Microsoft Docs
Incrementally migrate a legacy system by gradually replacing specific pieces of functionality with new applications and services.
architecture  migration  development  legacy  designpatterns 
march 2019 by program247365
The Contract of the Model-View-Intent Architecture – ProAndroidDev
There are multiple articles, talks, and podcasts that address the topic of what the Model-View-Intent architecture is, but I rarely hear about what I think are the principles of this architecture…
architecture  frontend 
january 2019 by program247365
Azure Reference Architectures | Microsoft Docs
Reference architectures, blueprints, and prescriptive implementation guidance for common workloads on Azure.
azure  cloud  architecture  microsoft  reference 
october 2018 by program247365
.NET Application Architecture Guidance
Free e-books and practical advice for developing for web, desktop, mobile, and microservices with Docker. Learn how to migrate existing .NET apps to the cloud.
dotnet  azure  architecture  reference  Microsoft 
march 2018 by program247365
The hidden costs of serverless – A Cloud Guru
Serverless architecture is the next step in the evolution of computing power. The reasons for taking the leap are clear: Like the jump from on-premises to the cloud, the move to Serverless is more or…
serverless  cloudcomputing  aws  architecture  cost 
january 2018 by program247365
Scalable and Modular Architecture for CSS
Finished rewriting Changing State chapter and rearranged navigation to make things easier to follow. http://t.co/JpSTe7QW
css  architecture  webdev  programming  howto 
september 2011 by program247365
Dealing with Duplicate Person Data - Proud to Use Perl
I've recently been working on a fairly large project that that has
contact information for almost 2 million people. These records contain
details for both online and offline actions. Since the data can come from
multiple sources there exist many duplicate records. Duplicate records
mean more processing for our code, more storage space and more hassle
for our clients who have to deal with these duplicates. All in all,
bad things to leave lying around.

In this article we'll look at some strategies that I used to identify
and remove these duplicates. All code in this article are samples, and
we'll leave the task of assembling them into a final working program up
to the reader.

CPAN is your Friend

Like all good Perl projects, we will make heavy use of the CPAN. It makes
our lives so much easier and every day I'm more in awe at the quality
and bredth of solutions I find there. For this project we'll be using
Text::LevenshteinXS,
Lingua::EN::Nickname and
Parallel::ForkManager.

What is a Duplicate?

For our project we didn't want to automatically merge duplicates but
rather flag them as being potential duplicates and then let our client
choose to do the actual merge. Not only does this force a human to make
the final decision but it also allows the client to determine which values
for which conflicting fields are chosen for the final merged record.

To mark a pair of records as being potential duplicates we want to
give them a score and then remember any pair that scored over a certain
threshold. In our case the scores go from 0 to 100 and our threshold is
75. Each organization is different in what fields it has and what they
consider to be a duplicate. Since some fields are more important than
others (having the same state is not as important as having the same
last name) we need to give our fields some weight. For this project
we'll use the following fields and weights:

my %field_weight = (
last_name => 25,
first_name => 25,
address1 => 20,
address2 => 10,
city => 20,
);

If the fields from 2 different records don't match (we ignore case for
all matches because "peters" is the same thing as "PETERS") for a given
field we subtract the given weight. You will of course have to play with
these numbers for your own data.

Catching Typos

What if 2 people who are living at the same address have first names
of "Michael" and "Michal"? They are most likely the same person and
someone somewhere made a typo. One of the easiest ways to catch typos
or strings that are almost identical is with an algorithm called a
"Levenshtein Distance". You can read more about it if you want the
specifics, but basically this algorithm calculates the number of changes
needed to transform one string into another. In the case of "Michael" vs
"Michal" the Levenshtein Distance is 1. The distance between "Michael"
and "Michelle" is 3 and the distance between "Michael" and "Peters"
is a whopping 7.

use Text::LevenshteinXS qw(distance);
my $distance = distance("Michael", "Michelle");

We're using Text::LevenshteinXS because it's simple and written in C
(the "XS" part of the name means it's written in Perl's C API which is
called XS).

More Weights

Since typos are usually 1 or 2 characters we don't really want anything
with a distance of more than 3. But 1 is more likely to be a duplicate
than is 3. So once again we'll use weighted values. For this project
we'll use the following weights modifiers:

my @distance_modifiers = (.10, .30, .45, .60, .90);

So if we're comparing our first_name fields and they have a Levenshtein
Distance of 2, then we'll subtract 7.5 points (25 x .30). If that's the
only difference between our two records then we'll have a total score
of 92.5 which is a pretty good score.

my $distance = distance($val1, $val2);
$score -= $distance ? $field_weight{$field} * $distance_modifiers[$distance -1] : 0;

Nicknames

Nicknames are another problem we want to deal with. If we encounter
2 records that are similar but one has a first name of "Michael"
and the other is "Mike" that won't score very well because it has a
Levenshtein Distance of 4. It gets even worse when you look at examples
like "Charles" and "Chuck". Our specific problem is made easier by the
fact that almost all our our data is in English. CPAN comes to the rescue
with Lingua::EN::Nickname.
You can give it 2 names and it will give
you a score that lets you know whether one is probably the nickname of
the other.

nickname_eq("Robert", "Bob"); # gives a score of 98

Having played with our data set more, we find that if the score is above
90 we are almost certain that it's a nickname and we treat it as if the
names had matched exactly. If the score is between 80 and 90 we use the
score as another modifier to the weight.

my $nick_score = nickname_eq($val1, $val2);
if( $nick_score < 90 && $nick_score > 80 ) {
$score -= $field_weight{$field} * (1 - ($nick_score / 100));
}

Abbreviations

We love to use abbreviations to make things shorter and easier to type. I
can't remember the last time I entered my full street name in a web
form. But to better find duplicates in our data we need to be able to
consider "Drive" and "Dr." to be the same thing. You could use something
like Geo::PostalAddress
to normalize() the different addresses,
but that was a little more processing than I wanted for this project.

So I just used a simple lookup hash based on the data I pulled from the
US Postal Service (http://www.usps.com/ncsc/lookups/abbr_suffix.txt). I
just took the abbreviations that were most common and our resulting
address normalization looks something like this:

my %ADDR_ABBR = (
avenue => 'ave',
avn => 'ave',
boulevard => 'blvd',
circle => 'cir',
circ => 'cir',
court => 'ct',
crt => 'ct',
drive => 'dr',
driv => 'dr',
drv => 'dr',
...
);

my @parts = split(/\s+/, $address);
for(my $i=0; $i<=$#parts; $i++) {
if( exists $ADDR_ABBR{$parts[$i]} ) {
$parts[$i] = $ADDR_ABBR{$parts[$i]};
}
}

$address = join(' ', @parts);

And now we can compare 2 addresses to get the Levenshtein distance
without having to worry about common abbreviations throwing it off.

Multiple Cores and Multiple CPUS

Since this process will be running on a machine with multiple CPUs we
want to take advantage of that. We are comparing each record to every
other record so it's basically an N-squared algorithm. With millions
of records this means the number of record comparisons will be in the
trillions and would take way too long if we didn't take advantage of the
extra CPUs on this machine. To split the work up I decided to take all
of the data for a single US state and work on that as a group and then
move on to the next state one by one.

This approach is trivially parallelizable so we should simply be able
to fork off multiple processes and have them work on their own portion
of the data. Duplicate scores will be recorded in the database so we
don't even need any communication between the processes (in reality I
used IPC::Shareable
to communicate with the main process so that we
could show a progress indicator of sorts when used from the command line,
but this isn't necessary).

Using fork() under Unix isn't really that hard especially if you
follow the canonical example in the Camel Book. But that's no reason it
can't be made even simpler. I like Parallel::ForkManager
because it
makes it brain-dead simple to fork off a bunch of processes up to a some
limit (for me this seemed to be the number of CPUs + 1). Then every time
a child process finishes it takes care of forking another one to take
it's place to do more work.

my $fork_manager = Parallel::ForkManager->new($num_cpus + 1);
while(my ($state) = get_next_state()) {
# fork off and let a child handle this state
my $pid = $fork_manager->start and next;

process_state_records($state);
$fork_manager->finish();
}
$fork_manager->wait_all_children;

Conclusions

While there are many ways to tackle this problem (and mine is far
from the most optimal) I hope that this helps you to see the power of
the CPAN when tackling big problems. It's hard to find a problem that
someone else hasn't already faced and with the size and generosity of
the Perl community, they've probably put at least some of the solution
on to the CPAN.
perl  algorithms  data  programming  architecture  contacts  cpanduplicateforkparallel 
december 2010 by program247365
« earlier      
per page:    204080120160

Copy this bookmark:





to read