« Prev  Next »

java - Bad Performance for Dedupe of 2 million records using mapreduce on Appengine - Stack Overflow

Questions

21 July, 2011 | GAE Cupboard Permalink

Questions  duplicates java mapreduce

http://stackoverflow.com/questions/6770781/bad-performance-for-dedupe-of-2-million-records-using-mapreduce-on-appengine

I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of data. The resulting unique records need to be noted into db.


Comments

blog comments powered by Disqus

GAE Cupboard
App Engine knowledge base directory

Posts: 7,823

Categories

Books

Tags

python (2,734)  java (1,580)  gae-datastore (554)  django (483)  gwt (457)  questions (250)  android (213)  javascript (197)  ajax (149)  jdo (141)  web (130)  academic (120)  google (114)  libraries (111)  blogging (108)  google-applications (105)  spring (105)  api (104)  gaming (101)  eclipse (99) 
more »