Saturday, April 16, 2011

Detecting Duplicate Content

I think about duplicate content because I know that if I were to build a tool to detect it, then I could make a fair bit of $.

In general, you define an operator on content to transform it into a vector and then the dot product of the vectors will give you a clue into how similar they are in raw content (ignoring order of terms up to a point). Now, this operator is going to be complex. The natural algorithm that any one could build is going to have complexity O(N2). You can build clustering algorithms, but their performance may not be all that great and at worst are useless.

Solving the problem I want to solve may not be feasible without massive resources.

Does Google dedicate massive resources to detect duplicate content? Now, this is an interesting question, and I doubt they even need to. This line of thought gave me a clue on how to even build a product that would be modestly useful. Rather than thinking about how to detect duplicate content, I think about how to punish duplicate content.

It is very easy to punish duplicate content as you go. For instance, If I was google, then I would look at search results and prune out duplicate listings as I go. If I search for "mathgladiator", then duplicate content will rank similarly and be adjacent in a search (or close). This algorithm is O(M) where M is the number of search results. So, as Google returns results, it adjusts and punishes data that it believes is duplicate with some form of voting system. Over time, duplicate content is dead.

Ok, how to provide as a service? Well, take an open source search engine that provides full text search (Nutch?) and then have it crawl your site. Take a list of keywords/terms that you care about and then cron the search and then compare adjacent search results. Alert on content that compares.

Problematically, this doesn't solve the deeper issue, but it gives an advantage to those that can build this system.

No comments:

Post a Comment