I'm killing WIN and making something else also called WIN because WIN is an awesome name.
Why? The biggest thing old WIN does right is URL management. Now, I'm going to build a new product that's so fucking awesome.
new WIN = node.js + jsdsom
Basically, a proxy server that puts your server behind a strict HTML parser. Why? the stricter the HTML means the debugging process will gurantee correct HTML. Good HTML can be analysed on the fly. So, now, software can be written with craptastic urls like something.php?foo=bar and be transformed via jQuery on the fly.
What about big data? Well, dealing with file uploads in PHP is kind of annoying. Instead, I can use node.js to hyper-visor the upload and them stream it into S3. This will give the php script a URL which is what PHP ultimately wants.
Saturday, April 16, 2011
Detecting Duplicate Content
I think about duplicate content because I know that if I were to build a tool to detect it, then I could make a fair bit of $.
In general, you define an operator on content to transform it into a vector and then the dot product of the vectors will give you a clue into how similar they are in raw content (ignoring order of terms up to a point). Now, this operator is going to be complex. The natural algorithm that any one could build is going to have complexity O(N2). You can build clustering algorithms, but their performance may not be all that great and at worst are useless.
Solving the problem I want to solve may not be feasible without massive resources.
Does Google dedicate massive resources to detect duplicate content? Now, this is an interesting question, and I doubt they even need to. This line of thought gave me a clue on how to even build a product that would be modestly useful. Rather than thinking about how to detect duplicate content, I think about how to punish duplicate content.
It is very easy to punish duplicate content as you go. For instance, If I was google, then I would look at search results and prune out duplicate listings as I go. If I search for "mathgladiator", then duplicate content will rank similarly and be adjacent in a search (or close). This algorithm is O(M) where M is the number of search results. So, as Google returns results, it adjusts and punishes data that it believes is duplicate with some form of voting system. Over time, duplicate content is dead.
Ok, how to provide as a service? Well, take an open source search engine that provides full text search (Nutch?) and then have it crawl your site. Take a list of keywords/terms that you care about and then cron the search and then compare adjacent search results. Alert on content that compares.
Problematically, this doesn't solve the deeper issue, but it gives an advantage to those that can build this system.
In general, you define an operator on content to transform it into a vector and then the dot product of the vectors will give you a clue into how similar they are in raw content (ignoring order of terms up to a point). Now, this operator is going to be complex. The natural algorithm that any one could build is going to have complexity O(N2). You can build clustering algorithms, but their performance may not be all that great and at worst are useless.
Solving the problem I want to solve may not be feasible without massive resources.
Does Google dedicate massive resources to detect duplicate content? Now, this is an interesting question, and I doubt they even need to. This line of thought gave me a clue on how to even build a product that would be modestly useful. Rather than thinking about how to detect duplicate content, I think about how to punish duplicate content.
It is very easy to punish duplicate content as you go. For instance, If I was google, then I would look at search results and prune out duplicate listings as I go. If I search for "mathgladiator", then duplicate content will rank similarly and be adjacent in a search (or close). This algorithm is O(M) where M is the number of search results. So, as Google returns results, it adjusts and punishes data that it believes is duplicate with some form of voting system. Over time, duplicate content is dead.
Ok, how to provide as a service? Well, take an open source search engine that provides full text search (Nutch?) and then have it crawl your site. Take a list of keywords/terms that you care about and then cron the search and then compare adjacent search results. Alert on content that compares.
Problematically, this doesn't solve the deeper issue, but it gives an advantage to those that can build this system.
Labels:
technology
Sunday, April 3, 2011
Entrepreneur to Amazonian : The Amazon Vortex
So, I used to be an entrepreneur/super consultant until I realized it wasn't making me happy. Happiness is a precarious thing. Ultimately, happiness boils down to aligning your calling with a profession that earns your keep (See Dan Pink's book Drive). The reality is that I'm not so much an entrepreneur as I am an inventor. Fundamentally, my callings are inventing, optimizing, educating, studying, building, proving things, using math, having fun with computers, and being right (a lot).
One day, I read Hamming's speech via pg, and it reminded me of why I was in academia: I want to work on great problems. I left academia to work on a super problem, but that failed. I thought about returning to academia, but I would have probably become a crank with gray hairs working on some crazy type system complaining about something.
Instead of writing papers that few will read, I want happy customers that benefit from my work. So, I decided to try an new job. I traced from my recent work and passion in Big Data and worked back-words, and get a job.
The first company on my list was Amazon Web Services, and now I work there. I'm working in the S3 team.
This means three things: (1) I have a Big Data problem (really big), (2) I'm under an NDA, and (3) I'm working with Protoss again.
So, like the Google Vortex, I'm now in the Amazon vortex. This fundamentally means my blog is going to change. I'll probably stop writing on technical things for a while. When I head to the mountains, I'll post pics.
Labels:
business,
personal,
technology
Subscribe to:
Posts (Atom)