Wikisearch?

by dbasch — 7 Comments »

Jimmy Wales from Wikipedia just announced his intention to create a search engine combining Nutch and human editors. Color me a skeptic, but although Wikipedia is great this has a neon Vaporware sign painted all over it. Here are some reasons why I’m not sure it’s a good idea:

1) It’s been done before. Yahoo began as a human-powered directory of websites and had to start including results from other engines (first Altavista, then Inktomi, then Google and finally their own) to answer the queries that could not be satisfied by their directory. Their paid editors could only manage about one million pages.

2) It’s being done right now, wiki style. There are companies such as Wink that already combine user-powered search with machine-produced results (full disclosure: we work for them). Wink does not use an open source crawler and ranking algorithm like Nutch, as Wales stated he will. Why not? See the next point.

3) Although Nutch is a great effort led by a group of extremely smart guys, it has one problem that has not been addressed so far. This is how to deal with “adversarial information retrieval” (i.e. spam). Google and Yahoo have their own “secret sauce” algorithms that change all the time in order to prevent spammers from taking over the first page of results for coveted queries. How could you do this if your algorithms were public? Spammers could simply tune their pages and link farms in a way that would always beat relevant content that doesn’t resort to the same tricks. The amount of spam on the web is on the order of billions of pages, and spammers are competing for traffic which instantly turns into profit. It is simply impossible to weed out spam without closely guarded algorithms that take years to develop.

4) Scalability. It’s not hard to crawl the web once and create a search engine that receives very little traffic on an index containing a billion pages. However, it is incredibly expensive to keep this crawl updated AND deal with thousands of queries per second. Wikipedia is tiny compared to the web at large, in fact it’s three to four orders of magnitude smaller. Making the transition from millions to billions of pages is not child’s play, and it would require hiring dozens of brilliant engineers to design a system that could compete with Google not just on relevance but also on freshness, reliability and maintainability. I’m not saying that it is necessarily too late for a new player to enter the game, but why make an announcement years before having anything workable? I assume Wales and his team are smart people and must have given these issues some thought. However, if they haven’t built a prototype yet, they cannot know all the intricacies and subtle points about building a real a web search engine. One thing I can say for sure after years of working for major search engines is that running one is much harder than Wikipedia from a purely technical standpoint.

Add to del.icio.us!Digg it!

Comments feed icon

  1. I’m not completely sure about the reasoning posted in (3). A spam detection algorithm just decides, given a page and the query, if it is spam or not. Using that to create a spam page, that goes undetected by this algorithm IS a problem. It could be either easy or extremely difficult. If the algorithm is good enough, the only solution feasible is just trial and error, which is exactly the same as having a secret algorithm. This could be, in some way, compared to security by obscurity.

  2. The point is that if you have the algorithm in your possession, the trial-and-error cycle is extremely fast. On the other hand, if you have to wait for the search engine to crawl your pages and see how they rank, each cycle could take several days or even weeks.

    In some cases trial and error may not even be necessary. For example, the algorithm may explicitly state certain features of the page that are a telltale sign of spam. With Google or Yahoo, this is a moving target and you won’t know until you try. With an open-source algorithm, you know not to bother with tricks that are guaranteed to raise a red flag.

  3. The ‘extremely fast’ could be weeks instead of years, in that case the algorithm is pretty good If ‘extremely fast’ means minutes instead of weeks, well in that case it’s a rotten egg.

    What if extra input to this algorithm are all crawled webpages, in that case you can have the source but you don’t have the input, so you can have no clue what the results would be, just trial and error in the production search engine, the same thing that you would do if you wouldn’t have the source.

    I’m not sure if such an algorithm exists, but it’s not easy to assume it doesn’t.

  4. […] - La Bitacora del Capitan comenzó vendiendo bijou en Mardel… entonces te conocía de La Perla :P - La Mágica Web y unas fotos de cabildo en plena epoca de compras - ABrito es el grinch :P - Atalaya y los preparativos verdaderos para Papa Noel - Dilbert y una de las situaciones similares a las que uno vivió en su vida corporativa :S - SigT da Diez consejos para prevenir la perdida de datos - Fabio y como ser un voyeurista digital - Online cuenta que PayPerPost tuvo que cambiar sus políticas… igualmente no creo usarlo - tuexperto me hizo una entrevista - ALT1040 con el caso de un salame que no usó el sentido común - Bajo mi pulgar muestra un nuevo “género” de videogames… ¿será vaporware? - idealabs muestra la tipìca ignorancia periodística al hablar de Internet - furilo hace una excelente reseña de proyectos activistas online - Error500 espera la rebaja en el ADSL - Julio mientras tanto espera al menos tener ADSL… y yo pense que ahi las cosas serían diferentes :S - zonageek descubre quien hizo plata con el video de Window In The Skies - Flaptor (quien sino) escribe en serio sobre el Wikisearch - Pensamientos Despeinados hace la lista de Los 15 personajes del deporte 2006 - Think Wasabi con el modem de Vodafone que viene causando sensación - defmay y un regalo denavidad que nadie pudohaber imaginado :S - Un Blog Mas habla desde el “pasado” :) - Peinate y las cosas debloggers que descubrió desde que blogguea - ALT1040 y la lista de los productos más vendidos en Amazon durante 2006 […]

  5. buenos articulos

  6. Your reasoning in point 3) is not solid since we have evidence of open system that cannot be broken. Consider RSA the public-key algorithm for cryptography.

    In the security field it has been shown that protection by “obfuscation” does not work.

    Why isn’t this possible in Adversarial WebIR?
    Maybe it is… so don’t put that option aside just yet :)

  7. Hector, cryptography and adversarial IR are different problems. Taking your example, the RSA algorithm has two keys: one is public and one is private. If your private key fell into the wrong hands, then the algorithm couldn’t protect you. For obvious reasons, a “private key” could not be open-sourced.

    By the way, who says that security through obscurity doesn’t work? Of course you shouldn’t rely on obscurity as your *single* security mechanism, but obscurity in addition to a secure algorithm is better than just the algorithm.

    Relevance algorithms don’t have a secret key, so opening the algorithm makes it really easy for people to game it. Obscurity is the only thing Yahoo and Google have going for them. This is why they keep their anti-spam algorithms secret and change them pretty often.

Leave a Reply