Flaptor autotagger

by dbasch — 5 Comments »

We just put up a beta version of our autotagging software at tagger.flaptor.com. It is a program that uses machine learning techniques to guess what tags could apply to a blog post or a news article. Please give it a try and tell us what you think!

Some interesting facts about it:

  • The current algorithm learned from hundreds of thousands of blog posts from different rss feeds, all of them in English. We plan to support other languages soon.
  • It is slightly biased towards current events, as most posts are from the past few weeks.
  • If you click on a suggested tag, you will see information regarding what words in the post contributed to that tag. This gives you a hint as to what search engines may think of your post based on the words it contains.

We are very excited about this tool and we will continue to improve it in the near future so stay tuned!

The perfect search engine

by dbasch — 3 Comments »

A few months ago Steve Newcomb from Powerset asked a question on LinkedIn: if you could build the perfect search engine for you, what would it do?

I was reading all the answers again and started wondering about what a “perfect” search engine would do in the true sense of the word. To me, a perfect search engine would not be a web-based application into which I have to type a question in order to get an answer. Rather, it would be more of an extension of my brain.

The reason I’m using a search engine in the first place is because there is some information that I currently don’t know or can’t remember. Many times it happens that I’m trying to remember some address, how to do something, etc. and my brain just isn’t finding the information. I don’t even know if I ever knew it! A perfect search engine would be like a second brain: it would be a telepathic mechanism to which I can resort in these cases. I wouldn’t have to even phrase a question in my mind, it would work by association just like my brain does.

I wouldn’t want this mechanism to kick in automatically though. I very much want to keep a clear distinction between my own memories (or personal information cache, if you prefer) and the collective knowledge of the global network. Perhaps an interesting way would be to have an audible voice tell me something like: “it seems like you don’t have what you are looking for, let me get it for you”. On second thought, that sounds too much like Microsoft’s Clippy. Maybe it would be better if I could consciously activate the mechanism by thinking something like “Searcher, I’m at a loss here. Please find me some relevant information”.

In a perfect world, this information would be one hundred percent worthy of trust. I would have no reason to doubt what the search engine tells me any more than I distrust myself when I walk back to where I think I parked my car this morning. Furthermore, since we are going for perfection, sometimes the search engine would be able to create content to suit my needs. For example, if I want a picture of a monkey in a scuba diving suit talking on a cellphone while skateboarding on the moon then the search engine would synthesize it for me.

Maybe what I described is not the perfect search engine. Maybe it will never happen, for better or worse. The point I’m trying to make is that the paradigm of search engines is still very primitive compared to what our imagination allows. The above question makes me imagine a discussion about perfect transportation taking place in the sixteenth century. A group of European craftsmen would be exchanging ideas about giant carriages pulled by hundreds of horses on excellent cobblestone paths, or extremely efficient ships with all the amenities of a palace, impervious to the fiercest storms and powered by enormous sails made out of the finest silk (or something along those lines). Some adventurous minds such as Leonardo could think of flying machines but they would be ahead of their time.

A Threat to Web Search?

by jorge — No Comments »

In this article, John C. Dvorak talks about search neutrality. He points out that the task of indexing the web is so daunting that only a few very large companies (Google, Yahoo, MSN) can afford to do it. So far, search results are neutral. But what about tomorrow, he asks; when does corruption sneak into the equation? Elections could be influenced by little tweaks to the search results. And it wouldn’t be even illegal!

Mainstream news media immediately comes to mind. They skew their news coverage. They are owned by powerful media moguls with political agendas. Most of the population watches only few large networks, so they are effectively in control of the public opinion. And it is all legal.

Yet even while these networks do have political agendas, they are spread throughout the spectrum. It is unlikely that they will merge into one all-controlling information monster. If you don’t trust Fox, you can watch CNN, or the BBC. Or all of them, and form your own opinion. Better yet, you can add to the mix smaller networks, local online newspapers, blogs and discussion forums. With online news aggregators, you no longer need to spend all morning reading all major newspapers.

I believe the same applies to the search world. It’s unlikely that Google, Yahoo, MSN, Altavista, AskJeeves and AOL Search will all merge into one giant entity that decides what parts of the web you are allowed to see. If you don’t trust one, just go search somewhere else, or use a meta-searcher like Dogpile, Vivisimo, Kartoo, Mamma or SurfWax. Better yet, add specialty and local search engines to the mix, along with usenet, del.icio.us, technorati, digg, reddit, and any number of information aggregators that are and most likely will remain out of reach for wannabe election riggers.

In both cases, the myriad of online information sources, be it news or search results (the distinction blurs) keeps the big players in check. If a blogger can topple a mainstream news anchor who didn’t bother to check his sources, small independent topical search engines can be trusted to keep the big ones honest.

Wikisearch?

by dbasch — 7 Comments »

Jimmy Wales from Wikipedia just announced his intention to create a search engine combining Nutch and human editors. Color me a skeptic, but although Wikipedia is great this has a neon Vaporware sign painted all over it. Here are some reasons why I’m not sure it’s a good idea:

1) It’s been done before. Yahoo began as a human-powered directory of websites and had to start including results from other engines (first Altavista, then Inktomi, then Google and finally their own) to answer the queries that could not be satisfied by their directory. Their paid editors could only manage about one million pages.

2) It’s being done right now, wiki style. There are companies such as Wink that already combine user-powered search with machine-produced results (full disclosure: we work for them). Wink does not use an open source crawler and ranking algorithm like Nutch, as Wales stated he will. Why not? See the next point.

3) Although Nutch is a great effort led by a group of extremely smart guys, it has one problem that has not been addressed so far. This is how to deal with “adversarial information retrieval” (i.e. spam). Google and Yahoo have their own “secret sauce” algorithms that change all the time in order to prevent spammers from taking over the first page of results for coveted queries. How could you do this if your algorithms were public? Spammers could simply tune their pages and link farms in a way that would always beat relevant content that doesn’t resort to the same tricks. The amount of spam on the web is on the order of billions of pages, and spammers are competing for traffic which instantly turns into profit. It is simply impossible to weed out spam without closely guarded algorithms that take years to develop.

4) Scalability. It’s not hard to crawl the web once and create a search engine that receives very little traffic on an index containing a billion pages. However, it is incredibly expensive to keep this crawl updated AND deal with thousands of queries per second. Wikipedia is tiny compared to the web at large, in fact it’s three to four orders of magnitude smaller. Making the transition from millions to billions of pages is not child’s play, and it would require hiring dozens of brilliant engineers to design a system that could compete with Google not just on relevance but also on freshness, reliability and maintainability. I’m not saying that it is necessarily too late for a new player to enter the game, but why make an announcement years before having anything workable? I assume Wales and his team are smart people and must have given these issues some thought. However, if they haven’t built a prototype yet, they cannot know all the intricacies and subtle points about building a real a web search engine. One thing I can say for sure after years of working for major search engines is that running one is much harder than Wikipedia from a purely technical standpoint.

Fon

by dbasch — 1 Comment »

This morning I was in a meeting with a number of entrepreneurs including Martin Varsavsky from Fon. He gave a nice presentation about the concept. I got a Fonera (FON’s wifi router) at the from the meeting, which we promptly set up at Flaptor’s office. It’s a neat concept, I won’t explain it here in detail as you can read all about it on the site. Essentially, sharing some of your bandwidth allows you to use Fon’s worldwide network for free. In order for something like this to take off it needs a significant critical mass, obviously there needs to be an abundance of FON access points in major cities. Martin showed some charts that make it look like it could happen in 2007, and it sure helps to have strategic partners like Google and Skype. I’ll keep my eye on it, and will look up FON access points when I travel.

Bonoki

by dbasch — No Comments »

Flaptor is mostly about search but sometimes people want to do other stuff. I read once (I believe it was on Peopleware) that when people feel passionately about a project, it’s usually a bad idea to stop them. This is how Bonoki was born. Mono, Rafa, Pancho and Pasto have been working on it for the past quarter. If nothing else, it was a good playground to learn some user interface technologies, perhaps it will become something more. For now, it’s a good place to post and comment on each others’ pictures. It’s still getting started although we refuse to put a Beta logo on it (that’s so 2005!).

Check it out, all feedback (however harsh) is welcome!

The Future of search

by dbasch — No Comments »

“Everything that can be invented has been invented.”
(falsely attributed to) Charles H. Duell
Director of U.S. Patent Office, 1899

When we think of searching for information, we imagine a text box on a white page with a button next to it. Type a few words, get a list of links and snippets of web pages in a fraction of a second. This interface has remained pretty much unchanged for the past decade. Google perfected it by eliminating unnecessary clutter and leaving just the essential inputs and outputs. Now, time for some rhetorical questions. Is this it? What will the experience of searching for information will look like ten, twenty, one hundred years from now? Is there room for a third question?
Read the rest of this entry »