Positive comments about Twist on Delicious

by dbasch — 2 Comments »

Twist got a lot of traffic today, as it made it to the top of del.icio.us/popular. Here are very positive comments from del.icio.us users. It’s fun to look at the logs and see what people are comparing. Also, moving it to Amazon EC2 was a good idea as the traffic is not a problem at all.

Twist – See trends on twitter

by dbasch — 3 Comments »

We just released a fun tool called Twist. If you are familiar with Google Trends, it should be obvious what it does. It allows you to track trends on Twitter over the past week, with a granularity of a couple of hours. While testing the tool we saw spikes in the charts whenever an interesting story broke (especially about technology or celebrities). We discovered trends such as the fact that people who Twitter tend to have lunch rather than breakfast on weekdays, but both are mentioned equally on weekends:

http://twist.flaptor.com/freq?gram=lunch,%20breakfast

This tool is coupled with our Twitter search engine, so you can see what people are saying about the concepts shown in the charts. We really like Twitter and we are excited about this project, so stay tuned for more improvements.

By the way, both applications run on Amazon EC2.

Wordpress.com search running on Hounder

by dbasch — 7 Comments »

Wordpress just launched a blog search service for the almost 3M blogs they host on the site. It’s powered by Hounder, the open-source search engine we have been developing at Flaptor over the past few years. We’ve been working on this for a few months so it’s great to see it live! It’s been a pleasure working with Matt and Toni to make this happen, and it will be fun to see how many blogs get indexed in the future and how Hounder scales to handle the traffic.

TagMahal released

by dbasch — 2 Comments »

We just released a WordPress plugin for our automatic tagger. It’s called TagMahal and it’s available here: tagger.flaptor.com/tagmahal. If you have a blog powered by WordPress, it will suggest tags for a post as you write it. Try it, it’s fun!

Intro to the Semantic Web

by dbasch — No Comments »

Here is a six-minute video about the Semantic Web. The idea is explained in a very simple and concise manner.



I think it’s a neat concept that would become popular if it could be made easy for authors to annotate their content as they go. One problem is that most people can’t even be bothered to tag blog posts, let alone incorporate new tags to describe the types of things they talk about. I believe that a content creation tool that could automatically discover entities such as places, artists, events or book titles would be very helpful in this respect. At Flaptor we are doing research along these lines, it looks like a field that is still in its infancy (pretty much like search engines in the nineties).

Flaptor Open Source

by dbasch — 2 Comments »

At Flaptor we believe in the open source philosophy. This is why we have decided to release our most widely used projects as open source. We have created Flaptor Open Source, an initiative for projects related to information retrieval.

We think this decision will be beneficial to the open source community as well as to our clients. On one hand, the community will be able to take advantage of proven and stable projects such as our search engine Search4j. On the other, we hope that the feedback from the developer and user communities will help us improve the quality, robustness and features of our code. This will benefit our clients, who will be running a better product. As for us, we expect to increase our market share and reach users who otherwise would not have been able to run our software. Of course, we will continue to sell support and customization services.

Through Flaptor Open Source we have already launched a byproduct of our search engine, called Clusterfest. It’s a framework to monitor and control multi-server java programs. Search4j will be available as open source soon, stay tuned!

Flaptor autotagger

by dbasch — 5 Comments »

We just put up a beta version of our autotagging software at tagger.flaptor.com. It is a program that uses machine learning techniques to guess what tags could apply to a blog post or a news article. Please give it a try and tell us what you think!

Some interesting facts about it:

  • The current algorithm learned from hundreds of thousands of blog posts from different rss feeds, all of them in English. We plan to support other languages soon.
  • It is slightly biased towards current events, as most posts are from the past few weeks.
  • If you click on a suggested tag, you will see information regarding what words in the post contributed to that tag. This gives you a hint as to what search engines may think of your post based on the words it contains.

We are very excited about this tool and we will continue to improve it in the near future so stay tuned!

The perfect search engine

by dbasch — 3 Comments »

A few months ago Steve Newcomb from Powerset asked a question on LinkedIn: if you could build the perfect search engine for you, what would it do?

I was reading all the answers again and started wondering about what a “perfect” search engine would do in the true sense of the word. To me, a perfect search engine would not be a web-based application into which I have to type a question in order to get an answer. Rather, it would be more of an extension of my brain.

The reason I’m using a search engine in the first place is because there is some information that I currently don’t know or can’t remember. Many times it happens that I’m trying to remember some address, how to do something, etc. and my brain just isn’t finding the information. I don’t even know if I ever knew it! A perfect search engine would be like a second brain: it would be a telepathic mechanism to which I can resort in these cases. I wouldn’t have to even phrase a question in my mind, it would work by association just like my brain does.

I wouldn’t want this mechanism to kick in automatically though. I very much want to keep a clear distinction between my own memories (or personal information cache, if you prefer) and the collective knowledge of the global network. Perhaps an interesting way would be to have an audible voice tell me something like: “it seems like you don’t have what you are looking for, let me get it for you”. On second thought, that sounds too much like Microsoft’s Clippy. Maybe it would be better if I could consciously activate the mechanism by thinking something like “Searcher, I’m at a loss here. Please find me some relevant information”.

In a perfect world, this information would be one hundred percent worthy of trust. I would have no reason to doubt what the search engine tells me any more than I distrust myself when I walk back to where I think I parked my car this morning. Furthermore, since we are going for perfection, sometimes the search engine would be able to create content to suit my needs. For example, if I want a picture of a monkey in a scuba diving suit talking on a cellphone while skateboarding on the moon then the search engine would synthesize it for me.

Maybe what I described is not the perfect search engine. Maybe it will never happen, for better or worse. The point I’m trying to make is that the paradigm of search engines is still very primitive compared to what our imagination allows. The above question makes me imagine a discussion about perfect transportation taking place in the sixteenth century. A group of European craftsmen would be exchanging ideas about giant carriages pulled by hundreds of horses on excellent cobblestone paths, or extremely efficient ships with all the amenities of a palace, impervious to the fiercest storms and powered by enormous sails made out of the finest silk (or something along those lines). Some adventurous minds such as Leonardo could think of flying machines but they would be ahead of their time.

A Threat to Web Search?

by jorge — No Comments »

In this article, John C. Dvorak talks about search neutrality. He points out that the task of indexing the web is so daunting that only a few very large companies (Google, Yahoo, MSN) can afford to do it. So far, search results are neutral. But what about tomorrow, he asks; when does corruption sneak into the equation? Elections could be influenced by little tweaks to the search results. And it wouldn’t be even illegal!

Mainstream news media immediately comes to mind. They skew their news coverage. They are owned by powerful media moguls with political agendas. Most of the population watches only few large networks, so they are effectively in control of the public opinion. And it is all legal.

Yet even while these networks do have political agendas, they are spread throughout the spectrum. It is unlikely that they will merge into one all-controlling information monster. If you don’t trust Fox, you can watch CNN, or the BBC. Or all of them, and form your own opinion. Better yet, you can add to the mix smaller networks, local online newspapers, blogs and discussion forums. With online news aggregators, you no longer need to spend all morning reading all major newspapers.

I believe the same applies to the search world. It’s unlikely that Google, Yahoo, MSN, Altavista, AskJeeves and AOL Search will all merge into one giant entity that decides what parts of the web you are allowed to see. If you don’t trust one, just go search somewhere else, or use a meta-searcher like Dogpile, Vivisimo, Kartoo, Mamma or SurfWax. Better yet, add specialty and local search engines to the mix, along with usenet, del.icio.us, technorati, digg, reddit, and any number of information aggregators that are and most likely will remain out of reach for wannabe election riggers.

In both cases, the myriad of online information sources, be it news or search results (the distinction blurs) keeps the big players in check. If a blogger can topple a mainstream news anchor who didn’t bother to check his sources, small independent topical search engines can be trusted to keep the big ones honest.

Wikisearch?

by dbasch — 7 Comments »

Jimmy Wales from Wikipedia just announced his intention to create a search engine combining Nutch and human editors. Color me a skeptic, but although Wikipedia is great this has a neon Vaporware sign painted all over it. Here are some reasons why I’m not sure it’s a good idea:

1) It’s been done before. Yahoo began as a human-powered directory of websites and had to start including results from other engines (first Altavista, then Inktomi, then Google and finally their own) to answer the queries that could not be satisfied by their directory. Their paid editors could only manage about one million pages.

2) It’s being done right now, wiki style. There are companies such as Wink that already combine user-powered search with machine-produced results (full disclosure: we work for them). Wink does not use an open source crawler and ranking algorithm like Nutch, as Wales stated he will. Why not? See the next point.

3) Although Nutch is a great effort led by a group of extremely smart guys, it has one problem that has not been addressed so far. This is how to deal with “adversarial information retrieval” (i.e. spam). Google and Yahoo have their own “secret sauce” algorithms that change all the time in order to prevent spammers from taking over the first page of results for coveted queries. How could you do this if your algorithms were public? Spammers could simply tune their pages and link farms in a way that would always beat relevant content that doesn’t resort to the same tricks. The amount of spam on the web is on the order of billions of pages, and spammers are competing for traffic which instantly turns into profit. It is simply impossible to weed out spam without closely guarded algorithms that take years to develop.

4) Scalability. It’s not hard to crawl the web once and create a search engine that receives very little traffic on an index containing a billion pages. However, it is incredibly expensive to keep this crawl updated AND deal with thousands of queries per second. Wikipedia is tiny compared to the web at large, in fact it’s three to four orders of magnitude smaller. Making the transition from millions to billions of pages is not child’s play, and it would require hiring dozens of brilliant engineers to design a system that could compete with Google not just on relevance but also on freshness, reliability and maintainability. I’m not saying that it is necessarily too late for a new player to enter the game, but why make an announcement years before having anything workable? I assume Wales and his team are smart people and must have given these issues some thought. However, if they haven’t built a prototype yet, they cannot know all the intricacies and subtle points about building a real a web search engine. One thing I can say for sure after years of working for major search engines is that running one is much harder than Wikipedia from a purely technical standpoint.