In pursuit of textual glory

Category: search

Search 3.0

“How could the world beat a path to your door when the path was uncharted, uncatalogued, and could be discovered only serendipitously?” — Paul Gilster, Digital Literacy

Nobody can deny the importance of search in improving the usability of the internet. Apart from actually hunting down information, search also forms an integral part of our exploration of the internet, the gateway rather. Millions of us use search engines to “enter” the internet. Typing in a few terms we sit back and click away at the flat list of links served to us. In this column I rant about search and the internet in general.

We all have used Google or Yahoo for search or any of the other popular search sites for that matter. For many people, using search engines has become routine. Not bad for a technology that’s not even 20 years old. But how did search engines come into being? What are the origins of this entity that prowls the outer reaches of cyberspace? The history of internet search is very interesting.

In 1998 the last of the current search super powers, and the most powerful to date, Google, was launched. It decided to rank pages using an important concept of implied value due to inbound links. This makes the web somewhat democratic as each off going link is a vote. Google has become so popular that major portals such as AOL and Yahoo have used Google and allowed that search technology to own the lions share of web searches. In 1998 MSN search is launched. The open directory and direct hit were also launched in 1998.

It is a study in the quest for greater granularity in finding information more and more of which was becoming virtual as we have a tendency to “go digital” in all walks of life.


Though search is getting better day by day, nothing epitomises the frustrations of sometimes trying to find the information that you want. Remember, “search” and ” find” are two different concepts. I quote from this register.co.uk article:

Ten years ago the internet – one computer network amongst many – was brought to our attention with the promise that it would give us unlimited access to “all the world’s information.” The phrase still pops up when people refer to the internet in the public prints. We’re awash with information, but more hasn’t proved to be better. What we have is a typical tragedy of the commons, a space that more closely resembles a toxic wasteland. The promise hasn’t been fulfilled. Canonical databases and archives cost money; copyright is a fact of life, and clever licensing workarounds don’t address the underlying economic issues. Information costs money and most rights holders like to be paid.Lazy governments have cynically taken advantage of this. Technologists only see more technology as the answer, and they’ve sold the idea to politicians. In the United Kingdom, the administration has presided over the slow strangulation of the public library service, and now simply points parents and schools to the internet. Buy a PC and broadband, and you’ll have everything you want: and if the garbage flies at you at 500 times the speed it did on dial-up, then you’re experiencing the thrill of truly living in the “information age”!

All the world’s info:

Search engines try and give you the exact information that you need. In the process of doing that they have to build indexes that are scalable and ever relevant, not to mention the fact that they have to be all encompassing with respect to “information”. Hence the stated purpose of companies such a s Google is “to organise all the world’s information”. A lofty goal indeed. In organising all the worlds information, all human knowledge will have to be the one that has to be indexed first. Let us take a look at this concept and its inherent drawbacks.

IT is evident to any one who takes a survey of the objects of human knowledge, that they are either ideas actually imprinted on the senses; or else such as are perceived by attending to the passions and operations of the mind; or lastly, ideas formed by help of memory and imagination – either compounding, dividing, or barely representing those originally perceived in the aforesaid ways. – By sight I have the ideas of light and colours, with their several degrees and variations By touch I perceive hard and soft, heat and cold, motion and resistance, and of all these more and less either as to quantity or degree. Smelling furnishes me with odours; the palate with tastes; and hearing conveys sounds to the mind in all their variety of tone and composition. – And as several of these are observed to accompany each other, they come to be marked by one name, and so to be reputed as one THING. Thus, for example, a certain colour, taste, smell, figure and consistence having been observed to go together, are accounted one distinct thing, signified by the name apple; other collections of ideas constitute a stone, a tree, a book, and the like sensible things – which as they are pleasing or disagreeable excite the passions of love, hatred, joy, grief, and so forth. – Of the Principles of Human Knowledge (1710), Bishop George Berkeley (1710)

It is plainly evident that a huge portion of all human knowledge that is existent today is of a metaphysical nature and not indexable. Even if they are, then I wonder how much of it is actually written down, for it to be indexed. That is problem number one. The nature of all the worlds information is not text based. It lies in sounds, sights, and experiences. But hey today we have search engines that index sound, as audio files, visual elements as video. But then that is not exactly information is it unless all such audio & videos were “howto’s”!!

But despite these drawbacks the web is so much more usable these days as a result of companies like Google, Yahoo etc. Still there is much wanting when on the quest for a particular piece of information. Below are some of the specific problems that I came across while using search engines.

Too much info:

The web started off as a network of interconnected computers. The early days of search was based on indexing everything in sight,a slow painstaking process until Google changed all that. unfortunately though more relevant search results are provided to each search query the problem still exists that the user has to come up with the perfect search query to find the right information!! It is still upto the user to create search queries that will yield the best results.

The perfect page is out there somewhere. It’s the page that has exactly the information you’re looking for and to you it’s beautiful and unattainable like a faraway star. If only you had a super-sized net for capturing it!

Most people use a search engine by simply typing a few words into the query box and then scrolling through whatever comes up. Sometimes their choice of words ends up narrowing the search unduly and causing them not to find what they’re looking for. More often the end result of the search is a haystack of off-target web pages that must be combed through. How often have you started a search on say ” How a radio works” and had to wade through blogs of individuals who made a DIY radio or sites that sold content or that exiated for plain advertising? The biggest problem people have with search engines (perhaps) is that they’re so good! You can type in a word and within a fraction of a second you’ll have 20,000 pages to look at. Most of those pages will not be exactly what you’re after, and you have to spend a load of time wading through the 19,993 that aren’t quite right.

I beleive that the solution to the problem lies in categorisation of serach results. Yes, categorisation! Rather than being faced with a flat list of links that Google thinks are relevant, what is needed is categories into which the results of that particular query fit. For example if I typed in the query “use radio” the results maybe presented as “buy radio”, ” radio usage”, “radio DIY”, etc, so that at a glance I can weed out the results that I think are not related to my taste and then delve deeper into the category that would then have more relevant content. That way I can weed out a lot of sites with the specific keyword but uninteresting content. That way it would be upto the system to categorise the results and not upto me to slice dice and refine my keywords to hone in on the information that I want. A more visual approach that would releive me of thinking. Some intriguing technologies are getting better at bringing order to all that chaos, and could revolutionize how people mine the Internet for information. Software now exists that analyzes search results and automatically sorts them into categories that, at a glance, present far more information than the typical textual list. A similar process powers Grokker, a downloadable program that not only sorts search results into categories but also “maps” the results in a holistic way, showing each category as a colorful circle. Within each circle, subcategories appear as more circles that can be clicked on and zoomed in on.

Informed Search :

The way search engines are constructed now means that you have to have a certain amount of search savvy, if I may use the word, to get results from the system. What I mean by that is that, anyone can type words into google and start a search but to further refine the results one has to have indepth knowledge of constructing search queries. That is asking a lot of your core user group. Search engines have to be more accessible to people wit less than stellar keyword construction abilities. A more democratic search engine is the way forward, one that serves the uninformed tramp as well as the university degree holder. Search engines have a socila responsibility as more content is on the web these days and search results are equally valuable to us all irrespective of our backgrounds.

Second generation of search : via this fantastic website

It includes a group of search services that make use of technology that organizes search results by peer ranking, or clusters results by concept, site or domain. This is in contrast to the more long-standing method of term relevancy ranking. This newer type of ranking usually works in addition to term ranking and looks at “off the page” information to determine the retrieval and order of your search results. Search engines that employ this alternative may be thought of as second generation search services. For example:

* Google ranks by the number of links from the hightest number of pages ranked high by the service
* Teoma ranks by the number of linking pages on the same subject as your search
* Vivisimo organizes results by keyword and/or concept

* The human element: concept processing. Second generation services such as Ask Jeeves and SurfWax apply different kinds of concept processing to a search statement to determine the probable intent of a search. This is often accomplished by the use of human generated indexes. With these services, the burden of coming up with precise or extensive terminology is shifted from the user to the engine. These services are therefore taking on the role of thesauri.
* The human element: “horizontal” presentation of results. Most search tools return results in one long, vertical list. In contrast to this, there is a growing group of search tools that use concept processing to return results in a horizontal organization. With these tools, you can first review concept categories retrieved by your search before examining the results within particular categories. This can make it easier to zero in on the aspects of your topic that interest you. Examples of these tools include All 4 One Metasearch, Clusty and Exalead.
* The human element: peer ranking. Search services such as Google and Teoma derive their results from the behavior and judgment of millions of Web developers.

The future:

It is interesting to note that there are certain forms of search that are still not widely adopted by search engines. I would like for example a way to scan the bar codes of all the books that I own so that Google can provide me with their book search results and I can easily search through my books. There is no copyright trouble as I did pay for the books. Interesting thought isn’t it. It would vastly increase my productivity.

The other craze on the internet now is “interesting links”. More people are turning to the web to serve as a source of entertainment and part of the trend is to search for “cool links”. Unfortunately typing “cool links” into Google is not what I mean. A whole array of service have sprung up based on the idea. Look at ” Del.icio.us”, “Digg” and the like. It would be so easy for Google, Yahoo and the like to come up with a page of interesting links gleaned from their daily, monthly and annual searches.
What is very markedly absent from my favourite search engine “Google” is the lack of the human element. They are not big fans of human involvement in providing search results. But isn’t a little bias a good thing when it comes to certain specific search results, say for instance product recommendations. They maybe able to add it to their Froogle search results maybe. Yahoo has the right mix by investing in up and coming web societies such as Flickr and Del.icio.us. I would love Google to incorporate tag based search or to even refine the concept in some way.

New tech:

The aim of the Semantic Web efforts is to be able to find and access Web sites and Web resources not by keywords, as Google does today, but by descriptions of their contents and capabilities,” Jerry Hobbs, a computer scientist at the University of Southern California,

Right now, this kind of search capability is impossible because Web search engines require that users guess the right keywords to find what they seek. However, several maturing technologies are considered the most likely keys to fulfilling the goals of the Semantic Web project. These technologies, already tried and tested in research labs, will help make the Semantic Web a reality:

The Web ontology language (OWL), developed on top of XML, will help search engines discern whether two Web sites have the same content even if they are described using different terminology or metalanguage.

It is an exciting time to be in. The possibilities are endless.

Tag along on the Web 2.0 train

It is nearing the end of another year and I need to write one last post. I did receive considerable email to my last post on tagging. In this I shall dwell on the concept of tagging in a little more detail.
For the minions who use the internet and are unfamiliar with the concept of tagging, tags are words that are assigned to the webpage or an object of interest. They are supposed to be short, relevant, with correct spelling and ideally are to be a single word for usability’s sake!. The idea is that you add tags to content that interests you, so that you can search them in future and discover content of a similar nature tagged by fellow taggers. That was easy wasn’t it. If you are feeling all gung-ho then let me give you the bad news. It is not as easy as it sounds for the concept is still in beta though you won’t find it mentioned anywhere!

Other people are quit dreadful.The only possible society is oneself.

Oscar Wilde

He couldn’t be more wrong. Serendipity as a result of expanding tag communities is a product of this phenomenon. In the days gone by the way into the internet was by typing keywords into search boxes of companies with colourful logos. Tagging has changed all that. If one stumbles on an interesting site then all one has to do is to click the tags accompanying the post to come across a veritable sea of links with the same tag. Whether they are all really relevant to what you expected or wanted to see is a thought for another day! But you can wade through the internet and keep finding many interesting links while searching for whatever started the activity in the first place.

This was a big year for tags. Lets take a look back at the major events in the world of tagged metadata.

Technorati introduces tags in January. Technorati’s tags was the first implementation of tagging. Technorati’s tags are picked up when the blogs associated with them are crawled. This is radically different from how del.icio.us does the tagging wherein the tags are owned by the site.

Yahoo buys Flickr and del.icio.us and starts what is now called the “Web 2.0” and has given rise to a veritable stew of posts on whether their search systems are going to be better than the databases maintained by the arachnids of Google. A definite case of mass arachnophobia. I personally believe that even though Google is having trouble with splogs and link farms and the lot, their concept of organizing the worlds information is likely to yield better search results than Yahoo’s My Web 2.0 launched in June, and no I don’t have a pet spider!. Google has taken up the mantle with tagging. Google now allows tagging of pages, though the tags are private. Google Base allows tagging too. Amazon launched tags for books in November.

It seems like every man and his dog has a tag now doesn’t it? One would be a fool to think that just because the major players have taken up tagging, the whole process would be simple now. There are more flavours of tagging than ever before. My previous argument of non standardization of the tagosphere wreaking havoc on the concept still holds good. 37Signals has an excellent write up on the matter. The fact that there are multiple interfaces is a bit confusing for the end user because it introduces an unnecessary learning curve for a supposedly simple task waiting for mass market appeal. On the side of the sites that make sense of the input, it probably doesn’t matter as whatever format the tags are originally entered. Once the system processes them it makes absolutely no difference whether you entered them with a space, comma or colon. It’s not about incompatible formats, but simply different ways of entering information into systems. From the end users’ perspective it’s irrelevant once the system accepts the data and breaks the string down into individual tags.

Relevance of the tags to the tagged content is a problem and will continue to be so as long as people have different tastes. But the problem waiting to happen is the “tag bomb”, which could be defined as spammers showering everything in sight with irrelevant tags that would show up in search results, hoping that somebody would click on them.

There is still the problem of searching all this data. On one side is the likes of Google with dedicated search engines crawling the net and indexing content and on the other is an army of taggers tagging everything in sight. At last count del.icio.us had about 100K users. Can random users tagging data yield better results than dedicated bots? I am having visions of “The Matrix” now.

Hybernaut.com has an excellent write up on this. I quote:

“Is the reliance on structured taxonomy an achilles heel of the user-fed Directory model?Perhaps the most likely outcome of all this will be a joint solution. If someone had the power to merge the tags collected by Technorati (or one of their peers) with the user-tagged content of Delicious, then they would be able to produce some powerful search results. And since search and syndication appear to be merging all over the place (Technorati ‘watchlists’, PubSub), someone with access to both crawled and user-fed tag databases would be able to produce superior syndication of serial microcontent like news and blog posts as well.”

In the meantime 43 people have tagged this site with the following tags,

“ crap, read, useless_dribble, *%@\\ ”

I remain.

Technorati Tags: , , , , , ,

Feeding frenzy

RSS is all the craze now and we have the unified icon to boot. Available in a variety of flavours as choices are good. RSS and ATOM are the ones available now.

Colin D Devroe called for a unified definition of RSS feeds while lambasting Wikipedia for providing a misleading description to novices. Having thought about it, RSS maybe defined as :

“A method off informing the reader of changes to a webpage without actually having to visit the site.Weblogs and news websites are common sources for web feeds, but feeds are also used to deliver structured information ranging from weather data to ”top ten“ lists of hit tunes.”

If you would like to see Wikipedia’s definition then here it is:

“A web feed is a document (often XML-based) which contains content items, often summaries of stories or weblog posts with web links to longer versions. Weblogs and news websites are common sources for web feeds, but feeds are also used to deliver structured information ranging from weather data to ”top ten“ lists of hit tunes. While RSS feed is by far the most common term, the generic ”web feed“ terminology is sometimes used by writers hoping to make the concept clear to novice users, and by advocates of other feed formats.” — Wikipedia.org

What I dont like about this definition is this bit here, “A web feed is a document (often XML-based) which contains content items, often summaries of stories or weblog posts with web links to longer versions”. It looks like I am not the only one, and rightly so as it is a matter of great consternation to me that a good few bloggers use this technology to their own selfish end. If Really Simple Syndication was all about making it easy for faithful readers to access content without wasting time accessing the website, then why would would you want to introduce a further step into the process by just providing a teaser. It’s non(ad)sense.

Its probably the argument that the blogger values the readers’ time so much that he doesn’t want them to waste time reading content that would not interest them and hence they don’t have to follow the teaser to the site. well they would not have subscribed to the feed if it was the odd interesting link, duh!, atleast I don’t anyway.

If you want a faithful following then full text feeds are a must, period. We need more people doing this so that this stops. Another thing I like about RSS is that it goads bloggers to come up with relevant and good material, and if thats not the case then all it requires one click on the unsubscribe button and poof! Democracy at the touch of a button via RSS.

Technorati Tags: , , , ,

Gollum Browser

Just randomly browsing the Wikipedia entries I came across this rather interesting iteration of a browser called Gollum!
gollum beta

It is being developed by a Harald Hanek, initially for his daughter and now under GPL for us all. In his own words he describes Gollum as such, “Gollum is a Wikipedia Browser for fast and eyefriendly browsing through the free encyclopedia ”Wikipedia“.
Gollum gives you access to nearly all Wikipedias in all languages. Further more Gollum gives you some special features which allow you to easily customize your work with Wikipedia.

In my opinion the interface of Wikipedia is too overloaded and confusing. So let’s get an easy to use interface. Gollum, the intuitive way to the powerful knowledge of Wikipedia.”

gollum navigation

Gollum is based on PHP and Javascript using XMLHttp request for communication, better known as Ajax. That means, there is no need for databases and the code is ready for PHP5. Therefore, the client is only required to use a browser like Firefox, MS Internet Explorer, Netscape or Safari with activated Javascript. Safari has yet to be tested according to the website but it works perfectly fine for me.

As you can see the navigation is nice and easy and the content is displayed in a very readable format. It loads pretty fast too, and has good localizations.

It is soon to be available as a beta download.

Technorati Tags: , , , , , ,