tadimens.htm: How to search the web, by fravia+: tadimens

~ Searching problems ~

				Searching problems

Searching problems
[A deep and uncharted web] [The web is a quicksand]
[Other searching techniques] [Psychological and visual aspects] [How to proceed]

A deep and uncharted web

The web is uncharted and deep. The volume of easily located information instantly accessible to a user is so massive as to be incomprehensible. At the time I started writing this snippet of mine (February 2000) there were supposed to exist well over 1.500 millions indexable pages, expanding exponentially.

In December 1997 the web had roughly 320 million pages. In February 1999, a series of studies concluded that the web size was about 800 million web pages, with about 15 trillion bytes of textual information (one byte equates roughly one text character), and 180 million images: about 3 trillion bytes of data. Now (September 2000) we have already "passed" the 2,000 million pages and the 600 million images marks.

NOTA BENE: I'm counting only the "publically available" information. All information behind firewalls, on local intranets, and all password-protected information, which is available only (ahem, in theory :-) by filling out search forms, is NOT even included. Since one of the aims of this site of mine is to allow you to access this kind information as well, should you feel you need to, the dimensions of what will be available to you are indeed more than staggering.
Believe me: the volume of information instantly accessible to you IS so massive as to be (almost) incomprehensible.

Of course the abovementioned data are just "the beginning": the web continues to grow at an incredible pace. It doubles in size in less than one year, with whole countries getting more and more wired, some of them showing growth percentages well over 100%.
There are many scientific studies available on this development: to give you a rough idea, let's say that at the beginning of September 2000 the web consists of - take or add a few millions here and there...

2,200,000,000 pages;
37,000,000,000,000 bytes of text;
600,000,000 images;
10.000,000,000,000 bytes of image data.

Everyday more than 3 millions new pages are added to the web, with more than 60,000,000,000 new bytes of text and well over 1,000,000 new images (with 16,000,000,000 new bytes of image data). These quotas are GROWING.

There are many sources of information on the deep web, and each of them deserves to be searched using various techniques. Therefore, the real first task is figuring out where to look. You should also consider the fact that most information CANNOT be found using the 'classical' search engines. The "largest" search engines cover (at best) only a tiny part of the web. Moreover they DO NOT index the most interesting parts of the web: they index commercial over educational sites, US sites over European sites and 'popular' sites (read sites loved by the zombies) over relatively unknown sites.
Remember also that each 'main' search engine has different strenghts and weaknesses, and that it would be nonsense to use always the same search engine (say altavista) to search for any target.
The information mass is overwhelming. To avoid drowning in the vast sea, anyone navigating the web needs to know how to find (and use) all available tools and techniques.

You'll have immediate rewards:

Knowledge is power. This is particularly true - as you'll soon notice - when searching the web. You will even be able to turn the knowledge you have acquired into money, if you so wish, losing however the capacity to acquire further knowledge in this process.
The value of critical information is increasing, and such information is now - for the first time in the history of our race - available for everybody (even if the volume of commercial rubbish you'll have to wade through is increasing exponentially).
You will also soon realize that finding and using some online resources can be quite fun and entertaining. Keeping 'on track' during your searches will be at times quite difficult indeed. So be it: there is much more to life than just knowledge and work :-)

The web is a quicksand

The web is uncharted and deep. Moreover the web is a quicksand: web pages are changed, removed or shifted continuously. Changes may be minor, major, or total. According to various projects striving to create archive snapshots of major portions of the web, the average lifespan of a webpage is between one and two months, which means that in the last day about: 40,000,000 pages and 10,000,000 images changed
Now you have the dimension of the problem. No algorithm, no computer-processing-power, no "battery" of ultra powerful supercomputers is capable to cope with this tide of ever-shifting exponentially rising data.
Effective searching requires new methods. This site will try to explain you some of them.

The major search engines have a great deal of trouble indexing any significant portion of the web. The engine with the largest coverage (Fast: supposedly a billion pages) only indexes 25% of the entire web, and some of the other 'main' search engines only cover a meagre 5% of it. It is a tremendous problem to index such a vast number of pages - for instance AltaVista indexed 100 million pages in October 1997, only 140 million in August 1999, only 250 million in january 2000 and only 350 million now - they obviously cannot keep up with the growth of the web.

The search engines re-index slowly, and have a field day even to keep their own databases "tidy", i.e. eliminating all those annoying 404 for the pages that went missing between two index-runs. Very seldom do they nowadays "follow the links" into uncharted parts of the web.
The rate of updating can be moreover extremely slow. A good exercise is to find an unlisted site and add it yourself to the main search engines. Go ahead and do it: Google and Northernlight will index a relatively significant part of it in a couple of weeks, Altavista will index (almost) only the pages you have manually entered (without following any links), whereby Fast/Alltheweb wont update its indexes for many MONTHS. So you have an additional problem with the big search engines: stale sites abound, new sites wont be listed.

Let's see:

Problem number ONE: More than 2000 millions web sites by the time you read this.
Please remember that all these numbers are underestimations: nobody really knows how many sites there are out there.
Problem number TWO: search engines that do not cope.
The most "rich" search engine at the moment covers far less than a third of the existing total. Search engines will boast that they 'selected' a smaller amount of pages, and in reality have visited many more (a dual number approach now in vogue to cover search engines shortcomings :-) Fact is that their coverage is - even in the best and most optimistic hypothesis - meager.

Taking account of the fact that Internet is a "quicksand", where sites disappear, move or are modified continuously, it is easy to understand that even the most powerful search engines are doomed: even keeping updated their already existing immense databases is getting a more and more daunting task (hence the many 404 errors when you make a search).
Conclusion:
Other techniques must be used to search the web.

Other techniques must be used to search the web

The web is uncharted and deep. Other techniques must be used to search it.
Of course you will first have to learn how to EFFECTIVELY use the existing search engines. You'll discover that there are quite a lot of (interesting) differences among the main ones. You'll also have to understand which specific algorithms have been implemented by their programmers. Some algorithms, as you will see, involve tracking how many sites link to a page and then increasing the 'weight' of that page when delivering result lists (google & infoseek), quite another kind of algorithm weights pages depending on the number of people clicking on a specific link inside the resulting lists (hotbot).
You'll also discover that the real purpose of all these "free" search engines is -quite simply- to gather data about you. Then you will have to master some other useful search techniques, like combing and klebing

Fravia's "three steps" in seeking information: searching, combing, klebing.

you search the info yourself
you search people that have already searched that info (and are willingly giving what they have gathered)
you lure people that have already searched that info (and hare hoarding their knowledge)

There is a lot more, of course: you will learn that there are many "webs inside the web", and that some of them are incredibly valuable, you will realize that the 'form' of the web is a very curious moebius knot, and that some important loops are almost unlinked (yet at times very valuable), whereby there are other loops of the huge knot that are heavily linked (yet mostly worthless).
Then you will have to learn how to build and program your own bots (and how to reverse the bots and the algorithms used by others' spiders :-). Then you will have to understand some half-forgotten old searching lore. Finally you'll have to develop what I would call a "zen" feeling when searching.
There is a lot to learn.
The web is uncharted and deep. You are embarking here on a very long voyage, at the end you'll be what I call, lacking a better definition, a good seeker, and you will probably be able to find anything you are looking for on the web.

Psychological and visual aspects

The web is uncharted and deep. The numbers of parameters you should take account of is quite vast. As funny as it may seem to non-experts, there is a quite relevant 'psychological' aspect that should always be considered. In fact 'what you are' heavily influences your success possibilities when you search. In simpler words: zombies won't find nutting, no matter how perfect the search engines they (could) use will be. In fact the 'mode' of a person's affective behaviour determines his own perseverance and attention to detail, and therefore influences heavily his search strategies and satisfaction and stress levels, which as you'll see below are far from being irrelevant when searching.

Note also that contrarily to common wisdom, the information on the web is NOT really disorganized (if it really were, we wouldn't be able to find it, duh): it is organized along patterns that you will slowly discern (search engines being only the most common known of them). Some patterns are format-specific (pdf files, for instance), others are related to the paths you are following in order to find the information (usenet for instance).

Also note that "a visual presentation" of your search results can be utterly fascinating (even if scientifically discutible).
See: some people are 'field-dependent' and need a lot of context or they get lost. Others are very visual, and prefer spatial organization of information. If - in order to understand what I mean - you want to see a "visualised" presentation of the results of a query, you may like to visit Cartia (http://www.cartia.com/), and newsmaps (http://www.newsmaps.com/), which offer examples of clustered visual displays of retrieved information like the [following snapshot of "world news"] taken on 8 March 2000).

Note that the visualization of search results can be intended in two different ways. One is simply to use graphical elements carefully in displaying results of searches, the other is attempting to display the search results graphically, in two or three dimensions, grouped by topic or category. Both approaches are meant to take advantage of the human capacity to process visual information quickly and efficiently. Good visualization should integrate the natural and technical world, use natural intuition, spatial cues and perception.

It is probably also useful, going back to the 'psychological' aspects of searching, to discern among different groups. A searcher may search better (or worse) depending on his experience in the particular topical area, on his age and cognitive style, on his technical aptitude and on his specific personality type. Don't underestimate the importance of the emotional (and motivational) involvement of the searcher: when seeking there is (luckily :-) a fairly DIRECT relationship between a greedy & provincial attitude and search failures, AND between an altruistic, non-petty attitude and success in the query.
Even anxiety plays a role when searching, believe it or not: stress levels diminish when searchers find results, duh. On the other hand the level of stress increases the longer the people search wading through the noise and the more web sites they have visited without approaching the signal.
I hope you'll be able to diminish your stress levels after having read the info provided on my site :-)

Few people know how to search, and even less know how to find what they have searched for

~S~ +Alistair

How to proceed

This site of mine is fairly didactic. Hence you'll find an "introductory" section, that could be of some use before entering the "basic" section, that I advise you to peruse "in the depth". Then you'll find my "advanced" section, where various techniques (like combing and klebing) will be examined. The advanced section should also help those among you that wish to "specialise" in one of the many seekers' disciplines. Then there is a "classroom" section, with a series of "in fieri" labs and classes.
As a "side section" you'll be able to visit the "other stuff" section, where I have crammed all sort of knowledges I find important for a seeker, even if not immediately seeking-related. Finally you'll have - of course - to proceed on your own, leaving my site through the "farewell" section.

I sincerely hope you'll learn enough to be able to contribute with your own essays, observations, searchstrings, tricks & hints. Building on each others' shoulders we'll reach new heights.

A word about images: you'll notice that while most pages load very quickly because there are almost NO pictures on them (the contents, not the frills, make this site the knowledge treasure it is intended to be) I have chosen some 'heavy loading' images for the main sections. You are well advised to ditch your iexplorer or netscape and switch to Opera, a fantastically quick browser (only a million bytes) that will give you the option to AVOID loading images -on the fly- whenever you decide. It is difficult to overestate how important this is on the commercially spammed to-day web. A whole section of my site is dedicated to the choice of the best browser, the sine qua non tool for any seeker.

There are many pages on my site... you are not compelled to follow any logical path. You may peruse everything at will, you are welcome. Do not be scared, nor paralysed if you don't understand everything immediately, knowledge is like one of the chill white wines bottled in the old lagoons I come from: you should sip it slowly and knowingly, else it won't do you no good.

And now, after having read all the above if you decide to start, go back to the [entrance] and click on the main logo.

~S~ fravia+