basimk.htm: How to search the web, by fravia+ basimk

~ Basic must know ~

				Basic Must Know

Version March 2001

[Terminology]
[Search engines' limits] [When to use what] [Focusing on specific needs]
[This page's AV-special]

(See the [hint & tips] page if you just want to start "working" quickly)

Some basic must know - 1: Terminology
Hey! You are allowed to continue without reading the following stuff only if you are really sure that you already know these basic matters...

[URLs] [Internet protocols]

A typical URL: http://www.searchlores.org/milano/milano.htm

Protocol	Domain	Directory	Document
http://	www.searchlores.org	milan	milano.htm

Missing a document name the browser will automatically open any document named "default" or "index" found in the directory

A URL (Uniform Resource Location) is the mechanism used for Web addresses. URLs are used in Web browsers to find the location of a particular Web page. A URL consists of three main parts - the protocol, the host name, and the directory location. First comes the protocol. This is usually "http" (Hypertext Transfer Protocol) or "ftp" (File Transfer Protocol). The protocol comes first, followed by a colon and two slashes. In this case it is using the HTTP protocol, which means a Web page is at that location. The next portion is the Internet host name www.searchlores.org. Somewhere on the Internet (here) is a system with that name, with a corresponding IP numeric address provided by the Internet DNS service (see below). The last portion is a specific document inside a directory structure, in this case /milan/milano.htm. Since a Web server will typically have many different Web pages on multiple directories, the URL provides a way of specifying where to look.

Internet Protocols ~ IP

The most fundamental protocol is called IP, for 'Internet Protocol' (duh). On a network, the common term for a location is 'address', and each system on the Internet has an address. Since you are on the web right now, in order to read this, you are automatically a node of the web and you have an Address as well. You can check your actual address using your C:\Windows\Winipcfg.exe utility (if you run windoze). This 'IP address' has many possible formats, a often overlooked fact which is quite relevant for seekers, as you'll understand in due time.
Internally, each computer system uses an IP address that is composed of four numbers, usually written for humans with dots between each number. An example IP numeric address is '209.103.174.104' (the IP address for my main site). However, since it's easier for humans to remember names instead of numbers, most IP addresses have corresponding names, also separated with dots. The previous address, written as a name is 'www.searchlores.org'.

Scattered throughout the Internet are dedicated systems with the responsibility of translating Internet name addresses into the IP address numeric form. These systems are called 'name servers'. Another term used in conjunction with Internet name addresses is 'host name', because every Internet address must correspond to his hosting computer system somewhere on the Internet. The systems that provide IP name to number translation are called 'Domain Name Servers', or DNS.

If you want to learn more about these things NOTHING can be better than RFCs. The Internet Request For Comments (or RFC) documents are the written definitions of the protocols and policies of the Internet. You can use the server specific excite form to search for specific terms.

[entry point to the RFCs] [Search the RFCs]

Some basic must know - 2
Search engines' limits and annoyances

One of the amazing characteristics (and probably advantages) of the Web is that it is not indexed in any standard manner. As a consequence retrieving information can seem difficult. Commercial search engines à la altavista, the most popular tools for locating web pages, return a huge mass of results, a general query risks to retrieve thousand of pages, many of them completely irrelevant, most of them quite off-topic. The problem is that you often enough get too many results.
All search engines crawl the Web and log in their databases all words from the web pages they have gathered. Some search engines, like google, even mirror the pages, thus allowing you to find even 'disappeared' or "lost" pages. These repositories, incidentally, make it almost impossible to "retire" something, once put on the web. I have conducted a personal experiment, trying to "pull" off the web my old, software reverse engineering oriented, site (that I have 'deprecated' :-) and found it next to impossible. In fact, once you publish something on the web, if of some interest, "it goes forth and multiplies".
Back to the crawling search engines with terabytes of compressed texts: Once you start logging words from hundred of millions of pages the results of an (unstructured or "simpleton") query can be overwhelming.
Without searching knowledge, and without a clear strategy, using a search engine is like wandering aimlessly in the dark and without spectacles and light, in the stacks of a poorly organized library trying to find a particular book.

A good Seeker's supper should not smell too much.
See, the problem is not only how to find, your info, but also and foremost, how to evaluate it. Imagine your search will give you "only" 200 - possibly valuable - results. That's way too many for effective human evaluation purposes. I'll quickly demonstrate this: let's say you have a good "zen" perception and it takes you just half a minute to quickly "diagonally scan" a page with your eyes~brain to "feel" if it is worth keeping~reading or not... that makes 100 minutes for that single query! More than one and a half hours smelling alien pages! Clearly that's not a very effective searching approach. For this very reason you should master some techniques that will allow you to drastically reduce the number of fishes you'll have to smell before choosing the ingredients for your seeker's supper.

Robot-driven search engines can be defined as search engines which use a bot (aka "spider", "worm", "wanderer" or "crawler") to automatically collect sites for their index. They are different from subject directories/trees which are hierarchical and rely on people to add sites to their index. No search engine can actually cope with the exponential growth of the web. The best engines (Altavista, Northernlight, Fast and Hotbot) cover actually less than 10% of the texts, images, programs and sounds present on the web. As you will learn on my site, you'll have to use advanced searching techniques like your own bot writing, local digging, combing, luring and klebing to (try to) get at the remaining 90%. The main search engines are in fact not able to follow "uncharted" links any more: the task of keeping the links they already have gathered UPDATED is difficult enough (as the many 404 you'll encounter testify). They now spider only "on submission".

This said you should by all means still use the main search engines (which in average overlap only for 50% of their sites, therefore you may well find on one something that another one does not show).
Be aware of the algos they use (highly variable from engine to engine, as explained in detail elsewhere on my site): mostly based on common parameters. For keywords occurrence, for instance:

Order of appearance of keyword terms
frequency of keyword
keyword in title and metatags
funny or rare keywords

Take however account of the fact that all main search engines are purposely spammed through the many tricks used to "get on the top positions", which -once more- depend from the specific algorithms used on each search engine (as I said, you'll learn how to reverse them elsewhere). The most common spamming techniques are:

Overuse or repetition of keywords
Use of keywords that do not relate to the content of the site
Use of fast meta refresh
Use of coloured text on same-color background, very old trick
Duplication of pages with different URLs
Use of different pages that bridge to the same URL

This -for many "broad" queries - means that people should actually simply JUMP the first twenty-thirty occurrences of results and start the results' evaluation directly from page 4 downwards :-)
Note also - moreover - that all main search engines are now beginning to sell "slots" and to allow some jolly answers based on whatever query you have made to figure in the list, thus pushing some sites on the first positions. One reason more to jump to page 3 or 4: screw commercial spammers and faked algos.
Finally, never forget that the very reason for having someone SETTING UP a search engine "for free" (sic) is to log ALL your queries in order to sell such data to third parties (see [seamara2.htm]). Therefore you should by all means be quite concerned with all the anonymity "matters" - and counter measures - that you will find explained [elsewhere] on my site.

Some basic must know - 3
When to use what
(directories and search engines)

You can and will learn to use effectively the various search tools on the web, but don't forget that the most important part of searching for information happens before you even get online.
It helps a lot to know which tool to use when, and that all depends on what kind of question you're asking or what type of information you're looking for.
I mean the "when" literally: the time of the day can affect results. It is often useful to avoid intensive searches when euroamericans are [awake].

Normal/Advanced inconsistencies & quircks
Remember also that most search engines have a "normal" and an "advanced" search mask, and that the "advanced" search mask gives DIFFERENT RESULTS vis-à-vis the normal one.
If you try an altavista query for +how to search +hints on the [normal] form you'll get 1999 pages (September 00), if you try the same query on the [advanced] form and you'll get only two!
Note that search engines quirks will give you eight pages on the advanced form if you invert the querystring to [+hints +"how to search"], a 400% difference for the 'advanced' form that does not correspond on the normal page: [+hints +"how to search"] where with this chiasmatic inversion you'll get just the same pages as before.

directories and search engines
As almost everyone knows, there are two main (and quite different) search tools on the web: directories and search engines. Directories like [yahoo], [Open directory] and [LookSmart] are CATEGORIZED lists, with brief descriptions of the sites. categories are based on submissions by web site owners, which are edited by more or less capable, mostly volunteer editors. Directories are good when you want prompts to guide you towards your signal, or when you want to go on a surf trip... when you need a specific information quickly you'll be better off with a search engine. See: Modern search engines index ALL words on the pages they register, and are therefore very useful if you know how to squeeze what you need out of the noise.
Let's seek this very page on fast/alltheweb... let's choose -purposely- a very RARE sequence of words, that I have written 10 lines above... you'll immediately see how this page will jump out of a billion sites :-)
"where with this chiasmatic inversion you'll get"
The same works with google as well...
"where with this chiasmatic inversion you'll get"
The reason you should choose a RARE sequence of words is obvious. The web is the contarry of a library: the more rare and esoteric your quarry, the more easy it is to find it. Spelling mistakes are an added bonus: "you will get" will give you millions of pages, "you woll get" will give you only one page (one at the time of writing this... probably two by the time this page will be indexed anew).

Of course directories and search engines are by no means the only options you have for searching your quarries.
You should choose your search tools "categorizing your question":.
Broad Topic
"What's out there on combing?"
"What's out there on searching strategies?"
"Can I get info about the actual web dimensions?"

Directories + Search engines + newsgroups + personal pages

Uncommon Hunt
Specific and unusual or unique
"I need info on the algos used by a specific search engine"
"I'm looking for source code I could re-use for my searching bots"
"I heard about a brand new searching algorithm"

Search Engines + Meta-Search Engines + newsgroups + messageboards + maillists + personal pages + your own bots

"Societal searching"
Anything people "are talking about", when a quick answer is needed on that field
"New tips for searching warez"
"Has anyone made new searching bots in rebol?"
"I just downloaded a new version of Ferret and it does not seem to work"

Usenet searches + messageboard searches + maillists

Always remember: Before plunging into the interesting (and often startling) world of web searching, ask yourself "Could someone have already done this work for me?" [Combing], i.e. searching - inter alia - people that have already searched the very information you are seeking, can give extremely interesting results. Messageboards, private pages and the whole usenet world are the seas you'll troll around when combing. In same cases the troll is meant in its REAL sense, read the [trolling FAQ] if you don't know that a troll is simply a posting on Usenet or on any messageboard designed to attract predictable responses. And this can in some cases be used to gain some extra-knowledge: see the ad hoc [trolls page] if you are interested.

Focusing on specific needs

The best way to look at Internet resources is through the focus of specific needs. Otherwise, you can spend a lifetime drifting through archipelagos of fascinating, but ultimately fruitless links.
A good rule of thumb is that if in less than 15 minutes you don't at least approach what you are searching for, you better revise your search_strategy.
So "approaching" needs a definition.
A search session can be divided in 5 phases:

Preparation (layout of the search strategy)
Start (or "broad") phase (signal weak)
Refining (main corrections)
Approaching (small corrections: signal very strong)
Closing in (the last efforts, no more noise on the signal)

With the "Dah-daa! Bingo I found my target!" as ultimate aim, of course.
Each of these phases has specific characteristics and needs specific knowledge, that I try to explain elsewhere on this site.
Yet there are some "global" parameters that are valuable for ALL phases, the most important one being, possibly, the capacity to remain "on track" during your session.
Once more, as Sielaff said long ago about footnotes and bibliography: "Focus always on your specific needs. Otherwise, you can spend a lifetime drifting through archipelagos of fascinating, but ultimately fruitless links". I don't need to underline how true this is for Internet searching as well...
Imagine yourself as a 'blade', cutting through million of useless "pudding sites" to find the few "rosinen" you are looking for. DO NOT GO ASTRAY! If you are looking for -say- some scripts in rebol that you badly need to ameliorate your own web searching bots you SHOULD NOT stop in order to read some fascinating info about IRC-bots,

even if you never saw anything as good elsewhere about this subject
even if these techniques could (eventually) be of some use,
even if you are genuinely interested in that stuff.

The above reasons do not matter!
Do not listen to Sirens: they are luring you in order to crush your search vessel onto the Scylla & Charybdis of all searchers!
For crying out loud: wake up! YOU ARE NOT searching for IRC-bots, you are searching for rebol-scripts! Write your task in big red letters on a yellow "post-it" and stick it in the middle of your screen (next to the counter that is counting down your 15 minutes maximal allowed search time :-)
This is more important than most newbies seekers believe.
The problem is that surfing and browsing the internet is quite seductive. You continue to find tidbits of interest. In fact, no matter how accurate your query was, a significant percentage of the search results you will obtain will not correspond to what you were looking for. Oh, you'll get results fairly quickly, the real time-wasting problem is the document review phase. This is the reason you should spend more time on the query formulation, you'll gain MUCH MORE time sparing it during the results evaluation phase. We'll together see how you can automate part of the evaluation task (thus sparing a huge amount of time).
Admittedly it can be at times quite interesting, from a seeker's point of view, to investigate WHY you did get some queer results (one of the assignments inside [lab1] requires you to search for a latin sentence, a specific search that usually gives some "funny" results), but in general you should imagine yourself (and your browservessel) as a sharp "blade" cutting through sites rather than as a lazy butterfly surfer drifting from link to link.

This page's special AV-tip

searching a specific host

Let's imagine you are interested in some of the goodies that the Massachusetts Institute of Technology may have about searching... well you can limit your search to pages at MIT only:
+host:mit.edu +"search tips" for instance (or whatever other combination you may fancy)
Try it! [+host:mit.edu +"search tips"]