introd.htm: How to search the web, by fravia+ intro

~ Introduction ~

				Intro duction

Version March 2001

[lore: collective knowledge or learning on a particular subject]
That's exactly what you will find on this site: various web-searching "lores"
(Note that the plural "lores" does not actually exist in the english language: I have made it up myself for reasons that are explained elsewhere)

[New Introduction] [Old introduction]
[Oddities and reversing spammers] [Good seekers are dangerous]
[How many URLs do the search engines cover?] [This page's AV-special]

[Recent conferences and workshops of mine]

(See the [hint & tips] page if you just want to start "working" quickly)

Caetera desiderantur
(actually: caetera desunt)

I have opened my searchlores.org in February 2000. The site seems popular: I have received on my main site alone (without counting mirrors like searchlore.org) a million hits per month on average. As per March 2001 some pages of this site - as you will notice - are still missing. They will be added soon or later, though, and the site should be complete one future sunny day. Note that many of the existing pages are continuously updated, see [news.htm] for ad hoc listings.

An introduction to web-searching

"Websearching, the sublime art"
(by ~S~ fravia+ 2000)

The web is uncharted and deep. At the time of writing this snippet of mine there are supposed to exist well over 1.500 millions indexable pages, expanding exponentially. This is an underestimation: nobody really knows how many there are. The "richest" search engine covers at the moment only approximately less than a third of the existing total (In January 2000 Fast (alltheweb), with 300 millions pages, overtook Altavista and its 250 millions pages). The most "rich" search engine at the moment covers far less than a third of the existing total: May 2000: With 500 milion pages Inktomi overtook Altavista (leaving it at 350 million pages) and Fast. Note that search engines will boast that they 'selected' a smaller amount of pages, and in reality have visited many more (a dual number approach now in vogue to cover search engines shortcomings :-) Fact is that their coverage is - even in the best and most optimistic hypothesis - meager.
As you will see, there are different ways to search the web for nuggets among piles of comercial rubbish, "simple" methods but also other, less simple, paths. There are various possible 'strategic' approaches:

You search yourself - searching
- using the main search engines
- using newsgroups
- using messageboards
- using maillists
You search people that have already searched - luring, trolling, combing
You follow seekers to where they come from - luring, trolling, klebing
You discover, enter and use "hidden" information databases - seeking, hacking
You write and use YOUR OWN searchbots and let them search for you - programming, algo-reversing

Note also that the PREPARATION phase (topic), the EVALUATION phase and the CONSOLIDATION (grepping) phase of each query are quite important "lores" per se.

You are embarking here on a very long voyage, at the end you'll be what I would like to call, lacking a better definition, a good seeker. As a consequence you will probably be able to find anything you may ever being looking for on the web.

Be warned! This knowledge will inter alia make you quite a dangerous person. You'll realize this perusing my site, if you didn't know it already. This possibility is at the same time the very reason for the existence of this site of mine: Indeed I'll try to teach and explain you some of the main necessary techniques and tricks used by able (and even some "master") seekers all over the web, but at the same time I'll do my best to (try to) keep you safely on that what I believe should be a "knowledge path".
My hope is that once in possession of this knowledge, you will remain on our side, helping us spread knowledge for free in a quickly disappearing web of knowledge, which has unfortunately been almost stomped to death by a web infested by those commercial barbarians that you will now find everywhere, zombies and lackeys of the slave-masters who use their sharp (and dangerous) horns of pushed advertisement and money in order to loot, rape and ultimately destroy all minds seeking knowledge.

For these reasons some sections of my site are dedicated to matters that I consider quite relevant for any good searcher:

[Best browser] (your tool and "weapon") for effective searching purposes (certainly not Microsoft Explorer)
How to eliminate those awful [popup banners] and, more generally, how to nuke, fight and [annoy advertisers] (lore, tools, links, techniques)
Some simple [anonymity] lore
[Reality cracking] (and -once more- applied anti-advertisement lore)
Text exegesis (and some applied anti-rhetorical [semantic] techniques)
How to gather (some snippets of) [real information]
Some simple [site busting] lore, which will come handy when you are really annoyed by commercial spammers and/or porno advertisers and dealers.

Anyway you are a host: you are not compelled to read or do anything I wish. It is up to you and you'll decide. I offer some knowledge for free, choose, pick, refuse whatever you will.
My hope is that some of you will help and contribute with their own work. I'm aware of the fact that there is no guarantee, though, never.

OLD INTRODUCTION

An introduction to web-searching
(Websearching, the sublime art, by fravia+ 1997)

I see it coming... in a few years (actually already now, even if most zombies establishments have not yet realized it) one of the most important jobs will be, of course, websearcher.
We'll have many specialized branches: web-searchers, web-stalkers, web-seekers and so on. Zen and 'feeling' as well as a very broad 'global' knowledge will be required.
It's a good antidote to the hyperspecialisation that has nearly brought the whole silly commercial oriented society we are compelled to live in into a well deserved dead end: only large-minded, capable searchers will be able to keep the 'larger' over-perspective, and will be able to find ANYTHING they need (for free, of course), from Vivaldi's Concerto n.7 in F for four violins and cello (it's on the Web) through the second edition of the Police Criminelle, Technique et Tactique (it's on the Web) to A Western Australian survival kit for writing English (it's on the Web).
For the first time in the history of humanity, as long as you have web access it DOES NOT MATTER ANYMORE (for knowledge purposes) if you are located in a big rich city with huge libraries, good universities and a smart cultural life or if you happen to live in the middle of nowhere in a very poor country! The dream of the lighthouse guardian is now reality!
EVERYTHING is on the Web for free! I mean: any book, any newspaper, any university paper and any image, moreover (soon) any sound, any music, any film!
This means that - amidst mountains of useless garbage - ALL ACCUMULATED KNOWLEDGE is on the Web, free for you to discover and enjoy! If you still don't believe it, just learn how to search, you are in for some surprises!

So, what is a good searcher?

A good searcher is the kind of guy that can gather in a couple of hours all the material you need to write that nasty University Paper it would have took you at least three months to put together! (You still have to write that thing, though :-)
A good searcher is the kind of guy that -given half a dozen computers and stable internet access- can solve any librarian problem for any (and wherever located!) middle sized town! (it remains to be seen if middle sized towns are really interested in solving 'any librarian problem', though :-)
A good searcher is the kind of guy YOU will need very oft and very badly during the next years - that is, unless you learn the sublime art of searching yourself! (and that's the purpose of my site: a small contribution to form the next generation of wizard searchers :-)

I sincerely hope you will be able to gain here some very handy knowledge, that I believe you will not easily find elsewhere. Anyway, I'm sure that the development of the Web (or at least of the still existing 'sound' part of it, neither commercialized nor brainwashed :-) will more and more underline the importance of these activities.
This whole endeavour is a 'living' workshop, of course, which will flourish gathering more and more additions from my readers (I sincerely hope that some "real" wizard searcher will join my efforts :-)
Hope to hear from you, and receive contributions from many searchers. Remember: we'll gain a lot only if we will be able to build on the shoulders of others, letting them build on our ones... if you just leech, you lose and we all lose at the same time!

Oddities and reversing spammers

As you probably already know (else what the hell are you doing here?) the various advanced techniques you may use in order to search the web amount to a difficult and ill-understood art.

Try for instance the following: as you will see the differences between the two queries seem inexplicable.
Can you tell me why ELIMINATING from our query the word "money" we actually (March 2000) get MORE hits for the same string "how to search"?
[+"how to search the www" -money] 120 hits
[+"how to search the www"] 118 hits
(Note that variations are possible even during a single day)
These - and other - quirks are due to the specific algorithms that the search engines use. Thus searching is still far from being a completely understood science. There is an 'art' aspect (a 'lore' aspect IMO) that plays a role, as you'll see more often than not.

The imperative of preparing a good advanced query notwithstanding, all searchers like to try a few "quick searches" to test a search engine or a query idea. Thus oddities are found. Typing in a few terms into a blank box and seeing what comes up can be great fun, since every now and then, sifting through a pile of less relevant material, you may even find some truly interesting results. More often, something appears that makes you wonder where it came from. These 'odd' results are at times worth investigating per se, since they can help you to reverse engineer the algos used by the main search engines.

Note that this kind of reverse engineering is actively performed by thousands and thousands of little commercial bastards, whose only aim is to spam each and every search engine with their pathetic little sites for profit purposes. Yet even this kind of vermin's activity can be useful for us: some of the tricks devised by commercial hooligans in order to spam the search engines can open for us, as you will see, whole horizons of new and useful techniques that we will use (and spread) in order to ELIMINATE those very spamming sites when searching for knowledge.

In fact we can - and will - use those same tricks REVERSED, in order to cut our queries deep through the spam sites and catch the little (and more and more rare) gems we are looking for. Hope you understand what I mean... I'll make an example: since the very moment you find in a page images with single pixel width/height -aka webbugs- that are pointing to the main index page of a given site (an old Architext trick) you know that you have to do with evil spammers, you just need to filter such crap out from your result lists with a simple specific filter... Perilli praemium adipiscunt! Eheh :-)

The oddities you'll encounter are due to the fact that search engines have some defaults and basic features that are different, and thus their specific working is not always intuitive.
Often these different settings are the culprits that cause those unusual, funny or "false" results.

For example, the default for many Web engines is to OR terms together, then provide results based on relevancy. This combination produces a retrieval that has all terms present in the first few hits, and then fewer terms as you move through your hit list. This explains why, even though your terms were ORed together, the last hits do not even contain all of your search terms. Unless you specifically ask to AND terms together, do not trust your search retrieval number to accurately portray the number of hits from your search strategy.

Another typical default is automatic truncation on each term. So if your search is for "web search" you will also retrieve documents with the terms "searching," "searcher," and even "web-spiders" in them.

Another way of explaining "false" results is by determining exactly what the search engine is searching. Usually, the default is the URL, but sometimes a search engine retrieves documents where your search terms appear anywhere in the document. An address might include your search term, but the actual document may not show your term when retrieved. Also never forget the quicksand nature of the web: you may retrieve a page that has had your term some time ago, but that has been updated in the mean time, whereby your term disappeared. Pay close attention to any Web engine documentation to clarify just how and what it searches. All "searching mysteries" can be solved if you have enough time and will to do so.

Good seekers are dangerous

My hope is that you will learn here, how to find every IMAGE, every SOUND and -especially- every SCRIPT or BOOK or SOFTWARE program known to man. As you will learn and understand perusing this side, there is no way anybody can put something on the web and block you from seeing it, given the weaknesses in all security protections actually available. Of course you should respect copyrights, yet you will be quite surprised by the incredible amount of knowledge and sheer information you will be able to gather from those that do not respect them. Keep in mind that a good searcher can develop into a very dangerous fellow, if needs be, since no knowledge known to man can be hidden from him. This notwithstanding, I hope you will strive to remain on the correct path and choose to diffuse knowledge instead of hoarding it. Believe me: you'll gain more than anybody else from this approach.

To state things even more clearly: I hope you will learn here how to find ANYTHING you may fancy for free (apart from the lot of your time and of your brain required to understand) as long as it is something that can be translated in the virtual world: images, books, ideas, source code, games, sounds, documents, applications, trends...
You are embarking here on a very long voyage. Good luck.

~S~ fravia+, February 2000

How many URLs do the search engines cover?

How many URLs do the search engines cover?
I have discovered an interesting trick to find it out, using Northernlight (one of the top search engines with Fast, Altavista, Google & Hotbot). Just perform the following query:
[http://www.northernlight.com/nlquery.fcg?cb=0&qr=search+or+not+search&orl=2%3A1]
As you can see, this querystring search or not search gives at the moment (March 2001) for Northernlight well over 322 million urls... and the first positions that have been found are quite interesting per se.
The same search on Altavista
[http://www.altavista.com/cgi-bin/query?sc=on&hl=on&q=search+or+not+search&kl=XX&pg=q&Translate=on&text=yes&search=Search]
gives "only" 224 million urls, but this depends on the stop words used in Alta.

The real problem is WHICH URLs the search engines cover... alas the "largest" search engines cover (at best) only a tiny part of the web. Moreover they DO NOT index the most interesting parts of the web: they index commercial over educational sites, US sites over European sites and 'popular' sites (read sites loved by the zombies) over relatively unknown sites. Moreover in their commercial 'race' to the 'we have indexed one billion pages' tag, the biggest search engines have recently begun to bloat the indexes, including relatively 'useless' pages (say a collection of 2 millions pages with the images of 2 million different galaxies that should have been more correctly considered a single database).

This said (even if you'll have to learn other searching techniques, as you will see) the main search negines are far from being useless! Note for instance how inputting a long exact phrase you'll immediately find a specific page "cutting" it out of the "pudding" masse. For instance this very page!
Try Alta's search for
"The web is uncharted and deep. At the time of writing this snippet of mine there are supposed to exist"
Try it also on Google, where you can use the "I'm feeling lucky" option as well...
The web is uncharted and deep. At the time of writing this snippet of mine there are supposed to exist"
and see the results for yourself.

Yet the Web is a quicksand, and the algos are continuously a-changing, so that our techniques must evolve as well. As you'll realize perusing my site, there is a lot to learn and re-learn in this field. Collective work and the contributions of my readers are the sine qua non to keep abrest. Our ultimate aim, as always is "simply" to spread for free the light of our knowledge to anyone that cares.

This page's special AV-tip

related pages searching on AV

You can find related pages searching in this way:
like:http://classics.mit.edu/Search/index.html for instance (or whatever other page you may fancy as "model")
Try it! [like:http://classics.mit.edu/Search/index.html]

A couple of recent workshops of mine
may give you a more exact Idea of what this site is about

[The art of information searching for the open culture aera] (Milan, Italy, October 2000)

I held this conference/workshop in Milan, at the SMAU, celebrating the local "Linuxday", on 20 October 2000. My [friend] [Richard Stallman] spoke in the morning (about GNU/Linux), I spoke in the afternoon presenting six short lesson/slides about simple effective searching techniques and 'the structure of the web'.

[The art of information searching on today's Internet] (Paris, France, February 2001)

Conference held on 6 February 2001 at the Ecole polytechnique (Paris) where I presented five short lesson/slides about not well known effective searching lore, 'the structure of the web', some simple pseudo-anonymity techniques and the way you can modify and use some commercial tools for your own searching purposes.