Search Engines

CONTENTS


Search Engines and Directories

If we want to find some relevant spots on the WWW then directories can be useful. Directories contain lists of such sites, organised by topic. By clicking on a topic you get a list of subtopics. You have to continue down the suptopics until you find the stuff you want. One example is Yahoo which is a directory but it passes it's request onto a search engine, like AltaVista, if it cannot find the request it is directory.

So while directories are helpful, they are not as good as search engines when your questions become more specific. Search engine's are Web pages containing forms into which you type a text string you want to search for.

Behind every search engine stands a database, in which are collected the URLs (Universal Resource Locators, or specially formatted Internet addresses) of Web pages and other Net resources. Most of these databases are created by crawlers (also known as robots or spiders). Spiders are software programmes that roam the Web, looking for new sites by following links from page to page. When a spider finds a new page, it adds it to the database.These databases store anywhere from a few thousand to over a million Web pages; the leading engines add new pages daily. The size of an engine's database has a big impact on the success of your search. Generally, the smaller the database, the fewer hits received.


How Search Engines index the WWW

A search engine's database is simply an index of words and phrases associated with URLs. And spiders collect pages or URLs as they roam the WWW. But that is not all they do. Web spiders do more than just collect URLs. They also collect information about each page. The search engine's back-end software uses this information to create an index, which is what you're actually searching when you submit a query. Not surprisingly, indexing techniques vary from engine to engine. Every engine indexes a page's URL and title. Most also index the headers that start each section. Others record the most frequently mentioned words or the first few lines of text. Open Text Index and Alta Vista actually index every word on the page; including words like the and that that other engines ignore. Excite's concept-based indexing can find relevant pages even if they don't contain your specific keywords.


How to use a Search engine

While the size of the database determines the number of hits it delivers, the quality of the indexing is a major factor in determining how many of those hits are relevant to your search. No matter how big the database, or how sophisticated the indexing, a search engine is only as good as the query you give it.

Not all engines treat your phrases the same way. InfoSeek stems words, seeking matches with parts of the whole: ask for impressionism, for example, and you'll also get matches for impression. Lycos, on the other hand, treats your search term as a stem -- so the word metal matches metallic. Several engines let you search for whole phrases. Instead of just searching for the individual words in your query string, they look for occurrences of them together. Some engines let you refine your queries with special operators. At the most basic level, this means you can, as with Lycos, search for sites that contain either any or all of your search terms. Others let you use more formal Boolean terms--AND, OR, and sometimes NOT. InfoSeek and Open Text Index are the only engines that give you proximal operators, which let you search for terms that appear near or next to each other.

Many search engines to look for whole phrases instead of just keywords. Most engines return hits in order of relevancy. Different search engines use different methods to calculate relevancy. InfoSeek ranks hits according to how frequently your search terms appear in the page relative to their frequency in the entire database, Lycos ranks them based on the number of terms found on the page, their proximity to one another, and their position on the page. Lycos does this best, giving you a relevancy rating, a page description, and a brief abstract of its text.


Metasearching

These are pages from which you can use several search engines to launch queries.The only problem with these parallel searchers is that you don't get full access to each engine's query tools such as the Boolean and proximity operators, for example, so your searches will be less accurate than if you used the real thing.


Various Search Engines and Directories

YAHOO

Databases: A subject guide to Internet resources, Usenet news, and email addresses.
Contents: Links (URLs) to Internet resources including brief text.
Searching: Simple search box, but the Yahoo Search page offers options for searching Yahoo!, Usenet, or Email Addresses. Can be limited to listings added in the last day, week, month, or three years. Boolean operators (and, or) and string searching are also supported. Note: if no results are found in Yahoo, the search is automatically passed to Alta Vista, which then searches it's database and passes the results back via Yahoo. There are also local Yahoos (e.g. Yahoo!France) which give more local links and My-Yahoo which a user can log on to and can be customised to that users interests.
SEARCH TIP: Yahoo! is a subject directory, which means it will not list many pages that search engines will typically retrieve (such as Joe Schmoe's page of hot links). Use a few words that describe your topic or that may be found on a high-level page (the first page you would see when visiting a site) for an organization or company.
Address: http://www.yahoo.com/
Update frequency: daily

ALTAVISTA

Databases: World Wide Web pages and Usenet News.
Contents: 30 million Web pages (September 1996) and over 14,000 newsgroups.
Searching: Offers a simple search or advanced query mode. Simple search can handle simple queries as well as much more advanced searching by using a particular search syntax. To find out all your searching options, see the Simple Search Help. Advanced query mode offers boolean (and, or, not) and simple adjacency (near) operators, as well as an option for additional words to use to rank the search results (occurrence of the word(s) within retrieved documents will make them sort higher in the list).
SEARCH TIP: Try "title:"Your Phrase"" as a first search. That is, a search like title:"Digital Library" will restrict the search to that mixed-case phrase found in the titles of Web documents.
Address: http://altavista.digital.com
Update Frequency: Constantly by Web robot ("Scooter").

EXCITE

Databases: WWW pages, Excite Web site reviews, Usenet news and classifieds.
Contents: 50 million Web pages and more than two weeks of Usenet news.
Searching: Offers only a Simple Search option which supports some advanced search options (see Search Tips below). Excite searches by concept, that is, words found near eachother are often related (e.g. IBM and Lotus). This tends to return a lot more links.
SEARCH TIPS: Use a plus sign (+) to specify that ALL documents have that word, or use a minus sign (-) to specify that NONE of the documents have that word. Excite also supports full Boolean operators and syntax. You can use AND, OR and AND NOT operators and parentheses for grouping. For example: (digital or virtual or electronic) AND library.
Address: http://www.excite.com/
Update Frequency: Constantly by Web robot.

HOTBOT

Database: World Wide Web pages.
Contents: 54 million Web pages (September 1996)
Searching: Offers a Simple Search and an Expert Search. Simple search supports the Boolean AND and OR operators, phrase searching, and choices "the person" or "the URL". Expert search also supports date limits, media type (VRML, Audio, Javascript, etc.) and location (country, etc.).
Address: http://www.hotbot.com/
Update Frequency: Constantly by Web robot ("Slurp").

INFOSEEK ULTRA

Database: WWW pages.
Contents: 50 million URLs (September 1996)
Searching: Only offers a simple query option, but search words may be limited to particular fields (such as within document titles), eliminated (precede the word with a minus sign "-") or required (precede the word with a plus sign "+"). Infoseek also offers a corporate search (not free) that searches a number of useful databases as well as the normal ones.
Address: http://ultra.infoseek.com/
Update Frequency: Constantly by Web robot.

LYCOS

Databases: World Wide Web pages, sounds, pictures, and sites by subject (in separate databases).
Contents: 51 million URLs (July 1996).
Searching: Offers a Simple Search and a Custom Search, the custom search supporting Boolean AND and OR, as well as some other settings.
Address: http://www.lycos.com/
Update frequency: Constantly by Web robot.

OPEN TEXT INDEX

Database: World Wide Web pages.
Contents: No current statistics available (September 1996).
Searching: Offers Simple Search and Power Search options. Simple search allows you to specify either keyword or phrase searching. Power search allows you to specify where your search words should occur (anywhere, title, summary, first heading, Web location (URL)) and relationships between them (and, or, but not, near, followed by).
Address: http://index.opentext.net/
Update frequency: Constantly by Web robot.


The Future for the WWW

The searching of the WWW for information using the search engines that index the WWW is slow. Anyone surfing the WWW will notice that half their time is spent looking for information they want. So could Objects be the answer in the future.

Objects are entities that come into being when we take a piece of information and wrap it up in some functionality. This could lead to the idea of distributed functionality and intelligent data. This idea means that pages, incapsulated in an object, will be able to present their own content and answer questions based on this content. So instead of searching through the pages for the information, a question could be sent out and the object that can answer the question will say it can and then the answer will come back. But what if there are two objects able to answer the question ?

Then we need a broker to guide us to the answer we want and present us with all the options, (ie Object Request Brokers[ORBs]).These Orbs must speak a common language like CORBA. Hence if two different Orbs are COBRA compliant they can talk to each other.

OMG, (Object Management Group), have developed a CORBA compliant protocols for Orbs to communicate over the Internet. This protocol is called Internet Inter_Object Protocol (IIOP). It is predicted that it will replace HTTP as the dominant Internat protocol. It will allow the WWW to become an ocean of functional objects as opposed to jungle of content.IONA have a Java version called ORBIXWeb, which allows Objects to be used by browsers, in the form of Java applets. There are currently 15000 Orbs in place worldwide. Netscape's next offering will be Communicator. It will include an IIOP based Orb, thus, when released, it will take the number of Orbs from 15000 to about 2 million overnight. So it looks like this is the future on the WWW.