Overview

This is an automatic information extraction system for collecting a list of resources from the World Wide Web. Currently, the resources extracted by this system are nouns and noun phrases. The extracted resources are sorted by their relevance to the user query.


The semantics of wildcards

The system supports two kinds of wildcards, namely % and *.

The % wildcard

The % wildcard represents a noun or noun phrase in a query. The use of the % wildcard enables you to specify which nouns or noun phrases you want extract from the Web. For example, the query "% is a Canadian city" will extract nouns or noun phrases immediately before "is a Canadian city".

The % wildcard should appear exactly once in the query.

The * wildcard

A word marked up by a pair of * wildcards will be augmented with its synonyms. Consider the following scenario: you want to extract a list of names of car manufacturers, so you enter your query as "% is a car manufacturer". However, some bona fide car manufacturers are often referred as "vehicle manufacturers", "sedan manufacturers", and so on. To address this problem, you can re-formulate the query as "% is a *car* manufacturer", and the query will be automatically expanded to include "car" and its synonyms.

The * wildcard is optional.


How to write queries

You can specify what information to extract by writing queries. A query in our system is similar to the phrase query for a typical search engine, except that you must use the % wildcard to indicate what to extract. The following is a list of sample queries: