by Kevin A. McGrail kevin@mcgrail.com
Using a Global DNS Record, an Apache Web Server, PERL, a few simple CGI’s and AGREP, this document will introduce you to our solution for creating an intuitive and easy to navigate Organizational Structure for related Websites with automatic Directory, Fuzzy Logic and Search capabilities for organizations that may involve 5, 50 or 50,000 entities. We’ll call it a WebRing on steroids.
A few years ago, I was trying to solve the problem of developing a very fast search engine that could search millions of lines of data including the non-indexable text fields in mySQL. However, I needed the search to be accurate, all-inclusive, and use best-match logic. Furthermore, it needed to do this in a multi-user environment on machines that were very slow by today's standards (Pentium II 400Mhz).
Over a period of months, I spent countless hours investigating and testing every single option I could think of, or those recommended. Luckily, I won’t bore you with the details.
But eventually, I developed a system using AGREP, short for Arizona Grep, which is the core feature behind Glimpse[1]. And as the years have passed, I've continued using AGREP in a variety of ways and as computers get faster and faster, I am still amazed by the power it has even as mySQL and MS SQL have added full-text search capabilities[2].
So, after promising to write a white paper about AGREP for quite some time, I finally had a topic that would show the simplicity of the solution and still show the extensibility and power it can provide.
The Knights of Columbus (KofC), the largest fraternal Catholic organization, has many councils and affiliates throughout the world. Many of these councils and affiliates have websites however finding them is not always the simple task it should be. Since I work in computers, I began working on a proof-of-concept to fix the problem of finding various KofC council and state websites.
Among the many design goals I wanted to meet, the most important to this paper were:
A) The ability to search and easily find a website in the ring
B) The search engine had to be easy to find and very flexible on spelling
C) Utilize one domain name to promote a unified identity and lower costs[3]
First, I created a database of the council websites including the Council Number, Location and Name. For ease of maintenance, I also extended this step and created a web-based interface to maintain this database.
Second, I used this database and a PERL program to create a flat-text file with a line of selected information for every unique ID in the database[4]. If a unique ID was not used in the database, then I created a blank line as a placeholder for that unique ID. It’s not ideal, but it simplified many aspects of the search—just don’t start your unique ID at 4,000,000.
Third, I created global DNS entries for all the related domains (*.kofcva.org, *.kofc.com, *.kofc-va.org[5]) that point to a virtual host web server that answers any request, for any name which we call the redirector.
Fourth, for the councils we hosted, we began using sub-domains equal to their councils such as www.6292.kofcva.org to allow them to be more easily found and re-visited by members.
Fifth, on the redirector, I used a CGI that parses HTTP_HOST to determine the website a user is looking for and define the search string[6].
Sixth, I used AGREP with this search string and the flat-text file created above to obtain the line number or numbers that match the search string the best[7]. Since these line numbers also match our unique IDs in the database, we can now perform an SQL query to obtain any information in the database linked to that unique ID.
Seventh, I used the database to provide a directory for all the members of the webring on the default web page for the organization.
Go to www.6292.kofcva.org, you will immediately go to this councils website.
Go to www.father_diamond.kofcva.org, you will be presented with the best entries in the database for this search URL for a Council Name.
Go to www.fairfax.kofcva.org, you will be presented with the best entries in the database for this search URL for a location.
Go to www.kofcva.org. You will be presented with the directory for the webring including all of the sites you have found above. Use the search feature and note that it uses the same redirector.
Try misspellings and see the power of Arizona Grep. Go to
www.asssembblee 1834.kofcva.org
for a good example of this.
[1] Agrep was written by Sun Wu, Udi Manber and Burra Gopal.
[2] Why mySQL's Full-Text Search Didn't Solve the Problem
The biggest problem is the weighting engine that mySQL uses; however, please don’t take that as a slam against mySQL. mySQL is a fabulous product at a very fair price. However, the full-text engine just wasn't going to work for me.
To illustrate the problem, suppose you have a database with only 10 entries that have the Title and Author of your entire Music Album collection. Because you are an exacting customer with extremely good taste, all 10 entries are by the band XTC. Now you want to perform search of Authors for XTC using the mySQL full-text search. The search for XTC may not produce the results you want because the word XTC is too common.
This approach is perfect if you are building a relevancy search engine that will have gazillions of items to search but you want to provide only the best answer subset. However, for an e-commerce site or something similar, you REALLY want every CD with XTC to be returned as a positive hit.
However, all of this doesn’t address one of the core features of AGREP. Very simply, the mySQL full-text search doesn't include the positively amazing fuzzy-logic / Best Match algorithm that AGREP provides.
[3] This is more for a single organizational webring such as the Knights but not all of our members are hosted on the same domain so it’s just a design goal, not a requirement.
[4] Pseudo code available at http://www.peregrinehw.com/downloads/agrep/create_search.txt.
[5] An example zone file is http://www.peregrinehw.com/downloads/agrep/example.named.global.dns.
[6] The code to parse the URL and use Agrep to do a best-match search for the appropriate unique Ids is available at http://www.peregrinehw.com/downloads/agrep/redirector.txt.
[7] See footnote #6.