Understanding Website Search Engines
by Cliff Lamere Albany, NY
17 Nov 2003, rev. 18 Jan 2006
TUTORIAL ON WEBSITE SEARCH ENGINES (sections often depend on prior sections)
Getting the Most Out of Website Search Engines
Where is a Website's Search Engine?
Search Engines Miss More Than They Find
Getting the Odds in Your Favor
Searching for a String of Words
Search for a String of Phrases (Solution to the John Smith Search Problem)
What is Searched by PicoSearch on THIS Website?
Look at Important Webpages after using the Search Engine
Aids for doing searches come in various types. There are those that search the internet (like Google), those that search all of the webpages on a single website (like PicoSearch), and those that search only the webpage you are looking at (choose Edit/Find). The three types all work differently. The first two cannot search for a part of a word, but Edit/Find can.
This webpage is only about the operation of website search engines. Many private genealogical site do not have these installed, but the larger sites often do. I did not add one to my own genealogy site for years because of how much data they miss. A search that doesn't find a word doesn't mean the word was not there. It may only mean that you don't understand how that kind of search works.
Getting the Most Out of Website Search Engines
Most genealogists do not understand how website search engines work, nor how to use them effectively. These instructions will help you get positive results on a higher percentage of your searches on a website, and for many people it will help you cut down significantly on the amount of time needed to do searches. This tutorial is not just a rehash of what search engine websites give as advice. It includes useful advice on how to greatly improve the online research capability of 80% or more of the genealogists who read this webpage.
Where is a Website's Search Engine?
Website search engines do not actually reside on the website that you are viewing (there may be exceptions such as with large commercial sites such as Ancestry and Amazon). Only the search window is on the website. With most genealogical sites, the first time you type something into the search window and click on Search or Go, you are taken to another website where the search program is housed.
For the search engine to work, the website must first be indexed. That means that the webmaster must give instructions to the site of the search engine to make a copy of every webpage on the website. In so doing, no links will be followed to a different website. The copied webpages are the ones that are looked at by the website search engine. The hits that are shown are displayed on the site of the search engine.
Any new webpages that are added to the website will not be available to the search engine unless the webmaster reindexes the website. Many inexperienced webmasters don't know that.
Search engine windows on websites have a default setting for your first search. Sometimes you are provided with a drop down menu to chose the type of search you prefer. At other times, you do not get any choice, and you cannot tell what the default setting is for your first search. In this case, your first search may be wasted, but it will take you to the search engine site where the search actually occurs. While there, you will probably find a drop down menu or other instructions that will allow you to choose the kind of search you prefer to make. Once there, you may want to repeat your first search after choosing the proper option.
Example of a search engine (this one won't do a search):
A PicoSearch engine with a drop down window has three choices:
Find ANY word
Find ALL words
Find EXACT phrase
"Find ANY word" allows you to search for more than one word at a time. If you search for John Smith, you will be told about every webpage that contains the word John and every one that contains the word Smith. "Find ALL words" will inform you only of webpages that have both words on them, but it might only be John Jones and Myra Smith. "Find EXACT phrase" will only tell you about webpages that have the word John followed immediately by the word Smith.
Some search boxes don't show any choices until you get to the search engine's site. After doing the first search you can see what that initial setting is. It is important to know, and it is not the same for all search engines. Search engines that lack a drop down menu may have written operating instructions. They usually allow the following search options:
AND (must be capitalized) (the same as PicoSearch's "Find ALL words")
OR (must be capitalized) (the same as PicoSearch's "Find ANY word")
" " (a word in parentheses is the same as PicoSearch's "Find EXACT phrase")
NOT (Example: NOT June means that you won't be shown any webpages containing the word June.)
+ (If attached to the beginning of a word, it means the same as AND)
- (If attached to the beginning of a word (example, -cemetery), the minus acts the same as NOT. Webpages
containing the word cemetery will be excluded from the search results. If you get too many hits from a
search, perhaps because some modern person with that same name is a well known soccer star, you can
exclude the webpages that contain the word soccer)
AND and OR would be used when you are searching for more than one word at a time. A single space must separate AND or OR from surrounding words.
Search Engines Miss More Than They Find
1) Newly added webpages cannot be seen by the search engine until they are added to the search engine's site. The webmaster must reindex the website every time a new page is added if you are to find any of its data with the search engine.
2) Some webmasters do not always reindex when something new is added, and some webmasters are unaware that they have to do it more than just that first time. Some NYGenWeb county websites on RootsWeb pass from administrator to administrator. Whoever is running it now may not even know that they are supposed to reindex. If a new administrator does not know the search engine account number and password, reindexing cannot be done. I think that there must be county websites that have a large number of their webpages unavailable to the search engine. Some sites may not have been reindexed in years. Some webmasters add a second or even a third search engine for valid reasons and to give you some additional options in your search. But, I have seen notes attached to search engines saying to try both because they don't give the same results. My guess is that one has been indexed recently and the other a long time ago. Much of this paragraph is speculation, but it should make you wonder about the accuracy of the search engines available to you.
3) Text on a graphic or map cannot be read by the search engine. In New York, members of the Beers family surveyed many of the counties after 1850. They created atlases of the counties and population centers which showed each house and business. The names of the property owners were included. I found some crucial information on one of these maps on a USGenWeb county site. The search engine did not find the names, so I was lucky that I looked around the site after my main search was completed.
4) Search engines give a false sense of security. You cannot believe negative results. One of the problems is that search engines find exactly what you ask for. Most surnames have spelling variations, some caused when a transcriber of a handwritten document had to guess at the spelling. On censuses or in tax, military, or church records, the person recording the name did not know how to spell it, and your ancestor often didn't know how either. A search engine will only find the spelling variation that you type into the search window. Changing one letter in a name might give positive results.
5) Some search engines pay attention to commas (perhaps hyphens also). It only finds text that is exactly the same as what you type. I have not experimented to see if a period after a middle initial is also needed. I believe some website search engines ignore commas when doing a search (Google, a different type of search engine, does ignore them.).
Getting the Odds in Your Favor
1) I recommend searching only for a surname unless that creates too many hits.
2) If you are searching for John Smith, you would have to search for several variations of the name.
John Smith
Smith John (it would appear this way in an alphabetized table)
Smith, John
In the last example, if by mistake the webpage has a space between Smith and the comma, you will not find him. If the name in a webpage is listed as John A. Smith or John Albert Smith, you will not find him by searching for John Smith. Some webpages would show John with his given name first, but others would show the surname first. If you do a surname-first search, searching for Smith John and Smith, John would both avoid problems with the middle name or initial. If a John A. Smith is listed on a website, he would be difficult to find, especially if you don't know the middle initial or name. You would have to search for just Smith or John AND Smith. Then you would have to wade through all of the hits.
Searching for a String of Words
This is a GREAT time saver. I search for about 10 spelling variations of a surname at one time. If the search engine has a drop down menu, I choose the ANY option. Many search engines don't have such a menu. In that case, if you enter several words into a search window, it will either search for ANY or ALL words in the string, but you often won't know which search is being performed until after the search has been completed. With some engines, you may only find out by experimenting. However, if you use OR to separate all surname spellings from each other, this will normally work with a search engine that doesn't provide a drop down menu of choices.
When I don't get a hit, am I to assume that there is not one of the 10 spellings on the website? No. I look at any webpage on the website, pick an uncommon word that is on it, and then I insert the word somewhere in the string of spellings (also adding an OR). If the search now shows the webpage, then I know that the technique works with that search engine and that the previous negative search was accurate.
Using the ANY option with a search engine such as PicoSearch means that if the website contains just 3 of 10 spellings, each on a different webpage, all 3 webpages will be shown as hits. Using ALL means that any webpage that does not have all 10 spellings on it will not be included in the search results. ALL will always get me zero hits when searching a long string.
Hint: I search for 70 spellings of a single surname. The beginning of the name comes in five different spellings. I have the 70 spellings typed on five lines in a word processor. Sometimes I use the smallest font available so I can get the necessary spellings all on a single line. They may be so small that I can't read them, but they become full-sized when I copy and paste them into the search window. A space must separate each spelling. Once you go to the webpage, the problem is to be able to find the one or more occurrences on that page. On a single search, I use only the spellings that all have the same beginning four letters. Then, when I go to the webpage, I search for those beginning letters using Edit/Find (same as Ctrl-f). This technique will not work well with a surname that begins with a common word if that common word is also common on that webpage.
Problem: When I copy and paste the string of surnames into some search engines from my WordPerfect word processor, a square box sometimes gets attached to the first word, and it is tricky getting rid of it. If you have that problem, precede the string with something like XZXZ that would hardly ever appear in a webpage. The box attaches to that and will not interfere with the search for the first real spelling variation. The square is an attempt on the part of the search engine window to show a hidden WP code that comes along when you copy and paste from that program. Netscape's email program shows it as a ?
Even worse is when I don't know the hidden WP code is causing a problem. In Heritage Quest, I learned that pasting in a single surname failed to find people I knew were there. They keep the code but don't give any visual clue about the problem. I always have to type the name instead of pasting it into their search engine window.
I think that some search windows remove all codes automatically and others don't. You should be aware of this possible problem with your own program if you paste rather than type names. You may be able to solve a pasting problem by using a different word processor, or by storing the names in the Draft folder of your email program and then pasting them from there. To be certain of removing codes, you can paste text into Notepad which comes with the Windows operating system.
Notepad is a simple program that I use it frequently. When I am copying text from an online table (like Social Security records), I paste it into Notepad, then immediately copy and paste it to its final destination. Notepad removes all codes, including those of WordPerfect 8. It doesn't allow tables, underlining, bold text, italicized text, graphics, etc.
Search for a String of Phrases (Solution to the John Smith Search Problem)
If a website has too many hits on the name Smith, you have to try something else. Many search engines will allow an exact phrase to be enclosed in quotation marks. Therefore, using the ANY setting (or using the OR option), I would do a single search for a string like this:
"John Smith" "John A. Smith" "John Albert Smith" "Smith John" "Smith, John"
If you know that John sometimes went by his middle name, or the name was originally Schmidt, more variations of the name can be added to the search string.
What is Searched by PicoSearch on THIS Website?
PicoSearch will ONLY search the webpages which have a smiley face in front of them, or subdivisions of those (when there was too much data to be put in one webpage). Those 200 webpages are the only webpages made by me and actually on this website. The 1000+ other links are to webpages on the websites of other people, and cannot be examined by the search engine on this site.
Fortunately, No. I usually use lowercase letters in search engines just because I can type them more quickly. Only the AND and OR options are case sensitive.
FreeFind's website search engine will allow you to use Abra* in a search. That would find both Abram and Abraham, two forms of the same name. Of course, it may find pages with Abramson that you don't want. The * can only be used for the ending of a word. PicoSearch does not allow the use of a wildcard. There are many kinds of unnamed search engines. If this feature is important to you, try it. If you get a positive result, then you know it works with that engine. If you get no result, you can't be certain of what it means.
Look at Important Webpages after using the Search Engine
If you think there are some important webpages on a site, after using the search engine it is advisable to look at them to find surname spellings you couldn't imagine. If the surnames are arranged alphabetically, slightly misspelled names are often easily seen. Also, the Edit/Find feature can be used to quickly search for a surname or part of a surname. I often use just part of the surname, the part that is least likely to be misspelled. [Website search engines find only whole words.]
On the home page of my own website, I provide the date the website was last indexed, plus a list of any pages that cannot be searched by the search engine yet. That allows a researcher to decide whether they need to look at any of those pages.
PicoSearch and FreeFind are both excellent website search engines. Many search engines don't have names on them, and they seem to be of several different sorts. Some of them act strangely. One NY county site has a search engine that resets itself after every search. The website seems proud of this fact, but I think it is a terrible feature. If I want ANY, I have to choose it on every search, because it always goes back to ALL and gives me zero hits if I'm not aware of what is happening. Some people will not realize that that option has changed for their next search. You have to be alert to oddities like this so that you know when a negative result is real.
Albany & Eastern New York Genealogy (HOME)
Visitors since 17 Nov 2003