The Web is so much larger than the index, Google has to make decisions about what to spider and what to index. Google doesn't spider every page they know about, nor do they add every spidered page to the index.
People may ask what is wrong for Google to include all the web pages to their search engine than to leave few others. Well,
World Wide Web is very large, and Google is not even sure how large. They can only index a fraction of it. Google has plenty of capital to buy more computers, but there just isn't enough bandwidth and electricity available in the world to index the entire Internet. Google's crawling and indexing programs are believed to be the largest computations ever.
Googlebots fetch pages, and then an indexing program analyzes the pages and stores a representation of the page into Google's index. The index is an incomplete model of the Web. From there, PageRank is calculated and secret algorithms generate the search results. The only pages that can show up in Google's search results are pages included in the index. If your page isn't indexed, it will never rank for any keywords.
Bandwidth and electricity are the constraining resources at Google. On some level they have to allocate those resources among all the different Web sites: Google isn't going to index Web sites A – G and then ignore H-Z. Each day Google has a large but limited number of URLs it can spider, so for large sites it's in the site owners' interests to help the indexing process run more efficiently, because that may lead to more pages being indexed.
Because the Web is so much larger than the index, Google has to make decisions about what to spider and what to index. Dan told me that Google doesn't spider every page they know about, nor do they add every spidered page to the index. Two thoughts flashed through my mind at that moment: (1) I need to buy Dan a drink, (2) What can I do to make sure my pages get indexed?
Bandwidth and electricity are the constraining resources at Google. On some level they have to allocate those resources among all the different Web sites: Google isn't going to index Web sites A – G and then ignore H-Z. Dan suggested that each day Google has a large but limited number of URLs it can spider, so for large sites it's in the site owners' interests to help the indexing process run more efficiently, because that may lead to more pages being indexed.
Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.
To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.
How much effort Google decides to put into spidering a site is a secret, but it's influenced by PageRank. If your site has relatively few pages with high PageRank, they'll all get into the index no problem, but if you have a large number of pages with low PageRank, you may find that some of them don't make it into Google's index.