We all know Google is the best search engine. We use Google to find information, news, price of products, companies.
Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.
Have you ever thought how does Google search engine work?
Search engines perform several activities in order to deliver search results.
At the simplest level, search engines do two things.
Index information. Discover and store information about 30 trillion individual pages on the World Wide Web.
Return results. Through a sophisticated series of algorithms and machine learning, identify and display to the searcher the pages most relevant to her search query.
How did Google find 30 million web pages? Over the past 18 years, Google has been crawling the web, page by page. A software program called a crawler — also known as a robot, bot, or spider — starts with an initial set of web pages. To get the crawler started, a human enters a seed set of pages, giving the crawler content and links to index and follow. Google’s crawling software is called Googlebot, Bing’s is called Bingbot, and Yahoo uses Slurp.
When a bot encounters a page, it captures the information on that page, including the textual content, the HTML code that renders the page, information about how the page is linked to, and the pages to which it links.
In the process of crawling a site, bots will encounter the same links repeatedly. For example, the links in the header and footer navigation should be on every page. Instead of recrawling the content in the same visit, Googlebot may just note the relationship between the two pages based on that link and move on to the next unique page.
As bots crawl to discover information, the information is stored in an index inside the data centers. The index organizes information and tells a search engine’s algorithms where to find the relevant information when returning search results.
But an index isn’t like a dark closet that everything gets stuffed into randomly as it’s crawled. Indexation is tidy, with discovered web page information stored along with other relevant information, such as whether the content is new or an updated version, the context of the content, the linking structure within that particular website and the rest of the web, synonyms for words within the text, when the page was published, and whether it contains pictures or video.
Returning Search Results
Results are displayed after you search for something in a search engine. Every web page displayed is called a search result, and the order in which the search results are displayed is known as ranking.
But once information is crawled and indexed, how does Google decide what to show in search results? The answer, of course, is a closely guarded secret.
How a search engine decides what to display is loosely referred to as its algorithm. Every search engine uses proprietary algorithms that it has designed to pull the most relevant information from its indices as quickly as possible in order to display it in a manner that its human searchers will find most useful.
For instance, Google Search Quality Senior Strategist Andrey Lipattsev recently confirmedthat Google’s top three search ranking factors are content, links, and RankBrain, a machine learning artificial intelligence system. Regardless of what each search engine calls its algorithm, the basic functions of modern search engine algorithms are similar.
Content determines contextual relevance. The words on a page, combined with the context in which they are used and to pages they are linked to, determines how the content is stored in the index and which search queries it might answer.
Links determine authority and relevance. In addition to providing a pathway for crawling and discovering new content, links also act as authority signals. Authority is determined by measuring signals related to the relevance and quality of the pages linking into each individual page, as well as the relevance and quality of the pages to which that page links.
Search engine algorithms combine hundreds of signals with machine learning to determine the match between each page’s context and authority and the searcher’s query to serve up a page of search results. A page needs to be among the top seven to 10 most-highly-matched pages algorithmically, in both contextual relevance and authority, to be displayed on the first page of search results.
As Googlebot crawls, it discovers more and more links. The image below shows a very simplistic diagram of a single, three-page crawl path on Jerry’s Artarama, a discount art supplies ecommerce site.
- Crawling – Process of fetching all the web pages linked to a website. This task is performed by a software called a crawler or a spider (or Googlebot, in case of Google).
- Indexing – Process of creating index for all the fetched web pages and keeping them into a giant database from where it can later be retrieved. Essentially, the process of indexing is identifying the words and expressions that best describe the page and assigning the page to particular keywords.
- Processing – When a search request comes, the search engine processes it, i.e., it compares the search string in the search request with the indexed pages in the database.
- Calculating Relevancy – It is likely that more than one page contains the search string, so the search engine starts calculating the relevancy of each of the pages in its index to the search string.
- Retrieving Results – The last step in search engine activities is retrieving the best matched results. Basically, it is nothing more than simply displaying them in the browser.