Aleena Byrne
WRIT 340
Professor Harly Ramsey
December 4, 2013
Bio: Aleena Byrne is a junior undergraduate student at the University of Southern California, majoring in Computer Science & Business Administration.
Keywords: Google, search engine, search algorithms, computer science, internet history
How did Google go from a name to a verb?
The most popular website internationally did not attain its popularity on a fluke. Google was in hot competition with many other search engines, but managed to get the formula right. By returning more relevant results than its competitors through using web crawlers with its own PageRank algorithm, Google rose to the top and has grown far past a simple search engine, becoming the tech giant used daily by millions today.
Introduction
With over one billion searches daily, chances are you, using the internet now to read this article, made a Google search today [1]. Whether you accessed the most popular website [2] from your computer, phone or tablet and inputted your search using text, voice, or image, Google always returns an array of results, and usually you do not need to look far down the page to find what you are looking for. With a simple search page, and a variety of products, Google makes it effortless and free for you to search for information, upload videos, map out directions, and more. It was not through sheer luck or good timing that Google became the most popular website and search engine. Google's rise and hold on the distribution and access of digital information is largely owed to the engineers that crafted its superior search algorithms at the infancy of the company.
Before Google: the history of search
Before looking at Google's path to success, it is vital to look at the technologies Google innovated upon. In 1990, Archie, ("Archive" without the "v") launched as the first "search engine". Realistically, it was only an accessible, downloadable index of directory listings, where only site titles were listed and not the contents of each site. Other search engines such as Veronica and Jughead, VLib, Excite, World Wide Web Wanderer, Aliweb, and Primitive Web Search made small iterations on this model in the next few years. Defaulted by the then popular browser Netscape, Infoseek was the first engine to allow webmasters to submit their site for indexing in 1994. AltaVista, later bought by Overture and eventually Yahoo in 2003, was the first engine to allow natural language queries [3]. While these are considered the first search engines, they were more akin to a table-of-contents for the internet, and failed to organize the results in a meaningful way.
Released later in 1994, WebCrawler was the first engine to have 'web crawlers' that indexed entire pages, returning more relevant sites in searches. The crawlers follow links to travel form page to page and index their findings (in a process discussed in detail later). This engine was so popular that it would crash during the daytime, and later was bought by Excite, which AOL would eventually license. Following WebCrawler's success, most search engines would use a similar crawler method for collecting site information [3]. Through implementing web crawlers, the information, still indexed like a table-of-contents, reflected more accurately on the contents of the pages, but still failed to sort the results.
Though all of the mentioned engines built upon a similar architecture: an index of sites related to the search keyword, the Yahoo Directory, released in 1994, differed as a man-made index. While it was free for information sites to be added, with a $300 per year for commercial sites, there was a long wait time. Yahoo would not work on their own search engine, and instead outsourced, until 2002. Lycos, released in July 1994, returned search results by relevancy rankings, with improved language-processing and similarity detection. A year later, Ask Jeeves attempted to rank links by popularity, but was easy to manipulate [3]. Even five years from the release of Archie, most search engines were merely table-of-contents, with the exception of Ask Jeeves, which failed at its attempt to sort results meaningfully.
While all these search engines were launching, Google co-founders Larry Page and Sergey Brin, had been working for two years on their search engine [4].
What Google did differently
Page and Brin's search engine, originally called BackRub and renamed to Google before its release in 1998, differed from all its predecessors. The engine utilized a sophisticated ranking algorithm, returning results based on popularity and reputation. Rather than look solely at the frequency of keywords and age of a site, their PageRank algorithm also ranked a site by its reputation. This reputation was generated by counting the incoming links to the site, and how trustworthy the linking sites are. The engineers hypothesized that if more, reliable sources reference the website, the website is likely to be more credible. See Figure 1 for a simplified example of site rankings [4].
Figure 1: Simple example of site ranking (Source: original)
In this example, Google finds all of the web pages listed under the search input and related keywords, in the case of this example, the engine finds links from the New York Times, Wall Street Journal, Wikipedia, UCLA and Joe Smith's Blog. Wikipedia with the highest number of incoming links has the highest rank. Wall Street Journal ranks second with two incoming links. With no incoming links, Joe Smith's Blog ranks last. Both the New York Times and UCLA each have one incoming link, but because the New York Time's incoming link comes from a more reputable source (rank 2) than UCLA's source (rank 5) so the New York Times ranks higher.
Crawling the web
Figure 1 also shows a depiction of a graph, mathematical structures that pair relations between data, used often in software engineering and studied in discrete mathematics. The different website pages are the "nodes", the data points or collections, and the lines are the "edges" that connect them. While not all graphs have arrows ("undirected" graphs), the arrows in the example show distinctions between the relationships of the nodes which means the graph is "directed". The web crawlers traverse the internet using algorithms based on graph theory.
While Google's web crawler software, GoogleBot, uses more sophisticated algorithms, the most basic web crawler uses breadth first search, an algorithm that records one data collection and then traverses to the data collections that are connected to the starting point, continuing to search recursively until all nodes have been reached [5]. For example in Figure 1, an example web crawler will arbitrarily start at Wikipedia, it stores the content information for Wikipedia before moving to the Wall Street Journal (following the only outbound link), storing its data, then moving to the New York Times (following the only outbound link), storing its data. At this point the only outbound link from the New York Times is to Wikipedia, which has already been visited, so the crawler instead jumps to either Joe Smith's Blog or to UCLA.
The breadth first search algorithm is better suited for the web than the other basic graph search algorithm (depth first search) because if the crawler starts with the "good" page it will spend most of its time on other good pages, without wandering away and becoming lost in cyberspace. See Figure 2 for a visualization of the two searches [6].
Figure B: The differences between graph searches (Source: adapted from [2])
Using the two algorithms on the same data set, with the red Xs denoting 'bad', untrustworthy or spam, websites, wields very different results. The breadth first algorithm allows the crawler to choose to spend less time pursuing the left branch of 'bad' results and spend more time curating 'good' content. The depth first search algorithm, however, must traverse the entirety of the 'bad' left branch before returning to 'good' results. Neither algorithm is inherently superior to the other (depth first search is better at solving mazes and puzzles), but breadth first search is utilized more in web page indexing [6].
What Google does today: from search to results
While the homepage has barely changed, and even minimalized, over the past twenty years, the processes of returning search results have grown in complexity. See Figure 3 for a visualization of the search query to result process.
Figure 3: The process of returning search results (Source: original)
Google first navigates the web using crawlers, curating site information traveling from page to page through links. After storing visit data in a repository, the sites are analyzed sorted by their content under keywords and go through the PageRank algorithm previously discussed. When a user makes a search, Google tries to understand the purpose of the search, clarifying its spelling and including relevant synonyms. More recently, the engine has made moves to predict user input using auto-complete and loading the search page instantly as the input is typed. After understanding the search, Google pulls all the relevant sites by search the sorted index and ranks the results. Over 200 ranking algorithms sort the most relevant, trustworthy, and popular results [7]. In the past few years, Google has begun to personalize results based on previous searches, web history, and predicted user demographics. This multi-stepped process, parsing the entire web of about 1 trillion sites [8], happens in less than half of a second, and brings up thousands of results.
More than just a ranking algorithm
From a simple search engine to a tech giant, placing 73 on the 2012 Fortune 500 ranking [9], Google has grown tremendously from introducing its PageRank algorithm in 1998. And though it played a large part in setting Google apart from its competitors, the algorithm was not the sole contributor to Google's success: the search page's simple design is noted to be a branding advantage during the 90s when searchers were often clouded by web portals and its superior AdSense ad picking algorithm considered its business advantage. And even after becoming the most-used search engine globally, Google continues to expand into new markets, distributing products such as YouTube, Google+, Gmail, Earth, and Chrome [10]. Additionally, the company invests millions in R&D in fields including computing processing, solar energy, and self-driving cars. While many of their products seem unrelated to the powerful search engine, their products are designed to funnel business into search: their free internet browser, open-source mobile operating system, and digital media store all utilize Google's search. Warren Buffet described the best businesses as "economic castles protected by unbreachable 'moats'" [11]. Google has the most powerful search engine as their castle, and all their ancillary products as the moat protecting it.
Conclusion
Through its simple interface and quick speeds, Google embedded itself into the social consciousness and business models of the modern world. The company name is used as a verb regularly, "Google it" is synonymous for "find more information on it", and saying the phrase out loud can often take longer than making the search itself. Google heralded a new age of storing, processing, and circulating digital media and sustained itself through creating free tools and personalizing paid content. 2012 saw an annual number of 1,873,910,000,000 searches, with an average of 5,134,000,000 searches per day [7]. While these numbers in themselves are a staggering reinforcement of Google's monopoly over human curiosity, their success was not a fluke. The engineering masterminds behind Google's humble origins in algorithm development earned the success they have found.
Works Cited
[1] Google, "Facts about Google and Competition." [Online]. Web. 15 Nov. 2013.
[2] Alexa, "Top Sites." Alexa Top 500 Global Sites. [Online]. Web. 15 Nov. 2013.
[3] Sullivan, Danny. "Where Are They Now? Search Engines We've Known & Loved." Where Are They Now? Search Engines We've Known & Loved. Search Engine Watch, 3 Mar. 2003. Web. 15 Nov. 2013.
[4] Google, "Our History in Depth." [Online]. Web. 15 Nov. 2013.
[5] Princeton, "Breadth-first Search." [Online]. Web. 15 Nov. 2013.
[6] Bing Liu. "Bing Liu's Home Page." Web Crawling. Indiana University School of Informatics in Web, n.d. Web. 15 Nov. 2013.
[7] Google. "How Search Works - The Story - Inside Search." [Online]. Web. 15 Nov. 2013.
[8] Sullivan, Danny. "Google “Knows” About 1 Trillion Web Items." Search Engine Land, n.d. Web. 15 Nov. 2013.
[9] CNNMoney. "Google - Fortune 500." [Online]. Web. 15 Nov. 2013.
[10] Google, "Google - Products." [Online]. Web. 15 Nov. 2013.
[11] Schonfeld, Erick. "Search Is Google's Castle, Everything Else Is A Moat." TechCrunch. [Online] Web. 15 Nov. 2013.
Share with your friends: |