OAKLAND, California – In 2000, just two years after it was founded, Google reached a milestone that would lay the foundation for its dominance for the next 20 years: it became the world's largest search engine, with an index of more than a billion web pages.
The rest of the internet never caught up, and the Google index just kept getting bigger. It is estimated that there are between 500 and 600 billion websites today.
Now that regulators around the world are looking for ways to contain Google's power, including a search monopoly case expected by attorneys general as early as this week and the antitrust lawsuit filed by the Justice Department in October, they are grappling with a company whose size is almost enough to crush competitors. And these competitors point the investigators to this enormous index, the center of gravity of the company.
"When people work on a search engine with a smaller index, they don't always get the results they want. And then they go to Google and stay with Google," said Matt Wells, who started Gigablast, a search engine with an index of about 20 years ago around five billion websites. "A little guy like me can't keep up."
Understanding how Google search works is key to figuring out why so many companies find it nearly impossible to assert themselves and actually go out of their way to meet their needs.
Every search query gives Google more data to make the search algorithm more intelligent. Google has performed so many more searches than any other search engine that it has a huge advantage over competitors when it comes to understanding what consumers are looking for. This lead only increases because Google has a market share of around 90 percent.
Google directs billions of users to locations on the Internet, and websites hungry for that traffic create different rules for the business. Often times, websites provide better and more frequent access to Google's so-called web crawlers – computers that automatically crawl the Internet and scan web pages – and enable the company to provide a more complete and up-to-date index of what is available on the Internet.
While working on the Bandcamp music site, Zack Maril, a software developer, worried about the importance of Google's dominance on websites.
When Google said its crawler, Googlebot, was having trouble with one of Bandcamp's pages in 2018, Maril made fixing the problem a priority as Google was vital to the site's traffic. When other crawlers encountered problems, Bandcamp usually blocked them.
Mr. Maril continued to explore the various ways websites opened doors to Google and closed doors to others. Last year, he sent a 20-page report entitled "Understanding Google" to a House Antitrust Subcommittee and then met with investigators to explain why other companies were unable to rebuild the Google index.
"It's largely an unchecked source of power for its monopoly," said Maril, 29, who works for another tech company that doesn't directly compete with Google. He asked the New York Times not to identify his employer as he did not speak for it.
A House Subcommittee report this year cited Mr Maril's investigation into Google's efforts to create a real-time map of the Internet and how it had "held on to its dominance". As the Justice Department tries to run Google's deals, which have the search engine focused on billions of smartphones and computers, Maril urges the government to intervene and regulate the Google index. A Google spokeswoman declined to comment.
Websites and search engines are symbiotic. Websites rely on search engines for traffic, while search engines need access to crawl the websites to provide relevant results to users. However, each crawler puts a strain on a website's resources in terms of server and bandwidth costs, and some aggressive crawlers resemble security risks that can bring a website to a standstill.
Since their pages are costly to crawl, websites have an incentive to only allow this from search engines that drive enough traffic to them. In the current world of search, what remains is Google, and in some cases Microsoft Bing.
Google and Microsoft are the only search engines that spend hundreds of millions of dollars annually to compile a real-time map of the English-speaking Internet. This is on top of the billions they have spent building their indices over the years. This came out from a report by the UK competition and market authority this summer.
Google has more than just market share ahead of Microsoft. According to the UK competition authorities, the Google index comprised around 500 to 600 billion web pages, compared to 100 to 200 billion for Microsoft.
Other large technology companies use crawlers for other purposes. Facebook has a crawler for links that appear on its website or services. According to Amazon, its crawler is helping to improve its voice-based assistant Alexa. Apple has its own crawler, Applebot, which has sparked speculation that it might try to build its own search engine.
However, indexing has always been a challenge for companies with no deep pockets.
Privacy-focused search engine DuckDuckGo decided more than a decade ago to stop crawling the entire web and is now syndicating the results from Microsoft. Sites like Wikipedia will continue to be crawled to provide results for answer fields that appear in the results. However, maintaining its own index does not usually make financial sense for the company.
"It costs more money than we can afford," said Gabriel Weinberg, CEO of DuckDuckGo. In a written statement to the House's Antitrust Subcommittee last year, the company stated that "an emerging search engine launch today (and for the foreseeable future) cannot avoid the need" to turn to Microsoft or Google for search results.
When FindX started developing an alternative to Google in 2015, the Danish company set out to create its own index and offered its own algorithm to achieve individual results.
FindX quickly ran into problems. Big website owners like Yelp and LinkedIn didn't allow the young search engine to crawl their websites. Due to a bug in the code, FindX's computers searching the Internet were flagged as a security risk and blocked by a group of the Internet's largest infrastructure providers. The pages collected were often spam or malicious websites.
"When you need to do the indexing, this is the hardest part," said Brian Schildt Laursen, a founder of FindX, which closed in 2018.
Last year, Mr. Schildt Laursen launched a new search engine, Givero, which offered users the opportunity to donate part of the company's earnings to charity. When he started Givero, he syndicated search results from Microsoft.
Most major websites are aware of who can crawl their pages. In general, Google and Microsoft get more access because they have more users while smaller search engines have to ask for permission.
"You need the traffic to convince the sites that you can copy and crawl, but you also need the content to grow your index and increase your traffic," said Marc Al-Hames, co-managing director of Cliqz , a German search engine that was closed this year after seven years of operation. "It's a chicken and egg problem."
In Europe, a group called the Open Search Foundation has proposed a plan to create a common internet index that can support many European search engines. According to Stefan Voigt, chairman and founder of the group, having a variety of options for search results is important because not just a handful of companies can tell which links people see and which don't.
"We just can't leave this to one or two companies," said Voigt.
When Mr Maril began investigating how websites were treating Google's crawler, he downloaded 17 million robots.txt files – essentially traffic rules published by almost every website that specifies where crawlers can go – and found many examples where Google had better access than its competitors.
ScienceDirect, a peer-reviewed article website, only allows Google's crawler to access links containing PDF documents. Only Google computers have access to entries on PBS Kids. On Alibaba.com, the US website of the Chinese e-commerce giant Alibaba, only Google's crawler has access to pages that list products.
That year, Mr. Maril started an organization, the Knuckleheads' Club ("because only one Knucklehead would race Google") and a website to raise awareness of Google's monopoly on web crawling.
"Google has all of this power in society," said Maril. "But I think there should be some democratic – small – control over this power."