Web Crawler

The task is to implement a web crawler that starts at a given URL and visits all the links that share the same hostname as the start URL. We will follow these steps:

  1. Extract the Hostname: Since we’re only interested in URLs with the same hostname, we first need to extract the hostname from the start URL.

  2. Initialize Data Structures: We’ll use a set to keep track of URLs we’ve already visited, and a list to store the URLs we’ve crawled. We’ll also enqueue the start URL as the first URL to crawl.

  3. Crawl the URLs: We’ll perform a breadth-first search (BFS) to crawl the URLs. For each URL, we’ll use the htmlParser.getUrls(url) method to obtain all linked URLs. We’ll then filter these URLs to only include those with the same hostname as the start URL, and only those we haven’t already visited.

  4. Repeat Until Done: We’ll repeat this process, visiting URLs in the order they were discovered, until we have visited all reachable URLs with the same hostname.

Here’s the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from collections import deque

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        # Function to extract hostname from a URL
        def get_hostname(url):
            return url.split('/')[2]

        # Extract the hostname from the start URL
        start_hostname = get_hostname(startUrl)

        # Initialize visited set and queue for BFS
        visited = set([startUrl])
        queue = deque([startUrl])
        result = []

        # Perform BFS
        while queue:
            url = queue.popleft()
            result.append(url)

            # Get all linked URLs from the current URL
            for linked_url in htmlParser.getUrls(url):
                # Only consider URLs with the same hostname, and that have not been visited yet
                if get_hostname(linked_url) == start_hostname and linked_url not in visited:
                    visited.add(linked_url)
                    queue.append(linked_url)

        return result

This code follows the constraints and requirements of the problem, exploring only the URLs that share the same hostname as the start URL and avoiding visiting the same URL more than once. It returns the URLs crawled in any order, as required.