Web Crawler
The task is to implement a web crawler that starts at a given URL and visits all the links that share the same hostname as the start URL. We will follow these steps:
Extract the Hostname: Since we’re only interested in URLs with the same hostname, we first need to extract the hostname from the start URL.
Initialize Data Structures: We’ll use a set to keep track of URLs we’ve already visited, and a list to store the URLs we’ve crawled. We’ll also enqueue the start URL as the first URL to crawl.
Crawl the URLs: We’ll perform a breadth-first search (BFS) to crawl the URLs. For each URL, we’ll use the
htmlParser.getUrls(url)
method to obtain all linked URLs. We’ll then filter these URLs to only include those with the same hostname as the start URL, and only those we haven’t already visited.Repeat Until Done: We’ll repeat this process, visiting URLs in the order they were discovered, until we have visited all reachable URLs with the same hostname.
Here’s the code:
|
|
This code follows the constraints and requirements of the problem, exploring only the URLs that share the same hostname as the start URL and avoiding visiting the same URL more than once. It returns the URLs crawled in any order, as required.