Site Map Generator

A web page parser that generates a Site Map for a given domain.

The Site Map Generator is a lightweight PHP application that parses web pages that belong to a particular domain, and collects all valid URL's within that domain.

The application then creates an external XML file containing the DOM elements associated with a standard site map, in which the collected URL's are then placed.

The PHP script uses the file_get_contents() function and loads the results into a DOMDocument() element. All 'href' attributes are analyzed and all valid URL's are added to a "found_urls" array. The URL's in this array are parsed and are added to a "crawled_urls" associative array, which is used to keep track of URL's that have already been parsed.

As URL's are found, they are pushed to the end of the "found_urls" array. Since the array elements are analyzed one by one, the algorithm essentially treats the array like a queue, and therefore, the URL's in a domain are searched in a Breadth-First manner. The algorithm continues until no new URL's are found and the last index of "found_urls" is reached.

After all URL's in the "found_urls" are parsed, an external XML file is generated and the URL's added to the "crawled_urls" array are then added within the standard XML site map tags. This file can then be uploaded for search engines to use.