Hi Dollars

One of the most viral news circulating on the internet is the Python Web Crawler, that crawls a website you ask it to and it crawls the whole website and all the links anddownloads all the data for you i.e the whole Website.

How to do it?

Just follow the steps.

1. Install python on your PC.

2. Input the code given below or you can just Copy & Paste it.

import sys, thread, Queue, re, urllib, urlparse, time, os, sys  dupcheck = set()    
q = Queue.Queue(100)   q.put(sys.argv[1])  
def queueURLs(html, origLink):       
for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I):           
link = url.split("#", 1)[0] if url.startswith("http") else 
'{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0]           
if link in dupcheck:  
dupcheck.add(link) if len(dupcheck) > 99999: dupcheck.clear() q.put(link) def getHTML(link): try: 
html = urllib.urlopen(link).read()
open(str(time.time()) + ".html", "w").write("" % link  + "\n" + html) 
queueURLs(html, link)       except (KeyboardInterrupt, SystemExit):
raise except Exception: pass  while True: thread.start_new_thread( getHTML, (q.get(),))       time.sleep(0.5)

3. Save the file in its default ‘ .py ‘ format, say, webcrawler.py

4. Execute the following command,

$ pythonwebcrawler.py www.websitename.com
where, ‘websitename’ is replaced by the website you need to be crawled/downloaded.
$ python webcrawler.py www.websitename.com


Leave a reply

Please enter your comment!
Please enter your name here