Python offers a module urllib and its advanced version urllib2 to allow downloading files from given URLs. The following shows three different ways to access the internet data.
#!/usr/bin/env python
url = "https://steakovercooked.com/robots.txt";
import urllib2
robots = urllib2.urlopen(url)
output = open("c:\\robots1.txt","wb")
output.write(robots.read())
output.close()
import urllib
urllib.urlretrieve(url, "c:\\robots2.txt")
# or more sophisticated way
# from stackoverflow
file_name = url.split("/")[-1]
u = urllib2.urlopen(url)
f = open("c:\\robots3.txt", "wb")
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 16384
while True:
buffer = u.read(block_sz)
if not buffer: break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8) * (len(status) + 1)
print status,
f.close()
The above script will download the file robots.txt from my website and create three identical copies under C:\ drive. However, the exception handling is missing from above script, if you try to download something that does not exist, or on the non-existent domain, exceptions will be thrown out. For example,
Traceback (most recent call last):
File "C:\Python27\test.py", line 6, in <module>
robots = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 418, in _open
"_open", req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python27\lib\urllib2.py", line 1177, in do_open
raise URLError(err)
URLError: <urlopen errno="" error="" failed="" getaddrinfo="">
</urlopen></module>
The downloading file is one of the essential techniques that is quite useful in processing interent data, e.g. spiders.
–EOF (The Ultimate Computing & Technology Blog) —
371 wordsLast Post: Checking Bots using PHP Script
Next Post: How to Implement file_put_contents and file_get_contents in PHP?