Python标准库urllib2的使用和获取网站状态举例

发表时间：2015-11-26 13:18 | 分类：Python | 浏览：3,752 次

Python 2.7标准库中的urllib2以urlopen函数的形式提供了一个非常简单的接口，我们可以使用这个函数来获取网站内容，比如可以用它来做网络爬虫。当然Urllib2也同样提供一个比较复杂的接口来处理复杂情况，例如：基础验证、cookies、代理等。

基本使用

urlopen函数可以接受一个字符串类型url或者一个request对象。

正常的返回对象中主要有这几个方法。

read()：获取网站全部html代码

info()：获取meta-information信息，比如服务器发送的头headers信息。

geturl()：获取真实打开的地址，通常可以识别网址是否设置跳转。这个urllib2会帮你完成，最后得到的是真实地址。

getcode()：获取http返回代码。

1、直接打开url

import urllib2
response = urllib2.urlopen('https://zhangnq.com/')
html= response.read()
print html

2、request对象访问

import urllib2
url='https://zhangnq.com/'
req=urllib2.Request(url)
response=urllib2.urlopen(req,timeout=30)
html= response.read()
print html

这里urlopen指定timeout超时时间。

3、传递data参数

如果你需要发送数据到URL，比如用户登录，那么HTTP中这个经常使用POST请求发送。这个步骤通常在你提交一个HTML表单时由浏览器完成。在python程序里如何使用POST提交任意的数据？首先需要把data编码成标准格式，然后作为data参数传递给Request对象，最后提交。编码工作使用urllib中的urlencode方法来完成。

import urllib 
import urllib2 
url = 'https://zhangnq.com/'
values = {'username' : 'sijitao',
        'password' : 'passw0rd'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html = response.read()
print html

如果需要使用GET请求发送，那么把编码后的data数据和url相加再提交即可。

import urllib
import urllib2
url="http://www.baidu.com/"
data={}
data['wd']='site:blog.nbhao.org'
url_values=urllib.urlencode(data)
furl=url+'s?'+url_values
req=urllib2.Request(furl)
response = urllib2.urlopen(req,timeout=5)
html = response.read()
print html

4、异常处理

一般使用URLError这个异常。在没有网络连接或者服务器不存在的情况时，URLError异常一般会带有"reason"属性。在网址不存在或者其他服务器错误时，我们可以捕获URLError中的code属性。

import urllib2
url='https://zhangnq.com/'
req=urllib2.Request(url)
response = None
try:
    response = urllib2.urlopen(req,timeout=5)
    print response.getcode()
    print response.geturl()
    print response.info()
    #print response.read()
except urllib2.URLError as e:
    print e
    if hasattr(e, 'code'):
        print 'Error code:',e.code
        #print e.read()
        print e.geturl()
        print e.info()
    elif hasattr(e, 'reason'):
        print 'Reason:',e.reason
except:
    pass
finally:
    if response:
        response.close()

urllib2库的基本使用一般就这些。

获取网站状态举例

背景是如何让程序判断一个网址导航站（http://www.hostunion.net/）中网址是否正常。有了urllib2的基本使用和异常的处理，一般就可以解决。例子中使用了pickle模块，判断如果超过5次异常就删除网站。例子如下。

def webCheck(timeout=60):
    result=sqlExecute("select id,url from websites where status = 3")
    webCheck_pkl='data/webCheck.pkl'
    try:
        f=file(webCheck_pkl,'rb')
        web_dict=pickle.load(f)
        f.close()
    except:
        web_dict={}
    l=[]
    if result:
        for row in result:
            url='http://'+row['url']
            req=urllib2.Request(url)
            req.add_header('User-Agent',"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36")
            response=None
            try:
                response=urllib2.urlopen(req,timeout=timeout)
                #print "Url: %s\t%s" % (url,response.getcode())
                try:
                    web_dict[row['id']]['result_code']
                    web_dict[row['id']]['fail_cnt']
                except:
                    web_dict[row['id']]={}
                    web_dict[row['id']]['fail_cnt']=0
                web_dict[row['id']]['result_code']=response.getcode()
                web_dict[row['id']]['fail_cnt']=0
            except urllib2.URLError as e:
                if hasattr(e, 'code'):
                    print "Url: %s\t%s" % (url,e.code)
                    try:
                        web_dict[row['id']]['result_code']
                        web_dict[row['id']]['fail_cnt']
                    except:
                        web_dict[row['id']]={}
                        web_dict[row['id']]['fail_cnt']=0
                    web_dict[row['id']]['result_code']=e.code
                    web_dict[row['id']]['fail_cnt']=web_dict[row['id']]['fail_cnt']+1
                    if web_dict[row['id']]['fail_cnt']>=5:
                        l.append(row['id'])
                elif hasattr(e, 'reason'):
                    print "Url: %s\t%s" % (url,'error')
                    try:
                        web_dict[row['id']]['result_code']
                        web_dict[row['id']]['fail_cnt']
                    except:
                        web_dict[row['id']]={}
                        web_dict[row['id']]['fail_cnt']=0
                    web_dict[row['id']]['result_code']=e.reason
                    web_dict[row['id']]['fail_cnt']=web_dict[row['id']]['fail_cnt']+1
                    if web_dict[row['id']]['fail_cnt']>=5:
                        l.append(row['id'])
            except:
                pass
            finally:
                if response:
                    response.close()
    for key in l:
        sql='update websites set status=1 where id=%s' % key
        sqlExecute(sql)
    #dump
    f=file(webCheck_pkl,'wb')
    pickle.dump(web_dict,f)
    f.close()

参考网址：http://www.pythontab.com/html/2014/pythonhexinbiancheng_1128/928.html

本文标签：Python

本文链接：https://www.sijitao.net/2249.html

欢迎您在本博客中留下评论，如需转载原创文章请注明出处，谢谢！

下一篇：CentOS 6.5中安装PPTP VPN服务步骤
上一篇：Linux下curl命令伪装http_referer和user-agent访问

现在只有1个回复

Comment (1)

Trackbacks (0)

192.168.1.1 　( 2015.11.26 16:34 ) : #-9

好先进的 python 语言。

还没有Trackbacks

日历
2024年四月

一二三四五六日

« 十

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

29 30
标签
360 apache CentOS chrome Fail2ban find Firefox GAE Gmail Google htaccess Life linux MongoDB MSN Mysql nagios Nginx PHP Postfix PostgresQL Python QQ Redis SEO Shell SQL ssl tomcat ubuntu virtualbox VPS windows Wordpress XML Zabbix 主机代理发牢骚域名小百科搜索热门百度邮箱

2024年四月
一	二	三	四	五	六	日
« 十
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30