HTTP Proxy HTTP代理
http代理的原理: 客户端用代理的ip和port替换服务器的ip和地址建立socket连接; 将HTTP请求发送至代理。 代理收到http请求,根据http请求的地址与服务器建立连接, 将客户端请求转至服务器,将服务器应答转至客户端, 客户端除了负责用代理的ip和port,代替服务器的ip和port建立socket, 还必须补全Request line(请求行)中的 Path-to-resoure(资源路)为全路径, 根据Request header中的Host字段补全。 删除请求头中的Proxy-Connection字段 代理负责客户端和服务器之间的请求和应答转发。
urllib2源码
OpenerDirector 为处理http协议的类,包含三类对象
- 第一类是负责处理请求数据的对象存储在process_request
- 第二类是负责处理应答数据的对象储存在process_response
- 第三类是通讯相关对象类存储在handle_open
buildopener 函数创建OpenerDirector对象并通过addhandler()将筛选的handler加入 对象至OpenDirector
default_classes = [ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor] # 所有基础模块 if hasattr(httplib, 'HTTPS'): default_classes.append(HTTPSHandler) skip = set() for klass in default_classes: for check in handlers: if isclass(check): if issubclass(check, klass): skip.add(klass) #有用户自定义的基础模块的继承类则去除基础模块 elif isinstance(check, klass): skip.add(klass) for klass in skip: default_classes.remove(klass) for klass in default_classes: opener.add_handler(klass()) for h in handlers: if isclass(h): h = h() opener.add_handler(h)
OpenerDirector的三类对象通过其函数名区分。
protocol_request 如 http_request属于process_request; protocol_response属于process_response; protocol_open属于handle_open
def open(self, fullurl, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT): # pre-process request meth_name = protocol+"_request" for processor in self.process_request.get(protocol, []): meth = getattr(processor, meth_name) req = meth(req) response = self._open(req, data) # post-process response meth_name = protocol+"_response" for processor in self.process_response.get(protocol, []): meth = getattr(processor, meth_name) response = meth(req, response) return response
def _open(self, req, data=None): protocol = req.get_type() result = self._call_chain(self.handle_open, protocol, protocol + '_open', req)
def _call_chain(self, chain, kind, meth_name, *args): # Handlers raise an exception if no one else should try to handle # the request, or return None if they can't but another handler # could. Otherwise, they return the response. handlers = chain.get(kind, ()) for handler in handlers: func = getattr(handler, meth_name) result = func(*args) if result is not None: return result
urllib2 代理相关
Request
- self.host为设置为代理地址
- self. __ original 为真正地址
ProxyHandler
初始化时,会增加protocolopen方法,该方法调用proxyopen进行设置代理
for type, url in proxies.items(): setattr(self, '%s_open' % type, lambda r, proxy=url, type=type, meth=self.proxy_open: \ meth(r, proxy, type))
注意设置代理时,代理模块会读取系统的代理设置
def proxy_open(self, req, proxy, type): if req.host and proxy_bypass(req.host): #注意:proxy_bypass return None
proxy_bypass 位于urllib模块中,会读取保存在操作系统中的代理设置,windows在注册表 中
internetSettings = _winreg.OpenKey(_winreg.HKEY_CURRENT_USER, r'Software\Microsoft\Windows\CurrentVersion\Internet Settings') proxyEnable = _winreg.QueryValueEx(internetSettings, 'ProxyEnable')[0] if proxyEnable: # Returned as Unicode but problems if not converted to ASCII proxyServer = str(_winreg.QueryValueEx(internetSettings, 'ProxyServer')[0])
标签:
python
日期: 2014-11-21 17:30:06, 10 years and 56 days ago