暑假学习了一点爬虫知识,学以致用是最快的学习方法。网上也搜不到兰州大学教务网的爬虫资料因此将自己做的总结整理出来。 抛砖引玉。 |
python比较方便也有类似的框架。
## 软件安装
- anaconda(免去安装包错误)
- chorme F12(或者别的抓包工具)
按照步骤安装完毕
## 步骤
### 打开教务系统 兰大教务处,观察网页源代码。
验证码
图片中所指出的为验证码地址 验证码 ### 抓登陆包 利用chrome 自带的抓包功能F12里的network, 我们输入账号, 密码, 以及验证码,点击登陆: 
在登陆成功后,查看所跳转的登陆表单发送的网址:(http://jwk.lzu.edu.cn/academic/j_acegi_security_check)我们的账号, 密码, 验证码信息都是发向这个地址的。
构造登陆表单
1 2 3 4 5 6 7 8
| LoginData = { "j_username": username, "j_password": password, "j_captcha": img_code, "button1": "(unable to decode value)" } login_req = urllib.request.Request(post_url, urllib.parse.urlencode(LoginData).encode('utf-8'));
|
找到所有成绩查询的网址
所有科目成绩
观察成绩页面源码
得到关键信息位置,便于提取到信息 我们想要提取到 '年', '学期', '课程名', '成绩', '学分', '选课属性'
定位到源代码中位置,利用csv包,将其输出。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| gpa_response=opener.open(gpa_url) gpa_item= BeautifulSoup(gpa_response.read(), 'lxml').find('table',class_='datalist') classname = "" score = "" credit = "" attributes = "" year ="" terms ="" with open('table.csv', 'w',newline='') as f: writer = csv.writer(f) writer.writerow(['年', '学期', '课程名', '成绩', '学分', '选课属性']) for row in gpa_item.findAll('tr'): cells= row.findAll('td') if len(cells) == 18: yers = cells[0].get_text() terms= cells[1].get_text() classname = cells[3].get_text() score = cells[8].get_text() credit = cells[9].get_text() attributes = cells[12].get_text() writer.writerow([x for x in [yers, terms, classname, score, credit, attributes]])
|
可以得到一个csv文件。包含上头提取的信息。
跳过验证码识别
该源代码还有缺陷,缺少验证码的自动识别,和登陆失败提醒, 现在登陆失败会直接报查找错误。
## 源代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
|
import os import requests import urllib import http.cookiejar import urllib.parse from bs4 import BeautifulSoup import csv import re
Img_URL = 'http://jwk.lzu.edu.cn/academic/getCaptcha.do' Login_URL = 'http://jwk.lzu.edu.cn/academic/index.jsp' post_url = 'http://jwk.lzu.edu.cn/academic/j_acegi_security_check' person_url = 'http://jwk.lzu.edu.cn/academic/showPersonalInfo.do' gpa_url = 'http://jwk.lzu.edu.cn/academic/manager/score/studentOwnScore.do?groupId=&moduleId=2021year=&term=¶=0&sortColumn=&Submit=%E6%9F%A5%E8%AF%A2' username = '320130908381' password = 'XXXXXXXX'
def login():
cj = http.cookiejar.LWPCookieJar() cookie_support = urllib.request.HTTPCookieProcessor(cj) opener = urllib.request.build_opener(cookie_support, urllib.request.HTTPHandler) urllib.request.install_opener(opener)
img_req = urllib.request.Request(Img_URL) img_response = opener.open(img_req)
try: out = open('code.jpg', 'wb') out.write(img_response.read()) out.flush() out.close() print( 'get code success') except IOError: print( 'file wrong') img_code = input("please input code: ") print('your code is %s' % img_code) LoginData = { "j_username": username, "j_password": password, "j_captcha": img_code, "button1": "(unable to decode value)" } login_req = urllib.request.Request(post_url, urllib.parse.urlencode(LoginData).encode('utf-8')); login_req.add_header('User-Agent',"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"); login_response = opener.open(login_req) if studentname = print( 'login success') gpa_response=opener.open(gpa_url) gpa_item= BeautifulSoup(gpa_response.read(), 'lxml').find('table',class_='datalist') classname = "" score = "" credit = "" attributes = "" year ="" terms ="" with open('table.csv', 'w',newline='') as f: writer = csv.writer(f) writer.writerow(['年', '学期', '课程名', '成绩', '学分', '选课属性']) for row in gpa_item.findAll('tr'): cells= row.findAll('td') if len(cells) == 18: yers = cells[0].get_text() terms= cells[1].get_text() classname = cells[3].get_text() score = cells[8].get_text() credit = cells[9].get_text() attributes = cells[12].get_text() writer.writerow([x for x in [yers, terms, classname, score, credit, attributes]]) if __name__ == '__main__': login()
|