ソフトモヒカンの勉強と開発の記録

サービス開発を目指して、プログラミングの勉強とコーディングをしています。本ブログは、そのログになります!

【Python】Webスクレイピング+抽出データをメール送信

Webスクレイピングを実行して、抜き出したテキストデータをメールで自分に送りたい!

開発環境

macOS High Sierra(バージョン10.13.5)
Python3.6.4
Sublime Text

前提条件とやりたいこと

学術論文の保存・公開ウェブサイトarxivをWebスクレイピングして、その日に公開された論文のタイトル、著者、要旨の一覧を取得します。

さらに、取得した内容を、メールの本文に記述して、送信するというところまでを行うコードを作成する。

arxivとは
arXiv - Wikipedia
arxiv
https://arxiv.org/

プログラム

#scrapeArxiv.py3
import urllib.request#webページへのアクセス
import urllib.error
from bs4 import BeautifulSoup#スクレイピングデータの操作
from datetime import datetime#現在日時の取得
from pytz import timezone#タイムゾーンの取得
import re
import ssl
#Gmail作成用
import smtplib
from email.mime.text import MIMEText
from email.utils import formatdate

#ssl認証で引っかかるのを防ぐ
ssl._create_default_https_context = ssl._create_unverified_context

#HTMLソースをbeautifulsoupで取得
def htmlAccess(url):
	#urlにアクセスして、html取得
	html = urllib.request.urlopen(url);
	#htmlをbeautifulsoupで操作
	soup = BeautifulSoup(html, "html.parser");
	return soup;

#Gmailのメッセージを作成
def create_message(from_addr, to_addr, bcc_addrs, subject, body):
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = from_addr
    msg['To'] = to_addr
    msg['Bcc'] = bcc_addrs
    msg['Date'] = formatdate()
    return msg

#Gmailアカウントから送信
def send(from_addr, to_addrs, my_password, msg):
    smtpobj = smtplib.SMTP('smtp.gmail.com', 587)
    smtpobj.ehlo()
    smtpobj.starttls()
    smtpobj.ehlo()
    smtpobj.login(from_addr, my_password)
    smtpobj.sendmail(from_addr, to_addrs, msg.as_string())
    smtpobj.close()


today = datetime.now(timezone('Asia/Tokyo'));#現在年月日時分秒曜日取得
year = today.year;
year = str(year);
month = today.month;
day = today.day;
day = str(day);
weekday = today.weekday();#曜日を数字で取得

#曜日を数字から英語に変換
weekdays = ('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun');
weekday = weekdays[weekday];
#月を数字から英語に変換
months = ('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec');
month = months[month-1];

#アーカイブページにアクセス
url_recent = "https://arxiv.org/list/cs.AI/recent";
soup = htmlAccess(url_recent);


#h3タグを日付でフィルタリング
filter_today = weekday + ", " + day + " " + month + " " + year;
#filter_today = "Fri, 1 Jun 2018"
h3_tag = soup.h3;
if filter_today in h3_tag:
	url_articles = h3_tag.find_next("dl").find_all(href=re.compile("/abs/"));#指定日の論文URLタグを取得
	url_articles = str(url_articles);#strに変換

	pattern = r'<a href=\"(\/abs\/[0-9]{1,}\.[0-9]{1,})\"\s';
	pattern_comp = re.compile(pattern);
	resultList = pattern_comp.findall(url_articles);

	urlList = [];#各論文へのアクセスURLを格納するリスト
	for value in resultList:
		url_article = "https://arxiv.org" + value;
		urlList.append(url_article);

	articleDict = {};#著者などの各論文の情報を格納
	articles = "";
	title_l = "[<h1 class=\"title mathjax\"><span class=\"descriptor\">Title:</span>";
	title_r = "</h1>]";
	pattern_author = r'<a href=".*">(.*)</a>';
	abstract_l = "[<blockquote class=\"abstract mathjax\">\n<span class=\"descriptor\">Abstract:</span>";
	abstract_r = "</blockquote>]";

	for index, value in enumerate(urlList):
		soup_article = htmlAccess(value);
		articleDict['Title'] = soup_article.select('.title.mathjax');#タイトル取得
		articleDict['Authors'] = soup_article.select('.authors');#著者取得
		articleDict['Abstract'] = soup_article.select('.abstract.mathjax');#アブスト取得
		articleDict['Title'] = str(articleDict['Title']);
		articleDict['Authors'] = str(articleDict['Authors']);
		articleDict['Abstract'] = str(articleDict['Abstract']);
		articleDict['Title'] = articleDict['Title'].lstrip(title_l).rstrip(title_r);
		articleDict['Authors'] = re.findall(pattern_author,articleDict['Authors']);
		articleDict['Authors'] = str(articleDict['Authors']).lstrip("[").rstrip("]");
		articleDict['Abstract'] = articleDict['Abstract'].lstrip(abstract_l).rstrip(abstract_r);
		tmp = "【"+str(index+1)+"】"+"Title:"+articleDict['Title']+"\n\n"+"Author:"+articleDict['Authors']+"\n\n"+"Abstract:"+articleDict['Abstract']+"\n\n\n";
		articles = articles + tmp;

	#メールで送信部分
	fromAddress = '送り主側のメールアドレス';
	myPassword = '送り主側で作成したアプリパスワード';#二段階認証をオンにした後、生成されたアプリパスワードを使用
	toAddress = '送り先側のメールアドレス';
	bccAddress = "";
	Subject = '本日のアーカイブ' + '(' + year + '/' + month + '/' + day + '/' + weekday + ')';
	Body = articles;

	if __name__ == '__main__':
		message = create_message(fromAddress, toAddress, bccAddress, Subject, Body);
		send(fromAddress, toAddress, myPassword, message);

実行結果

メールタイトルに、スクレイピングをした年月日曜日、本文にスクレイピング結果が出力されます。

  • メールタイトル:
本日のアーカイブ(2018/Jun/12/Tue)
  • 本文
【1】Title:
An Efficient, Generalized Bellman Update For Cooperative Inverse  Reinforcement Learning

Author:'Dhruv Malik', 'Malayandi Palaniappan', 'Jaime F. Fisac', 'Dylan Hadfield-Menell', 'Stuart Russell', 'Anca D. Dragan'

Abstract:Our goal is for AI systems to correctly identify and act according to their
human user's objectives. Cooperative Inverse Reinforcement Learning (CIRL)
formalizes this value alignment problem as a two-player game between a human
and robot, in which only the human knows the parameters of the reward function:
the robot needs to learn them as the interaction unfolds. Previous work showed
that CIRL can be solved as a POMDP, but with an action space size exponential
in the size of the reward parameter space. In this work, we exploit a specific
property of CIRL---the human is a full information agent---to derive an
optimality-preserving modification to the standard Bellman update; this reduces
the complexity of the problem by an exponential factor and allows us to relax
CIRL's assumption of human rationality. We apply this update to a variety of
POMDP solvers and find that it enables us to scale CIRL to non-trivial
problems, with larger reward parameter spaces, and larger action spaces for
both robot and human. In solutions to these larger problems, the human exhibits
pedagogic (teaching) behavior, while the robot interprets it as such and
attains higher value for the human.



【2】Title:
Greybox fuzzing as a contextual bandits problem

Author:'Ketan Patil', 'Aditya Kanade'

Abstract:Greybox fuzzing is one of the most useful and effective techniques for the
bug detection in large scale application programs. It uses minimal amount of
instrumentation. American Fuzzy Lop (AFL) is a popular coverage based
evolutionary greybox fuzzing tool. AFL performs extremely well in fuzz testing
large applications and finding critical vulnerabilities, but AFL involves a lot
of heuristics while deciding the favored test case(s), skipping test cases
during fuzzing, assigning fuzzing iterations to test case(s). In this work, we
aim at replacing the heuristics the AFL uses while assigning the fuzzing
iterations to a test case during the random fuzzing. We formalize this problem
as a `contextual bandit problem' and we propose an algorithm to solve this
problem. We have implemented our approach on top of the AFL. We modify the
AFL's heuristics with our learned model through the policy gradient method. Our
learning algorithm selects the multiplier of the number of fuzzing iterations
to be assigned to a test case during random fuzzing, given a fixed length
substring of the test case to be fuzzed. We fuzz the substring with this new
energy value and continuously updates the policy based upon the interesting
test cases it produces on fuzzing.



【3】Title:
Context-Aware Policy Reuse

Author:'Siyuan Li', 'Fangda Gu', 'Guangxiang Zhu', 'Chongjie Zhang'

Abstract:Transfer learning can greatly speed up reinforcement learning for a new task
by leveraging policies of relevant tasks.
<br/>Existing works of policy reuse either focus on only selecting a single best
source policy for transfer without considering contexts, or cannot guarantee to
learn an optimal policy for a target task.
<br/>To improve transfer efficiency and guarantee optimality, we develop a novel
policy reuse method, called {\em Context-Aware Policy reuSe} (CAPS), that
enables multi-policy transfer. Our method learns when and which source policy
is best for reuse, as well as when to terminate its reuse. CAPS provides
theoretical guarantees in convergence and optimality for both source policy
selection and target task learning. Empirical results on a grid-based
navigation domain and the Pygame Learning Environment demonstrate that CAPS
significantly outperforms other state-of-the-art policy reuse methods.

(多いので以下省略)

課題点としては2点。


Texコマンド、HTMLコマンドがそのまま出力されてしまいます。
もともと、ウェブサイト側でも出力されてしまっている場合もありますが、正しく表示されていても、私のコードではソースコードとして表示されてしまうようです。


もう1点は、更新数が多い場合、ウェブサイトの方は改ページされますが、私のコードは今回それを考慮していません。
(どうやるんだろうなぁ笑)

今後は、上の2点について改修していきたいと思っています。


そして、最終的には、自動化します!