Novel::Robot
download novel /bbs thread 小说/贴子下载器
support novel/forum website 支持小说/贴子站点
Novel::Robot::Parser
support robot ouput file type, 支持小说输出形式
Novel::Robot::Packer
for example, on debian, 以debian环境为例
$ apt-get install parallel ansible calibre cpanminus $ cpanm Novel::Robot
download and convert ebook, 简单下载或处理电子书
download the whole novel and convert to ebook 下载全本小说并转换为电子书
run_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t epub run_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -o mytest.epub
download 1-3 chapter 下载小说的1-3章,最终输出文件名为abc.epub
run_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -o abc.epub -i 1-3
convert txt to ebook, 把txt文件转成电子书
run_novel.pl -f 飘灯-风尘叹.txt -t epub run_novel.pl -f fct.txt -w 飘灯 -b 风尘叹 -t epub
run_novel.pl -s lofter -t epub -w chuweizhiyu -b 时之足 run_novel.pl -s lofter -t epub -w chuweizhiyu -b 时之足 -i 3- run_novel.pl -s lofter -t epub -w chuweizhiyu -b 时之足 -i 3-5
download/convert novel, use calibre-smtp to send ebook to email address : xxx@kindle.cn
下载小说并使用calibre-smtp推送到指定邮箱
local smtp service 本地已安装smtp服务
run_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t epub -T "xxx@kindle.cn" -F "yyy@somesite.cn" run_novel.pl -f fct.txt -w 飘灯 -b 风尘叹 -t epub -T "xxx@kindle.cn" -F "yyy@somesite.cn"
remote smtp service 使用远程smtp服务
run_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t epub -T "xxx@kindle.cn" -i 1-3 -M smtp.src.com -p 587 -U xxx -P somepwd -F yyy@somesite.cn run_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t epub -T "xxx@kindle.cn" -M smtp.qq.com -p 587 -U yyy -P 'aaaaaaaaaaaaagga' -F yyy@qq.com run_novel.pl -f fct.txt -w 飘灯 -b 风尘叹 -t epub -T "xxx@kindle.cn" -M smtp.qq.com -p 587 -U yyy -P 'aaaaaaaaaaaaagga' -F yyy@qq.com
use ansible,push ebook to remote host, and then send email
使用ansible把电子书上传到远程服务器,再在远程服务器调用calibre-smtp在服务器直接smtp推送
//todo
-s : site, 小说站点 -u : book url,小说url -w : writer name, 作者 -b : book name,书名 -f : txt file, txt文件或目录 -o : output ebook filename,输出电子书文件名 -t : ebook type,电子书类型
download novel from url
下载小说
get_novel.pl -u [url] -t [type] -i [min_item_num-max_item_num] --cookie [cookie] -o [dst_file/dst_dir] get_novel.pl -u [小说目录页url] -t [目标文件类型] -i [起止章节号] --cookie [cookie登录信息] -o [目标文件名/目标文件夹] get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t txt get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html -i 3 get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html -i 3-4 get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html -i 3- get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -t html -i -3
use chrome to parse dom:
飘灯-青春期妖怪.html
get_novel.pl -u "https://www.aliwx.com.cn/chapter?bid=7964189" --use_chrome --writer_path "//div[@class='chapterbox']//span" --book_regex "<title>(.+?)-" --content_path "//div[@class='chapter-content']"
公众号
get_novel.pl -u "https://mp.weixin.qq.com/s?xxxx" --use_chrome --content_path "//div[@id='js_content']" --item_list_path "//div[@id='js_content']//a" -w somewriter -b somebook
jjwxc vip: 可以使用firefox浏览器,先登录m.jjwxc.net,然后用 cookies.txt 扩展导出cookies.txt,则可以下载当前登录账号所购买的小说;也可直接指定firefox配置目录下的cookies.sqlite文件;或者直接使用m.jjwxc.net的cookie字符串
get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=217747" -i 33-34 --cookie cookies.txt get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=217747" -i 33-34 --cookie ~/.mozilla/firefox/*/cookies.sqlite get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=217747" -i 33-34 --cookie "name1=value1; name2=value2"
convert txt, parse chapter name with regex
转换txt,可指定章节标题的正则式
get_novel.pl -w [writer] -b [book] -f [txt_file/directory] -t [type] -r [chapter_regex] get_novel.pl -w [作者] -b [书名] -f [txt文件或目录] -t [目标文件类型] -r [章节标题匹配的正则式] get_novel.pl -w 牵机 -b 断情逐妖记 -f dq1.txt -t html get_novel.pl -w 牵机 -b 断情逐妖记 -f dq1.txt,dq2.txt,dir1 -r "第[ \\t\\d]+章" -t html get_novel.pl -f 飘灯-像妖怪一样自由.txt -t html
only print info, but not download, 输出小说信息(不下载)
get_novel.pl -u [url] -D 1 get_novel.pl -u "http://www.jjwxc.net/onebook.php?novelid=14838" -D 1
-s : site, 指定站点 -u : book url,小说url -w : writer name 作者名 -b : book name,书名 -f : txt file / txt file dir, 指定文本文件来源(可以是单个目录或文件) -c : firefox cookies.sqlite / netscape HTTP cookie file / cookie string, details in Novel::Robot::Browser -t : packer type (html/txt), 小说保存类型,例如txt/html -o : output packer filename, 保存的小说文件名 -i : min_item_num-max_item_num, 只取 x-y 章/楼 -j : min_page_num-max_page_num, 只取 x-y 页 -D : only print info, not download, 只输出信息,不下载 -v : verbose --term_progress_bar: 显示进度条(默认不显示) --use_chrome : use chrome dump_dom, only support GET method --with_toc : 小说保存时是否生成目录(默认是) --grep_content : 提取关键字 --filter_content : 过滤关键字 --only_poster : 贴子只看楼主 --min_content_word_num : 贴子每楼层最小字数 --max_process_num : 进程个数 --chapter_regex: 指定分割章节的正则表达式(例如:"第[ \\t\\d]+章") --item_list_path : xpath to extract item_list, 提取item_list的路径 --content_path : xpath to extract content, 提取content的路径 --writer_path : xpath to extract writer, 提取writer的路径 --book_path : xpath to extract book, 提取book的路径 --content_regex : regex to extract content, 提取content的正则 --writer_regex : regex to extract writer, 提取writer的正则 --book_regex : regex to extract book, 提取book的正则
use calibre to convert novel file into epub/epub/..., default filename format is [writer]-[bookname].[type]
使用calibre将下载的 html格式 的小说转换成 其他格式的电子书,例如epub、epub等等。如果未指定writer及book选项,则需要将html源文件名称设置为 [作者-书名]
conv_novel.pl -f [input_file] -o [output_file] -t [type] -w [writer] -b [book] conv_novel.pl -f [源文件] -o [目标文件] -t [目标文件类型(小写)] -w [作者] -b [书名] conv_novel.pl -f 天平-风起阿房.html -t epub conv_novel.pl -f mxj.html -w 施定柔 -b 迷侠记 -t epub
get bulk novel info: <writer,book,url> , default option is not download
get writer's novels info, not download
批量获取版块/作者专栏的小说信息,默认-D=1不下载
bulk_novel.pl -b "http://www.jjwxc.net/oneauthor.php?authorid=14644" bulk_novel.pl -b "http://www.jjwxc.net/oneauthor.php?authorid=14644" -i 1-3
get board's threads info, not download
查询版块贴子信息,默认-D=1不下载
bulk_novel.pl -b "http://bbs.jjwxc.net/board.php?board=153&page=0" -P 1-2 bulk_novel.pl -b "http://bbs.jjwxc.net/board.php?board=153&page=0" -i 1-20
query novels info, not download
批量查询,不下载
bulk_novel.pl -s jjwxc -q 作者 -k 牵机 -i 1-10 bulk_novel.pl -s jjwxc -q 作品 -k 网王 -P 1-3 bulk_novel.pl -s hjj -b 153 -q 贴子主题 -k 迷侠记
bulk download novels
下载专栏内的所有小说
bulk_novel.pl -b "http://www.jjwxc.net/oneauthor.php?authorid=14644" -D 0
manually select some novels, use parallel for multiple novels download
手动选择下载部分小说,用parallel批量调用run_novel.pl获取epub
bulk_novel.pl -s jjwxc -q 作者 -k 牵机 -P 1 > raw_booklist.txt awk -F, '$1=="牵机"' raw_booklist.txt > refine_booklist.txt parallel --colsep , run_novel.pl -u "{3}" -T epub :::: refine_booklist.txt
-s : site, 指定站点 -b : board_url / board_id 版块url,或版块编号 -q : query type, 查询类型 -k : query keyword, 查询关键字 -i : min_item_num-max_item_num, 只取 x-y 项 -j : min_page_num-max_page_num, 只取 x-y 页 -D : only print info, not download, 只输出信息,不下载
init to set src site and dst type
初始化设置解析引擎,目标文件类型
my $xs = Novel::Robot->new( site => 'jjwxc', type => 'html', );
set src site, 设置解析引擎
$xs->set_parser('jjwxc');
set dst type, 设置打包引擎
$xs->set_packer('html');
download one novel/thread 下载整本小说
$xs->set_parser('jjwxc'); my $index_url = 'http://www.jjwxc.net/onebook.php?novelid=2456'; $xs->get_novel($index_url); $xs->set_parser('txt'); $xs->get_novel([ '/somepath/somefile.txt' ] writer => '牵机', book => '断情逐妖记', );
bulk download writer's novel 下载作者专栏
#!/usr/bin/perl use strict; use warnings; use utf8; use Novel::Robot; use Data::Dumper; use Encode::Locale; use Encode; my $xs = Novel::Robot->new(); $xs->set_parser( 'jjwxc' ); $xs->set_packer( 'html' ); my $writer_url = 'http://www.jjwxc.net/oneauthor.php?authorid=14644'; my $writer_ref = $xs->{parser}->get_iterate_ref( $writer_url, info_sub => sub { $xs->{parser}->extract_elements( @_, sub => $xs->{parser}->can( "parse_board" ), ); }, item_list_sub => sub { $xs->{parser}->can( "parse_board_item" )->( $xs->{parser}, @_ ) }, ); print encode( locale => "$writer_ref->{writer},$_->{book},$_->{url}\n" ) for @{ $writer_ref->{item_list} }; print "\nstart download\n"; $xs->get_novel( $_->{url} ) for @{ $writer_ref->{item_list} };
bulk download query novel 下载查询结果
#!/usr/bin/perl use strict; use warnings; use utf8; use Novel::Robot; use Data::Dumper; use Encode::Locale; use Encode; my $xs = Novel::Robot->new(); $xs->set_parser( 'jjwxc' ); $xs->set_packer( 'html' ); my $keyword = '断情逐妖记'; my $query_url = $xs->{parser}->make_query_request($keyword, query_type => '作品'); my $query_ref = $xs->{parser}->get_iterate_ref( $query_url, info_sub => sub { return {}; }, item_list_sub => sub { $xs->{parser}->can( "parse_query_item" )->( $xs->{parser}, @_ ) }, ); print encode( locale => "$_->{writer},$_->{book},$_->{url}\n" ) for @{ $query_ref->{item_list} }; print "\nonly download first\n"; $xs->get_novel( $_->{url} ) for @{ $query_ref->{item_list} }[0];
To install Novel::Robot, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Novel::Robot
CPAN shell
perl -MCPAN -e shell install Novel::Robot
For more information on module installation, please visit the detailed CPAN module installation guide.