python转换html为markdown
目录
导读:虽然我大部分使用php生成markdown,但python库确实也比较丰富的不要不要,php composer也是参考学习python包管理,才会让php也有一种搭积木的感觉。
使用python将markdown转换成html的情况比较多,今天我们将另一个库将html转换为markdown。
html2text
安装
1.使用pip
pip install html2text #python3使用pip3
2.源码安装
如果使用的是python3将下面的python后面加一个3
git clone --depth 1 https://github.com/Alir3z4/html2text.git
python setup.py build
python setup.py install
使用
import html2text
html = "<p><strong>hello </strong> https://xxx.com </p>"
md = html2text.html2text(html)
print(md)
运行结果
**hello** https://xxx.com
高级用法
忽略链接即a标签
import html2text
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text_maker.bypass_tables = False
html = html
text = text_maker.handle(html)
print(text)
运行结果
**hello** https://xxx.com
链接
如果将ignore_links = False 运行结果
**hello** https://xxx.com
[链接](https://xxx.com)
我们可以看到开启之后只提取文本,而关闭后变成了markdown的链接语法
其他可选项
- UNICODE_SNOB for using unicode
- ESCAPE_SNOB for escaping every special character
- LINKS_EACH_PARAGRAPH for putting links after every paragraph
- BODY_WIDTH for wrapping long lines
- SKIP_INTERNAL_LINKS to skip #local-anchor things
- INLINE_LINKS for formatting images and links
- PROTECT_LINKS protect from line breaks
- GOOGLE_LIST_INDENT no of pixels to indent nested lists
- IGNORE_ANCHORS
- IGNORE_IMAGES
- IMAGES_AS_HTML always generate HTML tags for images; preserves
height
,width
,alt
if possible. - IMAGES_TO_ALT
- IMAGES_WITH_SIZE
- IGNORE_EMPHASIS
- BYPASS_TABLES format tables in HTML rather than Markdown
- IGNORE_TABLES ignore table-related tags (table, th, td, tr) while keeping rows
- SINGLE_LINE_BREAK to use a single line break rather than two
- UNIFIABLE is a dictionary which maps unicode abbreviations to ASCII values
- RE_SPACE for finding space-only lines
- RE_ORDERED_LIST_MATCHER for matching ordered lists in MD
- RE_UNORDERED_LIST_MATCHER for matching unordered list matcher in MD
- RE_MD_CHARS_MATCHER for matching Md \,[,],( and )
- RE_MD_CHARS_MATCHERALL for matching `,*, ,{,},[,],(,),#,!
- RE_MD_DOT_MATCHER for matching lines starting with 1.
- RE_MD_PLUS_MATCHER for matching lines starting with +
- RE_MD_DASH_MATCHER for matching lines starting with (-)
- RE_SLASH_CHARS a string of slash escapeable characters
- RE_MD_BACKSLASH_MATCHER to match \char
- USE_AUTOMATIC_LINKS to convert http://xyz to http://xyz
- MARK_CODE to wrap ‘pre’ blocks with [code]…[/code] tags
- WRAP_LINKS to decide if links have to be wrapped during text wrapping (implies INLINE_LINKS = False)
- WRAP_LIST_ITEMS to decide if list items have to be wrapped during text wrapping
- DECODE_ERRORS to handle decoding errors. ‘strict’, ‘ignore’, ‘replace’ are the acceptable values.
- DEFAULT_IMAGE_ALT takes a string as value and is used whenever an image tag is missing an
alt
value. The default for this is an empty string '’ to avoid backward breakage - OPEN_QUOTE is the character used to open a quote when replacing the
<q>
tag. It defaults to"
. - CLOSE_QUOTE is the character used to close a quote when replacing the
<q>
tag. It defaults to"
.
本文收藏来自互联网,用于学习研究,著作权归原作者所有,如有侵权请联系删除
markdown @tsingchan
部分引用格式为收藏注解,比如本句就是注解,非作者原文。