itextpdf html 样式,从使用 itext 创建的 PDF 中删除 HTML 和 CSS 样式

阅读: 评论:0

itextpdf html 样式,从使用 itext 创建的 PDF 中删除 HTML 和 CSS 样式

itextpdf html 样式,从使用 itext 创建的 PDF 中删除 HTML 和 CSS 样式

我们的应用程序类似,我们有一个富文本编辑器(TinyMCE),我们的输出是通过 iText PDF 生成的 PDF。我们希望 HTML 尽可能干净,理想情况下只使用 iText 的 HTMLWorker 支持的 HTML 标签。 TinyMCE 可以做到这一点,但仍然存在最终用户可以提交 HTML 的情况,这实际上搞砸了,这可能会破坏 iText 生成 PDF 的能力。

我们使用 jSoup 和 jTidy CSSParser 的组合来过滤掉在 HTML“样式”属性中输入的不需要的 CSS 样式。输入 TinyMCE 的 HTML 使用此服务进行清理,该服务可以清除文字标记中的任何粘贴(如果用户没有使用 TinyMCE 中的“从 Word 粘贴”按钮),并为我们提供了可以很好地转换为 iTextPDFs HTMLWorker 的 HTML。

我还在 iText 的 HTMLWorker 解析器(5.0.6)中发现表宽度问题,如果表格宽度在 style 属性中,HTMLWorker 忽略它并将表格宽度设置为 0,所以这是修复下面的一些逻辑。我们使用以下库:a

com.itextpdf:itextpdf:5.0.6 // used to generate PDFs

org.jsoup:jsoup:1.5.2 // used for cleaning HTML, primary cleaner

net.sf.jtidy:jtidy:r938 // used for cleaning HTML, secondary cleaner

net.sourceforge.cssparser:cssparser:0.9.5 // used to parse out unwanted HTML "style" attribute values

下面是我们为擦除 HTML 而构建的 Groovy 服务中的一些代码,只保留 iText 支持的标记和样式属性修复了表问题。代码中有一些特定于我们的应用程序的假设。这对我们来说非常有用。

import com.steadystate.css.parser.CSSOMParser

import org.htmlcleaner.CleanerProperties

import org.htmlcleaner.HtmlCleaner;

import org.htmlcleaner.PrettyHtmlSerializer

import org.htmlcleaner.SimpleHtmlSerializer

import org.htmlcleaner.TagNode

import org.jsoup.Jsoup

import des.Document

import org.jsoup.safety.Cleaner

import org.jsoup.safety.Whitelist

import org.jsoup.select.Elements

import org.w3c.css.sac.InputSource

import org.w3c.dom.css.CSSRule

import org.w3c.dom.css.CSSRuleList

import org.w3c.dom.css.CSSStyleDeclaration

import org.w3c.dom.css.CSSStyleSheet

import org.w3c.tidy.Tidy

class HtmlCleanerService {

static transactional = true

def cleanHTML(def html) {

// clean with JSoup which should filter out most unwanted things and

// ensure good html syntax

html = soupClean(html);

// run through JTidy to remove repeated nested tags, clean anything JSoup left out

html = tidyClean(html);

return html;

}

def tidyClean(def html) {

Tidy tidy = new Tidy()

tidy.setAsciiChars(true)

tidy.setDropEmptyParas(true)

tidy.setDropProprietaryAttributes(true)

tidy.setPrintBodyOnly(true)

tidy.setEncloseText(true)

tidy.setJoinStyles(true)

tidy.setLogicalEmphasis(true)

tidy.setQuoteMarks(true)

tidy.setHideComments(true)

tidy.setWraplen(120)

// (makeClean || dropFontTags) = replaces presentational markup by style rules

tidy.setMakeClean(true) // remove presentational clutter.

tidy.setDropFontTags(true)

// word2000 = drop style & class attributes and empty p, span elements

// draconian cleaning for Word2000

tidy.setWord2000(true)

tidy.setMakeBare(true) // remove Microsoft cruft.

tidy.setRepeatedAttributes(org.w3c.tidy.Configuration.KEEP_FIRST) // keep first or last duplicate attribute

// TODO ? tidy.setForceOutput(true)

def reader = new StringReader(html);

def writer = new StringWriter();

// hide output from stderr

tidy.setShowWarnings(false)

tidy.setErrout(new PrintWriter(new StringWriter()))

tidy.parse(reader, writer); // run tidy, providing an input and output stream

String()

}

def soupClean(def html) {

// clean the html

Document dirty = Jsoup.parseBodyFragment(html);

Cleaner cleaner = new Cleaner(createWhitelist());

Document clean = cleaner.clean(dirty);

// now hunt down all style attributes and ensure we only have those that render with iTextPDF

Elements styledNodes = clean.select("[style]"); // a with href

styledNodes.each { element ->

def style = element.attr("style");

def tag = element.tagName().toLowerCase()

def newstyle = ""

CSSOMParser parser = new CSSOMParser();

InputSource is = new InputSource(new StringReader(style))

CSSStyleDeclaration styledeclaration = parser.parseStyleDeclaration(is)

boolean hasProps = false

for (int i=0; i < Length(); i++) {

def propname = styledeclaration.item(i)

def propval = PropertyValue(propname)

propval = propval ? im() : ""

if (["padding-left", "text-decoration", "text-align", "font-weight", "font-style"].contains(propname)) {

newstyle = newstyle + propname + ": " + propval + ";"

hasProps = true

}

// standardize table widths, itextPDF won't render tables if there is only width in the

// style attribute. Here we ensure the width is in its own attribute, and change the value so

// it is in percentage and no larger than 100% to avoid end users from creating really goofy

// tables that they can't edit properly becuase they have made the width too large.

//

// width of the display area in the editor is about 740px, so let's ensure everything

// is relative to that

//

// TODO could get into trouble with nested tables and widths within as we assume

// one table (e.g. could have nested tables both with widths of 500)

if (tag.equals("table") && propname.equals("width")) {

if (dsWith("%")) {

// ensure it is <= 100%

propval = placeAll(~"[^0-9]", "")

propval = Math.min(100, Integer())

}

else {

// else we have measurement in px or assumed px, clean up and

// get integer value, then calculate a percentage

propval = placeAll(~"[^0-9]", "")

propval = Math.min(100, (int) (Integer() / 740)*100)

}

element.attr("width", propval + "%")

}

}

if (hasProps) {

element.attr("style", newstyle)

} else {

}

}

return clean.body().html();

}

/**

* Returns a JSoup whitelist suitable for sane HTML output and iTextPDF

*/

def createWhitelist() {

Whitelist wl = new Whitelist();

// iText supported tags

wl.addTags(

"br", "div", "p", "pre", "span", "blockquote", "q", "hr",

"h1", "h2", "h3", "h4", "h5", "h6",

"u", "strike", "s", "strong", "sub", "sup", "em", "i", "b",

"ul", "ol", "li", "ol",

"table", "tbody", "td", "tfoot", "th", "thead", "tr",

);

// iText attributes recognized which we care about

// padding-left (div/p/span indentation)

// text-align (for table right/left align)

// text-decoration (for span/div/p underline, strikethrough)

// font-weight (for span/div/p bolder etc)

// font-style (for span/div/p italic etc)

// width (for tables)

// colspan/rowspan (for tables)

["span", "div", "p", "table", "ul", "ol", "pre", "td", "th"].each { tag ->

["style", "padding-left", "text-decoration", "text-align", "font-weight", "font-style"].each { attr ->

wl.addAttributes(tag, attr)

}

}

["td", "th"].each { tag ->

["colspan", "rowspan", "width"].each { attr ->

wl.addAttributes(tag, attr)

}

}

wl.addAttributes("table", "width", "style", "cellpadding")

// img support

// wl.addAttributes("img", "align", "alt", "height", "src", "title", "width")

return wl

}

}

本文发布于:2024-02-04 13:57:53,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170708789556173.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:样式   html   itextpdf   itext   CSS
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23