Ticket #1760 (closed enhancement: fixed)

Opened 2 years ago

Last modified 2 years ago

PortalTransforms: faster scrubHTML

Reported by: ybastide Assigned to: madarche
Priority: P2 Milestone: CPS 3.4.4
Component: PortalTransforms Version: TRUNK
Severity: normal Keywords: PortalTransforms lxml html
Cc:

Description

Hi,

PortalTransforms.libtransforms.utils.scrubHTML is a function for cleaning HTML, removing unknown tags and raising an exception if scripts, objects and such are present. This function uses sgmllib's SGMLParser and is slow as a dog.

Here's a much faster version using lxml.

Comments?

yves

Attachments

PortalTransforms-lxml.diff (6.0 kB) - added by ybastide on 10/17/06 17:33:22.
scrubHTML-v2.diff (5.0 kB) - added by ybastide on 10/23/06 23:29:50.
word_to_text.py (0.9 kB) - added by ybastide on 10/23/06 23:31:09.
PortalTransforms.diff (7.6 kB) - added by ybastide on 11/17/06 14:43:41.
New version of PortalTransforms?.libtransforms.utils (mainly)
word_to_text-v2.py (1.0 kB) - added by ybastide on 11/17/06 14:45:12.
New version of PortalTransforms.transforms.word_to_text

Change History

10/17/06 17:33:22 changed by ybastide

  • attachment PortalTransforms-lxml.diff added.

10/23/06 23:29:50 changed by ybastide

  • attachment scrubHTML-v2.diff added.

10/23/06 23:31:09 changed by ybastide

  • attachment word_to_text.py added.

11/17/06 14:43:41 changed by ybastide

  • attachment PortalTransforms.diff added.

New version of PortalTransforms?.libtransforms.utils (mainly)

11/17/06 14:45:12 changed by ybastide

  • attachment word_to_text-v2.py added.

New version of PortalTransforms.transforms.word_to_text

11/17/06 15:22:07 changed by madarche

  • owner changed from trac to madarche.

12/05/06 15:05:48 changed by madarche

  • status changed from new to closed.
  • resolution set to fixed.

Fixed by changeset [50502].

Thanks for those useful patches.

12/05/06 15:06:15 changed by madarche

  • milestone set to CPS 3.4.4.