Regular expression (regex) to remove double encoding of html entities

30th March, 2011 - Posted by david

When you have users copying and pasting in data to forms on your website, which then gets stored in your database, you invariably end up with all sorts of ways of encoding and storing special characters. Ideally, these will end up in your database as the correct characters (such as € for the euro symbol), which will then get encoded as HTML entities when you display this data on your website (so, € becomes € in the HTML).

However, with older systems, especially those built in-house, you end up with the HTML entity version of certain characters in your database. It’s pretty much a fact of web development. Let’s use the example of a string that says “Price: €100” but gets stored in the database as “Price: €100”. When you go to display this text on your encoded web-page, you end up seeing things such as “Price: €100” in your browser. This is a result of double encoding, as the & in € is first getting encoded as &.

In order to remove these, I came up with the following function, that uses a simple regular expression to tidy such instances up.

function remove_double_encoding($in)
    return preg_replace('/&([a-zA-Z0-9]{2,7});/', '&$1;', $in);

What this does is looks for any 2 to 7 letter strings with & immediately before them and ; immediately after. When it finds a match, it simply replaces the & with &. It does this for all instances in your input string.

Update: Forgot that you can also have purely numeric codes here, so added ‘0-9’ to the regex.

Tags: html php regex regular expression | david | 30th Mar, 2011 at 18:43pm | No Comments

No Comments

Leave a reply

You must be logged in to post a comment.