Generally, for security purposes, it’s a bad idea to allow users to enter HTML into any data that will be displayed to other users, since this can open the door to a cross-site scripting (XSS) attack. This is why, when you declare a text-based page item (Text Field, Text Area, and so on), the Security section includes the ability to restrict the characters that the users can enter. Pick the one that fits your expected data; there are four levels other than the “Wild West” default that give you a great deal of control over what the users may enter, helping guard against XSS and SQL injection attacks.
But, every now and then, there are legitimate reasons to allow users to enter HTML. You might have administrator-only pages that allow trusted users to change parts of your application, etc. Or you may have HTML data in your database that you want to display. As long as you trust the source, this isn’t an issue.
Today, however, I ran into an issue where we had HTML data that we wanted to display, but we wanted to strip out the HTML tags first. Another developer used a function like this:
create or replace function remove_html ( p_str in varchar2 ) return varchar2 as x varchar2(32767) := null; in_html boolean := false; s varchar2(1); begin if p_str is null then return p_str; end if; for i in 1 .. length( p_str ) loop s := substr( p_str, i, 1 ); if in_html then if s = '>' then in_html := false; end if; else if s = '<' then in_html := true; end if; end if; if not in_html and s != '>' then x := x || s; end if; end loop; return x; end remove_html;
This marches through the entire string one character at a time, looking for the start/end of HTML tags. For the most part, this worked. But for some of our data, we were getting an “ORA-06502: PL/SQL: numeric or value error: character string buffer too small” error, which took a while to diagnose.
It turns out that some of the data had characters from the “extended” character set, such as the em dash or smart quotes. And those characters needed to have a varchar2(3) for storage, so the “s” variable in the above code was too small. While we could have (probably) solved the issue by changing the size of s, I felt that it was better to avoid the issue entirely, and use the Oracle database to do the work for us. So, instead, I edited our code to use regexp_replace:
select regexp_replace(old_str, '<[^>]+>', '') new_str from dual;
This works for all of the data, and we don’t have to worry about it. The only potential downside is that it’s slightly harder to read, but honestly, I think everyone should be comfortable with regular expressions.