MS Word special characters Regex

Microsoft Word

That pesky Microsoft! They have to be different and mess us developers around don’t they. Have you ever noticed that Microsoft Word’s symbols look a bit different or act a little odd? Well it’s because they are not the standard char characters. This can be a pain for Regex and other things. So how do you get them…

The reason they are so difficult is they use Windows-1252 character encoding set which are not represented in ASCII or ISO-8859-1. This is what just about everyone doesn’t do of course. These characters include:

  • … ellipsis
  • ‘smart’ “quotes”
  • en – dash and em — dash
  • dagger † and double dagger ‡

There are few more of course, but these are the most common few that come up. You can find more of Microsoft Word Windows-1252 character encoding here.

 

Symbol Encoding
single quotes and apostrophe \u2018\u2019\u201A
double quotes \u201C\u201D\u201E
ellipsis \u2026
dashes \u2013\u2014
circumflex \u02C6
open angle bracket \u2039
close angle bracket \u203A
spaces \u02DC\u00A0

 

Here is a pre-built method for JavaScript and C# to combat these.

 

JavaScript Clean String

var wordClean = function(text) {
var cleanStr = text;

// smart single quotes and apostrophe
cleanStr = cleanStr.replace(/[\u2018\u2019\u201A]/g, “\'”);

// smart double quotes
cleanStr = cleanStr.replace(/[\u201C\u201D\u201E]/g, “\””);

// ellipsis
cleanStr = cleanStr.replace(/\u2026/g, “…”);

// dashes
cleanStr = cleanStr.replace(/[\u2013\u2014]/g, “-“);

// circumflex
cleanStr = cleanStr.replace(/\u02C6/g, “^”);

// open angle bracket
cleanStr = cleanStr.replace(/\u2039/g, “<“);

// close angle bracket
cleanStr = cleanStr.replace(/\u203A/g, “>”);

// spaces
cleanStr = cleanStr.replace(/[\u02DC\u00A0]/g, ” “);

return cleanStr ;
}


C# Clean String

public string wordClean (string text){
var cleanStr  = text;

// smart single quotes and apostrophe
cleanStr  = Regex.Replace(s, “[\u2018\u2019\u201A]”, “‘”);

// smart double quotes
cleanStr  = Regex.Replace(s, “[\u201C\u201D\u201E]”, “\””);

// ellipsis
cleanStr  = Regex.Replace(s, “\u2026”, “…”);

// dashes
cleanStr  = Regex.Replace(s, “[\u2013\u2014]”, “-“);

// circumflex
cleanStr  = Regex.Replace(s, “\u02C6”, “^”);

// open angle bracket
cleanStr  = Regex.Replace(s, “\u2039”, “<“);

// close angle bracket
cleanStr  = Regex.Replace(s, “\u203A”, “>”);

// spaces
cleanStr  = Regex.Replace(s, “[\u02DC\u00A0]”, ” “);

return cleanStr ;
}


If you are doing some validation using Regex, here is also how you can check these characters.

JavaScript Regex

function containsWordChar(text) {
var contains;

switch (text) {

case (text.match(/^[\u2018\u2019\u201A]$/)):
contains += “single quotes and apostrophe, “;

case (text.match(/^[\u201C\u201D\u201E]$/)):
contains += “double quotes, “;

case (text.match(/^[\u2026]$/)):
contains += “ellipsis, “;

case (text.match(/^[\u2013\u2014]$/)):
contains += “dashes, “;

case (text.match(/^[\u02C6]$/)):
contains += “circumflex, “;

case (text.match(/^[\u2039]$/)):
contains += “open angle bracket, “;

case (text.match(/^[\u203A]$/)):
contains += “close angle bracket, “;

case (text.match(/^[\u02DC\u00A0]$/)):
contains += “spaces, “;

default:
contains += “double quotes”;

}

return contains;
}


C# Regex (MVC)

[RegularExpression("^[\u2018\u2019\u201A\u201C\u201D\u201E\u2026\u2013\u2014\u02C6\u2039\u203A\u02DC\u00A0]+$", ErrorMessage = "Your content contain some Microsoft Word Windows-1252 character encoding.")]


C# Regex

Public string containsWordChar(text) {
String contains;

switch (text) {

case (text.IsMatch(@”^[\u2018\u2019\u201A]$”)):
contains += “single quotes and apostrophe, “;

case (text.IsMatch(@”^[\u201C\u201D\u201E]$”)):
contains += “double quotes, “;

case (text.IsMatch(@”^[\u2026]$”)):
contains += “ellipsis, “;

case (text.IsMatch(@”^[\u2013\u2014]$”)):
contains += “dashes, “;

case (text.IsMatch(@”^[\u02C6]$”)):
contains += “circumflex, “;

case (text.IsMatch(@”^[\u2039]$”)):
contains += “open angle bracket, “;

case (text.IsMatch(@”^[\u203A]$”)):
contains += “close angle bracket, “;

case (text.IsMatch(@”^[\u02DC\u00A0]$”)):
contains += “spaces, “;

default:
contains += “double quotes”;

}

return contains;
}

 

Published by Chris Pateman - PR Coder

A Digital Technical Lead, constantly learning and sharing the knowledge journey.

3 thoughts on “MS Word special characters Regex

  1. Woah! I’m really digging the template/theme of this website.

    It’s simple, yet effective. A lot of times it’s tough to get that “perfect balance” between user friendliness and visual appearance.
    I must say you have done a superb job with this. Also, the blog loads extremely
    fast for me on Safari. Outstanding Blog!

    Like

Leave a message please

This site uses Akismet to reduce spam. Learn how your comment data is processed.