strip non-ASCII characters from a string using RegEx (Regular expressions)


Leave a comment  →

In case you have a string data which has some non ASCII characters and want to strip off all those non-ASCII characters the following regular expression will help you.

[^u0000-u007F]+

Explanation

  • [^u0000-u007F]+ match a single character not present in the list below
    Quantifier: + Between one and unlimited times, as many times as possible
  • u0000-u007F a single character in the range between the following two characters
    • u0000 the literal character u0000 (case sensitive)
    • u007F the literal character u007F (case sensitive)

^ is the not operator. It tells the regex to find everything that doesn’t match, instead of everything that does match.

The u####-u#### says which characters match.u0000-u007F is the equivilent of the first 255 characters in utf-8 or unicode, which are always the ASCII characters. So you match every non ASCII character (because of the not)

RECOMMENDED READ  AVRWIZ automatic code generator for AVR microcontrollers

I had a string like the one below where there are many non standard chars

name 1= Chanel 51������������������������������������������������������������

Applying the replace all method in java as below

 String s=   "name 1= Chanel 51������������������������������������������������������������"
s = s.replaceAll("[^u0000-u007F]+","");
System.out.println(s);

would output the following to console

name 1= Chanel 51

Test it herehttps://regex101.com

ASCII Table for reference.

Ascii Table

Extended ASCII characters

 

EBCDIC and IBM Scan Codes

Leave a Reply