Removing Pesky Unicode Zero Width Spaces

I recently ran into a problem while downloading a dataset: the character strings contained unwanted zero-width spaces (<U+200B>).

These characters are unicode characters used to adjust word spacing and line breaks (see this Wikipedia article for more information).

Below is a brief exploration of this particular unicode character followed by two ways to remove it:

Targeted approach: str_replace_all("string","\\u200b","")
Generalized approach: iconv("string", "utf-8", "ascii", sub="")

Exploring the unicode zero-width space character: `<U+200B>`

a <- "\u200b"

# Print unicode
print(a)
## [1] "<U+200B>"

# Print escaped unicode
stri_escape_unicode(a)
## [1] "\\u200b"

# Replace the unicode with an empty string
a_no_unicode <- str_replace(a,"\\u200b","")

# Print the revised string
print(a_no_unicode)
## [1] ""

# Print the escaped unicode version of the revised string (just to double-check)
stri_escape_unicode(a_no_unicode)
## [1] ""

# Compare the results in a table
tibble("print(a)" = a,
       "stri_escape_unicode(a)" = stri_escape_unicode(a),
       "print(a_no_unicode)" = a_no_unicode,
       "stri_escape_unicode(a_no_unicode)" = stri_escape_unicode(a_no_unicode))
## # A tibble: 1 x 4
##   `print(a)` `stri_escape_unicode(a)` `print(a_no_unic~ `stri_escape_unic~
##   <chr>      <chr>                    <chr>             <chr>             
## 1 <U+200B>           "\\u200b"                ""                ""

Example with data from the wild

csv_url <- "https://data.kingcounty.gov/resource/es38-6nrz.csv"

agencies <- read_csv(csv_url)
## Parsed with column specification:
## cols(
##   department = col_character(),
##   division = col_character()
## )

# Print data with unicode characters
print(as.data.frame(agencies[1:5,]))
##                                                                                                                      department
## 1 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 2 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 3 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 4 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 5 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
##                                     division
## 1                                       <NA>
## 2                    Office of the Executive
## 3 Office of Performance, Strategy and Budget
## 4          Office of Labor Relations<U+200B>
## 5  Office of Economic and Financial Analysis

# Targeting the exact text doesn't work
zwsp_pattern_first <- "<U+200B>"

agencies %>% 
  mutate_all(funs(str_replace_all(.,zwsp_pattern_first,""))) %>% 
  slice(1:5) %>% 
  as.data.frame()
##                                                                                                                      department
## 1 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 2 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 3 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 4 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 5 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
##                                     division
## 1                                       <NA>
## 2                    Office of the Executive
## 3 Office of Performance, Strategy and Budget
## 4          Office of Labor Relations<U+200B>
## 5  Office of Economic and Financial Analysis

# Escaping the unicode does work
zwsp_pattern_second <- "\\u200b"

agencies %>% 
  mutate_all(funs(str_replace_all(.,zwsp_pattern_second,""))) %>% 
  slice(1:5) %>% 
  as.data.frame()
##              department                                   division
## 1 King County Executive                                       <NA>
## 2 King County Executive                    Office of the Executive
## 3 King County Executive Office of Performance, Strategy and Budget
## 4 King County Executive                  Office of Labor Relations
## 5 King County Executive  Office of Economic and Financial Analysis

# Or convert the whole thing from UTF-8 to ASCII using base::iconv()
agencies %>% 
  mutate_all(funs(iconv(.,'utf-8', 'ascii', sub=''))) %>% 
  slice(1:5) %>% 
  as.data.frame()
##              department                                   division
## 1 King County Executive                                       <NA>
## 2 King County Executive                    Office of the Executive
## 3 King County Executive Office of Performance, Strategy and Budget
## 4 King County Executive                  Office of Labor Relations
## 5 King County Executive  Office of Economic and Financial Analysis

Resources

stringr vignette
Stackoverflow question about removing unicode characters: How to remove strange characters using gsub in R?
FileFormate.Info: A to Z Index of Unicode Characters: starting with ‘A’

Exploring the unicode zero-width space character: <U+200B>

Example with data from the wild

Resources

Exploring the unicode zero-width space character: `<U+200B>`