I recently ran into a problem while downloading a dataset: the character strings contained unwanted zero-width spaces (<U+200B>
).
These characters are unicode characters used to adjust word spacing and line breaks (see this Wikipedia article for more information).
Below is a brief exploration of this particular unicode character followed by two ways to remove it:
- Targeted approach:
str_replace_all("string","\\u200b","")
- Generalized approach:
iconv("string", "utf-8", "ascii", sub="")
Exploring the unicode zero-width space character: <U+200B>
a <- "\u200b"
# Print unicode
print(a)
## [1] "<U+200B>"
# Print escaped unicode
stri_escape_unicode(a)
## [1] "\\u200b"
# Replace the unicode with an empty string
a_no_unicode <- str_replace(a,"\\u200b","")
# Print the revised string
print(a_no_unicode)
## [1] ""
# Print the escaped unicode version of the revised string (just to double-check)
stri_escape_unicode(a_no_unicode)
## [1] ""
# Compare the results in a table
tibble("print(a)" = a,
"stri_escape_unicode(a)" = stri_escape_unicode(a),
"print(a_no_unicode)" = a_no_unicode,
"stri_escape_unicode(a_no_unicode)" = stri_escape_unicode(a_no_unicode))
## # A tibble: 1 x 4
## `print(a)` `stri_escape_unicode(a)` `print(a_no_unic~ `stri_escape_unic~
## <chr> <chr> <chr> <chr>
## 1 <U+200B> "\\u200b" "" ""
Example with data from the wild
csv_url <- "https://data.kingcounty.gov/resource/es38-6nrz.csv"
agencies <- read_csv(csv_url)
## Parsed with column specification:
## cols(
## department = col_character(),
## division = col_character()
## )
# Print data with unicode characters
print(as.data.frame(agencies[1:5,]))
## department
## 1 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 2 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 3 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 4 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 5 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## division
## 1 <NA>
## 2 Office of the Executive
## 3 Office of Performance, Strategy and Budget
## 4 Office of Labor Relations<U+200B>
## 5 Office of Economic and Financial Analysis
# Targeting the exact text doesn't work
zwsp_pattern_first <- "<U+200B>"
agencies %>%
mutate_all(funs(str_replace_all(.,zwsp_pattern_first,""))) %>%
slice(1:5) %>%
as.data.frame()
## department
## 1 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 2 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 3 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 4 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## 5 <U+200B><U+200B>King County Executive<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
## division
## 1 <NA>
## 2 Office of the Executive
## 3 Office of Performance, Strategy and Budget
## 4 Office of Labor Relations<U+200B>
## 5 Office of Economic and Financial Analysis
# Escaping the unicode does work
zwsp_pattern_second <- "\\u200b"
agencies %>%
mutate_all(funs(str_replace_all(.,zwsp_pattern_second,""))) %>%
slice(1:5) %>%
as.data.frame()
## department division
## 1 King County Executive <NA>
## 2 King County Executive Office of the Executive
## 3 King County Executive Office of Performance, Strategy and Budget
## 4 King County Executive Office of Labor Relations
## 5 King County Executive Office of Economic and Financial Analysis
# Or convert the whole thing from UTF-8 to ASCII using base::iconv()
agencies %>%
mutate_all(funs(iconv(.,'utf-8', 'ascii', sub=''))) %>%
slice(1:5) %>%
as.data.frame()
## department division
## 1 King County Executive <NA>
## 2 King County Executive Office of the Executive
## 3 King County Executive Office of Performance, Strategy and Budget
## 4 King County Executive Office of Labor Relations
## 5 King County Executive Office of Economic and Financial Analysis
Resources
stringr
vignette- Stackoverflow question about removing unicode characters: How to remove strange characters using gsub in R?
- FileFormate.Info: A to Z Index of Unicode Characters: starting with ‘A’