Sunday, April 15, 2018

Java basics - working with text

Sources:

About Unicode in Java

Java is using Unicode 16-bit representation internally.
Unicode distinguishes between the association of characters as abstract concepts (e.g., "Greek capital letter omega Ω") to a subset of the natural numbers, called code point on the one hand, and the representation of code points by values stored in units of a computer's memory. The Unicode standard defines seven of these character encoding schemes.
In Java, the 65536 numeric values of Unicode are UTF-16 code units, values that are used in the UTF-16 encoding of Unicode texts. Any representation of Unicode must be capable of representing the full range of code points, its upper bound being 0x10FFFF. Thus, code points beyond 0xFFFF need to be represented by pairs of UTF-16 code units, and the values used with these so-called surrogate pairs are exempt from being used as code points themselves.
The full range of Unicode code points can only be stored in a variable of type int. The actual number of Unicode characters cannot be represented in a char variable.

When to be cautious with Characters?

It is possible that a String value contains surrogate pairs intermingled with individual code units.
In such cases one character can take up two indices in the string.
To verify if the string consists of only individual code units one can use:
s.lenght() == s.codePointCount(0, s.length())
...because String.length() method returns the number of code units, or 16-bit char values, in the string, while the String.codePointCount() method returns the count of the number of characters (including supplementary characters).
If you have to process strings containing surrogate pairs, there's an implementation of a unicode charAt method in this article (using offsetByCodePoints and codePointAt methods of string).

Regarding conversion to uppercase and lowercase, use the String.toUpperCase() and String.toLowerCase() methods only because those handle all cases of conversions correctly compared to the Character implementations.

Working with text in Java

If your editor and file system allow it, you can use Java's native UTF-16 characters directly in your code.
Always use 'single quotes' for char literals and "double quotes" for String literals.
Escape sequences for char and String literals: \b (backspace), \t (tab), \n (line feed), \f (form feed), \r (carriage return), \" (double quote), \' (single quote), and \\ (backslash).

Primitive type: char

char is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
Default value: '\u0000'
Check for default value: ch == Character.MIN_VALUE or ch == 0

Non-primitive types: Character and String

How to decide if something is a letter, a digit, or whitespace?

Use Java's built in Character class for that:
char ch = '\u0041';
assertEquals('A', ch );
assertFalse(Character.isDigit(ch));
assertTrue(Character.isLetter(ch));
assertTrue(Character.isLetterOrDigit(ch));
assertFalse(Character.isLowerCase(ch));
assertTrue(Character.isUpperCase(ch));
assertFalse(Character.isSpaceChar(ch));
assertTrue(Character.isDefined(ch));

How to decide if some letter is in the English Alphabet?

char ch = 'A';
assertTrue(((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')));
ch = 'á';
assertFalse(((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')));
Or using regex:
Pattern p = Pattern.compile("[A-Za-z]");
assertTrue(p.matcher("a").find());
assertFalse(p.matcher("Á").find()); 

Sorting text

The default String comparator compares based on the unicode values of characters. (putting uppercase before lowercase, etc.)
To sort in (localized) natural language order one must use a Collator. An example usage in shown is this article, here's a lengthier demonstration, and here's some information on how customize sorting rules.
List list = Arrays.asList("alma", "Anna", "Ági", "ágy");
Collections.sort(list);
assertEquals(Arrays.asList("Anna", "alma", "Ági", "ágy"), list);

Collections.sort(list, String.CASE_INSENSITIVE_ORDER);
assertEquals(Arrays.asList("alma", "Anna", "Ági", "ágy"), list);

Collator huCollator = Collator.getInstance(Locale.forLanguageTag("hu-HU"));
Collections.sort(list, huCollator);
assertEquals(Arrays.asList("Ági", "ágy", "alma", "Anna"), list);

Splitting and joining text

Conversion between String and char

String str = "My fancy text";
char[] chars = str.toCharArray();
String joined = new String(chars);
assertEquals(str, joined);

Splitting and joining Strings

String str = "My fancy text";
String[] splitted = str.split(" ");
String joined = String.join(" ", splitted);
assertEquals(str, joined );

Parsing text

Scanner

A Scanner breaks its input into tokens using a delimiter pattern.
Default delimiter: whitespace. Set it with Scanner.useDelimiter()
Localization for reading numbers: via the Scanner.useLocale(locale) method.
Reset to defaults with Scanner.reset() method.
Delimiters:
  • regex
Navigate with Scanner.next() returns Object between the current and the next delimiter.

BreakIterator

The BreakIterator class implements methods for finding the location of boundaries in text. Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur.
Boundaries:
  • Character
  • Word
  • Sentence
  • Line
Navigate with BreakIterator.next() and BreakIterator.previous() - returns next int index of boundary.
For info on usage see the Java Tutorial.

StringTokenizer

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

String.split()

Splits a string around matches of the given regular expression. Returns an array.
With the regex match it works like Scanner but parses the whole text at once.
Delimiter defaults to whitespace.

Pattern matching

java.util.regex contains classes for matching character sequences against patterns specified by regular expressions.
An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.
Instances of the Matcher class are used to match character sequences against a given pattern. Input is provided to matchers via the CharSequence interface in order to support matching against characters from a wide variety of input sources.

The different matching methods of Matcher

Pattern pattern = Pattern.compile("foo");

// find: all occurrences one by one
assertTrue(pattern.matcher("afooo").find());
assertFalse(pattern.matcher("aooo").find());
// find starting at given index
assertTrue(pattern.matcher("afooo").find(0));
assertTrue(pattern.matcher("afooo").find(1));
assertFalse(pattern.matcher("afooo").find(2));

// lookingAt: like String.startsWith() but with regex
assertTrue(pattern.matcher("fooo").lookingAt());
assertFalse(pattern.matcher("afooo").lookingAt());

// matches: like String.equals() but with regex
assertTrue(pattern.matcher("foo").matches());
assertFalse(pattern.matcher("fooo").matches());

Retrieving matched subsequences 

The explicit state of a matcher includes the start and end indices of the most recent successful match. It also includes the start and end indices of the input subsequence captured by each capturing group in the pattern as well as a total count of such subsequences. This can be used to retrieve what is matched:
Pattern pattern = Pattern.compile("f.o");
Matcher matcher = pattern.matcher("afaoofeoofo");
assertTrue(matcher.find()); // finds the first match
assertEquals("fao", matcher.group()); 
assertTrue(matcher.find()); // finds the second match
assertEquals("feo", matcher.group());
assertFalse(matcher.find()); // no more to find
matcher.reset(); // resets the matcher
assertTrue(matcher.find()); // finds the first match again
assertEquals("fao", matcher.group());

Iterating over the matches

while(matcher.find()) {
 String group = matcher.group();
}

Using capturing groups

Pattern pattern = Pattern.compile("(f(.)o)");
Matcher matcher = pattern.matcher("afaoofeoofuo");
assertEquals(2, matcher.groupCount()); // groups specified in pattern
assertTrue(matcher.find()); // finds the first match again
assertEquals("fao", matcher.group(1)); //referencing the capturing group
assertEquals("a", matcher.group(2)); //referencing the capturing group

Making replacements

Replace the first substring of a string that matches the given regular expression with the given replacement:
  • str.replaceFirst(regex, repl) yields exactly the same result as 
  • Pattern.compile(regex).matcher(str).replaceFirst(repl)
Replace each substring of this string that matches the given regular expression with the given replacement:
  • str.replaceAll(regex, repl) yields exactly the same result as 
  • Pattern.compile(regex).matcher(str).replaceAll(repl)

Making complex replacements

To have more control on the replacement, use Matcher.appendReplacement() with Matcher.appendTail().
The most basic case: replace with fixed string
Pattern p = Pattern.compile("f.o");
Matcher m = p.matcher("afaoofeoofuo");
StringBuffer sb = new StringBuffer(); // the buffer to write the result to
while (m.find()) {
 m.appendReplacement(sb, "-"); // replace the whole match with the given string
}
m.appendTail(sb); // write the rest of the string after the last match to the buffer.
assertEquals("a-o-o-", sb.toString());
A more complex case: replace with multiple capturing groups
Pattern p = Pattern.compile("(f)(.)(o)");
Matcher m = p.matcher("afaoofeoofuo");
StringBuffer sb = new StringBuffer();
while (m.find()) {
 m.appendReplacement(sb, "$1-$3"); // replace only the second group
}
m.appendTail(sb);
assertEquals("af-oof-oof-o", sb.toString());
A more complex case: replace with value from map
Map map = new HashMap<>();
map.put("a", "1"); map.put("e", "2"); map.put("u", "3");
Pattern p = Pattern.compile("(f)(.)(o)");
Matcher m = p.matcher("afaoofeoofuo");
StringBuffer sb = new StringBuffer();
while (m.find()) {
 m.appendReplacement(sb, "$1" + map.get(m.group(2)) + "$3"); // replace only the second group
}
m.appendTail(sb);
assertEquals("af1oof2oof3o", sb.toString());
Note: If you want the replacement to contain $ or \ literals, wrap it in Matcher.quoteReplacement().

Escape special characters with double backslash

Regex patterns are specified within String literals. Java has some reserved escapes like \n for line break, so the regex escapes like \s need to be escaped with an extra \ resulting in \\s for matching a whitespace character.
The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary.


Sources:


1 comment:

  1. Nice explanation with appropriate links and examples. Thanks for sharing.

    Cheers,
    http://www.flowerbrackets.com/two-best-ways-in-java-to-sort-string-array/

    ReplyDelete