1<?xml version="1.0" encoding="utf-8"?> 2 3<overlay xmlns="http://hoa-project.net/xyl/xylophone"> 4<yield id="chapter"> 5 6 <p>Strings can sometimes be <strong>complex</strong>, especially when they use 7 the <code>Unicode</code> encoding format. The <code>Hoa\Ustring</code> library 8 provides several operations on UTF-8 strings.</p> 9 10 <h2 id="Table_of_contents">Table of contents</h2> 11 12 <tableofcontents id="main-toc" /> 13 14 <h2 id="Introduction" for="main-toc">Introduction</h2> 15 16 <p>When we manipulate strings, the <a href="http://unicode.org/">Unicode</a> 17 format establishes itself because of its <strong>compatibility</strong> with 18 historical formats (like ASCII) and its capacity to understand a 19 <strong>large</strong> range of characters and symbols for all cultures and 20 all regions in the world. PHP provides several tools to manipulate such 21 strings, like the following extensions: 22 <a href="http://php.net/mbstring"><code>mbstring</code></a>, 23 <a href="http://php.net/iconv"><code>iconv</code></a> or also the excellent 24 <a href="http://php.net/intl"><code>intl</code></a> which is based on 25 <a href="http://icu-project.org/">ICU</a>, the reference implementation of 26 Unicode. Unfortunately, sometimes we have to mix these extensions to achieve 27 our aims and at the cost of a certain <strong>complexity</strong> along with 28 a regrettable <strong>verbosity</strong>.</p> 29 <p>The <code>Hoa\Ustring</code> library answers to these issues by providing a 30 <strong>simple</strong> way to manipulate strings with 31 <strong>performance</strong> and <strong>efficiency</strong> in minds. It 32 also provides some evoluated algorithms to perform <strong>search</strong> 33 operations on strings.</p> 34 35 <h2 id="Unicode_strings" for="main-toc">Unicode strings</h2> 36 37 <p>The <code>Hoa\Ustring\Ustring</code> class represents a 38 <strong>UTF-8</strong> Unicode strings and allows to manipulate it easily. 39 This class implements the 40 <a href="http://php.net/arrayaccess"><code>ArrayAccess</code></a>, 41 <a href="http://php.net/countable"><code>Countable</code></a> and 42 <a href="http://php.net/iteratoraggregate"><code>IteratorAggregate</code></a> 43 interfaces. We are going to use three examples in three different languages: 44 French, Arab and Japanese. Thus:</p> 45 <pre><code class="language-php">$french = new Hoa\Ustring\Ustring('Je t\'aime'); 46$arabic = new Hoa\Ustring\Ustring('أحبك'); 47$japanese = new Hoa\Ustring\Ustring('私はあなたを愛して');</code></pre> 48 <p>Now, let's see what we can do on these three strings.</p> 49 50 <h3 id="String_manipulation" for="main-toc">String manipulation</h3> 51 52 <p>Let's start with <strong>elementary</strong> operations. If we would like 53 to <strong>count</strong> the number of characters (not bytes), we will use 54 the <a href="http://php.net/count"><code>count</code> function</a>. Thus:</p> 55 <pre><code class="language-php">var_dump( 56 count($french), 57 count($arabic), 58 count($japanese) 59); 60 61/** 62 * Will output: 63 * int(9) 64 * int(4) 65 * int(9) 66 */</code></pre> 67 <p>When we speak about text position, it is not suitable to speak about the 68 right or the left, but rather about a <strong>beginning</strong> or an 69 <strong>end</strong>, and based on the <strong>direction</strong> of writing. 70 We can know this direction thanks to the 71 <code>Hoa\Ustring\Ustring::getDirection</code> method. It returns the value of 72 one of the following constants:</p> 73 <ul> 74 <li><code>Hoa\Ustring\Ustring::LTR</code>, for left-to-right, if the text is 75 written from the left to the right,</li> 76 <li><code>Hoa\Ustring\Ustring::RTL</code>, for right-to-left, if the text is 77 written from the right to the left.</li> 78 </ul> 79 <p>Let's observe the result with our examples:</p> 80 <pre><code class="language-php">var_dump( 81 $french->getDirection() === Hoa\Ustring\Ustring::LTR, // is left-to-right? 82 $arabic->getDirection() === Hoa\Ustring\Ustring::RTL, // is right-to-left? 83 $japanese->getDirection() === Hoa\Ustring\Ustring::LTR // is left-to-right? 84); 85 86/** 87 * Will output: 88 * bool(true) 89 * bool(true) 90 * bool(true) 91 */</code></pre> 92 <p>The result of this method is computed thanks to the 93 <code>Hoa\Ustring\Ustring::getCharDirection</code> static method which computes 94 the direction of only one character.</p> 95 <p>If we would like to <strong>concatenate</strong> another string to the end 96 or to the beginning, we will respectively use the 97 <code>Hoa\Ustring\Ustring::append</code> and 98 <code>Hoa\Ustring\Ustring::prepend</code> methods. These methods, like most of 99 the ones which modifies the string, return the object itself, in order to 100 chain the calls. For instance:</p> 101 <pre><code class="language-php">echo $french->append('… et toi, m\'aimes-tu ?')->prepend('Mam\'zelle ! '); 102 103/** 104 * Will output: 105 * Mam'zelle ! Je t'aime… et toi, m'aimes-tu ? 106 */</code></pre> 107 <p>We also have the <code>Hoa\Ustring\Ustring::toLowerCase</code> and 108 <code>Hoa\Ustring\Ustring::toUpperCase</code> methods to, respectively, set 109 the case of the string to lower or upper. For instance:</p> 110 <pre><code class="language-php">echo $french->toUpperCase(); 111 112/** 113 * Will output: 114 * MAM'ZELLE ! JE T'AIME… ET TOI, M'AIMES-TU ? 115 */</code></pre> 116 <p>We can also add characters to the beginning or to the end of the string to 117 reach a <strong>minimum</strong> length. This operation is frequently called 118 the <em>padding</em> (for historical reasons dating back to typewriters). 119 That's why we have the <code>Hoa\Ustring\Ustring::pad</code> method which 120 takes three arguments: the minimum length, characters to add and a constant 121 indicating whether we have to add at the end or at the beginning of the string 122 (respectively <code>Hoa\Ustring\Ustring::END</code>, by default, and 123 <code>Hoa\Ustring\Ustring::BEGINNING</code>).</p> 124 <pre><code class="language-php">echo $arabic->pad(20, ' '); 125 126/** 127 * Will output: 128 * أحبك 129 */</code></pre> 130 <p>A similar operation allows to remove, by default, <strong>spaces</strong> 131 at the beginning and at the end of the string thanks to the 132 <code>Hoa\Ustring\Ustring::trim</code> method. For example, to retreive our 133 original Arabic string:</p> 134 <pre><code class="language-php">echo $arabic->trim(); 135 136/** 137 * Will output: 138 * أحبك 139 */</code></pre> 140 <p>If we would like to remove other characters, we can use its first argument 141 which must be a regular expression. Finally, its second argument allows to 142 specify from what side we would like to remove character: at the beginning, at 143 the end or both, still by using the 144 <code>Hoa\Ustring\Ustring::BEGINNING</code> and 145 <code>Hoa\Ustring\Ustring::END</code> constants.</p> 146 <p>If we would like to remove other characters, we can use its first argument 147 which must be a regular expression. Finally, its second argument allows to 148 specify the side where to remove characters: at the beginning, at the end or 149 both, still by using the <code>Hoa\Ustring\Ustring::BEGINNING</code> and 150 <code>Hoa\Ustring\Ustring::END</code> constants. We can combine these 151 constants to express “both sides”, which is the default value: 152 <code class="language-php">Hoa\Ustring\Ustring::BEGINNING | 153 Hoa\Ustring\Ustring::END</code>. For example, to remove all the numbers and 154 the spaces only at the end, we will write:</p> 155 <pre><code class="language-php">$arabic->trim('\s|\d', Hoa\Ustring\Ustring::END);</code></pre> 156 <p>We can also <strong>reduce</strong> the string to a 157 <strong>sub-string</strong> by specifying the position of the first character 158 followed by the length of the sub-string to the 159 <code>Hoa\Ustring\Ustring::reduce</code> method:</p> 160 <pre><code class="language-php">echo $french->reduce(3, 6)->reduce(2, 4); 161 162/** 163 * Will output: 164 * aime 165 */</code></pre> 166 <p>If we would like to get a specific character, we can rely on the 167 <code>ArrayAccess</code> interface. For instance, to get the first character 168 of each of our examples (from their original definitions):</p> 169 <pre><code class="language-php">var_dump( 170 $french[0], 171 $arabic[0], 172 $japanese[0] 173); 174 175/** 176 * Will output: 177 * string(1) "J" 178 * string(2) "أ" 179 * string(3) "私" 180 */</code></pre> 181 <p>If we would like the last character, we will use the -1 index. The index is 182 not bounded to the length of the string. If the index exceeds this length, 183 then a <em>modulo</em> will be applied.</p> 184 <p>We can also modify or remove a specific character with this method. For 185 example:</p> 186 <pre><code class="language-php">$french->append(' ?'); 187$french[-1] = '!'; 188echo $french; 189 190/** 191 * Will output: 192 * Je t'aime ! 193 */</code></pre> 194 <p>Another very useful method is the <strong>ASCII</strong> transformation. 195 Be careful, this is not always possible, according to your settings. For 196 example:</p> 197 <pre><code class="language-php">$title = new Hoa\Ustring\Ustring('Un été brûlant sur la côte'); 198echo $title->toAscii(); 199 200/** 201 * Will output: 202 * Un ete brulant sur la cote 203 */</code></pre> 204 <p>We can also transform from Arabic or Japanese to ASCII. Symbols, like 205 Mathemeticals symbols or emojis, are also transformed:</p> 206 <pre><code class="language-php">$emoji = new Hoa\Ustring\Ustring('I ❤ Unicode'); 207$maths = new Hoa\Ustring\Ustring('∀ i ∈ ℕ'); 208 209echo 210 $arabic->toAscii(), "\n", 211 $japanese->toAscii(), "\n", 212 $emoji->toAscii(), "\n", 213 $maths->toAscii(), "\n"; 214 215/** 216 * Will output: 217 * ahbk 218 * sihaanatawo aishite 219 * I (heavy black heart)️ Unicode 220 * (for all) i (element of) N 221 */</code></pre> 222 <p>In order this method to work correctly, the 223 <a href="http://php.net/intl"><code>intl</code></a> extension needs to be 224 present, so that the 225 <a href="http://php.net/transliterator"><code>Transliterator</code></a> class 226 is present. If it does not exist, the 227 <a href="http://php.net/normalizer"><code>Normalizer</code></a> class must 228 exist. If this class does not exist neither, the 229 <code>Hoa\Ustring\Ustring::toAscii</code> method can still try a 230 transformation, but it is less efficient. To activate this last solution, 231 <code>true</code> must be passed as a single argument. This <em lang="fr">tour 232 de force</em> is not recommended in most cases.</p> 233 <p>We also find the <code>getTransliterator</code> method which returns a 234 <code>Transliterator</code> object, or <code>null</code> if this class does 235 not exist. This method takes a transliteration identifier as argument. We 236 suggest to <a href="http://userguide.icu-project.org/transforms/general">read 237 the documentation about the transliterator of ICU</a> to understand this 238 identifier. The <code>transliterate</code> method allows to transliterate the 239 current string based on an identifier and a beginning index and an end 240 one. This method works the same way than the 241 <a href="http://php.net/transliterator.transliterate"><code>Transliterator::transliterate</code></a> 242 method.</p> 243 <p>More generally, to change the <strong>encoding</strong> format, we can use 244 the <code>Hoa\Ustring\Ustring::transcode</code> static method, with a string 245 as first argument, the original encoding format as second argument and the 246 expected encoding format as third argument (UTF-8 by default). The get the 247 list of encoding formats, we have to refer to the 248 <a href="http://php.net/iconv"><code>iconv</code></a> extension or to use the 249 following command line in a terminal:</p> 250 <pre><code class="language-php">$ iconv --list</code></pre> 251 <p>To know if a string is encoded in UTF-8, we can use the 252 <code>Hoa\Ustring\Ustring::isUtf8</code> static method; for instance:</p> 253 <pre><code class="language-php">var_dump( 254 Hoa\Ustring\Ustring::isUtf8('a'), 255 Hoa\Ustring\Ustring::isUtf8(Hoa\Ustring\Ustring::transcode('a', 'UTF-8', 'UTF-16')) 256); 257 258/** 259 * Will output: 260 * bool(true) 261 * bool(false) 262 */</code></pre> 263 <p>We can <strong>split</strong> the string into several sub-strings by using 264 the <code>Hoa\Ustring\Ustring::split</code> method. As first argument, we have 265 a regular expression (of kind <a href="http://pcre.org/">PCRE</a>), then an 266 integer representing the maximum number of elements to return and finally a 267 combination of constants. These constants are the same as the ones of 268 <a href="http://php.net/preg_split"><code>preg_split</code></a>.</p> 269 <p>By default, the second argument is set to -1, which means infinity, and the 270 last argument is set to <code>PREG_SPLIT_NO_EMPTY</code>. Thus, if we would 271 like to get all the words of a string, we will write:</p> 272 <pre><code class="language-php">print_r($title->split('#\b|\s#')); 273 274/** 275 * Will output: 276 * Array 277 * ( 278 * [0] => Un 279 * [1] => ete 280 * [2] => brulant 281 * [3] => sur 282 * [4] => la 283 * [5] => cote 284 * ) 285 */</code></pre> 286 <p>If we would like to <strong>iterate</strong> over all the 287 <strong>characters</strong>, it is recommended to use the 288 <code>IteratorAggregate</code> method, being the 289 <code>Hoa\Ustring\Ustring::getIterator</code> method. Let's see on the Arabic 290 example:</p> 291 <pre><code class="language-php">foreach ($arabic as $letter) { 292 echo $letter, "\n"; 293} 294 295/** 296 * Will output: 297 * أ 298 * ح 299 * ب 300 * ك 301 */</code></pre> 302 <p>We notice that the iteration is based on the text direction, it means that 303 the first element of the iteration is the first letter of the string starting 304 from the beginning.</p> 305 <p>Of course, if we would like to get an array of characters, we can use the 306 <a href="http://php.net/iterator_to_array"><code>iterator_to_array</code></a> 307 PHP function:</p> 308 <pre><code class="language-php">print_r(iterator_to_array($arabic)); 309 310/** 311 * Will output: 312 * Array 313 * ( 314 * [0] => أ 315 * [1] => ح 316 * [2] => ب 317 * [3] => ك 318 * ) 319 */</code></pre> 320 321 <h3 id="Comparison_and_search" for="main-toc">Comparison and search</h3> 322 323 <p>Strings can also be <strong>compared</strong> thanks to the 324 <code>Hoa\Ustring\Ustring::compare</code> method:</p> 325 <pre><code class="language-php">$string = new Hoa\Ustring\Ustring('abc'); 326var_dump( 327 $string->compare('wxyz') 328); 329 330/** 331 * Will output: 332 * string(-1) 333 */</code></pre> 334 <p>This methods returns -1 if the initial string comes before (in the 335 alphabetical order), 0 if it is identical and 1 if it comes after. If we 336 would like to use all the power of the underlying mechanism, we can call the 337 <code>Hoa\Ustring\Ustring::getCollator</code> static method (if the 338 <a href="http://php.net/Collator"><code>Collator</code></a> class exists, else 339 <code>Hoa\Ustring\Ustring::compare</code> will use a simple byte to bytes 340 comparison without taking care of the other parameters). Thus, if we would 341 like to sort an array of strings, we will write:</p> 342 <pre><code class="language-php">$strings = array('c', 'Σ', 'd', 'x', 'α', 'a'); 343Hoa\Ustring\Ustring::getCollator()->sort($strings); 344print_r($strings); 345 346/** 347 * Could output: 348 * Array 349 * ( 350 * [0] => a 351 * [1] => c 352 * [2] => d 353 * [3] => x 354 * [4] => α 355 * [5] => Σ 356 * ) 357 */</code></pre> 358 <p>Comparison between two strings depends on the <strong>locale</strong>, it 359 means of the localization of the system, like the language, the country, the 360 region etc. We can use the 361 <a href="@hack:chapter=Locale"><code>Hoa\Locale</code> library</a> to modify 362 these data, but it's not a dependence of <code>Hoa\Ustring</code>.</p> 363 <p>We can also know if a string <strong>matches</strong> a certain pattern, 364 still expressed with a regular expression. To achieve that, we will use the 365 <code>Hoa\Ustring\Ustring::match</code> method. This method relies on the 366 <a href="http://php.net/preg_match"><code>preg_match</code></a> and 367 <a href="http://php.net/preg_match_all"><code>preg_match_all</code></a> PHP 368 functions, but by modifying the pattern's options to ensure the Unicode 369 support. We have the following parameters: the pattern, a variable passed by 370 reference to collect the matches, flags, an offset and finally a boolean 371 indicating whether the search is global or not (respectively if we have to use 372 <code>preg_match_all</code> or <code>preg_match</code>). By default, the 373 search is not global.</p> 374 <p>Thus, we will check that our French example contains <code>aime</code> with 375 a direct object complement:</p> 376 <pre><code class="language-php">$french->match('#(?:(?&lt;direct_object>\w)[\'\b])aime#', $matches); 377var_dump($matches['direct_object']); 378 379/** 380 * Will output: 381 * string(1) "t" 382 */</code></pre> 383 <p>This method returns <code>false</code> if an error is raised (for example 384 if the pattern is not correct), 0 if no match has been found, the number of 385 matches else.</p> 386 <p>Similarly, we can <strong>search</strong> and <strong>replace</strong> 387 sub-strings by other sub-strings based on a pattern, still expressed with a 388 regular expression. To achieve that, we will use the 389 <code>Hoa\Ustring\Ustring::replace</code> method. This method uses the 390 <a href="http://php.net/preg_replace"><code>preg_replace</code></a> and 391 <a href="http://php.net/preg_replace_callback"><code>preg_replace_callback</code></a> 392 PHP functions, but still by modifying the pattern's options to ensure the 393 Unicode support. As first argument, we find one or more patterns, as second 394 argument, one or more replacements and as last argument the limit of 395 replacements to apply. If the replacement is a callable, then the 396 <code>preg_replace_callback</code> function will be used.</p> 397 <p>Thus, we will modify our French example to be more polite:</p> 398 <pre><code class="language-php">$french->replace('#(?:\w[\'\b])(?&lt;verb>aime)#', function ($matches) { 399 return 'vous ' . $matches['verb']; 400}); 401 402echo $french; 403 404/** 405 * Will output: 406 * Je vous aime 407 */</code></pre> 408 <p>The <code>Hoa\Ustring\Ustring</code> class provides constants which are 409 aliases of existing PHP constants and ensure a better readability of the 410 code:</p> 411 <ul> 412 <li><code>Hoa\Ustring\Ustring::WITHOUT_EMPTY</code>, alias of 413 <code>PREG_SPLIT_NO_EMPTY</code>,</li> 414 <li><code>Hoa\Ustring\Ustring::WITH_DELIMITERS</code>, alias of 415 <code>PREG_SPLIT_DELIM_CAPTURE</code>,</li> 416 <li><code>Hoa\Ustring\Ustring::WITH_OFFSET</code>, alias of 417 <code>PREG_OFFSET_CAPTURE</code> and 418 <code>PREG_SPLIT_OFFSET_CAPTURE</code>,</li> 419 <li><code>Hoa\Ustring\Ustring::GROUP_BY_PATTERN</code>, alias of 420 <code>PREG_PATTERN_ORDER</code>,</li> 421 <li><code>Hoa\Ustring\Ustring::GROUP_BY_TUPLE</code>, alias of 422 <code>PREG_SET_ORDER</code>.</li> 423 </ul> 424 <p>Because they are strict aliases, we can write:</p> 425 <pre><code class="language-php">$string = new Hoa\Ustring\Ustring('abc1 defg2 hikl3 xyz4'); 426$string->match( 427 '#(\w+)(\d)#', 428 $matches, 429 Hoa\Ustring\Ustring::WITH_OFFSET 430 | Hoa\Ustring\Ustring::GROUP_BY_TUPLE, 431 0, 432 true 433);</code></pre> 434 435 <h3 id="Characters" for="main-toc">Characters</h3> 436 437 <p>The <code>Hoa\Ustring\Ustring</code> class offers static methods working on 438 a single Unicode character. We have already mentionned the 439 <code>getCharDirection</code> method which allows to know the 440 <strong>direction</strong> of a character. We also have the 441 <code>getCharWidth</code> which counts the <strong>number of columns</strong> 442 necessary to print a single character. Thus:</p> 443 <pre><code class="language-php">var_dump( 444 Hoa\Ustring\Ustring::getCharWidth(Hoa\Ustring\Ustring::fromCode(0x7f)), 445 Hoa\Ustring\Ustring::getCharWidth('a'), 446 Hoa\Ustring\Ustring::getCharWidth('㽠') 447); 448 449/** 450 * Will output: 451 * int(-1) 452 * int(1) 453 * int(2) 454 */</code></pre> 455 <p>This method returns -1 or 0 if the character is not 456 <strong>printable</strong> (for instance, if this is a control character, like 457 <code>0x7f</code> which corresponds to <code>DELETE</code>), 1 or more if this 458 is a character that can be printed. In our example, <code>㽠</code> requires 459 2 columns to be printed.</p> 460 <p>To get more semantics, we have the 461 <code>Hoa\Ustring\Ustring::isCharPrintable</code> method which allows to know 462 whether a character is printable or not.</p> 463 <p>If we would like to count the number of columns necessary for a whole 464 string, we have to use the <code>Hoa\Ustring\Ustring::getWidth</code> method. 465 Thus:</p> 466 <pre><code class="language-php">var_dump( 467 $french->getWidth(), 468 $arabic->getWidth(), 469 $japanese->getWidth() 470); 471 472/** 473 * Will output: 474 * int(9) 475 * int(4) 476 * int(18) 477 */</code></pre> 478 <p>Try this in your terminal with a <strong>monospaced</strong> font. You will 479 observe that Japanese requires 18 columns to be printed. This measure is very 480 useful if we would like to know the length of a string to position it 481 efficiently.</p> 482 <p>The <code>getCharWidth</code> method is different of <code>getWidth</code> 483 because it includes control characters. This method is intended to be used, 484 for example, with terminals (please, see the 485 <a href="@hack:chapter=Console"><code>Hoa\Console</code> library</a>).</p> 486 <p>Finally, if this time we are not interested by Unicode characters but 487 rather by <strong>machine</strong> characters <code>char</code> (being 488 1 byte), we have an extra operation. The 489 <code>Hoa\Ustring\Ustring::getBytesLength</code> method will count the 490 <strong>length</strong> of the string in bytes:</p> 491 <pre><code class="language-php">var_dump( 492 $arabic->getBytesLength(), 493 $japanese->getBytesLength() 494); 495 496/** 497 * Will output: 498 * int(8) 499 * int(27) 500 */</code></pre> 501 <p>If we compare these results with the ones of the 502 <code>Hoa\Ustring\Ustring::count</code> method, we understand that the Arabic 503 characters are encoded with 2 bytes whereas Japanese characteres are encoded 504 with 3 bytes. We can also get a specific byte thanks to the 505 <code>Hoa\Ustring\Ustring::getByteAt</code> method. Once again, the index is 506 not bounded.</p> 507 508 <h3 id="Code-point" for="main-toc">Code-point</h3> 509 510 <p>Each character is represented by an integer, called a 511 <strong>code-point</strong>. To get the code-point of a character, we can 512 use the <code>Hoa\Ustring\Ustring::toCode</code> static method, and to get a 513 character based on its code-point, we can use the 514 <code>Hoa\Ustring\Ustring::fromCode</code> static method. We also have the 515 <code>Hoa\Ustring\Ustring::toBinaryCode</code> method which returns the binary 516 representation of a character. Let's take an example:</p> 517 <pre><code class="language-php">var_dump( 518 Hoa\Ustring\Ustring::toCode('Σ'), 519 Hoa\Ustring\Ustring::toBinaryCode('Σ'), 520 Hoa\Ustring\Ustring::fromCode(0x1a9) 521); 522 523/** 524 * Will output: 525 * int(931) 526 * string(32) "1100111010100011" 527 * string(2) "Σ" 528 */</code></pre> 529 530 <h2 id="Search_algorithms" for="main-toc">Search algorithms</h2> 531 532 <p>The <code>Hoa\Ustring</code> library provides sophisticated 533 <strong>search</strong> algorithms on strings through the 534 <code>Hoa\Ustring\Search</code> class.</p> 535 <p>We will study the <code>Hoa\Ustring\Search::approximated</code> algorithm 536 which searches a sub-string in a string up to <strong><em>k</em> 537 differences</strong> (a difference is an addition, a deletion or a 538 modification). Let's take the classical example of a DNA representation: We 539 will search all the sub-strings approximating <code>GATAA</code> with 540 1 difference (maximum) in <code>CAGATAAGAGAA</code>. So, we will write:</p> 541 <pre><code class="language-php">$x = 'GATAA'; 542$y = 'CAGATAAGAGAA'; 543$k = 1; 544$search = Hoa\Ustring\Search::approximated($y, $x, $k); 545$n = count($search); 546 547echo 'Try to match ', $x, ' in ', $y, ' with at most ', $k, ' difference(s):', "\n"; 548echo $n, ' match(es) found:', "\n"; 549 550foreach ($search as $position) { 551 echo ' • ', substr($y, $position['i'], $position['l'), "\n"; 552} 553 554/** 555 * Will output: 556 * Try to match GATAA in CAGATAAGAGAA with at most 1 difference(s): 557 * 4 match(es) found: 558 * • AGATA 559 * • GATAA 560 * • ATAAG 561 * • GAGAA 562 */</code></pre> 563 <p>This methods returns an array of arrays. Each sub-array represents a result 564 and contains three indexes: <code>i</code> for the position of the first 565 character (byte) of the result, <code>j</code> for the position of the last 566 character and <code>l</code> for the length of the result (simply 567 <code>j</code> - <code>i</code>). Thus, we can compute the results by using 568 our initial string (here <code class="language-php">$y</code>) and its 569 indexes.</p> 570 <p>With our example, we have four results. The first is <code>AGATA</code>, 571 being <code>GATA<em>A</em></code> with one moved character, and 572 <code>AGATA</code> exists in <code>C<em>AGATA</em>AGAGAA</code>. The second 573 result is <code>GATAA</code>, our sub-string, which well and truly exists in 574 <code>CA<em>GATAA</em>GAGAA</code>. The third result is <code>ATAAG</code>, 575 being <code><em>G</em>ATAA</code> with one moved character, and 576 <code>ATAAG</code> exists in <code>CAG<em>ATAAG</em>AGAA</code>. Finally, the 577 last result is <code>GAGAA</code>, being <code>GA<em>T</em>AA</code> with one 578 modified character, and <code>GAGAA</code> exists in 579 <code>CAGATAA<em>GAGAA</em></code>.</p> 580 <p>Another example, more concrete this time. We will consider the 581 <code>--testIt --foobar --testThat --testAt</code> string (which represents 582 possible options of a command line), and we will search <code>--testot</code>, 583 an option that should have been given by the user. This option does not exist 584 as it is. We will then use our search algorithm with at most 1 difference. 585 Let's see:</p> 586 <pre><code class="language-php">$x = 'testot'; 587$y = '--testIt --foobar --testThat --testAt'; 588$k = 1; 589$search = Hoa\Ustring\Search::approximated($y, $x, $k); 590$n = count($search); 591 592// … 593 594/** 595 * Will output: 596 * Try to match testot in --testIt --foobar --testThat --testAt with at most 1 difference(s) 597 * 2 match(es) found: 598 * • testIt 599 * • testAt 600 */</code></pre> 601 <p>The <code>testIt</code> and <code>testAt</code> results are true options, 602 so we can suggest them to the user. This is a mechanism user by 603 <code>Hoa\Console</code> to suggest corrections to the user in case of a 604 mistyping.</p> 605 606 <h2 id="Conclusion" for="main-toc">Conclusion</h2> 607 608 <p>The <code>Hoa\Ustring</code> library provides facilities to manipulate 609 strings encoded with the Unicode format, but also to make sophisticated search 610 on strings.</p> 611 612</yield> 613</overlay> 614