php-src/ext
Gustavo André dos Santos Lopes f5b421621d BreakIterator and RuleBasedBreakiterator added
This commit adds wrappers for the classes BreakIterator and
RuleBasedbreakIterator. The C++ ICU classes are described here:
<http://icu-project.org/apiref/icu4c/classBreakIterator.html>
<http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html>

Additionally, a tutorial is available at:
<http://userguide.icu-project.org/boundaryanalysis>

This implementation wraps UTF-8 text in a UText. The text is
iterated without any copying or conversion to UTF-16. There is
also no validation that the input is actually UTF-8; where there
are malformed sequences, the UText will simply U+FFFD.

The class BreakIterator cannot be instantiated directly (has a
private constructor). It provides the interface exposed by the ICU
abstract class with the same name. The PHP class is not abstract
because we may use it to wrap native subclasses of BreakIterator
that we don't know how to wrap. This class includes methods to
move the iterator position to the beginning (first()), to the
end (last()), forward (next()), backwards (previous()), to the
boundary preceding a certain position (preceding()) and following
a certain position (following()) and to obtain the current position
(current()). next() can also be used to advance or recede an
arbitrary number of positions.

BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of "characters", createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return
RuleBasedbreakIterators where the names of the rule sets are found
in the ICU data, observing the passed locale (although the locale
is taken into considering there are very few exceptions to the
root rules).

The clone and compare_object PHP object handlers are also
implemented, though the comparison does not yield meaningful results
when used with >, <, >= and <=.

Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for
an ordinal of the current boundary is not feasible because
we are allowed to move to any boundary at any time. It we were
to determine the current ordinal when last() is called we'd
have to traverse the whole input text to find out how many
breaks there were before. Therefore, BreakIterator implements
only Traversable. It can be wrapped in an IteratorIterator,
but the usual warnings apply.

Finally, I added a convenience method to BreakIterator:
getPartsIterator(). This provides an IntlIterator, backed
by the BreakIterator PHP object (i.e. moving the pointer or
changing the text in BreakIterator affects the iterator
and also moving the iterator affects the backing BreakIterator),
which allows traversing the text between each boundary.
This iterator uses the original text to retrieve the text
between two positions, not the code points returned by the
wrapping UText. Therefore, if the text includes invalid code
unit sequences, these invalid sequences will be in the output
of this iterator, not U+FFFD code points.

The class RuleBasedIterator exposes a constructor that allows
building an iterator from arbitrary compiled or non-compiled
rules. The form of these rules in described in the tutorial linked
above. The rest of the methods allow retrieving the rules --
getRules() and getCompiledRules() --, a hash code of the rule set
(hashCode()) and the rules statuses (getRuleStatus() and
getRuleStatusVec()).

Because the RuleBasedBreakIterator constructor may return parse
errors, I reuse the UParseError to text function that was in the
transliterator files. Therefore, I move that function to
intl_error.c.

common_enum.cpp was also changed, mainly to expose previously
static functions. This avoided code duplication when implementing
the BreakIterator iterator and the IntlIterator returned by
BreakIterator::getPartsIterator().
2012-06-04 22:25:07 +02:00
..
bcmath - Year++ 2012-01-01 13:15:04 +00:00
bz2 - Year++ 2012-01-01 13:15:04 +00:00
calendar Merge branch 'PHP-5.3' into PHP-5.4 2012-04-28 11:44:54 +02:00
com_dotnet Merge branch 'PHP-5.3' into PHP-5.4 2012-05-25 00:23:51 +02:00
ctype - Year++ 2012-01-01 13:15:04 +00:00
curl VIM uses spaces as tabs and that doesn't comply with the coding 2012-05-27 15:39:45 -07:00
date Merge branch 'PHP-5.4' 2012-04-24 13:45:07 +02:00
dba Merge branch 'PHP-5.3' into PHP-5.4 2012-03-20 17:58:58 +01:00
dom Merge branch 'PHP-5.3' into PHP-5.4 2012-05-15 11:43:28 +01:00
enchant Merge branch 'PHP-5.3' into PHP-5.4 2012-03-20 17:58:58 +01:00
ereg - Year++ 2012-01-01 13:15:04 +00:00
exif Merge commit 'e59b6dc0ae' 2012-06-03 19:02:00 -03:00
fileinfo Merge branch 'PHP-5.4' 2012-05-29 17:42:35 +02:00
filter Merge branch 'PHP-5.3' into PHP-5.4 2012-04-30 10:28:00 +02:00
ftp MFH r322485 2012-01-26 05:15:57 +00:00
gd Merge branch 'PHP-5.3' into PHP-5.4 2012-04-04 18:54:03 +02:00
gettext - Year++ 2012-01-01 13:15:04 +00:00
gmp Merge branch 'PHP-5.3' into PHP-5.4 2012-05-21 12:37:59 +02:00
hash fix tests failing due to corrected hash tiger 2012-03-19 21:49:47 +01:00
iconv fix bug #55042 - erealloc without updating pointer 2012-05-30 22:26:26 -07:00
imap - Year++ 2012-01-01 13:15:04 +00:00
interbase Merge branch 'PHP-5.4' 2012-03-29 18:28:38 +03:00
intl BreakIterator and RuleBasedBreakiterator added 2012-06-04 22:25:07 +02:00
json Revert "Update test to fix breakage caused by the previous commit." 2012-05-15 23:25:06 -07:00
ldap Merge branch 'PHP-5.3' into PHP-5.4 2012-04-16 15:26:50 +02:00
libxml Fix: 62067 Moved comments to FILE section 2012-05-19 16:34:16 +01:00
mbstring Fixed bug #61631 mbstring mail related tests fail 2012-04-10 12:23:07 +02:00
mcrypt - Year++ 2012-01-01 13:15:04 +00:00
mssql - Year++ 2012-01-01 13:15:04 +00:00
mysql - Year++ 2012-01-01 13:15:04 +00:00
mysqli Merge branch 'PHP-5.4' 2012-05-16 16:00:17 +02:00
mysqlnd close the underlying stream as early as possible and so notify the 2012-06-01 22:12:08 +03:00
oci8 Merge branch 'PHP-5.3' into PHP-5.4 2012-03-30 16:17:37 -07:00
odbc - Year++ 2012-01-01 13:15:04 +00:00
openssl Fix bug #61413 ext\openssl\tests\openssl_encrypt_crash.phpt fails 5.3 only 2012-04-24 14:05:35 +02:00
pcntl Merge branch '5.4' 2012-03-29 08:48:13 +01:00
pcre Deprecate /e modifier 2012-03-04 13:39:12 +00:00
pdo Merge branch 'PHP-5.4' 2012-04-19 12:49:47 +02:00
pdo_dblib - Year++ 2012-01-01 13:15:04 +00:00
pdo_firebird fix gcov Warning: ibase_drop_db(): lock time-out on wait transaction object http://gcov.php.net/viewer.php?version=PHP_5_4&func=tests&file=ext%2Fpdo_firebird%2Ftests%2Fbug_53280.phpt 2012-02-05 09:58:50 +00:00
pdo_mysql Merge branch 'PHP-5.4' 2012-05-02 16:15:35 +02:00
pdo_oci - Year++ 2012-01-01 13:15:04 +00:00
pdo_odbc Fixed bug #61212 (PDO ODBC Segfaults on SQL_SUCESS_WITH_INFO). 2012-03-14 20:20:33 +00:00
pdo_pgsql - Fixed bug #61267: pdo_pgsql's PDO::exec() returns the number of SELECTed 2012-03-08 08:52:28 +00:00
pdo_sqlite - fix #55226, WS 2012-01-31 07:17:05 +00:00
pgsql add pg_escape_identifier/pg_escape_literal 2012-04-19 13:40:24 -07:00
phar fix unchecked emalloc 2012-05-30 21:37:28 +02:00
posix Merge branch 'PHP-5.3' into PHP-5.4 2012-05-15 11:43:28 +01:00
pspell - Year++ 2012-01-01 13:15:04 +00:00
readline Fix bug #62186 readline fails to compile 2012-05-31 01:15:22 +02:00
recode Replace $Revision$ with $Id$ in keyword expansion enable files 2012-03-20 17:53:47 +01:00
reflection Merge branch 'PHP-5.3' into PHP-5.4 2012-06-01 15:00:02 +08:00
session Merge branch 'PHP-5.3' into PHP-5.4 2012-04-30 12:10:43 +02:00
shmop - Year++ 2012-01-01 13:15:04 +00:00
simplexml Merge branch 'PHP-5.3' into PHP-5.4 2012-03-20 17:58:58 +01:00
skeleton Replace $Revision$ with $Id$ in keyword expansion enable files 2012-03-20 17:53:47 +01:00
snmp merge from trunk: 2012-01-13 18:46:56 +00:00
soap remove remaining traces of unicode.* ini settings 2012-05-27 19:57:34 -04:00
sockets Merge branch 'PHP-5.4' 2012-05-21 08:55:05 -03:00
spl Merge remote-tracking branch 'origin/PHP-5.4' 2012-05-24 23:38:53 +08:00
sqlite3 Merge branch 'PHP-5.4' 2012-04-26 15:18:17 +02:00
standard Merge branch 'PHP-5.4' 2012-05-30 14:44:35 +08:00
sybase_ct - Year++ 2012-01-01 13:15:04 +00:00
sysvmsg Merge branch 'PHP-5.3' into PHP-5.4 2012-03-20 17:58:58 +01:00
sysvsem - Year++ 2012-01-01 13:15:04 +00:00
sysvshm - Year++ 2012-01-01 13:15:04 +00:00
tidy Merge branch 'PHP-5.3' into PHP-5.4 2012-05-21 12:52:10 +02:00
tokenizer - Year++ 2012-01-01 13:15:04 +00:00
wddx - Year++ 2012-01-01 13:15:04 +00:00
xml - Year++ 2012-01-01 13:15:04 +00:00
xmlreader more verbose skip reason in test files with not so obvious extension requirements 2012-02-25 12:10:41 +00:00
xmlrpc Fix bug #61264: xmlrpc_parse_method_descriptions leaks temporary variable 2012-03-03 12:46:17 +00:00
xmlwriter - Fixed bug #62064 (memory leak in the XML Writer module) 2012-05-18 19:34:39 -03:00
xsl - Year++ 2012-01-01 13:15:04 +00:00
zip - Year++ 2012-01-01 13:15:04 +00:00
zlib cleanup merge 2012-05-15 09:44:01 +02:00
ext_skel
ext_skel_win32.php