:: Buchempfehlung ::

hoihoi · #1 02.12.2007, 20:10:07

Hallo,

im folgenden Code trennt str_word_count() Wörter auch an Stellen, an denen Umlaute stehen, obwohl sie explizit angegeben sind, obwohl locale auf de gesetzt ist und die Umlaute nicht als html-Codes, sondern direkt im Quelltext stehen.
Der String $text enthält die Umlaute allerdings noch.

Was mache ich falsch?

Hier der Code:

PHP-Code:


			
$loc_de = setlocale (LC_ALL, 'de_DE@euro', 'de_DE', 'de', 'ge');

$text = file_get_contents('http://de.wikipedia.org/wiki/PHP');

$paragraphs = array();

for ($i = 0; $pos1 = stripos($text,'<p>'); $i++) { // Nur Absätze aus HTML-Datei verwenden.

    $pos2 = stripos($text,'</p>');

    $paragraphs[$i] = substr($text,$pos1,$pos2-$pos1);

    $text = substr($text,$pos2+4);

}

$text = join(' ',$paragraphs);

while ($pos1 = stripos($text,'<script')) { // Skript-Teile löschen.

    $text = substr_replace($text,'',$pos1,stripos($text,'</script>')-$pos1);

}

$textarray = str_word_count(strip_tags($text),1,'äöüÄÖÜß'); // Hier sprachspezifische Sonderzeichen angeben?

print_r($text);

print_r($textarray);

Alles Gute

Ingo

#2 03.12.2007, 01:20:08

<glaskugel>
Da die str_* Funktionen bisher nur mit 8Bit Zeichensätzen richtig klarkommen, du aber die armen kleinen mit UTF-8 fütterst, solltest du dich eigendlich nicht so doll über Fehlfunktionen wundern. Bis auch auf deinem Server PHP6 installiert ist, solltest du auf die mb_* Funktionen zurückgreifen.
</glaskugel>

hoihoi · #3 03.12.2007, 02:36:03

Juhu rambi,

danke für deine Antwort. Sie hat mich auf die richtige Spur gebracht.
Die Multibyte-Funktionen habe ich mir angeschaut, bin aber noch ein wenig davor zurückgeschreckt, zumal es dort ja nicht str_word_count() gibt, was aber eben genau das macht, was ich haben will.

Stattdessen habe ich die Funktion utf8_decode() gefunden und die macht den Trick. ;-)
Die Umlaute werden zwar in der Ausgabe als Fragezeichen dargestellt, aber nach utf8_encode() ist alles höchstübermäßig.

Hier: http://www.php.net/utf8-decode wird auch noch auf iconv() hingewiesen, das noch flexibler ist.

Vielen Dank und alles Gute

hoihoi

#4 03.12.2007, 02:59:42

Schau mal an, ein Mitdenker!! Gut!

Allerdings halte ich nicht so viel vom Konvertieren...
Nagut, von iso-irgendwas Richtung UTF ist meist kein Problem. Aber andersrum, kannst du mit Verlusten zu rechnen.

hoihoi · #5 08.12.2007, 21:24:02

Hallo,

dein Einwand, dass beim Dekodieren von UTF-8 Informationsverlust auftreten kann, hat mir doch zu denken gegeben.
Also habe ich versucht, die Funktion str_word_count() so nachzubauen, dass sie auch UTF-8-Strings verarbeiten kann.

uft8_str_word_count() akzeptiert Parameter so, wie auch str_word_count() und auch die Ausgaben sind analog, wobei der Eingabe-String utf8-kodiert sein muss und die Ausgabe-Arrays ebenfalls utf8-Werte enthalten.

Locale-Informationen werden nicht verabeitet und die Spezialität von str_word_count(), das "' "und "-" am Wortanfang nicht zum Wort hinzugenommen werden, ist ebenfalls nicht berücksichtigt.

Die Funktion benutzt auch andere nachgebaute UTF8-Stringfunktionen, die ebenfalls unten aufgeführt sind.

Ich würde mich sehr über eine kritische Betrachtung des Codes oder Verbesserungen freuen.

utf8_str_word_count():
Achtung, die Sache mit der charlist funktioniert nur eingeschränkt. Verbesserter Vorschlag weiter unten.

PHP-Code:


			
/**
 * Works like str_word_count() for UTF8-strings
 * does not support locale information
 * does not exclude "'" or "-" on beginning of a word 
 * may behave strange if $format is not 0, 1 or 2
 * public domain
 */
function utf8_str_word_count($string, $format = 0, $charlist = '') {
    $oldstring = $string;
    $charlist = '/[^a-zA-Z' . $charlist . '\-\']/';
    $string = preg_replace(utf8_encode($charlist), ' ', $string);
    $string = preg_replace('=[\s]+=', ' ', trim($string));
    if ('' !== $string) $array = explode(' ', $string);
    $count = count($array);
    switch ($format) {
    case 0:
        return($count);
    case 1:
        return($array);
    case 2:
        foreach ($array as $value) {
        $pos += utf8_strpos($oldstring, $value);
        $posarray[$pos] = $value;
        $oldstring = utf8_substr($oldstring, utf8_strlen($value) + utf8_strpos($oldstring, $value));
        $pos += utf8_strlen($value);
        }
        return($posarray);
    }
}

Hier sind die anderen UTF8-Stringfunktionen, die von utf8_str_word_count() benötigt werden:

PHP-Code:


			
/**
 * UTF8 helper functions
 *
 * @license    LGPL (http://www.gnu.org/copyleft/lesser.html)
 * @author     Andreas Gohr <andi@splitbrain.org>
 */
 
/**
 * check for mb_string support
 */
if(!defined('UTF8_MBSTRING')){
  if(function_exists('mb_substr') && !defined('UTF8_NOMBSTRING')){
    define('UTF8_MBSTRING',1);
  }else{
    define('UTF8_MBSTRING',0);
  }
}
 
if(UTF8_MBSTRING){ mb_internal_encoding('UTF-8'); }

/**
 * Unicode aware replacement for strlen()
 *
 * utf8_decode() converts characters that are not in ISO-8859-1
 * to '?', which, for the purpose of counting, is alright - It's
 * even faster than mb_strlen.
 *
 * @author <chernyshevsky at hotmail dot com>
 * @see    strlen()
 * @see    utf8_decode()
 */
function utf8_strlen($string){
  return strlen(utf8_decode($string));
}
 
/**
 * UTF-8 aware alternative to substr
 *
 * Return part of a string given character offset (and optionally length)
 *
 * @author Harry Fuecks <hfuecks@gmail.com>
 * @author Chris Smith <chris@jalakai.co.uk>
 * @param string
 * @param integer number of UTF-8 characters offset (from left)
 * @param integer (optional) length in UTF-8 characters from offset
 * @return mixed string or false if failure
 */
function utf8_substr($str, $offset, $length = null) {
    if(UTF8_MBSTRING){
        if( $length === null ){
            return mb_substr($str, $offset);
        }else{
            return mb_substr($str, $offset, $length);
        }
    }
 
    /*
     * Notes:
     *
     * no mb string support, so we'll use pcre regex's with 'u' flag
     * pcre only supports repetitions of less than 65536, in order to accept up to MAXINT values for
     * offset and length, we'll repeat a group of 65535 characters when needed (ok, up to MAXINT-65536)
     *
     * substr documentation states false can be returned in some cases (e.g. offset > string length)
     * mb_substr never returns false, it will return an empty string instead.
     *
     * calculating the number of characters in the string is a relatively expensive operation, so
     * we only carry it out when necessary. It isn't necessary for +ve offsets and no specified length
     */
 
    // cast parameters to appropriate types to avoid multiple notices/warnings
    $str = (string)$str;                          // generates E_NOTICE for PHP4 objects, but not PHP5 objects
    $offset = (int)$offset;
    if (!is_null($length)) $length = (int)$length;
 
    // handle trivial cases
    if ($length === 0) return '';
    if ($offset < 0 && $length < 0 && $length < $offset) return '';
 
    $offset_pattern = '';
    $length_pattern = '';
 
    // normalise -ve offsets (we could use a tail anchored pattern, but they are horribly slow!)
    if ($offset < 0) {
      $strlen = strlen(utf8_decode($str));        // see notes
      $offset = $strlen + $offset;
      if ($offset < 0) $offset = 0;
    }
 
    // establish a pattern for offset, a non-captured group equal in length to offset
    if ($offset > 0) {
      $Ox = (int)($offset/65535);
      $Oy = $offset%65535;
 
      if ($Ox) $offset_pattern = '(?:.{65535}){'.$Ox.'}';
      $offset_pattern = '^(?:'.$offset_pattern.'.{'.$Oy.'})';
    } else {
      $offset_pattern = '^';                      // offset == 0; just anchor the pattern
    }
 
    // establish a pattern for length
    if (is_null($length)) {
      $length_pattern = '(.*)$';                  // the rest of the string
    } else {
 
      if (!isset($strlen)) $strlen = strlen(utf8_decode($str));    // see notes
      if ($offset > $strlen) return '';           // another trivial case
 
      if ($length > 0) {
 
        $length = min($strlen-$offset, $length);  // reduce any length that would go passed the end of the string
 
        $Lx = (int)($length/65535);
        $Ly = $length%65535;
 
        // +ve length requires ... a captured group of length characters
        if ($Lx) $length_pattern = '(?:.{65535}){'.$Lx.'}';
        $length_pattern = '('.$length_pattern.'.{'.$Ly.'})';
 
      } else if ($length < 0) {
 
        if ($length < ($offset - $strlen)) return '';
 
        $Lx = (int)((-$length)/65535);
        $Ly = (-$length)%65535;
 
        // -ve length requires ... capture everything except a group of -length characters
        //                         anchored at the tail-end of the string
        if ($Lx) $length_pattern = '(?:.{65535}){'.$Lx.'}';
        $length_pattern = '(.*)(?:'.$length_pattern.'.{'.$Ly.'})$';
      }
    }
 
    if (!preg_match('#'.$offset_pattern.$length_pattern.'#us',$str,$match)) return '';
    return $match[1];
}

/**
 * This is an Unicode aware replacement for strpos
 *
 * @author Leo Feyer <leo@typolight.org>
 * @see    strpos()
 * @param  string
 * @param  string
 * @param  integer
 * @return integer
 */
function utf8_strpos($haystack, $needle, $offset=0){
    $comp = 0;
    $length = null;
 
    while (is_null($length) || $length < $offset) {
        $pos = strpos($haystack, $needle, $offset + $comp);
 
        if ($pos === false)
            return false;
 
        $length = utf8_strlen(substr($haystack, 0, $pos));
 
        if ($length < $offset)
            $comp = $pos - $length;
    }
 
    return $length;
}

Quellen und weiterführende Links:
http://lists.phpbar.de/pipermail/php...14/016838.html
http://dev.splitbrain.org/view/darcs...i/inc/utf8.php
http://www.sitepoint.com/blogs/2006/...hp-utf-8-tips/

hoihoi · #6 10.12.2007, 18:40:00

Ok, hier sind noch zwei interessante Artikel zum Thema character encoding (also wie Zeichen abgespeichert werden, z.B. ASCII oder UTF-8):

http://www.joelonsoftware.com/articles/Unicode.html
http://www.sitepoint.com/blogs/2006/...ter-encodings/

Viel Spaß.

hoihoi · #7 04.01.2008, 16:01:28

Hallo,

in meinem Vorschlag, wie str_word_count() für UTF-8 nachzubauen wäre, wird die charlist nicht immer ordentlich verarbeitet. Schon die Konstruktion mit utf8_encode() ist Unsinn.

Mittlerweile habe ich mich etwas mit regular expressions beschäftigt und eine neue Version gebastelt. Diese entfernt sich etwas von der Originalversion, da einfach alle Zeichen zum Wort hinzugenommen werden, die von /[^\w\pL]/u als Buchstabe gewertet werden.

PHP-Code:


			
/**

 * Works like str_word_count() for UTF8-strings

 * no charlist, uses every word character

 * does not use locale information

 * does not exclude "'" or "-" on beginning of a word 

 * may behave strange if $format is not 0, 1 or 2

 * public domain

 */

function utf8_str_word_count($string, $format = 0) {

    $oldstring = $string;

    $string = preg_replace('/[^\w\pL]/u', ' ', $string);

    $string = preg_replace('/[\d]/', ' ', $string);

    $string = preg_replace('/[\s]+/', ' ', trim($string));

    $array = explode(' ', $string);

    $count = count($array);

    switch ($format) {

    case 0:

        return($count);

    case 1:

        return($array);

    case 2:

        foreach ($array as $value) {

        $pos += utf8_strpos($oldstring, $value);

        $posarray[$pos] = $value;

        $oldstring = utf8_substr($oldstring, utf8_strlen($value) + utf8_strpos($oldstring, $value));

        $pos += utf8_strlen($value);

        }

        return($posarray);

    }

}

Ich bin nach wie vor Anfänger und habe vielleicht nicht alle Eventualitäten getestet, sicher gibt es noch einige Verbesserungsmöglichkeiten.
Wäre ja schön, wenn jemand noch eine Idee dazu hätte.

hoihoi · #8 08.01.2008, 03:17:01

Ist das Thema so wenig relevant?

Ich habe die Funktion noch einmal komplett überarbeitet:

PHP-Code:


			
/**
 * This is an Unicode aware replacement for str_word_count()
 * $charlist has to be ascii or utf-8
 * does not use locale information
 * public domain
 * see also http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php
 */
function utf8_str_word_count($string,$format=0,$charlist='') {
    $array = explode(' ',trim(preg_replace("/[^'\-A-Za-z".$charlist."]+/u",' ',$string)));
    switch ($format) {
    case 0:
        return(count($array));
    case 1:
        return($array);
    case 2:
        $pos = 0;
        foreach ($array as $value) {
        $pos = utf8_strpos($string,$value,$pos);
        $posarray[$pos] = $value;
        $pos += utf8_strlen($value);
        }
        return($posarray);
    }
}

Die charlist ist jetzt wieder dabei, darf aber nur ASCII- oder UTF-8-Zeichen enthalten (Wegen dem /u-Modifikator).
Alle UTF-8-Buchstaben (word characters) lassen sich mit

PHP-Code:


			
$charlist = '\\pL';

einbeziehen.

Den "case 2" habe ich nach einem Tip von Katachi in diesem Thread deutlich vereinfacht.
Damit werden von den utf8-Funktionen, die ich oben aufgeführt habe, nur noch utf8_strlen() und utf8_strpos() benötigt.

Ach ja, laut Dokumentation bezieht str_word_count() "'" und "-" am Wortanfang nicht mit ein, in meinen Tests tat es das aber. Kann das jemand bestätigen oder widerlegen?

hoihoi · #9 18.01.2008, 16:19:54

Update:

PHP-Code:


			
function utf8_str_word_count($string,$format=0,$charlist='') {
    $array = preg_split("/[^'\-A-Za-z".$charlist."]+/u",$string,-1,PREG_SPLIT_NO_EMPTY);
    switch ($format) {
    case 0:
        return(count($array));
    case 1:
        return($array);
    case 2:
        $pos = 0;
        foreach ($array as $value) {
        $pos = utf8_strpos($string,$value,$pos);
        $posarray[$pos] = $value;
        $pos += utf8_strlen($value);
        }
        return($posarray);
    }
}

Ist deutlich schneller und natürlich auch schöner.

:: Buchempfehlung ::

:: Anbieterverzeichnis ::

Globale Branchen

Informieren Sie sich über ausgewählte Unternehmen im Anbieterverzeichnis von SELFPHP

:: Newsletter ::

Abonnieren Sie hier den kostenlosen SELFPHP Newsletter!