libunibreak 6.1
Loading...
Searching...
No Matches
Macros | Enumerations | Functions
linebreak.c File Reference

Implementation of the line breaking algorithm as described in Unicode Standard Annex 14. More...

#include <assert.h>
#include <stddef.h>
#include <string.h>
#include "eastasianwidthdef.h"
#include "linebreak.h"
#include "linebreakdef.h"
Include dependency graph for linebreak.c:

Macros

#define LINEBREAK_UNDEFINED   -1
 Special value used internally to indicate an undefined break result.
 
#define LINEBREAK_INDEX_SIZE   40
 Size of the second-level index to the line breaking properties.
 
#define ENDS_WITH(str, suffix)   ends_with((str), (suffix), sizeof(suffix) - 1)
 

Enumerations

enum  BreakAction {
  DIR_BRK , IND_BRK , CMI_BRK , CMP_BRK ,
  PRH_BRK
}
 Enumeration of break actions. More...
 

Functions

void init_linebreak (void)
 Does nothing.
 
void lb_init_break_context (struct LineBreakContext *lbpCtx, utf32_t ch, const char *lang)
 Initializes line breaking context for a given language.
 
int lb_process_next_char (struct LineBreakContext *lbpCtx, utf32_t ch)
 Updates LineBreakingContext for the next codepoint and returns the detected break.
 
enum LineBreakClass lb_get_char_class (const struct LineBreakContext *lbpCtx, utf32_t ch)
 Gets the line breaking class of a character for a line breaking context.
 
size_t set_linebreaks (const void *s, size_t len, const char *lang, enum BreakOutputType outputType, char *brks, get_next_char_t get_next_char)
 Sets the line breaking information for a generic input string.
 
void set_linebreaks_utf8 (const utf8_t *s, size_t len, const char *lang, char *brks)
 Sets the line breaking information for a UTF-8 input string.
 
size_t set_linebreaks_utf8_per_code_point (const utf8_t *s, size_t len, const char *lang, char *brks)
 Sets the line breaking information for a UTF-8 input string.
 
void set_linebreaks_utf16 (const utf16_t *s, size_t len, const char *lang, char *brks)
 Sets the line breaking information for a UTF-16 input string.
 
size_t set_linebreaks_utf16_per_code_point (const utf16_t *s, size_t len, const char *lang, char *brks)
 Sets the line breaking information for a UTF-16 input string.
 
void set_linebreaks_utf32 (const utf32_t *s, size_t len, const char *lang, char *brks)
 Sets the line breaking information for a UTF-32 input string.
 
int is_line_breakable (utf32_t char1, utf32_t char2, const char *lang)
 Tells whether a line break can occur between two Unicode characters.
 

Detailed Description

Implementation of the line breaking algorithm as described in Unicode Standard Annex 14.

Author
Wu Yongwei
Petr Filipsky

Macro Definition Documentation

◆ ENDS_WITH

#define ENDS_WITH (   str,
  suffix 
)    ends_with((str), (suffix), sizeof(suffix) - 1)

◆ LINEBREAK_INDEX_SIZE

#define LINEBREAK_INDEX_SIZE   40

Size of the second-level index to the line breaking properties.

◆ LINEBREAK_UNDEFINED

#define LINEBREAK_UNDEFINED   -1

Special value used internally to indicate an undefined break result.

Enumeration Type Documentation

◆ BreakAction

Enumeration of break actions.

They are used in the break action pair table baTable.

Enumerator
DIR_BRK 

Direct break opportunity.

IND_BRK 

Indirect break opportunity.

CMI_BRK 

Indirect break opportunity for combining marks.

CMP_BRK 

Prohibited break for combining marks.

PRH_BRK 

Prohibited break.

Function Documentation

◆ init_linebreak()

void init_linebreak ( void  )

Does nothing.

This is kept for binary compatibility.

◆ is_line_breakable()

int is_line_breakable ( utf32_t  char1,
utf32_t  char2,
const char *  lang 
)

Tells whether a line break can occur between two Unicode characters.

This is a wrapper function to expose a simple interface. Generally speaking, it is better to use set_linebreaks_utf32 instead, since complicated cases involving combining marks, spaces, etc. cannot be correctly processed.

Parameters
char1the first Unicode character
char2the second Unicode character
langlanguage of the input
Returns
one of LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR

◆ lb_get_char_class()

enum LineBreakClass lb_get_char_class ( const struct LineBreakContext lbpCtx,
utf32_t  ch 
)

Gets the line breaking class of a character for a line breaking context.

This function will check the language-specific data first, and then the default data if there is no language-specific property available for the character.

Parameters
lbpCtxpointer to the line breaking context
chcharacter to check
Returns
the line breaking class if found; LBP_XX otherwise

◆ lb_init_break_context()

void lb_init_break_context ( struct LineBreakContext lbpCtx,
utf32_t  ch,
const char *  lang 
)

Initializes line breaking context for a given language.

Parameters
[in,out]lbpCtxpointer to the line breaking context
[in]chthe first character to process
[in]langlanguage of the input
Postcondition
the line breaking context is initialized

◆ lb_process_next_char()

int lb_process_next_char ( struct LineBreakContext lbpCtx,
utf32_t  ch 
)

Updates LineBreakingContext for the next codepoint and returns the detected break.

Parameters
[in,out]lbpCtxpointer to the line breaking context
[in]chUnicode codepoint
Returns
break result, one of LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, and LINEBREAK_NOBREAK
Postcondition
the line breaking context is updated

◆ set_linebreaks()

size_t set_linebreaks ( const void *  s,
size_t  len,
const char *  lang,
enum BreakOutputType  outputType,
char *  brks,
get_next_char_t  get_next_char 
)

Sets the line breaking information for a generic input string.

Currently, this implementation has customization for the following ISO 639-1 language codes (for lang):

  • de (German)
  • en (English)
  • es (Spanish)
  • fr (French)
  • ja (Japanese)
  • ko (Korean)
  • ru (Russian)
  • zh (Chinese)

In addition, a suffix "-strict" may be added to indicate strict (as versus normal) line-breaking behaviour. See the Conditional Japanese Starter section of UAX #14 for more details.

Parameters
[in]sinput string
[in]lenlength of the input
[in]langlanguage of the input
[in]outputTypeoutput per code-unit or per code-point
[out]brkspointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR
[in]get_next_charfunction to get the next UTF-32 character
Returns
The number of entries in brks filled. This is equal to the number of code-points or code-units in the source string, depending on the outputType parameter.

◆ set_linebreaks_utf16()

void set_linebreaks_utf16 ( const utf16_t s,
size_t  len,
const char *  lang,
char *  brks 
)

Sets the line breaking information for a UTF-16 input string.

Parameters
[in]sinput UTF-16 string
[in]lenlength of the input
[in]langlanguage of the input
[out]brkspointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR
See also
set_linebreaks for a note about lang.

◆ set_linebreaks_utf16_per_code_point()

size_t set_linebreaks_utf16_per_code_point ( const utf16_t s,
size_t  len,
const char *  lang,
char *  brks 
)

Sets the line breaking information for a UTF-16 input string.

Parameters
[in]sinput UTF-16 string
[in]lenlength of the input
[in]langlanguage of the input
[out]brkspointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK
Returns
The number of entries in brks filled. This is equal to the number of code-points in the source string.
See also
set_linebreaks for a note about lang.

◆ set_linebreaks_utf32()

void set_linebreaks_utf32 ( const utf32_t s,
size_t  len,
const char *  lang,
char *  brks 
)

Sets the line breaking information for a UTF-32 input string.

Parameters
[in]sinput UTF-32 string
[in]lenlength of the input
[in]langlanguage of the input
[out]brkspointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR
See also
set_linebreaks for a note about lang.

◆ set_linebreaks_utf8()

void set_linebreaks_utf8 ( const utf8_t s,
size_t  len,
const char *  lang,
char *  brks 
)

Sets the line breaking information for a UTF-8 input string.

Parameters
[in]sinput UTF-8 string
[in]lenlength of the input
[in]langlanguage of the input
[out]brkspointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK, or LINEBREAK_INSIDEACHAR
See also
set_linebreaks for a note about lang.

◆ set_linebreaks_utf8_per_code_point()

size_t set_linebreaks_utf8_per_code_point ( const utf8_t s,
size_t  len,
const char *  lang,
char *  brks 
)

Sets the line breaking information for a UTF-8 input string.

Parameters
[in]sinput UTF-8 string
[in]lenlength of the input
[in]langlanguage of the input
[out]brkspointer to the output breaking data, containing LINEBREAK_MUSTBREAK, LINEBREAK_ALLOWBREAK, LINEBREAK_NOBREAK
Returns
The number of entries in brks filled. This is equal to the number of code-points in the source string.
See also
set_linebreaks for a note about lang.