Tcl_GetEncoding, Tcl_FreeEncoding, Tcl_GetEncodingFromObj,
Tcl_ExternalToUtfDString, Tcl_ExternalToUtf, Tcl_UtfToExternalDString,
Tcl_UtfToExternal, Tcl_WinTCharToUtf, Tcl_WinUtfToTChar, Tcl_GetEncodingName,
Tcl_SetSystemEncoding, Tcl_GetEncodingNameFromEnvironment,
Tcl_GetEncodingNames, Tcl_CreateEncoding, Tcl_GetEncodingSearchPath,
Tcl_SetEncodingSearchPath, Tcl_GetDefaultEncodingDir,
Tcl_SetDefaultEncodingDir - procedures for creating and using encodings
#include <tcl.h>
Tcl_Encoding
Tcl_GetEncoding(interp, name)
void
Tcl_FreeEncoding(encoding)
int
Tcl_GetEncodingFromObj(interp, objPtr, encodingPtr)
char *
Tcl_ExternalToUtfDString(encoding, src, srcLen, dstPtr)
char *
Tcl_UtfToExternalDString(encoding, src, srcLen, dstPtr)
int
Tcl_ExternalToUtf(interp, encoding, src, srcLen, flags, statePtr,
dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr)
int
Tcl_UtfToExternal(interp, encoding, src, srcLen, flags, statePtr,
dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr)
char *
Tcl_WinTCharToUtf(tsrc, srcLen, dstPtr)
TCHAR *
Tcl_WinUtfToTChar(src, srcLen, dstPtr)
const char *
Tcl_GetEncodingName(encoding)
int
Tcl_SetSystemEncoding(interp, name)
const char *
Tcl_GetEncodingNameFromEnvironment(bufPtr)
void
Tcl_GetEncodingNames(interp)
Tcl_Encoding
Tcl_CreateEncoding(typePtr)
Tcl_Obj *
Tcl_GetEncodingSearchPath()
int
Tcl_SetEncodingSearchPath(searchPath)
const char *
Tcl_GetDefaultEncodingDir(void)
void
Tcl_SetDefaultEncodingDir(path)
- Tcl_Interp *interp (in)
- Interpreter to use for error reporting, or NULL if no error
reporting is desired.
- const char *name (in)
- Name of encoding to load.
- Tcl_Encoding encoding (in)
- The encoding to query, free, or use for converting text. If
encoding is NULL, the current system encoding is used.
- Tcl_Obj *objPtr (in)
- Name of encoding to get token for.
- Tcl_Encoding *encodingPtr (out)
- Points to storage where encoding token is to be
written.
- const char *src (in)
- For the Tcl_ExternalToUtf functions, an array of
bytes in the specified encoding that are to be converted to UTF-8. For the
Tcl_UtfToExternal and Tcl_WinUtfToTChar functions, an array
of UTF-8 characters to be converted to the specified encoding.
- const TCHAR *tsrc (in)
- An array of Windows TCHAR characters to convert to
UTF-8.
- int srcLen (in)
- Length of src or tsrc in bytes. If the length
is negative, the encoding-specific length of the string is used.
- Tcl_DString *dstPtr (out)
- Pointer to an uninitialized or free Tcl_DString in
which the converted result will be stored.
- int flags (in)
- Various flag bits OR-ed together. TCL_ENCODING_START
signifies that the source buffer is the first block in a (potentially
multi-block) input stream, telling the conversion routine to reset to an
initial state and perform any initialization that needs to occur before
the first byte is converted. TCL_ENCODING_END signifies that the
source buffer is the last block in a (potentially multi-block) input
stream, telling the conversion routine to perform any finalization that
needs to occur after the last byte is converted and then to reset to an
initial state. TCL_ENCODING_STOPONERROR signifies that the
conversion routine should return immediately upon reading a source
character that does not exist in the target encoding; otherwise a default
fallback character will automatically be substituted.
- Tcl_EncodingState *statePtr (in/out)
- Used when converting a (generally long or indefinite
length) byte stream in a piece-by-piece fashion. The conversion routine
stores its current state in *statePtr after src (the buffer
containing the current piece) has been converted; that state information
must be passed back when converting the next piece of the stream so the
conversion routine knows what state it was in when it left off at the end
of the last piece. May be NULL, in which case the value specified for
flags is ignored and the source buffer is assumed to contain the
complete string to convert.
- char *dst (out)
- Buffer in which the converted result will be stored. No
more than dstLen bytes will be stored in dst.
- int dstLen (in)
- The maximum length of the output buffer dst in
bytes.
- int *srcReadPtr (out)
- Filled with the number of bytes from src that were
actually converted. This may be less than the original source length if
there was a problem converting some source characters. May be NULL.
- int *dstWrotePtr (out)
- Filled with the number of bytes that were actually stored
in the output buffer as a result of the conversion. May be NULL.
- int *dstCharsPtr (out)
- Filled with the number of characters that correspond to the
number of bytes stored in the output buffer. May be NULL.
- Tcl_DString *bufPtr (out)
- Storage for the prescribed system encoding name.
- const Tcl_EncodingType *typePtr (in)
- Structure that defines a new type of encoding.
- Tcl_Obj *searchPath (in)
- List of filesystem directories in which to search for
encoding data files.
- const char *path (in)
- A path to the location of the encoding file.
These routines convert between Tcl's internal character representation, UTF-8,
and character representations used by various operating systems or file
systems, such as Unicode, ASCII, or Shift-JIS. When operating on strings, such
as such as obtaining the names of files or displaying characters using
international fonts, the strings must be translated into one or possibly
multiple formats that the various system calls can use. For instance, on a
Japanese Unix workstation, a user might obtain a filename represented in the
EUC-JP file encoding and then translate the characters to the jisx0208 font
encoding in order to display the filename in a Tk widget. The purpose of the
encoding package is to help bridge the translation gap. UTF-8 provides an
intermediate staging ground for all the various encodings. In the example
above, text would be translated into UTF-8 from whatever file encoding the
operating system is using. Then it would be translated from UTF-8 into
whatever font encoding the display routines require.
Some basic encodings are compiled into Tcl. Others can be defined by the user or
dynamically loaded from encoding files in a platform-independent manner.
Tcl_GetEncoding finds an encoding given its
name. The name may
refer to a built-in Tcl encoding, a user-defined encoding registered by
calling
Tcl_CreateEncoding, or a dynamically-loadable encoding file.
The return value is a token that represents the encoding and can be used in
subsequent calls to procedures such as
Tcl_GetEncodingName,
Tcl_FreeEncoding, and
Tcl_UtfToExternal. If the name did not
refer to any known or loadable encoding, NULL is returned and an error message
is returned in
interp.
The encoding package maintains a database of all encodings currently in use. The
first time
name is seen,
Tcl_GetEncoding returns an encoding
with a reference count of 1. If the same
name is requested further
times, then the reference count for that encoding is incremented without the
overhead of allocating a new encoding and all its associated data structures.
When an
encoding is no longer needed,
Tcl_FreeEncoding should be
called to release it. When an
encoding is no longer in use anywhere
(i.e., it has been freed as many times as it has been gotten)
Tcl_FreeEncoding will release all storage the encoding was using and
delete it from the database.
Tcl_GetEncodingFromObj treats the string representation of
objPtr
as an encoding name, and finds an encoding with that name, just as
Tcl_GetEncoding does. When an encoding is found, it is cached within
the
objPtr value for future reference, the
Tcl_Encoding token is
written to the storage pointed to by
encodingPtr, and the value
TCL_OK is returned. If no such encoding is found, the value
TCL_ERROR is returned, and no writing to
*encodingPtr
takes place. Just as with
Tcl_GetEncoding, the caller should call
Tcl_FreeEncoding on the resulting encoding token when that token will
no longer be used.
Tcl_ExternalToUtfDString converts a source buffer
src from the
specified
encoding into UTF-8. The converted bytes are stored in
dstPtr, which is then null-terminated. The caller should eventually
call
Tcl_DStringFree to free any information stored in
dstPtr.
When converting, if any of the characters in the source buffer cannot be
represented in the target encoding, a default fallback character will be used.
The return value is a pointer to the value stored in the DString.
Tcl_ExternalToUtf converts a source buffer
src from the specified
encoding into UTF-8. Up to
srcLen bytes are converted from the
source buffer and up to
dstLen converted bytes are stored in
dst. In all cases,
*srcReadPtr is filled with the number of
bytes that were successfully converted from
src and
*dstWrotePtr
is filled with the corresponding number of bytes that were stored in
dst. The return value is one of the following:
- TCL_OK
- All bytes of src were converted.
- TCL_CONVERT_NOSPACE
- The destination buffer was not large enough for all of the
converted data; as many characters as could fit were converted
though.
- TCL_CONVERT_MULTIBYTE
- The last few bytes in the source buffer were the beginning
of a multibyte sequence, but more bytes were needed to complete this
sequence. A subsequent call to the conversion routine should pass a buffer
containing the unconverted bytes that remained in src plus some
further bytes from the source stream to properly convert the formerly
split-up multibyte sequence.
- TCL_CONVERT_SYNTAX
- The source buffer contained an invalid character sequence.
This may occur if the input stream has been damaged or if the input
encoding method was misidentified.
- TCL_CONVERT_UNKNOWN
- The source buffer contained a character that could not be
represented in the target encoding and TCL_ENCODING_STOPONERROR was
specified.
Tcl_UtfToExternalDString converts a source buffer
src from UTF-8
into the specified
encoding. The converted bytes are stored in
dstPtr, which is then terminated with the appropriate encoding-specific
null. The caller should eventually call
Tcl_DStringFree to free any
information stored in
dstPtr. When converting, if any of the characters
in the source buffer cannot be represented in the target encoding, a default
fallback character will be used. The return value is a pointer to the value
stored in the DString.
Tcl_UtfToExternal converts a source buffer
src from UTF-8 into the
specified
encoding. Up to
srcLen bytes are converted from the
source buffer and up to
dstLen converted bytes are stored in
dst. In all cases,
*srcReadPtr is filled with the number of
bytes that were successfully converted from
src and
*dstWrotePtr
is filled with the corresponding number of bytes that were stored in
dst. The return values are the same as the return values for
Tcl_ExternalToUtf.
Tcl_WinUtfToTChar and
Tcl_WinTCharToUtf are Windows-only
convenience functions for converting between UTF-8 and Windows strings based
on the TCHAR type which is by convention a Unicode character on Windows NT.
Tcl_GetEncodingName is roughly the inverse of
Tcl_GetEncoding.
Given an
encoding, the return value is the
name argument that
was used to create the encoding. The string returned by
Tcl_GetEncodingName is only guaranteed to persist until the
encoding is deleted. The caller must not modify this string.
Tcl_SetSystemEncoding sets the default encoding that should be used
whenever the user passes a NULL value for the
encoding argument to any
of the other encoding functions. If
name is NULL, the system encoding
is reset to the default system encoding,
binary. If the name did not
refer to any known or loadable encoding,
TCL_ERROR is returned and an
error message is left in
interp. Otherwise, this procedure increments
the reference count of the new system encoding, decrements the reference count
of the old system encoding, and returns
TCL_OK.
Tcl_GetEncodingNameFromEnvironment provides a means for the Tcl library
to report the encoding name it believes to be the correct one to use as the
system encoding, based on system calls and examination of the environment
suitable for the platform. It accepts
bufPtr, a pointer to an
uninitialized or freed
Tcl_DString and writes the encoding name to it.
The
Tcl_DStringValue is returned.
Tcl_GetEncodingNames sets the
interp result to a list consisting
of the names of all the encodings that are currently defined or can be
dynamically loaded, searching the encoding path specified by
Tcl_SetDefaultEncodingDir. This procedure does not ensure that the
dynamically-loadable encoding files contain valid data, but merely that they
exist.
Tcl_CreateEncoding defines a new encoding and registers the C procedures
that are called back to convert between the encoding and UTF-8. Encodings
created by
Tcl_CreateEncoding are thereafter visible in the database
used by
Tcl_GetEncoding. Just as with the
Tcl_GetEncoding
procedure, the return value is a token that represents the encoding and can be
used in subsequent calls to other encoding functions.
Tcl_CreateEncoding returns an encoding with a reference count of 1. If
an encoding with the specified
name already exists, then its entry in
the database is replaced with the new encoding; the token for the old encoding
will remain valid and continue to behave as before, but users of the new token
will now call the new encoding procedures.
The
typePtr argument to
Tcl_CreateEncoding contains information
about the name of the encoding and the procedures that will be called to
convert between this encoding and UTF-8. It is defined as follows:
typedef struct Tcl_EncodingType {
const char * encodingName;
Tcl_EncodingConvertProc * toUtfProc;
Tcl_EncodingConvertProc * fromUtfProc;
Tcl_EncodingFreeProc * freeProc;
ClientData clientData;
int nullSize;
} Tcl_EncodingType;
The
encodingName provides a string name for the encoding, by which it can
be referred in other procedures such as
Tcl_GetEncoding. The
toUtfProc refers to a callback procedure to invoke to convert text from
this encoding into UTF-8. The
fromUtfProc refers to a callback
procedure to invoke to convert text from UTF-8 into this encoding. The
freeProc refers to a callback procedure to invoke when this encoding is
deleted. The
freeProc field may be NULL. The
clientData contains
an arbitrary one-word value passed to
toUtfProc,
fromUtfProc,
and
freeProc whenever they are called. Typically, this is a pointer to
a data structure containing encoding-specific information that can be used by
the callback procedures. For instance, two very similar encodings such as
ascii and
macRoman may use the same callback procedure, but use
different values of
clientData to control its behavior. The
nullSize specifies the number of zero bytes that signify end-of-string
in this encoding. It must be
1 (for single-byte or multi-byte encodings
like ASCII or Shift-JIS) or
2 (for double-byte encodings like Unicode).
Constant-sized encodings with 3 or more bytes per character (such as CNS11643)
are not accepted.
The callback procedures
toUtfProc and
fromUtfProc should match the
type
Tcl_EncodingConvertProc:
typedef int Tcl_EncodingConvertProc(
ClientData clientData,
const char * src,
int srcLen,
int flags,
Tcl_EncodingState * statePtr,
char * dst,
int dstLen,
int * srcReadPtr,
int * dstWrotePtr,
int * dstCharsPtr);
The
toUtfProc and
fromUtfProc procedures are called by the
Tcl_ExternalToUtf or
Tcl_UtfToExternal family of functions to
perform the actual conversion. The
clientData parameter to these
procedures is the same as the
clientData field specified to
Tcl_CreateEncoding when the encoding was created. The remaining
arguments to the callback procedures are the same as the arguments, documented
at the top, to
Tcl_ExternalToUtf or
Tcl_UtfToExternal, with the
following exceptions. If the
srcLen argument to one of those high-level
functions is negative, the value passed to the callback procedure will be the
appropriate encoding-specific string length of
src. If any of the
srcReadPtr,
dstWrotePtr, or
dstCharsPtr arguments to one
of the high-level functions is NULL, the corresponding value passed to the
callback procedure will be a non-NULL location.
The callback procedure
freeProc, if non-NULL, should match the type
Tcl_EncodingFreeProc:
typedef void Tcl_EncodingFreeProc(
ClientData clientData);
This
freeProc function is called when the encoding is deleted. The
clientData parameter is the same as the
clientData field
specified to
Tcl_CreateEncoding when the encoding was created.
Tcl_GetEncodingSearchPath and
Tcl_SetEncodingSearchPath are called
to access and set the list of filesystem directories searched for encoding
data files.
The value returned by
Tcl_GetEncodingSearchPath is the value stored by
the last successful call to
Tcl_SetEncodingSearchPath. If no calls to
Tcl_SetEncodingSearchPath have occurred, Tcl will compute an initial
value based on the environment. There is one encoding search path for the
entire process, shared by all threads in the process.
Tcl_SetEncodingSearchPath stores
searchPath and returns
TCL_OK, unless
searchPath is not a valid Tcl list, which causes
TCL_ERROR to be returned. The elements of
searchPath are not
verified as existing readable filesystem directories. When searching for
encoding data files takes place, and non-existent or non-readable filesystem
directories on the
searchPath are silently ignored.
Tcl_GetDefaultEncodingDir and
Tcl_SetDefaultEncodingDir are
obsolete interfaces best replaced with calls to
Tcl_GetEncodingSearchPath and
Tcl_SetEncodingSearchPath. They
are called to access and set the first element of the
searchPath list.
Since Tcl searches
searchPath for encoding data files in list order,
these routines establish the “default” directory in which to
find encoding data files.
Space would prohibit precompiling into Tcl every possible encoding algorithm, so
many encodings are stored on disk as dynamically-loadable encoding files. This
behavior also allows the user to create additional encoding files that can be
loaded using the same mechanism. These encoding files contain information
about the tables and/or escape sequences used to map between an external
encoding and Unicode. The external encoding may consist of single-byte,
multi-byte, or double-byte characters.
Each dynamically-loadable encoding is represented as a text file. The initial
line of the file, beginning with a “#” symbol, is a comment that
provides a human-readable description of the file. The next line identifies
the type of encoding file. It can be one of the following letters:
- [1] S
- A single-byte encoding, where one character is always one
byte long in the encoding. An example is iso8859-1, used by many
European languages.
- [2] D
- A double-byte encoding, where one character is always two
bytes long in the encoding. An example is big5, used for Chinese
text.
- [3] M
- A multi-byte encoding, where one character may be either
one or two bytes long. Certain bytes are lead bytes, indicating that
another byte must follow and that together the two bytes represent one
character. Other bytes are not lead bytes and represent themselves. An
example is shiftjis, used by many Japanese computers.
- [4] E
- An escape-sequence encoding, specifying that certain
sequences of bytes do not represent characters, but commands that describe
how following bytes should be interpreted.
The rest of the lines in the file depend on the type.
Cases [1], [2], and [3] are collectively referred to as table-based encoding
files. The lines in a table-based encoding file are in the same format as this
example taken from the
shiftjis encoding (this is not the complete
file):
# Encoding file: shiftjis, multi-byte
M
003F 0 40
00
0000000100020003000400050006000700080009000A000B000C000D000E000F
0010001100120013001400150016001700180019001A001B001C001D001E001F
0020002100220023002400250026002700280029002A002B002C002D002E002F
0030003100320033003400350036003700380039003A003B003C003D003E003F
0040004100420043004400450046004700480049004A004B004C004D004E004F
0050005100520053005400550056005700580059005A005B005C005D005E005F
0060006100620063006400650066006700680069006A006B006C006D006E006F
0070007100720073007400750076007700780079007A007B007C007D203E007F
0080000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F
FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F
FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F
FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
81
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E
FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C
301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B
FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000
00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5
FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6
25A125A025B325B225BD25BC203B301221922190219121933013000000000000
000000000000000000000000000000002208220B2286228722822283222A2229
000000000000000000000000000000002227222800AC21D221D4220022030000
0000000000000000000000000000000000000000222022A52312220222072261
2252226A226B221A223D221D2235222B222C0000000000000000000000000000
212B2030266F266D266A2020202100B6000000000000000025EF000000000000
The third line of the file is three numbers. The first number is the fallback
character (in base 16) to use when converting from UTF-8 to this encoding. The
second number is a
1 if this file represents the encoding for a symbol
font, or
0 otherwise. The last number (in base 10) is how many pages of
data follow.
Subsequent lines in the example above are pages that describe how to map from
the encoding into 2-byte Unicode. The first line in a page identifies the page
number. Following it are 256 double-byte numbers, arranged as 16 rows of 16
numbers. Given a character in the encoding, the high byte of that character is
used to select which page, and the low byte of that character is used as an
index to select one of the double-byte numbers in that page - the value
obtained being the corresponding Unicode character. By examination of the
example above, one can see that the characters 0x7E and 0x8163 in
shiftjis map to 203E and 2026 in Unicode, respectively.
Following the first page will be all the other pages, each in the same format as
the first: one number identifying the page followed by 256 double-byte Unicode
characters. If a character in the encoding maps to the Unicode character 0000,
it means that the character does not actually exist. If all characters on a
page would map to 0000, that page can be omitted.
Case [4] is the escape-sequence encoding file. The lines in an this type of file
are in the same format as this example taken from the
iso2022-jp
encoding:
# Encoding file: iso2022-jp, escape-driven
E
init {}
final {}
iso8859-1 \x1b(B
jis0201 \x1b(J
jis0208 \x1b$@
jis0208 \x1b$B
jis0212 \x1b$(D
gb2312 \x1b$A
ksc5601 \x1b$(C
In the file, the first column represents an option and the second column is the
associated value.
init is a string to emit or expect before the first
character is converted, while
final is a string to emit or expect after
the last character. All other options are names of table-based encodings; the
associated value is the escape-sequence that marks that encoding. Tcl syntax
is used for the values; in the above example, for instance, “
{}” represents the empty string and “
\x1b”
represents character 27.
When
Tcl_GetEncoding encounters an encoding
name that has not been
loaded, it attempts to load an encoding file called
name.enc
from the
encoding subdirectory of each directory that Tcl searches for
its script library. If the encoding file exists, but is malformed, an error
message will be left in
interp.
utf, encoding, convert