I am having an inordinate amount of difficulties locating an example of how to display Unicode using Lazarus and FPC. It's supposed to be fully UTF-8 capable now, yet all I get are question marks for text strings, or else little skinny boxes in place of the characters.

Can someone provide for me a very simple example of how to make this work?

I'm using:
Lazarus
Version #: 0.9.26.2 beta
Date: 2009-03-13
FPC Version 2.2.2
SVN Revision: 18980
i386-win32-win32/win64

I'm running this test on Windows XP Media Center Edition.

Is there an environment variable that must be set (LANG?) to make this work? If so, what must it be set to?

Every example I have found online has not worked, or else it left out some key piece of information required to make it work. In fact, I have not found any complete working example, just little snippets of code that don't seem relevant to my situation.

I've tried various fonts including Lucida Sans Unicode and Arial Unicode MS, and just about every other one on my system. Tahoma was mentioned in one example but it didn't work any better.

I've tried using a Label control, an Edit control, and a SynEdit control. I use TRichEdit in the Delphi version of my program, and SynEdit is the closest equivalent control in Lazarus.

I notice that Lazarus/FPC provides routines to convert from AnsiToUTF8, UTF8ToAnsi, also SysToUTF8 and UTF8ToSys.

But I am puzzled by the fact there are no routines provided to convert between the WideString form of a Unicode string and the UTF8 representation. That's very odd to me. It would seem that such routines would be provided for convenience.

Is there some piece of the puzzle that I am missing?

Can someone give me a clue or a code snippet to demonstrate EXACTLY how to display something like Cyrillic or Greek? I'd be very happy to know the secret of how it's supposed to work.

Thanks in advance for any help you can provide.

I think I have found the answers to my own questions, but I'm still not sure if the LANG variable had anything to do with the solution.

The SynEdit1 control seems to want to use the Courier font by default, so I set the Font.Name explicitly inside the program. That seems to help a lot.

I did some more reading and figured out that I needed to convert my 16-bit strings to UTF-8 using my own method, since it didn't seem to work otherwise. I'm also still not sure why LoadFromFile now works fine for loading UTF-8 file into SynEdit. It did not work until I set the Font.Name to 'Arial Unicode MS' within the program.

unit UnicodeExample;
 
{$mode objfpc}{$H+}
 
interface
 
uses
  Classes, SysUtils, FileUtil, LResources, Forms, Controls, Graphics, Dialogs,
  SynEdit, StdCtrls;
 
const
  CR               = 13;
  LF               = 10;
 
type
 
  { TForm1 }
 
  TForm1 = class(TForm)
    ClearButton: TButton;
    Edit1: TEdit;
    Label1: TLabel;
    Label2: TLabel;
    Label3: TLabel;
    SynEdit1: TSynEdit;
    procedure ClearButtonClicked(Sender: TObject);
    procedure EditingDone(Sender: TObject);
    procedure FormCreate(Sender: TObject);
  private
    { private declarations }
  public
    { public declarations }
  end; 
 
var
  Form1: TForm1;
 
implementation
 
{ TForm1 }
 
procedure TForm1.ClearButtonClicked(Sender: TObject);
begin
  SynEdit1.Lines.Clear;
end;

// This routine I adapted from someone else's code.  I am sure
// I can find a reference if someone  wants to know where I
// got the original routine.
function UTF16ToUTF8(const InpStr : WideString) : AnsiString;
var
   Len,I,N     : Integer;
   TempAnsiStr : AnsiString;
   U           : Word;
begin
  N := Length(InpStr);
  SetLength(TempAnsiStr, N * 3);   // Worst case
  Len := 0;
  for I := 1 to N do begin
    U  := Ord(InpStr[I]);
    case U of
      $0000..$007F :
        begin
          Inc(Len);
          TempAnsiStr[Len] := Chr(U);
        end;
      $0080..$07FF :
        begin
          Inc(Len);
          TempAnsiStr[Len] := Chr($C0 or (U shr 6));
          Inc(Len);
          TempAnsiStr[Len] := Chr($80 or (U and $3F));
        end;
      $0800..$FFFF :
        begin
           Inc(Len);
           TempAnsiStr[Len] := Chr($E0 or (U shr 12));
           Inc(Len);
           TempAnsiStr[Len] := Chr($80 or ((U shr 6) and $3F));
           Inc(Len);
           TempAnsiStr[Len] := Chr($80 or (U and $3F));
        end;
    end;
  end;
 
  SetLength(TempAnsiStr, Len+2);
  Inc(Len);
  TempAnsiStr[Len] := Chr(CR);
  Inc(Len);
  TempAnsiStr[Len] := Chr(LF);
 
  Result := TempAnsiStr;
end;
 
procedure TForm1.FormCreate(Sender: TObject);
var
  RussianLine,
  GreekLine,
  HebrewLine,
  KoreanLine  : WideString;
  I           : Integer;
begin
   SynEdit1.Font.Name := 'Arial Unicode MS';
   Label1.Font.Name   := 'Arial Unicode MS';
   Label2.Font.Name   := 'Arial Unicode MS';
   Label3.Font.Name   := 'Arial Unicode MS';
   // Load a UTF-8 File
   SynEdit1.Lines.LoadFromFile('c:\ixxx\text\1000CommonRussianWords.rus.txt');
   RussianLine := '';
   for I := $0410 to $042F do begin
     RussianLine := RussianLine + WideChar(I);
   end;
   GreekLine   := '';
   for I := $0391 to $03C9 do begin
     GreekLine := GreekLine + WideChar(I);
   end;
   HebrewLine   := '';
   for I := $05D0 to $05F4 do begin
     HebrewLine := HebrewLine + WideChar(I);
   end;
   KoreanLine   := '';
   for I := $1100 to $1117 do begin
     KoreanLine := KoreanLine + WideChar(I);
   end;
   SynEdit1.Lines.Add(UTF16ToUTF8(RussianLine));
   SynEdit1.Lines.Add(UTF16ToUTF8(GreekLine));
   SynEdit1.Lines.Add(UTF16ToUTF8(HebrewLine));
   SynEdit1.Lines.Add(UTF16ToUTF8(KoreanLine));
//   SynEdit1.Text := OutLine;
//   Label1.Caption := 'This is a test  äëïöü àèìòù áéíóú çÇ ñÑ ð ';  // SysToUTF8(OutLine);
   Label1.Caption := UTF16ToUTF8(RussianLine);
   Label2.Caption := UTF16ToUTF8(GreekLine);
   Label3.Caption := UTF16ToUTF8(KoreanLine);
end;

I found a reference for my conversion routine, or one very similar to it, not sure who the orignal author was, but presume it is this guy!
I've seen other versions of the same code floating around out in cyberspace. I did not want to take credit for someone else's work.

I still don't understand why SynEdit would not provide conversions for WideString to AnsiString UTF8 format. That almost seems like a no-brainer. Perhaps I'm still missing something. I often tend to leap before I look, despite the old adage.

Taken from Stefan Heymann's XML Parser at http://www.destructor.de/

FUNCTION  AnsiToUtf8 (Source : ANSISTRING) : STRING;
          (* Converts the given Windows ANSI (Win1252) String to UTF-8. *)
VAR
  I   : INTEGER;  // Loop counter
  U   : WORD;     // Current Unicode value
  Len : INTEGER;  // Current real length of "Result" string
BEGIN
  SetLength (Result, Length (Source) * 3);   // Worst case
  Len := 0;
  FOR I := 1 TO Length (Source) DO BEGIN
    U := WIN1252_UNICODE [ORD (Source [I])];
    CASE U OF
      $0000..$007F : BEGIN
                       INC (Len);
                       Result [Len] := CHR (U);
                     END;
      $0080..$07FF : BEGIN
                       INC (Len);
                       Result [Len] := CHR ($C0 OR (U SHR 6));
                       INC (Len);
                       Result [Len] := CHR ($80 OR (U AND $3F));
                     END;
      $0800..$FFFF : BEGIN
                       INC (Len);
                       Result [Len] := CHR ($E0 OR (U SHR 12));
                       INC (Len);
                       Result [Len] := CHR ($80 OR ((U SHR 6) AND $3F));
                       INC (Len);
                       Result [Len] := CHR ($80 OR (U AND $3F));
                     END;
      END;
    END;
  SetLength (Result, Len);
END;

Whoops!

Taken from Stefan Heymann's XML Parser at http://www.destructor.de/

FUNCTION  AnsiToUtf8 (Source : ANSISTRING) : STRING;
          (* Converts the given Windows ANSI (Win1252) String to UTF-8. *)
VAR
  I   : INTEGER;  // Loop counter
  U   : WORD;     // Current Unicode value
  Len : INTEGER;  // Current real length of "Result" string
BEGIN
  SetLength (Result, Length (Source) * 3);   // Worst case
  Len := 0;
  FOR I := 1 TO Length (Source) DO BEGIN
    U := WIN1252_UNICODE [ORD (Source [I])];
    CASE U OF
      $0000..$007F : BEGIN
                       INC (Len);
                       Result [Len] := CHR (U);
                     END;
      $0080..$07FF : BEGIN
                       INC (Len);
                       Result [Len] := CHR ($C0 OR (U SHR 6));
                       INC (Len);
                       Result [Len] := CHR ($80 OR (U AND $3F));
                     END;
      $0800..$FFFF : BEGIN
                       INC (Len);
                       Result [Len] := CHR ($E0 OR (U SHR 12));
                       INC (Len);
                       Result [Len] := CHR ($80 OR ((U SHR 6) AND $3F));
                       INC (Len);
                       Result [Len] := CHR ($80 OR (U AND $3F));
                     END;
      END;
    END;
  SetLength (Result, Len);
END;
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.