Given these demands, computers provide a good way to ease this repetitiveness by automating lookup and editing; by converting resources like dictionaries, other translations, and collections of example sentences to a computer-readable format, lookups can be performed much faster. Likewise, sophisticated editing languages can reduce the time and complexity of editing translation results.
The current state-of-the-art for translation aid software, represented by programs like SDL Trados [2] and Wordfast [3], is an environment similar to a word-processor that is centered around providing auto-completion of translation candidates based on a user's past translations. The typical workflow in environments like SDL Trados is as follows:
The interface for programs like Trados may suffice when a user is strictly adhering to this workflow, translating segments sequentially with little pause for lookup or editing, however if users deviate from this usage pattern, it becomes difficult to perform other tasks, such as consulting background material, looking up unknown vocabulary, or performing complex edits. The commands are hard to figure out and remember.
In order to make it easier for the user to figure out and remember how to do something, the interfaces of the tools provided need to closely correspond with what the user wants to do. To make the UI correspond, we need to make it easier to identify commands in a consistent manner.
An alternative to keyboard shortcuts is the use of menus. While clicking a menu entry may be easier than using keyboard shortcuts, hierarchical menus still suffer from the same obfuscation problems that plague keyboard shortcuts, only in the case of menus, users end up classifying their commands in a hierarchy. Another problem with menus is that they are static in nature; updating the menu layout can confuse users, and, in some implementations require alteration of the underlying program. This makes menus too difficult to adapt to changing tasks or workflows.
Where is the consistency in these approaches? Operating systems often provide human interface guidelines that are intended to provide consistent keyboard shortcuts and menu layouts across applications. Examples of this are the Command+N and Command+W shortcuts in Mac OS X for creating and destroying windows. But ultimately, this consistency is difficult to enforce; third party applications often break the rules, and operating systems, like Linux, that lack enforcement authorities are unable to maintain a high level of consistency.
One option for identifying commands that we have not yet considered is the plain text interface. With the advent of GUIs, plain text interfaces are often dismissed as not being user friendly enough. This bad reputation of the command line is due to the perceived obscurity of the names of some commands and the inflexible keyboard-only nature of older terminals.
This design has some interesting implications. Because any text can be executed as a command, the user is not limited in terms of what tools he or she can access, keeping the workflow dynamic and allowing it to easily adapt to the needs of a given task. At the same time, the clear division in roles the keyboard for text production, and the mouse for text interpretation as well as the clean semantics of the mouse buttons for select/execute/get provides consistency to the UI. By providing an easy way for text to be interpreted as commands, Acme combines the command and its interface, removing the disconnect between what the user wants to do and how it is done. This makes the UI easier to figure out and remember, and produces an interface that is more suitable for text-based tasks.
Acme makes applying tools to data as easy as clicking on text with a mouse. Furthermore, the small tools design philosophy that inspired Acme makes it easy to apply existing NLP tools in Acme. This means that tasks like using reference materials or editing text are easy. Users who frequently deal with language, such as linguists, translators, language learners, or poets need to be able to efficiently explore and manipulate language. Acme can provide a powerful working environment for users who spend a lot of time interacting with text.
Acme is clearly attractive for text-heavy tasks like translation, however, some work needs to be done to produce a usable environment. Tools and intuitive interfaces need to be constructed for common text manipulation tasks such as splitting a sentence into words or phrases, looking up words in dictionaries, or translating phrases using translation memories or machine translation systems. Support for multi-lingual input needs to be improved: we need redistributable fonts with good Unicode coverage and good input method editors for handling input and display of text in languages that do not use Latin-based scripts. Finally, we need a way of getting Acme to the end user in a simple, easy to install package that works out-of-box.
Consider the task of tokenizing a sentence into words. Many NLP tools, such as the dictionary search function, lookup in Figure 1, operate on word-level information, so this task is an essential form of preprocessing for applying them. We will build an interface for the languages English and Japanese since the actual method of tokenization is very different and illustrates the need for a simple, consistent interface.
The shell function tok_en tokenizes an English sentence by splitting on any punctuation and removing extraneous whitespace. While this approach is sometimes over aggressive in splitting hyphenated words and words with apostrophes will occasionally be segmented unnecessarily it is a simple, easy to implement heuristic.
fn tok_en {
sed 's/([!"#%&'()*+,-./:;<=>?@[]^_`{|}~])/ 1 /g
s/ +/ /g
s/^ +//g
s/ +//g'
}
fn tok_ja {
tcs -f utf-8 -t euc-jp |
mecab -Owakati | # suppress POS output, tokenizing into words
tcs -f euc-jp -t utf-8
}
We combine tok_en and tok_ja together with the following shell function:
fn tok_any {
args = *
if {~ 1 -e} { # English
f = tok_en
(lang args) = args
} {~ 1 -j} { # Japanese
f = tok_ja
(lang args) = args
} { # Use English as fallback
f = tok_en
}
f args
}
tok_any acts as a multiplexer, calling the proper language-specific implementation of a given task based on its settings. Consistent naming and pipe based I/O makes this possible. It does not matter how different the implementations of the English and Japanese tokenizers are; they are encapsulated in separate functions, but combined together they represent a language-independent task.
Whichlang is a limbo function similar to Plan 9's freq(1) that uses a simple character frequency based heuristic to identify the language of an input stream of text. Occurrences of alphabetical characters or punctuation commonly used in English writing are counted as evidence for English, whereas occurrences of characters typically used in Japanese such as hiragana, katakana, or Han ideographs are taken as evidence for Japanese. This is a simple approach that could be improved on, but it can easily be expanded and will suffice for our current purposes.
whichlang(fd: ref Sys->FD): string
{
EN: con 0; JA: con 1; FLOOR: con 0.5;
lang := array[2] of { 0, 0 };
buf := array[256] of byte;
c:= 0;
for(;;) {
n := sys->read(fd, buf, len buf);
s := string buf[0:n];
if(n <= 0)
break;
for (i := 0; i < len s; i++) {
if (s[i] != ' ') {
c++;
case s[i] {
'a' to 'z' or 'A' to 'Z' or
'!' to '/' or ':' to '?' =>
lang[EN]++;
'、' to '〾' or # Asian punctuation
'ぁ' to 'ゟ' or # hiragana
'゠' to 'ヿ' or 'ㇰ' to 'ㇿ' or # katakana
'㆐' to '㆟' or # kanbun
'㐀' to '䶵' or # CJK unified ideographs ext.
'一' to '龥' or # CJK unified ideographs
'!' to '゚' => # half- and full-width forms
lang[JA]++;
}
}
}
}
l := "";
max := 0;
if ((lang[EN] > max) && ((real lang[EN] / real c) > FLOOR)) {
l = "en";
max = lang[EN];
}
if ((lang[JA] > max) && ((real lang[JA] / real c) > FLOOR)) {
l = "ja";
max = lang[JA];
}
return l;
}
Turning whichlang into a Limbo module allows our language detection facilities to be used in any other Limbo program. Writing a wrapper to use whichlang as a stand-alone program is trivial. Assuming such a program, we can write a shell script that will automatically set the language flag for any program to the language of its input.
Setlang takes as its first argument the command whose language is to be set. It determines if it has been called with any manually set language flags, and, in their absence, it uses tee to cache a copy of the program's input and pipe a portion to whichlang for identification. If a language is successfully identified, the language flag is set accordingly. Finally the command is called with the language flag either set or omitted entirely.
fn setlang {
args = *
or {~ #args 1 2} {
echo 'usage: setlang <command> [-e | -j]' >[1=2]
raise usage
}
(com args) = *
(and {~ #args 1} {~ args -e -j} {
(lang nil) = args
})
tmp = {pid}^.tmp
if {~ #lang 0} {
l = `{tee tmp | sed 200q | whichlang}
if {~ wl en} {
lang = -e
} {~ l ja} {
lang = -j
}
}
if {ftest -f tmp} {
com lang < tmp
rm -f tmp
} {
com lang
}
}
fn tokenize {
setlang tok_any *
}
The examples of whichlang and setlang show how easy it is to use the Plan 9 design philosophy and the tools provided by Inferno to create simple, consistent text-based interfaces for multi-lingual NLP applications. Clearly, Inferno is an attractive platform for NLP development, however, the addition of Acme also makes it ideal for the distribution of NLP services. What is needed is an easy way to get Acme and useful NLP tools into the hands of the end user.
Acme SAC simplifies the user experience further by eliminating any installation or setup process. Users simply download a tarball and run a single application to start Acme. The first time Acme is started, a new user account with reasonable default settings is automatically created. Inferno is capable of accessing the local file system, and Acme SAC mounts and makes it available by default. The os command provides access to host OS commands, allowing users to continue using any of their existing tools that support pipe-based I/O, and Acme SAC clients can communicate with Inferno installations, making it easy to run commands on remote machines.
To help users get accustomed to the Acme environment, Acme SAC is distributed with a README that acts as an interactive tutorial thanks to Acme's dynamic text interpretation capabilities. Users learn how to use the mouse to select and execute text, how to browse man pages, and how to use Acme SAC's IRC client. Every text example is immediately available for exploration to the user.
Our goal is to provide a set of tools that will help our target users group of linguiphiles work with text. The classic text "Unix(TM) for Poets" [8] by Ken Church can act as an interactive README for NLP, however, its examples need to be updated for the Inferno shell, and perhaps, to include text in languages other than English. We are in the process of updating it (the new text will be called "Acme for Poets"), however, the lack of awk and paste commands make some of the examples challenging.
Mac OS X applications are typically distributed in a single directory that contains all of the application's dependencies. This makes it easy for users to manage; installation and upgrades consist of simply dragging the application to a desired folder, however, this caused problems for some of Acme SAC's settings. For example, by default, Acme SAC creates its user directories inside of the Acme SAC tree. This is problematic for Mac OS X because a subsequent install would overwrite all of the user's data, and it causes security issues when installed by an administrator. Moving the acme home directory into the Mac OS X user's home directory seemed most appropriate, however, it would be inappropriate to bind /Users/me directly to Acme SAC's user directory since that could cause security problems. The solution was to create a special directory in the Mac OS X user's home directory called acme-home/ and copy all Acme SAC users files there during new user creation. This protects the user's other files in the event that something goes wrong with Acme SAC. In addition, we bind Acme SAC's /tmp onto acme-home/ as well, making Acme SAC's root directory truly read-only.
While Plan 9 and Inferno both support limited Unicode character input via modifier keys, languages that do not use Latin-based scripts often require more sophisticated forms of managing input. There are native IMEs, such as ktrans[9] , that make input conversion possible in Plan 9 and Inferno, but they are often limited in the scope of languages covered. In contrast, Mac OS X and Windows both provide IMEs capable of handing a large number of languages, and can dedicate more resources to improving the quality of input conversion than a lot of smaller projects can.
For our purposes, the IMEs of Acme SAC's host OSes also provide the user with consistency; he or she is not required to learn a new method of entering text to do translation in Acme SAC. Acme SAC's graphical component is implemented using it's host operating system's native window system API, making it easy to provide IME support in a minimal amount of code in a manner that does not disturb users who do not need the IMEs. In fact, when we questioned Acme SAC's creator on how the Windows version's IME support was implemented, he was not aware of its implementation at all (personal correspondence). We were able to implement support for the IME in Mac OS X with only a few dozen lines of code the majority updating the the Mac encoding of special keys to their Unicode equivalents.
One problem that arises with increased support of languages is that fonts are required to display them. This problem is well-illustrated in the case statement in whichlang from Section 5*.
One of the reasons for the widespread adoption of SDL Trados is its ability to let the user translate popular document formats like Microsoft Word and Adobe Pagemaker, stripping away the formatting before translation and replacing it afterward. In order to attract users, it may be necessary to handle translation of such documents as well. The raises the more general issue of how (if at all) markup should be handled in Acme. Programs like strings(1) make it possible to extract plain text from Word documents, but would it be possible to read and write in the Word format without losing formatting information?
Another asset of state-of-the-art translation aid environments is their support of translation memories and auto-complete. This starts with dividing the input text into smaller translation units. Such "text chunking," as it is known in the field of natural language processing, is easy to replicate with existing tools. Indeed it was one of the first features we implemented in Acme. With the text to translated divided into chunks, we need a fast way of navigating these chunks. Acme SAC's client wrapper for Charon may provide some guidance: adding a special client with Prev and Next buttons to move from chunk to chunk could ease navigation of the document being translated. We will also need to provide text alignment facilities in order to construct translation memories from document pairs. Investigating an appropriate way of implementing this in Acme and Inferno remains an area of future work.
Finally, perhaps the most important question concerns Acme itself. While it is our position that Acme's text based interface simplifies interaction with text by eliminating the need for nested menus and keyboard shortcuts, we have not empirically investigated Acme's learning curve. Thus, answering the question of whether or not one has to become an Acme power user in order to translate using it is of utmost importance.