Non-uniform file encodings in the Eclipse Platform
Last modified: February 23, 2004
Plan item description: Eclipse 2.1 uses a single
global file encoding setting for reading and writing files in the
workspace. This is problematic; for example, when Java source files in
the workspace use OS default file encoding while XML files in the
workspace use UTF-8 file encoding. The Platform should support
non-uniform file encodings. [Platform Core, Platform UI, Text, Search,
Compare, JDT UI, JDT Core] [Theme: User experience] (bug 37933, 5399)
The pre-M7 situation is as follows:
ResourcesPlugin.getEncoding
returns the default
encoding for the workspace (the org.eclipse.core.resources.encoding
preference value if available, otherwise the value of the file.encoding
Java system property).
IFile.getContents
/setContents
work with
byte streams - no encoding can be applied.
IFile.getEncoding
tries to guess the file encoding
(looking for the Byte Order Mark),
which is not enough. Also, this API has no known client
so far. This API method would be deprecated.
- the Java compiler supports non-uniform encondings for Java source
files, but in Eclipse it relies on
ResourcesPlugin.getEncoding
(same value for all sources)
.
- the text editor framework supports setting the encoding for files
being edited (setting a persistent property on the file resource), but
there is no support for setting the encoding of multiple files
simultaneously, and other components are not aware of the encoding
settings.
Requirements
- the encoding of a file should be automatically determined by
considering the file's content and/or its name or extension.
- encoding information should be available not only for workspace
resources (e.g. IFiles) but for external files too. This has become
more important in light of the recent RCP work since use of the
resource plugin has become optional.
- encoding information should be available for local history
contents (IFileState) and archives (e.g. *.zip, and *.jar).
- in the future it should be possible to use the content based
encoding interpreter for determining more information about the file
(e.g. content type) without duplicating the mechanism but rather by
augmenting
it.
- users should be able to set the default encoding for a project.
- users should be able to share the default encoding settings in a
team
repository (metadata should reside in the project content area).
- file contents-based encoding prevails upon the inherited encoding
setting.
- users should be able to easily store a file in a different
encoding (in order to change its encoding).
Proposed solution
In addition to the existing approach of having a single global encoding for a
workbench we propose
- an extensible mechanism to determine the encoding of a stream by
analyzing its contents or, if available its file name,
- to add a default encoding property to projects. This default
encoding is used if no encoding could be determined in the first step.
We do not (yet) propose a settable encoding attribute per file because
- we do not see an immediate need for this fine level of
granularity,
- we have no sharable file attributes in Eclipse which would make
sharing of files with different encodings difficult.
The encoding for a stream or an IStorage (as returned by two getCharset
methods - see API changes) will be:
- the encoding discovered by a content
interpreter associated to the file extension (or file type), if one exists
and can determine the encoding, or
- the default encoding define for the enclosing project, if any, or
- the global workspace encoding (equivalent to
ResourcesPlugin.getEncoding()).
Regarding #1, an extension-point would allow file format-aware
encoding interpreters to register to the encoding discovery mechanism
for specific file types (extensions) or to associate existing encoding
interpreters to their own file extensions. Users would be able to
associate more file extensions for the known interpreters (preference).
All clients, when creating character-based streams when
reading/writing the contents of a file resource, should pass along the
charset string obtained from one of the getCharset
methods instead of
the one provided by ResourcesPlugin.getEncoding
. Examples
are: text editors, compiler, search, compare.
API changes
Added:
To make the encoding support available for non-workspace based
resources we propose to add the following method to
org.eclipse.core.runtime.IPlatform:
public interface IPlatform {
// ...
public String getCharset(InputStream stream, String fileExtension) throws CoreException;
// ...
}
The InputStream seems to be the most widely used and scalable
mechanism to get
access to any kind of byte content. InputStreams can be easily created
for a
java.io.File, an IStorage (which subsumes IFile and IFileState; see
below), as
well as for bytes in memory (ByteArrayInputStream).
The optional file extension argument can be used to quickly reject more
expensive ways for infering the encoding from the contents.
A corresponding implementation (based on IContentInterpreters; see
below) lives in org.eclipse.core.runtime.Platform.
For the resource plugin we propose to add a new interface
IEncodedStorage that adds the single method getCharset to the existing
IStorage interface:
interface IEncodedStorage extends IStorage {
public String getCharset() throws CoreException;
}
Its method getCharset returns the name of the encoding for an
IStorage. It would make sense to add this method directly to the
IStorage interface, since any InputStream can only be interpreted
correctly if the used encoding is known. But because clients are
allowed to implement IStorage this would be a breaking API change, so
we decided to introduce a separate extension to IStorage.
Two existing interfaces will extend IEncodedStorage: IFile and
IFileState, two concrete class will provide an implementation: File and
FileState.
For both, files and file states, the implementation of getCharset
first uses IPlatform.getCharset(...) from above to find an encoding
based on any registered IContentInterpreters. If no encoding can be
determined, File.getCharset() locates the enclosing project of the file
and queries its IProjectDescription for a default encoding. For this we
need the following two new methods on IProjectDescription:
interface IProjectDescription {
// ...
public String getDefaultCharset();
public void setDefaultCharset(String charset);
// ...
}
If no default encoding has been defined fo the project, the workspace's
default encoding preference is returned (via the existing API).
Other implementers of IStorage will have to decide whether they should
base their implementation on IEncodedStorage.
The implementation of Platform.getCharset will make use of content interpreters
implementing the IContentInterpreter interface and that can be associated to file
types through a new Core Runtime extension point "org.eclipse.core.runtime.contentInterpreter".
Users can associate additional file extensions via preferences.
The method interpretContent does not return the detected encoding but stores
it into a result object of type IContentInfo that is passed in as an argument.
This approach makes it possible to allow for collecting additional information
(like 'type'/'subtype') instead of just the encoding.
interface IContentInterpreter {
public void interpretContent(IContentInfo result, InputStream contents);
}
The IContentInfo is:
public interface IContentInfo {
public void setCharset(String charset);
public String getCharset();
}
Since we would not allow clients to implement (or extend) IContentInfo,
we will be able to extend the API with new setters and getters in the
future without breaking API.
The platform would provide itself implementations of
IContentInterpreters for xml and other
popular file formats.
Deprecated:
public int IFile.getEncoding()
public int IFile.ENCODING_*
constants
public String
ResourcesPlugin.getEncoding():
Since all clients of this
method will most likely have to adapt their code, I suggest to
deprecate getEncoding() and introduce a new method getDefaultCharset()
that better reflects the real purpose (and brings it more in line with
IProjectDescription.getDefaultCharset())
UI Changes
We need to add new UI for changing the default encoding for a project.
A
good place for this would be the Property dialog
since encoding can be considered a property of the project, similar to
the read-only property etc. The property dialog for files
would only show the current value for the encoding but would not allow
to change it.
We should provide a "Convert Encoding"
action that converts the contents of a file (or all files in a hierarchy) to a
different encoding. This action would ask the user for two encodings: the first
is used when reading all selected files and the second when writing these files
back to the workspace.
The action would not
change the encoding value returned by getCharset() but it would provide
a means to make the encoding of multiple files consistent with the
default encoding of the enclosing project.
(An alternative to this UI would be to provide something like a "Save with
encoding" action for editors. But this UI seems to be less convenient if
the encoding of multiple files needs to be changed).
In order to make sharing of files with heterogenous encodings easier,
we'll have to enhance the compare/merge tools to be able to work with
heterogenous encodings:
To facilitate that, we try to automatically determine the encoding for
the remote resource
- by knowing the default encoding of the remote project if the
.project file (containing the encoding attribut) has been synched
first, or
- by using the local IEncodingInterpreter mechanism for the remote
resources (which are available as streams), or
- by allowing the user to change the encoding for the remote
resource on the fly until it displays correctly.
With these means it becomes possible to compare and merge files
independent from the fact whether we use the same encodings on both
sides or not.
However, if we want to use the same encoding (that is if we catchup with the remote
.project file), we will have to convert the encoding of our local files to adapt
them to the new encoding. For this we will provide the "Convert
Encoding" action in the Compare/Merge tools where required.
Scenarios
- The user opens text files whose contents was created using encoding "MS932"
in a workspace whose default encoding is "US-ASCII". It was not
possible to guess the file encoding automatically, so what the user sees is
gibberish. The user figures out the cause of the problem and explicitly sets
the encoding for the project containg the files to be "MS932". He
will have to reload all editors to see the contents correctly and will have
to trigger a full build in the affected project.
- The user has a Java project with a few Java files and no explicitely specified
project encoding and a CP1252 workspace encoding. Now he wants to start using
all kinds of Unicode characters in his Java files. He sets the default encoding
of his project to UTF-16 and he converts all existing Java files to the UTF-16
encoding. All newly created Java files will automatically have the correct
encoding. Project metadata files like ".project" or "plugin.xml"
files will still be read in their correct encoding since IContentInterpreters
still apply to them.
- Determining the encoding to use for newly created files: Normally the encoding
becomes relevant on saving the file for the first time. Since the IFile already
exists and knows its project, the encoding to use can be determined by the
proposed API. A potential problem might arise from the fact that a newly created
file should use an encoding that is consistent with the encoding it would
get from an IContentInterpreter. Examples are *.properties and *.xml files.
They have a UTF-8 encoding even if the enclosing project uses a different
encoding. The code that writes these files to disk (and defines the initial
encoding) must understand this. It can neither use the encoding for the project,
nor can it use the encoding for the file (because the file is still empty
when the IEncodingInterpreter tries to determines its encoding).