Proposal

Removing restrictions on valid characters in paths

Summary
The data structure org.eclipse.core.runtime.IPath and its canonical implementation, org.eclipse.core.runtime.Path, impose restrictions on segment names that can be more restrictive than those for file names in the underlying file system. When Eclipse paths are used to represent file system paths, these restrictions prevent valid files from being added to the Eclipse workspace. This document describes the current set of restrictions in Eclipse 3.0, and proposes changes to lift these restrictions.

Last modified: October 18, 2004

Background

IPath is an abstract data structure supplied by the org.eclipse.core.runtime plug-in, and consists of the following parts:

To facilitate conversion between IPath and String instances, IPath reserves the colon (':') character as the device delimiter, and the forward and back slash ('/' and '\') characters as segment delimiters. The API javadoc for IPath.isValidSegment outlines the complete set of restrictions:

Contrast this with the restrictions on file names in Unix (as defined in the "Base Definitions" volume of IEEE Std. 1003.1-2001, section 3.169 Filename): Thus when Eclipse IPath objects are used to represent Unix file names, they are unable to represent file names containing '\' or ':', or file names with leading or trailing white space.

Proposed Solution

Lifting the restriction on paths with leading or trailing whitespace and paths containing the '\' character is easily achieved by specifying a new constructor for creation of paths. Lifting the restriction on the ':' character is more difficult, since it is needed as the path delimiter on operating systems that support a device.

The solution must accomodate the two interesting categories of IPath users:

IPath already acknowledges these two uses in its toString methods. The standard toString method creates a platform-neutral encoding of the path as a String. The toOSString method creates a platform-specific encoding suitable for passing to java.io.File or other API that deals directly with the file system.

The proposed solution is to introduce two constructors for creating IPath that perform the inverse of the two existing toString methods:

Since changing the behaviour of the existing toString method would cause too much breakage, an new method, toPortableString will be introduced for creating a platform-neutral string representation of paths. The existing toString method will remain unchanged.

Most clients will use the platform-specific form of paths. The path can be converted to/from a platform-neutral representation when a path needs to be serialized in a portable fashion.

The platform-neutral encoding of paths (IPath.toPortableString) will allow all characters except slash ('/') in segment names, and include an optional device separated from the segments by a single colon character. Literal colon characters in path segments are escaped through doubling (one colon becomes two colons). The following are some examples of windows file system paths and the corresponding platform-neutral encoding:

Canonical Unix paths look identical to their platform-neutral encoding, except in the presence of segments containing the colon character. The following are some Unix paths and the corresponding encoding by IPath.toPortableString:

UNC paths, which typically have no device but have a double leading separator will generally be the same

If for some reason a UNC path had a device, it will preceed the slashes:

This platform-neutral encoding unambiguously encodes all possible paths on all supported platforms. Most importantly, this toPortableString implementation is fully backward compatible with the Eclipse 3.0 implementation of IPath.toString for all paths that can be created in Eclipse 3.0. This means that clients who previously used toString for serializing paths can move to the new toPortableString/fromPortableString methods without migrating file formats.

The platform-specific Path factory method will impose the minimum platform-specific requirements needed to unambiguosly parse all possible paths on that platform. The Windows implementation, for example, will interpret everything up to the first ':' as the device, and treat both '/' and '\' as path segment separators. No other rules will be imposed. Thus the existing restriction on paths that prevents path segments from having leading or trailing whitespace will no longer be enforced on any platform.

As before, detailed validation of all legal characters and names on that platform will not be enforced. Some clients use technology such as Cygwin or Samba to mount foreign file systems on a platform. In these situations, path name rules for the local file system do not apply. While it is difficult to fully support these users, any additional platform-specific verification performed on paths causes further problems for these users. Imposing the absolute minimum requirements for unamiguously parsing paths allows the majority of users to function without further impacting the corner cases.

API Details

The following existing methods on IPath and Path are affected:

New methods for Eclipse 3.1:

What do we do with the Path(String) constructor?

This proposal introduces two factory methods that clearly distinguish platform-neutral and platform-specific encodings of paths. The difficult question is what to do with old single argument Path constructor. The two options are:

  1. Leave the implementation of this constructor unchanged, but deprecate it. The advantage of this solution is that it does not break the API contract spelled out in the current Path constructor, which explicitly states how it handles ':' and '\' characters. The disadvantage is that this will require all callers of the existing Path constructors to migrate to one of the two path factory methods, depending on the origin of the path string being used. Clients that do not migrate to the new factory methods risk errors introduced when trying to construct IPath instances corresponding to file system paths that were previously treated as invalid. For example, the resources plugin would allow introduction of resources with the ':' and '\' characters. Other plugins trying to create a path corresponding to those resources using the old constructors will fail. Experiments with this solution showed that plug-ins that failed to migrate to the new factory method were broken due to the unexpected introduction of previously invalid paths. This presents a bleak picture for backwards-compatibility, regardless of the fact that no API contracts are broken.
  2. The second option is to change the existing single argument path constructor to be platform-specific. In other words, the Windows implementation of these methods would remain unchanged, but implementations on other platforms would stop treating ':' as the device separator, and no longer treat '\' as a path segment separator. This clearly violates the existing API specification of the Path constructor. On the positive side, this introduces very little breakage in practice. The net effect is of removing old restrictions on some operating systems. The only breakage will be caused to clients who use a device for some reason on all operating systems, and clients that need to construct IPath objects representing file system paths from platforms other than the one that the current Eclipse instance is running in. For example, a plug-in running on Linux would not be able to use the old constructors to create IPath objects representing files from a remote Windows system.

After investigating the implementation of both of the above approaches, the second option introduces the smallest breakage by far. For example, the first option requires almost all of the 600 references to the Path constructors found in the current edition of the Eclipse platform. The second option requires only a small set of localized changes in code that deals with serializing and deserializing paths in a platform-neutral manner. Based on testing the implementation of these two options, this proposal recommends option two.

Examples

The following examples illustrate the behaviour of the various Path constructors and to*String methods.

Given the absolute path with device "C:" and single segment "foo", the following IPath methods will produce:

Given the relative path with null device, and two segments "C:" and "foo": Given the string "C:\\foo" (single backslash escaped in Java literal format), the following constructors will produce: Given the string "C:/foo": Given the string "C::/foo":

Other migration issues

All clients who store absolute IPath objects as platform-neutral strings in a serialized form (as produced by IPath.toString in Eclipse 3.0), should switch to the new fromPortableString/toPortableString methods rather than the Path constructor and the toString method. Backward compatibility with files written by Eclipse 3.0 is automatic (no changes to file format or changing file format version numbers required). Examples of files that contain string representations of paths that will need to migrate include the workspace .project and .classpath files.

Other Observations

Under this proposal IPath.toPortableString and Path.fromPortableString are perfect inverses of each other. In other words, the expression

path.equals(Path.fromPortableString(path.toPortableString()))
will be true for all paths, and
string.equals(Path.fromPortableString(string).toPortableString())
will be true for all strings that represent canonical paths (strings with duplicate slashes or "." and ".." references will turn out differently). Furthermore, the Eclipse 3.1 implementation of Path.fromPortableString will be the perfect inverse of the Eclipse 3.0 implementation of IPath.toString.

On Unix, the toOSString and fromOSString methods will be inverses of each other. On Windows, the same can only be said for paths that do not contain colon or backslash characters within segment names (such paths are invalid on Windows anyway). Consider the following example:

   String input = "foo::bar";
   IPath pathOne = Path.fromPortableString(input);
   IPath pathTwo = Path.fromOSString(pathOne.toOSString());
   pathOne.equals(pathTwo) -> false!
The input string represents a path with no device, and a single segment whose name is "foo:bar" (invalid on Windows). When this is output using toOSString, it is encoded as "foo:bar". The fromOSString then interprets this as a path with device "foo:" and first segment "bar". Similar mangling occurs if you create a path with a segment containing the backslash character:
   String input = "foo\\bar"; 
   IPath pathOne = Path.fromPortableString(input);
   IPath pathTwo = Path.fromOSString(pathOne.toOSString());
   pathOne.equals(pathTwo) -> false!
In this case, the input is a path with one segment whose name is "foo\bar". This is interpreted by fromOSString as a path with two segments "foo" and "bar". In other words, under this proposal you cannot reliably manipulate paths containing backslash or colon using to/fromOSString on Windows. This seems to be an acceptable limitation.

References