| |
Subscribe / Log in / New account

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

On the Collabora blog, Gabriel Krisman Bertazi writes about a feature he developed: case-insensitive ext4. He describes how to enable the feature in the kernel (>= 5.2), how to create an ext4 filesystem that will support case-insensitive lookups, as well as some gotchas; he starts with some justification for the idea:

A file name is a text string used to uniquely identify a file (in this context, 'directory' is the same as a file) at a specific level of the directory hierarchy. While, from the operating system point of view, it doesn't matter what the file name is, as long as it is unique, meaningful file names are essential for the end user, since it is the main key to locate and retrieve data. In other words, a meaningful file name is what people rely upon to find their valuable documents, pictures and spreadsheets.

Traditionally, Linux (and Unix) filesystems have always considered file names as an opaque byte sequence without any special meaning, requiring users to submit the exact match of the file to find it in the filesystem. But that is not how humans operate. When people write titles, 'important report.ods' and 'IMPORTANT REPORT.ods' usually mean the same piece of data, and you don't care how it was written when creating it. We care about the content and the semantics of the words IMPORTANT and REPORT.



Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 27, 2020 23:05 UTC (Thu) by donbarry (guest, #10485) [Link] (10 responses)

Why are people doomed to recreate mistakes of the past? Surely they are aware of them.

The place for this is not in a filesystem, it's in higher-level interfaces to it. A filesystem needs to be rigorous in minimizing "gotchas", because it has many layers depending on it. Monkeying around with semantics is better done with these higher layers who have a far smaller list of software which can be broken by their changes and can evolve with it.

The long shadow of Windows and its choices continue to haunt.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 6:46 UTC (Fri) by warrax (subscriber, #103205) [Link] (6 responses)

We've tried that for the past 30-40 years or so... It doesn't work well enough.

The only way it could possibly work if it was as 'embedded in the fabric of everything *NIX' as e.g. libc is (or via POSIX mandate, perhaps?), but that's not going to happen. Plus, you still need to track different encodings/casing rules for different file systems, e.g USB sticks, so it needs to exist in the data *somehow*... and a file system seems about right for that, practically. Ideally, you'd do it per filename, but that's probably impractical considering everything on *NIX treats filenames as bytestrings. So here we are.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 7:05 UTC (Sat) by gfernandes (subscriber, #119910) [Link] (5 responses)

Why is the file system the "right" for it?

It would seem to me this is a totally spurious development. Gnome 3 already searches for files ignoring case. You can simply type in the Windows key, and start typing the file name - et voilà! Your file is one of the results.

So why does anyone need this?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 19:52 UTC (Sat) by t-v (guest, #112111) [Link] (4 responses)

>Why is the file system the "right" for it?

Because the filesystem defines the mapping of names to files.

Suggesting that it should be handled elsewhere implies that some tools then will interpret filenames differently to others. It also means that you can have two files with filenames that look distinct to some tools and not distinct to others.

I can see how it is a complex feature and not everyone wants it for everything (and there is good reason it's optional, right?), but the kernel (filesystem or vfs or whatever) certainly seems like the natural place to put this abstraction.
In the end, as long as realpath canonicalizes paths, it would seem to be not even breaking that many guarantees you rely on.

I'm a bit surprised people can get all worked up about this feature.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 20:23 UTC (Sat) by gfernandes (subscriber, #119910) [Link] (3 responses)

>>Because the filesystem defines the mapping of names to files.

I'd think that's some pretty good reason to **not** confuse the matter by making names case insensitive?

If **you** are confused with how you name files, use Gnome3, set up a bash alias for find - several ways to deal with that.

Making the filesystem case insensitive seems a bit unnecessary.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 19:45 UTC (Sun) by Wol (subscriber, #4433) [Link] (2 responses)

So you are now recreating the directory structure (mapping names to files) in gnome?

And you have NO GUARANTEES WHATSOEVER that other apps won't tamper with filesystem directory structure behind your back, invalidating your map in the process ...

Isn't mapping filenames to files one of the main jobs for a filesystem? ALL filesystems enforce a "set of valid characters" rule - even *nix! Why *shouldn't* a filesystem declare a canonical list? Just say that the canonical name can't contain eg upper case, and then allow aliases that are stored in the same directory entry eg "what the user entered" as opposed to "what the filesystem transmogrified it to".

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 5:30 UTC (Fri) by gfernandes (subscriber, #119910) [Link] (1 responses)

>>So you are now recreating the directory structure (mapping names to files) in gnome?

I don't think I said that.

What I _did_ say is that Gnome _indexes_ your files and _allows_ searching in a car insensitive manner.

So why the song and dance when it's a non feature?

Not that it affects me in the least - I've been on btrfs for quite some time now.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 5:31 UTC (Fri) by gfernandes (subscriber, #119910) [Link]

... **case** insensitive...

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 8:50 UTC (Fri) by oldtomas (guest, #72579) [Link]

Couldn't agree more. Or, quoting the text: "...a meaningful file name is what people rely upon ...".

Now, what's the fraction of Unicode which has a notion of "case"? What's the fraction of humanity whose native language has? (the second might be one or two orders higher, still it's probably less than 0.5).

Still: does that justify Rube-Goldberging that mess into a kernel? IMHO: no.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 6, 2020 13:34 UTC (Sun) by jond (subscriber, #37669) [Link] (1 responses)

Macs have case insensitive file systems. The world hasn’t ended, and huge amounts of FLOS works fine on Macs without needing adjustment. There are even some advantages for the lazy or inaccurate: “git show HEAd”, “git show HEad”, “git show head” all work as if you’d correctly typed HEAD.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 6, 2020 20:28 UTC (Sun) by zlynx (guest, #2285) [Link]

The world hasn't ended I suppose. But I know of at least two Java programmers who developed on OS X and created endless amounts of extra work for the rest of their team because of their spelling errors.

From the amount of capitalization mistakes and outright spelling errors in things like Bash shell scripts I'm convinced they semi-randomly hammered the keyboard then relied on IDE support and auto complete for everything when writing code.

If they hadn't been forced to deploy on Linux servers who knows how bad it would have become.

color me sceptical

Posted Aug 27, 2020 23:31 UTC (Thu) by gus3 (guest, #61103) [Link] (3 responses)

He gives the example of a file named "floß" being looked up using the name "FLOSS", successfully. But could a file originally named "Floss" be looked up using the name "Floß"? I'm not so sure.

It's just another question to be answered as the semantics are clarified.

color me sceptical

Posted Aug 27, 2020 23:38 UTC (Thu) by krisman (subscriber, #102057) [Link] (1 responses)

> He gives the example of a file named "floß" being looked up using the name "FLOSS", successfully. But could a file
> originally named "Floss" be looked up using the name "Floß"? I'm not so sure.

The article is more of a higher level overview and the floß serves to exemplify what we mean by the complexity of non-english languages, I didn't mean to show the strict semantics with that one :)

If you check documentation it will show we use Unicode's canonical decomposition for normalization (NFD) with small modifications, documented in ./admin-guide/ext4.rst

color me sceptical

Posted Aug 28, 2020 0:25 UTC (Fri) by gus3 (guest, #61103) [Link]

Your comment led me to http://www.unicode.org/reports/tr15/ and in particular http://www.unicode.org/reports/tr15/#Description_Norm showing that this matter is already being dealt with.

Thank you for your quick reply!

color me sceptical

Posted Sep 1, 2020 17:15 UTC (Tue) by nilsmeyer (guest, #122604) [Link]

From the perspective of language this seems odd. FLOSS, floss and Floß refer to three entirely different things in German language (the acronym being borrowed). ß would be more easily translated to sz in languages that don't use this, though Swiss German often uses ss instead (Fussball instead of Fußball). For search / sorting this may be fine but may cause collisions when used as a unique identifier.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 27, 2020 23:55 UTC (Thu) by dullfire (guest, #111432) [Link] (63 responses)

Honestly I can only see the argument being valid if one is operating under the predicate that average users are expected to use a cli.

If it's not a cli I can't see how it matters. GUI software will display the correct things, and if there's a search, it can default to case-insensitive (if that's a sane default for the expected user base).

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 5:03 UTC (Fri) by xanni (subscriber, #361) [Link] (4 responses)

It still matters for Wine and Proton - and occasionally also bugs in native Linux ports of Windows games where the capitalisation of a file or directory name varies across the source code and it didn't matter on Windows (or MacOS for that matter) and thus wasn't caught, even in automated testing.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 18:34 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

Here's a particular example:

A number of people on (mostly) Windows have figured out that the most effective way to mod Skyrim is to build a UnionFS-like-thing in userspace. This allows you to install lots of mods over the same basic directory structure, and when the game goes looking for an asset, it transparently finds the mod that wants to edit that asset, without having to know anything about the mods themselves. Unfortunately, Windows is case-insensitive, so most mods use a random mixture of capitalization in their directory structures (which need to match up 1:1 with the game's native directory structures, or else asset lookups will fail). If you wanted to recreate this setup on Linux, you'd need to put case folding in the mod manager's UnionFS implementation (which was designed to run on Windows, in userspace, and has no idea that it has to fold case).

(Disclaimer: I have never tried to do this on Linux, so I have no idea how, or if, they actually managed to solve this problem in Wine et al.)

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 14:11 UTC (Tue) by niner (subscriber, #26151) [Link] (2 responses)

Sounds like a rather simple script could solve this by comparing 2 directory structures, fold casing comparison candidates and adjusting the case of the one to the original's if they differ only by case.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 21:59 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

Well, yes, but you'd need to run that script on every update of every mod that you ever install. People routinely install hundreds of mods, which get updated on a sporadic and irregular basis - anywhere from "once every couple of days" to "last update was 5+ years ago" depending on the mod.

More to the point, however, this is a layering violation of its own, and arguably a much worse one than a case-insensitive ext4 would be. Wine wants to recreate a Windows-compatible environment, not individually hack a million separate apps to work right in an incompatible environment. If the choice is between "reach inside the guts of every single app that presumes a case-insensitive filesystem, and fiddle around with it until it works," and "maintain a relatively straightforward out-of-tree ext4 patch," the latter is probably a lot less work than the former. Bear in mind, of course, that many of those apps are closed-source, but ext4 is not.

So, in this hypothetical where ext4 never grew a case-insensitive mode, you eventually reach the point where they have a stable out-of-tree patch that people are actually using to solve a real problem. Then the logical next question is, who exactly benefits from the patch being out-of-tree?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 0:09 UTC (Wed) by floppus (guest, #137245) [Link]

Maybe you already know this, but Wine already implements case-insensitive filename lookups on case-sensitive filesystems. It does this in userspace, and has done so for years. No hacking in the guts of individual programs, and no special kernel support necessary. So the mods you're talking about would presumably work in Wine today.

There are certainly advantages, in performance and consistency, to doing the work in the kernel instead, but it would be much less convenient if Wine *required* ~/.wine/ to be stored on a case-insensitive filesystem.

After all, the purpose of Wine is not just to create a Windows-compatible environment, but to create a Windows-compatible environment inside a Unix-like OS.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 5:46 UTC (Fri) by ibukanov (subscriber, #3942) [Link] (57 responses)

At work we cross-compile for Windows from Linux. Windows C or C++ sources often include files using different cases. Windows.h versus windows.h or even ALL-CAPS.H for shorter abbreviated header names. Using a case-insensitive file system is a very straightforward way to deal with it.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 8:30 UTC (Fri) by ledow (guest, #11753) [Link] (10 responses)

That's not a particularly good example, however, as you could have far less impact by making the compiling include lookup procedure do that. Simpler, less-affecting, works everywhere.

Pretty much every compiler has had a patch for this at some point but just wants to push it down to the filesystem, and in that case (a cross-platform compiler dealing with cross-platform code) I'm not at all sure that the local filesystem is the place to let handle it.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 9:29 UTC (Fri) by Sesse (subscriber, #53779) [Link] (9 responses)

Not particularly good performance, though. For every path component, you need to list all files in that directory, match against them one by one and find the one that matches the best. And what if there's both foo and Foo?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 12:20 UTC (Fri) by gutschke (subscriber, #27910) [Link] (1 responses)

I'd argue that mismatched capitalization is at best a style violation and at worst an outright error. As such, it should eventually be fixed. Just the same as any other style issues.

This means, the compiler will only ever need to execute the slow path in a small number of cases. That's fine. 99% of the time, it can just open the file that the user requested. And in the rare exception, it scans the directories and prints a warning message. Subsequently, the results can be cached.

Much saner than making system-wide configuration changes for the benefit of a single defective source file

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 13:29 UTC (Fri) by jreiser (subscriber, #11027) [Link]

99% of the time, it [the compiler] can just open the file that the user requested.

The search for a #include file often fails through many directories (much of the -I list, both explicit and implicit) before finding the right one. So the speed penalty of case-insensitivity "everywhere" will be noticeable.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 17:02 UTC (Fri) by kreijack (guest, #43513) [Link] (6 responses)

> Not particularly good performance, though. For every path component, you need to list all files in that directory,
> match against them one by one and find the one that matches the best.

This is true both for the kernel implementation and for the user space implementation.

> And what if there's both foo and Foo?

It can't happen. To mark a directory "case insensitive", it has to be empty; and after a directory is marked "case insensitive" foo and Foo can't exists at the same time.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 15:04 UTC (Sat) by Sesse (subscriber, #53779) [Link] (3 responses)

> This is true both for the kernel implementation and for the user space implementation.

The kernel implementation can case-fold and then do a better-than-linear lookup after that (e.g. in the dentry cache), the user-space implementation cannot.

> It can't happen. To mark a directory "case insensitive", it has to be empty; and after a directory is marked "case insensitive" foo and Foo can't exists at the same time.

This was in response to the comment that suggested _not_ to use the kernel case insensitivity support, but instead build the logic into the compiler. So the compiler would have to handle this case.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 19:23 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (2 responses)

> The kernel implementation can case-fold and then do a better-than-linear lookup after that (e.g. in the dentry cache), the user-space implementation cannot.

Strictly speaking, it is possible for userspace to maintain a slightly-out-of-date index of each directory. This still requires a linear scan of each directory, but you can do it asynchronously, and then keep it up to date with inotify events.

However, a compiler should not be in the business of maintaining such an index, and it would be redundant to the kernel's internal data structures in any event. Moreover, I for one do not want to have yet another layer I need to consult whenever the compiler fails to find my shiny new foo.h file. So pushing this off on userspace is definitely a Bad Idea.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 7:04 UTC (Sun) by gfernandes (subscriber, #119910) [Link] (1 responses)

Afaict, such an index already exists, and is used by our friendly Desktop, Gnome 3. Why can't use-space tap into this?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 16:06 UTC (Sun) by Wol (subscriber, #4433) [Link]

gentoo make.conf
blah blah blah -GNOME

There's no guarantee that gnome will be on the system. And if by user-space you mean samba, then there's a good chance there'll be no screen hence no requirement for gnome.

And if you don't want two files to have the same case-insensitive file name, you have to enforce it at the directory level - any attempt to enforce it higher up risks something bypassing your checks!

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 3:55 UTC (Tue) by rsidd (subscriber, #2582) [Link] (1 responses)

> It can't happen. To mark a directory "case insensitive", it has to be empty; and after a directory is marked "case insensitive" foo and Foo can't exists at the same time.

What happens when you do "cp bar/* baz/" and baz is marked "case insensitive" and bar contains both foo and Foo? Does the first to be copied get overwritten by the second in baz? Or is there an error? What if foo and Foo are directories? Does everything in the two directories end up in one directory in the destination?

I suppose the answer is "don't do that" and "don't enable case insensitive unless you really really need it..."

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 10:38 UTC (Tue) by Wol (subscriber, #4433) [Link]

As pointed out elsewhere, the result is /baz will contain either Foo or foo depending on which one gets copied first (even worse, chances are if it contains the name Foo it will contain the contents foo and vice versa ... :-)

do a "cp -v --no-clobber" if you don't want that to happen. Remember nix does what you tell it, not what you meant to tell it ...

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 9:17 UTC (Fri) by tchernobog (guest, #73595) [Link] (38 responses)

Is avoiding a rename to all lowercase or fixing your includes really worth introducing this level of complexity in the kernel, which will need to be maintained for ages immemorial? At one point, it becomes a question of cost/benefit ratio.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 11:29 UTC (Fri) by thumperward (guest, #34368) [Link] (34 responses)

It fascinates me that so many LWN commenters who obviously wish themselves to be thought of as learned greybeards are seemingly completely oblivious to the pains that not having this feature has caused Linux (and its predecessors) for the last thirty years. Indeed this has also been true of literally every other case where "we should simply reimplement support for this in every individual application which needs it" has been the previous state of affairs as well.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 13:58 UTC (Fri) by warrax (subscriber, #103205) [Link] (30 responses)

Absolutely agreed. There's also this vague appeal to "complexity". It's a reasonably simple feature that e.g. DBs (because, you know, they deal with humans) have been doing for decades. It's bizarre that there seems to be this blind spot.

(If I were being uncharitable, I would just chalk it up to being averse to any change.)

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 14:14 UTC (Fri) by thumperward (guest, #34368) [Link]

I would never stoop to being uncharitable to such well-travelled scholars, who surely have a good reason to defend to the death the half-sentence "line noise is a valid filename" but then seem to curiously trail off when it continues "so long as it is terminated by NUL".

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 19:37 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (28 responses)

> There's also this vague appeal to "complexity".

Whenever I hear a software engineer gripe about "complexity," it almost always means "complexity in the layer I'm responsible for." Nobody ever talks about the overall system's complexity, with the result that the system tends towards a maximum of complexity as engineers push responsibilities off on one another. All that complexity-shifting means a lot more data has to flow between the system's various layers, which increases the overall complexity over time.

This is not the fault of the engineers. The process has been going on for so long that almost nobody can hold the entire system in their head at a time (particularly when you start talking about systems larger than one Unix box, such as a distributed system). Everyone only "sees" the complexity nearest them, and they just tacitly assume that the other layers will be fine. So of course this newly-discovered complexity belongs in another layer. My layer takes X and turns it into Y. This complexity is of type Z, which *obviously* needs to be transformed into X before I can handle it. So clearly, the complexity belongs in the layer above me.

To make matters worse, it's often difficult to see the difference between unhelpful complexity-shifting and helpful refactoring. They are, after all, basically the same process. But since (as discussed above) nobody has a global view of the system, it's really hard to see whether moving complexity from A to B is going to reduce or increase overall system complexity. So it ends up being a game of office politics (whether the work is being carried out in an office or on a mailing list), because that is something humans are capable of comprehending. And office politics has the well-known problem of being completely toxic.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 19:14 UTC (Sun) by Wol (subscriber, #4433) [Link] (26 responses)

> This is not the fault of the engineers. The process has been going on for so long that almost nobody can hold the entire system in their head at a time (particularly when you start talking about systems larger than one Unix box, such as a distributed system). Everyone only "sees" the complexity nearest them, and they just tacitly assume that the other layers will be fine. So of course this newly-discovered complexity belongs in another layer. My layer takes X and turns it into Y. This complexity is of type Z, which *obviously* needs to be transformed into X before I can handle it. So clearly, the complexity belongs in the layer above me.

Actually, if you ANALYZE the problem, you can usually work out where the complexity belongs. All too often engineers have an itch or a problem, and want an immediate solution. So either the wrong layer claims the problem, or the right layer has no desire to solve it.

This is my problem with RDBMSs. They've defined "data" to make the problem easy for computers. With the result that lists - data - have been pushed into the data management layer when they belong in the data storage layer. And now we have to mix meaningful and meaningless data together so we can recreate lists. If we have a list-capable DBMS (Pick, anyone :-) we can convert a list to a set by throwing away information. But with an RDBMS we can't recreate a list from a set, without storing loads of metadata in the data layer :-(

But actually, that problem is glaringly obvious from the act of normalisation ... analyse the problem and you see it ...

I always say it's fine just solving the bit of the problem that you want/need. But if you don't analyse the *whole* problem-space, fixing part of it now is likely to make anybody stumbling across a different part of it grief in the future.

I agree about office politics, though ...

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 21:19 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (21 responses)

> This is my problem with RDBMSs. They've defined "data" to make the problem easy for computers. With the result that lists - data - have been pushed into the data management layer when they belong in the data storage layer. And now we have to mix meaningful and meaningless data together so we can recreate lists. If we have a list-capable DBMS (Pick, anyone :-) we can convert a list to a set by throwing away information. But with an RDBMS we can't recreate a list from a set, without storing loads of metadata in the data layer :-(
>
> But actually, that problem is glaringly obvious from the act of normalisation ... analyse the problem and you see it ...

Arguably, this is a matter of opinion, and your belief about whether a list should be a valid data type will depend on how you feel about pragmatism vs. formalism, type theory vs. set theory, whether nontrivial VIEWs are useful, and whether ORMs were a Good Idea. So, basically the same office politics I was just decrying.

My point (which I could have made clearer) was that there is often no "right" answer to these questions, and even if an answer may be "right" in a particular context, it probably fails to generalize to other cases (whose priorities and business requirements may differ). For example, a project might want to be very sure that every piece of data is represented by one and only one entity in the database (a "single source of truth"), so that you cannot have different parts of the database fall out of sync with one another due to faulty application logic. Normalization is specifically designed to solve that problem, and rejecting it makes it harder, or even impossible, to provide that sort of guarantee. Another project might, just as validly, not care as much about guarding against application bugs, because their data is less sensitive to integrity problems, or because they are already taking greater care at the application layer and do not need the DB to double-check their homework (see also CHECK constraints, triggers, stored procedures, etc.). These are both equally valid opinions which may be appropriate to different situations, but it's very hard to have an evidence-based discussion around which of those two scenarios you're actually living in.

In practice, my understanding is that SQL defines an ARRAY type which is at least minimally functional, but it may not be as performant as you might like, depending on what you are trying to do with it. If my understanding is correct, then this has more to do with the standards of your particular codebase (i.e. "are we allowed to use ARRAY?") than with the RDBMS itself. In the worst case scenario, you can always serialize your weird data to BLOB, with the caveat that, obviously, the database doesn't know what's in a BLOB and can't do anything useful with them other than spitting the same bytes back out again when you ask for them.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 23:32 UTC (Mon) by Wol (subscriber, #4433) [Link] (20 responses)

> > But actually, that problem is glaringly obvious from the act of normalisation ... analyse the problem and you see it ...

> Arguably, this is a matter of opinion, and your belief about whether a list should be a valid data type will depend on how you feel about pragmatism vs. formalism, type theory vs. set theory, whether nontrivial VIEWs are useful, and whether ORMs were a Good Idea. So, basically the same office politics I was just decrying.

Well, how do you store "order" in a set? Answer: you can NOT. Yes you can create an "order" field, and stick something in it, but unless you can stick "first", "second" etc then it's not data, it's META data. And as soon as you start MIXing data and metadata you have a massive problem. And as soon as you want to insert or delete an item from the list, you also have a massive problem.

imho, if you can't store a list in a set, then list must be a datatype. And as I said, seeing as you can create a set from a list by throwing away information, but you can't go the other way and create a list from a set, (At least, not without having extraneous information about knowing how to sort the field called "order") that means a list is a superset of a set (and a bag), and can therefore replace both of them.

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 2:15 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (19 responses)

> At least, not without having extraneous information about knowing how to sort the field called "order"

You can't have it both ways. If the order is "extraneous," that means you don't care about the order, so then you don't store it in the first place. If it's not extraneous, then it's not extraneous, so you store it like any other data. It can't be simultaneously extraneous at the data storage layer but important at the business logic layer, because that's now how data storage works.

I recognize, of course, that changing the order is a more difficult problem, and in many cases, you may need to resort to the ARRAY type in practice. This is particularly likely to be a reasonable choice if the information will not be reused anywhere else in the system, so that you don't lose very much safety by failing to properly normalize it. But to claim that you can't build lists out of sets is absurd; sets are the foundation of mathematics, and you can absolutely build lists out of them (which mathematicians tend to refer to as "tuples" or "n-tuples" for integer n).

And, again, I reiterate that your choice not to use ARRAY is your own self-constructed problem; it's in the SQL standard, which every real RDBMS supports, so use it if you want it.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 10:35 UTC (Tue) by Wol (subscriber, #4433) [Link] (18 responses)

> > At least, not without having extraneous information about knowing how to sort the field called "order"

> You can't have it both ways. If the order is "extraneous," that means you don't care about the order, so then you don't store it in the first place. If it's not extraneous, then it's not extraneous, so you store it like any other data. It can't be simultaneously extraneous at the data storage layer but important at the business logic layer, because that's now how data storage works.

If I have a field called "colour", then the contents of the field have meaning. If I have a field called "order" then the contents of that field are pseudo-random garbage. THAT is the problem.

> And, again, I reiterate that your choice not to use ARRAY is your own self-constructed problem; it's in the SQL standard, which every real RDBMS supports, so use it if you want it.

Please (a) read up on what an RDBMS is. *NO* "real" RDBMS supports arrays - they are forbidden by C&D (yes I know what is marketed as an rdbms supports arrays). And (b) please read what I wrote - it should be pretty obvious I do not (from choice) use rdbms's - I use databases whose natural format is not columns but arrays.

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 16:14 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

> Please (a) read up on what an RDBMS is. *NO* "real" RDBMS supports arrays - they are forbidden by C&D
That's incorrect. The relational algebra doesn't actually care about the data types in rows.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 16:43 UTC (Tue) by Wol (subscriber, #4433) [Link] (10 responses)

Sorry Cyberax ...

copied from wikipedia ...

> Rule 0: The foundation rule:

> For any system that is advertised as, or claimed to be, a relational data base management system, that system must be able to manage data bases entirely through its relational capabilities.

> Rule 2: The guaranteed access rule:

> Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

In other words, you can't store an array in a cell and call it an RDBMS. (And note I did say "according to Codd & Date" ...)

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 16:47 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

> In other words, you can't store an array in a cell and call it an RDBMS. (And note I did say "according to Codd & Date" ...)
You absolutely can. An array is just treated as an atomic value. Relational algebra actually doesn't care about data types, as long as it's possible to construct a selection operation (basically, a predicate) on top of them.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 18:00 UTC (Tue) by Wol (subscriber, #4433) [Link] (8 responses)

Which is where Pick scores, because "an array in a cell" is not an atomic value, and hence can be manipulated and understood by the DB.

What do you do if that array is a list of foreign keys? Because Pick is quite happy doing its equivalent of a join on that array, while if the RDBMS treats it as an atomic value, it can't do a join ...

Again, this is another case of inefficiency caused by RDBMS design, because in Pick it doesn't care whether an attribute is a single foreign key or a list of them, while in an RDBMS you have to split a list out into a separate table - or use some sort of hoisting logic. And if you're using hoisting logic you're breaking up a single atomic object into multiple atomic objects ... wtf ...

Just use a DB that is natively list friendly ... :-) Once again, this is bashing the real world into your favourite mathematical model. As Dick Feynmann pointed out, "nature cannot be fooled", and the result is rarely nice. SQL is the Pascal of databases - it forces you to follow its rules. And like Pascal, it's a lot harder to program in than languages that actually try to match the real world. I've said this before - a Pick programmer can hold in his head a database schema that will cover a wall in SQL and have SQL programmers running for cover ... BECAUSE the Pick schema actually tries to be a close approximation to the real world.

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 19:46 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

> Which is where Pick scores, because "an array in a cell" is not an atomic value, and hence can be manipulated and understood by the DB.
Nothing whatsoever stops regular DBs from doing this. PostgreSQL, MSSQL, Oracle all support arrays and other complex data types.

> Because Pick is quite happy doing its equivalent of a join on that array, while if the RDBMS treats it as an atomic value, it can't do a join ...
Formally, any "join on array" can be rewritten without it (it would just make predicates a bit more complicated). However, in practice all the RDBMS support making joins on custom data.

In short, Pick is nothing special whatsoever. It's an obsolete DB for those, who prefer to live in the past.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 21:35 UTC (Tue) by Wol (subscriber, #4433) [Link] (6 responses)

> Formally, any "join on array" can be rewritten without it (it would just make predicates a bit more complicated). However, in practice all the RDBMS support making joins on custom data.

And that complexity is IN THE WRONG PLACE. (Which is where this whole discussion started.)

How do you do it while complying with C&D? Because either the list is an atom, in which case it complies but you can't do the join, or you can split the list into its constituent atoms in which case it can't comply because it's not an atom.

It's like I compared the columns "colour" and "order". One contains meaningful values, the other contains pseudo-random garbage. That's added complexity for the programmer - why should he be able to mangle some columns and not others?

Imho the whole problem starts with C&D's statement that says "data comes in rows and columns". In other words, it declares what is acceptable, and anything that doesn't comply needs a data analyst with a sledge hammer to bash round pegs into square holes.

On the other hand, *I* define data as "what the user gives me" and a LOT of that comes as lists. I also define metadata as "anything I can deduce from the data". And Pick makes it *easy* to keep those two different things *separate*. As soon as I have a list, C&D *forces* me to mix and muddle the two. And seriously, how much data/information comes from the real world as a set? Collections of real world objects! Everything *about* an object comes as a list, comes with order, even if said order is random and doesn't really matter. Take the invoice I go on about - the order of line items on an invoice or in a ledger may be random, but that order is an *extremely important* attribute of an invoice!

You may think Pick is obsolete, but why are Pick databases so easy to understand, while relational databases turn your brain to mush? It's because Pick maps pretty closely to the real world. In Pick, the FILE maps to an object definition, the RECORD maps an instance of said object, and the ATTRIBUTE maps to, well, the attribute(s) of said object. Whereas what does a relational table map to? It depends ... What does the row map to? It depends ... What does the column map to? That's easy, an attribute.

And if I wanted to, I could easily map my FILE to your table, my RECORD to your row, and my ATTRIBUTE to your column. Bingo, I've just implemented a relational database in Pick. You can't implement a pick database in C&D, it won't let you! What's the rule? "The generic always trumps the specific". My N-dimensional database trumps your 2-dimensional one! I can look to the world like a 2-dimensional relational database in every respect other than the fact I can outperform it for speed EVERY TIME.

You may be right in that the marketing for relational has trumped everything else and seized the market share mindshare. But you know what? Anything you can do, I can do it with half the resources (or less) because Pick does so much that relational can't. I will admit, though :-) , that anybody who implements a Pick database without using relational maths to design the system is setting themselves up for failure! :-) The maths is good, but implementing it as a 2-dimensional DBMS is just PLAIN STUPID! The world isn't 2-dimensional ..

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 7:42 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> How do you do it while complying with C&D? Because either the list is an atom, in which case it complies but you can't do the join, or you can split the list into its constituent atoms in which case it can't comply because it's not an atom.
You seem to not understand what you're arguing against.

Relational algebra doesn't care about the complexity of individual column types. They can be JSONs, arrays, whatever. The data type makes no difference, as long as it's possible to use it for https://en.wikipedia.org/wiki/Selection_(relational_algebra)

So nobody stops you from writing: "select t.* from sometable t where t.array_field[123] = 456". From a theoretical perspective it's enough to express any condition involving finite arrays. In practice all databases support other extensions. For example in Postgres: "select t.* from unnest(array[1,2,3,2,3,5]) item_id left join items t on t.id=item_id".

Perhaps you should look around at ACTUAL modern databases, not your imaginary version of them?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 7:49 UTC (Wed) by Wol (subscriber, #4433) [Link] (4 responses)

> Perhaps you should look around at ACTUAL modern databases, not your imaginary version of them?

Because I'm being a sod and arguing from C&D? C&D forbids it - yes I know modern "relational" databases do it, but by doing it they break the definition of what a relational database is (as per the people who created the relational database).

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 7:50 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> C&D forbids it - yes
It does not. I gave you an example that uses an array and is fully compatible with C&D's formulation of relational algebra.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 8:56 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

But rule 2 says that any atomic value can be accessed by just table/row/column. Didn't you just add array-index to that list to get the key you wanted?

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 6, 2020 2:24 UTC (Sun) by flussence (guest, #85566) [Link] (1 responses)

>Didn't you just add array-index to that list to get the key you wanted?

Array indexing was added to SQL-92 28 years ago with SUBSTRING(), which operates on the character array data type formerly defined in SQL-86/FIPS-127.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 6, 2020 10:09 UTC (Sun) by Wol (subscriber, #4433) [Link]

But you'd still need a TARDIS to address my point :-)

Surely the fact that it took 22 years to fix what is, imnsho, a serious design flaw, just confirms me in my belief that an RDBMS is a theoretical exercise like Pascal, simplified to make it easy for computers. Unlike Pascal, however, it took off and has seriously hindered data management ever since :-(

To me, it's just second nature to store foreign keys in an array. Let me ask a couple of questions - (1) what percentage of RDBMS programmers today even realise that arrays exist (not the experts, your run-of-the-mill including power users ...). (2) Of them, how many (like me) would use them for foreign keys as a matter of course? and (3) Can a modern RDBMS index the individual atoms in an array? I *hope* the answer is "yes they all can".

(And a fourth - can you put an array in an array? In *most* circumstances yes this is a stupid idea, but sometimes it does make sense ...)

Relational doesn't even eat its own dog food - I'm pretty sure most RDBMSs enforce the rule that every row has a primary key internally - even if it is just an index into a list :-) and I know that on at least one occasion a table I've been dealing with has ended up with a bag in it. I really don't know how we fixed that because all the internal tools assumed that they would be dealing with a single row, not a two-row set, and crashed accordingly.

Or take a table's list of columns - there, I said it, LIST. The order may be unimportant mathematically, but it's extremely important for human comprehension. The RDBMS must have some hidden mechanism to ensure order preservation (and, per a previous post of mine, if it's a hidden field for sort-order, that just goes back to my point about mixing data and meta-data in the same table, a BAD BAD BAD idea!).

Anyway, I'm probably coming over as a fanatic (which I am :-), but I seriously think Relational is badly broken as a design document for a DBMS. (It is, however, a brilliant too for data analysis - I wouldn't be without it for that!)

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 22:04 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (5 responses)

> *NO* "real" RDBMS supports arrays - they are forbidden by C&D

By "real RDBMS," I refer to software that is actually used in the real world, not the irrelevant opinion of some random bit of academia. Can you identify any "real RDBMS" by this definition that lacks support for ARRAY?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 23:30 UTC (Tue) by Wol (subscriber, #4433) [Link] (4 responses)

> not the irrelevant opinion of some random bit of academia.

You mean the people who actually defined what a relational database was in the very beginning?

So your definition of a "real RDBMS" is actually a bodge-up to get round a balls-up in the original design.

I'd rather use a DBMS that actually has a solid, coherent, logical design to it, thank you very much.

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 0:19 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link] (2 responses)

>So your definition of a "real RDBMS" is actually a bodge-up to get round a balls-up in the original design.

You have to be aware at this point, that this quixotic insistence that anything that doesn't strictly fit into an original academic definition is not a "real RDBMS" and only some obscure pet database qualifies isn't going to something you are going to find consensus around. The commonly accepted definition of what is a RDBMS today would definitely include apparently don't want to include but I doubt you are changing any minds here. Tough luck on that one.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 1:38 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

Ironically, this entire discussion is an example of precisely the sort of "office politics" that I said these discussions always devolve into, even though Wol was responding to that very assertion in the first place! I think my point is quite well demonstrated.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 7:53 UTC (Wed) by Wol (subscriber, #4433) [Link]

Well, as far as I'm concerned it all start with complexity being "in the wrong place". And bodging an array into a set-based rdbms imho is exactly said "complexity in the wrong place" :-)

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 1:31 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

>> Can you identify any "real RDBMS" by this definition that lacks support for ARRAY?

It would seem your answer is "no," then. So your complaints about RDBMS's have nothing to do with real-world software and are purely conceptual. I therefore see no point in continuing this discussion, because it has nothing to do with what I was originally talking about (complexity *in real software*).

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 21:55 UTC (Mon) by nix (subscriber, #2304) [Link] (3 responses)

> This is my problem with RDBMSs. They've defined "data" to make the problem easy for computers. With the result that lists - data - have been pushed into the data management layer when they belong in the data storage layer. And now we have to mix meaningful and meaningless data together so we can recreate lists. If we have a list-capable DBMS (Pick, anyone :-) we can convert a list to a set by throwing away information. But with an RDBMS we can't recreate a list from a set, without storing loads of metadata in the data layer :-(

Of course this problem is not intrinsic to the relational calculus, which of course doesn't even have a concept of 'tables', only relations, and does not define anything at all about data storage versus data management. It is perfectly possible for an RDBMS to spot the frequent use of relations with incrementing values as a key and represent it in storage as a list of some kind. It's just that (almost?) none do any such optimization.

(But then, most modern RDBMSs have almost no relationship to the actual relational calculus, which is really quite elegant. SQL and table-based databases... aren't.)

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 1, 2020 0:06 UTC (Tue) by Wol (subscriber, #4433) [Link] (2 responses)

> (But then, most modern RDBMSs have almost no relationship to the actual relational calculus, which is really quite elegant. SQL and table-based databases... aren't.)

Yup. Most modern databases aren't RDBMSs. They claim to be but break all the rules.

> It is perfectly possible for an RDBMS to spot the frequent use of relations with incrementing values as a key and represent it in storage as a list of some kind. It's just that (almost?) none do any such optimization.

I won't say Pick does this by design, it does it more by accident, but it does exactly that! Let's take my invoice example - line items are they lines on an invoice or lines in a ledger ... ?

I'd probably store them as objects in their own right, lines in a ledger, with an array in the invoice pointing to them. But I could store them as sub-rows in an invoice, and simply define the ledger as all these subrows in the INVOICE file. Either way, simply accessing the invoice record will optimise access to *all* the associated ledger lines.

Oh - and if I understand you right, those incrementing values - are you incrementing them across the ledger, or incrementing them as part of a compound key in the invoice. Either way involves pain (major pain if you're doing it on a ledger basis) if you want to insert a line, and while it's not so painful it could give you grief deleting lines too. And those values - are they numeric, alphabetic, whatever. The VALUE is irrelevant, it's not data, it's the SORT ORDER that matters, which means we are muddling data and metadata, and pushing stuff into the data management layer that doesn't belong there.

Relational explicitly allows for future optimisation, but also actively hinders said optimisation by saying you MUST use two-dimensional calculus. I'd argue that that itself is massively inefficient. And every time I try to analyse optimisation in Pick, I can't see any way of improving it. That invoice example again - if I access a line via the invoice record, the mere act of accessing it optimises access to all other lines on that invoice. And the probability of me wanting to access one of those lines is much higher than any other line in the ledger. It's much harder to optimise access to any other random ledger line based on selecting one ledger line unless you index the field you've selected on.

I guess, in a way, Pick justs indexes all interesting foreign keys by default. Do a decent object/relational analysis and this all just falls out naturally. I actually see the difference between Pick and an RDBMS as relational stores one-dimensional rows in a three-dimensional world. Pick takes real world objects and stores them, and if it's done properly they have a relational analysis done on them and each "atom" in the database is a relational view of a real world object. Hence it's much easier to comprehend, and it's also much more efficient because an object in the world is stored as an atom in the database.

And given that pretty much all "rdbms"s now include lists (arrays) which are most definitely not compatible with a true RDBMS, why not use a list-based database right from the start? You'll get better results than trying to bash square pegs into round holes.

Relational PRESCRIBES data as coming in rows and columns. It can't handle any other sort of data and pretends it doesn't exist. Like geometry used to prescribe parallel lines as never meeting. We threw that out and stepped outside Euclidean geometry. Let's throw out this stupid 2-dimensionality of relational, and step into the n-dimensional world of list-based databases like Pick.

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 13, 2020 16:24 UTC (Sun) by nix (subscriber, #2304) [Link] (1 responses)

> I guess, in a way, Pick justs indexes all interesting foreign keys by default.

This means it cannot be used for actual real-world databases of any scale. The not terribly large financial database systems I used to work on had considerable thought put into table design so as to minimize the number of unnecessary indexes, because when you're talking terabyte-scale tables, indexes are both expensive to compute and take ages to build -- but queries that do not exploit them will effectively never terminate. Indexing all interesting foreign keys automatically (without human input into what defines "interesting") would be an enormous waste of disk space and sacrifice performance to no end. (And we were talking spinning rust here: losing hours to weeks on one indexing operation was not unknown.)

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 13, 2020 19:24 UTC (Sun) by Wol (subscriber, #4433) [Link]

> This means it cannot be used for actual real-world databases of any scale.

I presume by that, you mean size? Bear in mind, for the same amount of user-supplied data, a Pick database probably occupies half the disk space.

And that example of the astronomical database where Oracle had to disable indexing to meet the target, while Cache (not Pick, but similar) sailed right past the target plus 150% ...

I said Pick "in a way, just indexes all interesting foreign keys by default". If I want to know the keys of all the ledger lines in my invoice, I just read that invoice from the client ledger and I've got a list of all the lines - the data record IS the index ... think of a hierarchical database ...

That's fast because once you've got your top-level record you just drill down the links. That's exactly what Pick does, except that with Pick any record can be the top level. If I want a list of all foreign keys associated with an object, I just read the record for that object. Forget about minimising index accesses, Pick minimizes spinning rust accesses. Given an invoice number, how many table and index references do you need to get all the information about the invoice? I'll assume there are ten line items, so you presumably need to select the invoice table - an index access to find the record followed by a table access to get the item itself. Now select the ledger index to find all the line item keys, hopefully it's optimised and gives you the internal key rather than the primary key so that you don't need yet another index lookup to find out where the line item is. How many spinning rust accesses is that? AT LEAST thirteen, may be more. Pick it's eleven, one for the invoice, one for each item. That's for eleven different objects. Is it *possible* to improve on that, even theoretically? And your extra two accesses, that's reading the table index. Depending on the size of your table, that could be a LOT of spinning rust that I don't even go near ... (Pick uses dynamically hashed files, so enforces primary keys ...)

That's probably behind another favourite of mine where some experts spent six months trying to get Oracle on a twin Xeon 800 to run faster than Pick on a Pentium 90 ...

I guess Pick can be used for far bigger databases than relational because, for any given hardware, the Pick database will store/process twice as much user data ... :-)

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 6:11 UTC (Fri) by mgedmin (subscriber, #34497) [Link]

> Whenever I hear a software engineer gripe about "complexity," it almost always means "complexity in the layer I'm responsible for." Nobody ever talks about the overall system's complexity, with the result that the system tends towards a maximum of complexity as engineers push responsibilities off on one another.

Quote of the Week material.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 16:38 UTC (Fri) by k8to (guest, #15413) [Link] (1 responses)

I can't agree. Instead I've seen the pain of the existence of this "feature" on windows and macs with the insane behaviors that have been introduced.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 16:04 UTC (Sat) by marcH (subscriber, #57642) [Link]

I've used Windows, Macs and Unix systems continuously for more than 20 years. I never had any issue with case insensitive filesystems, the subtle bugs and the pain were only on case insensitive systems.

Of course higher level user interfaces should be case-insensitive, most are already. This just belongs to neither the filesystem nor the command line.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 19:57 UTC (Fri) by vadim (subscriber, #35271) [Link]

The last decade or so has showed well that many greybeards are very set in their ways, and have gotten stuck in a comfortable rut.

I've encountered a fair amount of people who don't seem to have the faintest clue why anybody felt the need to use systemd for instance -- for them SysV scripts and inittab is all that's needed.

I think this can happen quite easily -- all you need to do is to either do the same thing at the same company for decades, settle on a very narrow specialization where you have little clue of what other people are doing, or refuse to do anything new and keep finding jobs where things are done the old fashioned way.

And really, if you're a seriously hardcore Linux guy who hasn't touched anything but Linux in the last decade or two, this whole concern might as well be alien to you. This is more of an issue for people dealing with issues of portability, and not everybody does.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 11:31 UTC (Fri) by khim (subscriber, #9252) [Link]

Rename to all lower-case is not an option.

This would only make sense in an imaginary world where Linux (in conjuntion with other case-insensive FS OSes) took 90% of desktop.

Linux community tried to achieve that for quater-century - and got nowhere.

That means that at this point choice is between literally millions of lines of code in various libraries and programs - or much smaller number of lines in kernel.

And yes, cost/benefit ratio clearly shows that having one implementation in kernel is more maintanable medium-term.

What would happen in year 2500, when Linux would, finally, achieve desktop dominance - is question not for us but for our descendants.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 16:00 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> really worth introducing this level of complexity in the kernel, which will need to be maintained for ages immemorial?

Well, as I understand it, it's an OPTIONAL addition to ext4. So it will only exist as long as ext4. And presumably it can be left out if you don't want it.

Yes it's a pain dealing with it. But dealing with humans is a pain :-)

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 18:46 UTC (Fri) by tchernobog (guest, #73595) [Link]

Even optional additions have a maintenance cost, and interaction with other parts of the code. It means more code to review, understand, check for security exploits, non-obvious interdependencies from a behavioral standpoint...

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 13:57 UTC (Sun) by remi.chateauneu (subscriber, #51826) [Link] (1 responses)

"Windows C or C++ sources often include files using different cases"

But it implies that this C++ project may not build if the sources are moved to another file-system. And of course it might not build on BSD too. And not run depending on the file system, if it opens files like "abc.tmp", "Abc.Tmp" etc... This, just to avoid properly capitalizing header filenames.

To generalize to a language with accents, the example "'important report.ods' and 'IMPORTANT REPORT.ods'", these file names would be "mean the same data":

"œuvrer à un système général.bêta"
"oeuvrer a un systeme general.beta"
"ŒUVRER À UN SYSTÈME GÉNÉRAL,BÊTA"
"œuvrer à un système général.ß"
"oeuvrer-a-un-systeme-general.beta"

... plus combinations of words delimiters like spaces, quotes, tabs, non-breaking spaces, underscores or hyphens (possibly duplicate or missing) etc... because sentences without these, "mean the same piece of data".

And these file extensions would point to the same applications, or not ?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 3, 2020 1:43 UTC (Thu) by draco (subscriber, #1792) [Link]

s/ß/β/, otherwise #4 is definitely different :-D

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 6, 2020 18:47 UTC (Sun) by scientes (guest, #83068) [Link]

The distcc way is to do a native compile, and then offload it to a larger machine (potentially a transparent cross-compile).

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 11, 2020 19:59 UTC (Fri) by bartoc (guest, #124262) [Link] (3 responses)

have you met my good friends aux.c and com.h, their siblings aux.h and com.c, and their wonderful parents AUX.c and COM.H? Really a wonderful family, shame what happened to them :D

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 13, 2020 16:28 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

I buy and sell mechanical clocks now and then (because a good clock is its own reward), and keep track of the prices of clocks I've bought, and that I mean to buy in a file called clock$.org. (Admittedly, I mostly do this to annoy Windows users who don't realise that there is a device named clock$, which is more or less all of them.)

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 13, 2020 19:59 UTC (Sun) by felix.s (guest, #104710) [Link] (1 responses)

A device named clock$ hasn’t been in Windows for quite some time now.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 14, 2020 17:27 UTC (Mon) by nix (subscriber, #2304) [Link]

Oh damn they took it out? I assumed that this sort of ancient historical monster would never, ever go away, even though it was more or less useless in anything newer than DOS 2...

Normalization vs. Case-sensitivity

Posted Aug 28, 2020 11:38 UTC (Fri) by V02460 (subscriber, #123493) [Link]

Is it agreed upon that files as viewed by the user are named in international script? If so, there needs to be an agreed-upon encoding. If the choice of encoding is Unicode, we need to work with all its quirks, which would especially be normalization. Having this supported well sounds very desirable to me then (not opening the discussion on where it should be implemented).

What doesn't make sense to me is that the article conflates Unicode normalization and case-insensitivity.
I as a user can't keep the different versions of café apart, so normalization helps me there. For letter-casing instead I don't have a problem keeping e.g. B and b apart from each other.

Citing semantics is a little bit misleading, I think. We wouldn't want filenames with different synonyms to be mapped to the same data as that would be quite arbitrary and a little restrictive. Introducing special rules on usable characters for latin scripts feels quite arbitrary and a little restrictive to me as well. Adding this change to make a system more compatible, on the other hand, and making it optional as well, sounds like a good idea to me, though.

About search: Why can't search be case-insensitive, even if the files are stored with case?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 12:13 UTC (Fri) by nettings (subscriber, #429) [Link] (8 responses)

For the sake of the patch author, I hope it's just a joke.
For the sake of everyone else, I really hope this thing dies a horrible death on LKML...

Next thing someone comes up with is localizing folder names. Anyone?

$~ echo $LANG
de_DE
$~ ls -al /
/bfe
/benutzer
/bntzr
/bib
/einhngn
/grt
/prg
/proz
/sprg
/stiefel
/vä
/vrbg
*wakes up sweaty and disoriented*

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 12:25 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (2 responses)

> For the sake of everyone else, I really hope this thing dies a horrible death on LKML...

From the very first sentence of the linked blog

"Linux 5.2 was released over one year ago and with it, a new feature was added to support optimized case-insensitive file name lookups in the Ext4 filesystem - the first of native Linux filesystems to do it."

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 14:19 UTC (Fri) by thumperward (guest, #34368) [Link] (1 responses)

You're making the beginners' mistake of assuming that just because AmigaDOS supported something in 1985 that little old Linux would be capable of handling it only 35 years later.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 6:25 UTC (Sat) by zdzichu (subscriber, #17118) [Link]

Rahul made it obvious that “nettings” have not read the article. Thus nettings' comments are pure noise here.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 12:59 UTC (Fri) by cesarb (subscriber, #6266) [Link] (4 responses)

> Next thing someone comes up with is localizing folder names. Anyone?

Doesn't XDG already do this? The folders automatically created on my home directory are named "Área de trabalho", "Documentos", "Downloads", "Imagens", "Modelos", "Música", "Público", "Vídeos". Had my locale been anything other than pt-BR, these folders would have other names.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 21:12 UTC (Fri) by NAR (subscriber, #1313) [Link] (3 responses)

I think Windows does it differently. If I list the directory names of a Windows partition from Linux, I see e.g. /users. If I list the same directory in Windows Explorer (or probably from the system file open dialog) I see Felhasználók (at least in Hungarian Windows). So I think the translation happens in the UI - the dialogs, etc. do not show the physical filename (that's stored on the disk), but translate it. I'm not even sure if it's consistent, some applications have their own dialogs and those do not translate the names (I don't have Windows in front of me right now, so can't check it, but I think e.g. GIMP does not translate the filenames). As far as I noticed, XDG does not do it, I get to see the same filenames regardless of the current locale.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 21:40 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

Windows does (or did) something weird in the registry.

Even directly in Windows, if I went in and looked at it using Windows Explorer, I would see c:\Users\Wol\Documents. But if I went in as me, I would see "My Documents".

What I hope linux does, and I suspect it is the case, is that it canonicalises the name and uses that as the directory name, but whatever name the user gave it it saves in a "display name" field so that is what the user sees. So while the user might type "Foo", "fOO", "foo", whatever, the actual directory entry will always be "foo" or "FOO" depending on which case it chooses to use. So long as the display name is used then the user will see whatever they typed, giving you a "case insensitive but case preserving" system.

Windows has pulled that sort of stunt ever since W95 ...

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 12:29 UTC (Mon) by milesrout (subscriber, #126894) [Link]

What's done with XDG is that there are some environment variables somewhere in your profile that will set something like XDG_DOCUMENTS_DIR=~/docs or ~/Documents or ~/[Documents in Hungarian]

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 23:42 UTC (Sat) by cesarb (subscriber, #6266) [Link]

I don't have a computer with Windows nearby to check, but IIRC Windows does (or at least did) the same: for instance, the directory which is called "C:\Program Files" in English Windows is instead "C:\Arquivos de Programas" in Brazilian Portuguese Windows, and these are the physical directory names on the disk.

But yeah, Windows Explorer does things differently. What you see in Windows Explorer is a virtual hierarchy defined as COM objects (the Shell Namespace), not the real filesystem hierarchy, so you can have for instance virtual folders (like the Control Panel) which are visible in that virtual hierarchy but are not in the filesystem. The inconsistency you see is probably between applications using the Shell Namespace and applications using the filesystem directly.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 14:24 UTC (Fri) by dskoll (subscriber, #1630) [Link]

While I understand the need for this feature, I hate the very idea of it with every fiber of my being. Luckily, I don't have to deal with case-insensitivity since I have no day-to-day interaction with anything that needs it. So I just don't enable the feature.

As long as having the feature doesn't impose any penalty for those who choose not to enable it, I think (reluctantly) that this is the appropriate place to put it.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 16:00 UTC (Fri) by magfr (subscriber, #16052) [Link] (7 responses)

Read about dotted and dotless i and then please explain how this works without locale information for every file name.
https://en.m.wikipedia.org/wiki/Dotted_and_dotless_I

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 18:09 UTC (Fri) by Jonno (subscriber, #49613) [Link] (6 responses)

Unicode defines a set of "default casing operations", which is a set of pure function with no dependency other then the input string(s). The kernel uses "canonical caseless match" which is defined as:

> R4: toCasefold( X ): Map each character C in X to Case_Folding(C).
> Case_Folding(C) uses the mappings with the status field value “C” or “F”
> in the data file CaseFolding.txt in the Unicode Character Database.
>
> [...]
>
> D145: A string X is a canonical caseless match for a string Y if and only if:
> NFD(toCasefold(NFD( X ))) = NFD(toCasefold(NFD( Y )))

The Unicode also provides guidance for the implementation of "tailored casing operations", including suggested rules for locale dependent casing operations for Lithuanian, Turkish and Azeri, which is what you are talking about. (Note that the Lithuanian rules does not affect toCasefold, only toUppercase, toLowercase and toTitlecase; and that the rules for Turkish and Azeri are identical).

Optional support for using Turkic case folding instead of default case folding would be great, and would fit right in as another flag argument. I'm sure patches would be welcome...

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 18:56 UTC (Fri) by nave (subscriber, #105585) [Link] (4 responses)

> Optional support for using Turkic case folding instead of default case folding would be great, and would fit right in as another flag argument.

Files are copied between computers.
Their filenames may have been created with different locale settings.

Support for using Turkic case folding (or any other option) is not enough.

The locale settings must follow the file.

We would need something like the RDF langString: "name@lang", where 'lang' is an IETF BCP 47 language tag.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 0:04 UTC (Sat) by Jonno (subscriber, #49613) [Link] (3 responses)

> The locale settings must follow the file.

That is impossible, as the locale is not a property of the file name, but of the comparison operation. At best you could set the locale on a per-directory basis, but I hardly see how that would be any better than per filesystem.

(For reference, NTFS uses Turkic case folding for file systems formatted on a Turkish language Windows install, and non-Turkic case folding for file systems formatted on any other language Windows install; and Turkish Windows users seems to deal with it just fine.)

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 18:30 UTC (Sat) by k8to (guest, #15413) [Link] (1 responses)

Using the historical locale mechanism is a non-starter. Locale is per-process. Processes share files.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 22:37 UTC (Sat) by nave (subscriber, #105585) [Link]

> Using the historical locale mechanism is a non-starter. Locale is per-process. Processes share files.

Exactly!

Files are (will be eventually) shared between processes, users, computers, countries.

*Any* *scope* we can choose for locale settings (process, cgroup, user, directory, file system, computer, OS localization, LAN, organization, country) is simultaneously:

- too big to handle all possible filenames correctly (when doing case folding or other human language operation);

- too small to be sure that files will always be moved inside it (never shared outside that scope).

Having filenames with an IETF BCP 47 language tag attached (based on the locale, for example)
may help with the human language operations when a file is shared/copied/moved.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 10:00 UTC (Sun) by nave (subscriber, #105585) [Link]

The goal of Windows compatibility (especially helping Samba) is important.
The design is OK *if* described as "support case-insensitive lookups for Windows compatibility".
This is important and very useful work, and I'm grateful for it.

The following justification has the heart in the right place but I think it's giving us *false hope*:

"[...] that is not how humans operate. When people write titles, 'important report.ods' and 'IMPORTANT REPORT.ods' usually mean the same piece of data, and you don't care how it was written when creating it.
We care about the content and the semantics of the words IMPORTANT and REPORT"

We cannot achieve this goal so easily. Many commenters explained why.

To me it seems that the core of the issue is **closed vs. open world** bias.

Locales support human language- / culture-specific processing of *curated collections* = closed worlds.
Example: sort order or case-folding in a dictionary, book index, document archive.

I agree that a single directory, and maybe a whole tree (a file system instance), can be managed like a curated collection, a closed world.

But can we expect *careful curation* from users who do *not* care about case? *These* users were mentioned to justify the need. I don't think it will work.

An *open world* is the more general case that I care about, and the only realistic expectation:
file systems should store and find later *all* files that we acquire, which may come from any other computer, locale, OS localization, organization, country.

> For reference, NTFS uses Turkic case folding for file systems formatted on a Turkish language Windows install,
> and non-Turkic case folding for file systems formatted on any other language Windows install;
> Turkish Windows users seems to deal with it just fine.

Let's assume Turkish Windows users exchanging files are satisfied with the handling of dotted and dotless I in filenames.
Does everything work as expected when their files, or whole trees, go through other computers, or just flash drives formatted by non-Turkish users?
I'd expect that they do what most of us do: for successful exchange avoid fancy names, the US-ASCII subset is safest, etc.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 6:27 UTC (Fri) by mgedmin (subscriber, #34497) [Link]

What are the Unicode rules for locale dependent casing operations for Lithuanian? I'm Lithuanian, and I've never heard of these.

*googles, looks it up in SpecialCases.txt*

Ah, it's about accented text which is basically only used in dictionaries and textbooks to indicate which syllable is stressed. Lowercase i retains its dot when an additional accent indicating stress is placed on it, which requires an extra Unicode combining character that needs to be explicitly dropped when uppercasing.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 16:03 UTC (Fri) by Chousuke (subscriber, #54562) [Link] (12 responses)

I'm not sure what to think of this. Suddenly instead of having to be careful dealing with filesystems I *know* do weird things with filenames, I have to be careful of ext4 variants as well.

What happens with case-insensitive ext4 if you copy over files with the same name but different case from another filesystem? Do you just destroy data silently, or does it actually complain?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 16:13 UTC (Fri) by Chousuke (subscriber, #54562) [Link] (1 responses)

Just realized I could test this myself. Any data in files with conflicting case gets silently clobbered.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 16:54 UTC (Fri) by dvdeug (guest, #10998) [Link]

That is the Unix way; if you want for files to not get silently clobbered, you have to set an option on cp or mv or pretty much any other POSIX utility that creates or moves files.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 18:21 UTC (Sat) by marcH (subscriber, #57642) [Link] (8 responses)

Also: what happens when you copy files across case-insensitive filesystems with different locales and capitalization rules?

For even more fun, imagine the source and/or destination have per-directory sensitivity.

Sheer insanity.

PS: also realized this can probably be tested with Windows today. Too bad life is too short, already wasted enough time with Windows and issues like these.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 11:18 UTC (Mon) by Wol (subscriber, #4433) [Link] (4 responses)

You have the "file system name" and the "user supplied name". The file system name is enforced by the local name - canonicalised from the user supplied name, and unique.

The user supplied name is used when passing it somewhere else that is not (immediately) accessing the file.

So a copy uses the user-supplied name in user-space, the file systems at either end canonicalise that name to ensure uniqueness. The only way we can then get grief of inaccessible files (yes Apple had that problem) is if we change the canonicalisation rule on an active file system. BAD IDEA!

Analyse the problem then the solution is obvious - use user names in user space, and system names in system space!

Cheers,
Wol

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 17:02 UTC (Mon) by marcH (subscriber, #57642) [Link] (3 responses)

> canonicalised from the user supplied name, and unique.

Not my NTFS experience, so while researching it I found that case-insensitivity is NOT implemented at the NTFS filesystem level?!?

https://www.betaarchive.com/wiki/index.php/Microsoft_KB_A...

Unless it is now?!? http://drewthaler.blogspot.com/2007/12/case-against-insen...
> NTFS: Case-insensitive in different ways depending on the version of Windows that created the volume.

What an total mess...

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 17:08 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Not my NTFS experience, so while researching it I found that case-insensitivity is NOT implemented at the NTFS filesystem level?!?
It is implemented in the FS. You have to do case folding in the FS driver. Moreover, NT used to actually store the case conversion table in a special hidden file on NTFS, so you can implement it by doing a simple lookup in a 16-bit table.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 17:20 UTC (Mon) by marcH (subscriber, #57642) [Link] (1 responses)

Ha, so the same filesystem can do both simultaneously depending on who uses it? Fascinating... obsession for complexity. Can ext4 do both simultaneously too? Is it case-preserving too? Sorry for being lazy but I'm still wondering whether it's just the documentation on these topics that is messy or the implementation too. Very afraid the latter. Speaking of which:

Since you seem knowledgeable about this, would you know why Microsoft apparently tried to delete this KB from the Internet? And also comment on the "depending on the Windows version" quote?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 17:27 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> Ha, so the same filesystem can do both simultaneously depending on who uses it?
It used to be a per-FS flag, actually.

> Since you seem knowledgeable about this, would you know why Microsoft apparently tried to delete this KB from the Internet?
MS doesn't really delete KB articles, they just constantly change the way they're organized. And they recently started expiring the old articles. They are still available through KB archive if you need them, though.

The archive states that the article applied to:
> Microsoft Windows NT Advanced Server 3.1
> Microsoft Windows NT Workstation 3.1

Which are long dead and gone.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 6:30 UTC (Fri) by mgedmin (subscriber, #34497) [Link] (2 responses)

> For even more fun, imagine the source and/or destination have per-directory sensitivity.

Have you ever mounted a VFAT-formatted USB drive on a Linux system and copied files between it and elsewhere? This is not a new, never-heard-before, oh the calamity! situation.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 8:55 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> Have you ever mounted a VFAT-formatted USB drive on a Linux system and copied files between it and elsewhere?

Yes and that's exactly why the idea of the same quirks but on a much larger scale is scary.

On Linux people script cp -R and rsync without even thinking about it. robocopy always sounds like an adventure.

Anyway the comment you're answering was about per-directory case sensitivity, I miss how this VFAT example is related.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 16:35 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

One could check the filesystem in use for any given directory and find out it is vfat/ntfs/cifs and apply some case-insensitive and/or Windows naming rule logic to their operations in that directory. But no one does that (and won't do whatever is needed to ask about case-insensitive ext4 directories either). So the case exists today, but has traditionally been limited to "am I working on a USB key" kind of things.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 2, 2020 1:20 UTC (Wed) by riking (subscriber, #95706) [Link]

Note that even if the option is active on a filesystem, it doesn't do anything until it's also enabled on specific folders.

Your program can check for +F on the folder if you need to deal with this.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 17:08 UTC (Fri) by jch (guest, #51929) [Link]

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 19:15 UTC (Fri) by cpitrat (subscriber, #116459) [Link]

I also mean the same thing when I type "important report" and "improtant reptort", or even "rapport important". Will v2 support this?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 28, 2020 22:31 UTC (Fri) by Matlib (guest, #134276) [Link] (9 responses)

I've made a number of Debian and Ubuntu installations in the past to all sorts of people, including those who didn't really feel any difference whether caps lock was on or off. I don't recall anyone complaining about case-sensitive names.

The save dialog could ask for confirmation if the name is too similar to an existing one. Even better, the drop-down list may show similarly named files when typing. This falls more into UX enhancement category.

Anyway, what did they complain about then?

  • #0 – that there was no confirmation on delete
  • (linked problem) – that trash folders were created on memory sticks
  • incompatibilities between OO/LO and MS Office

SpaceFM fortunately solved the first two though.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 0:47 UTC (Sat) by rgmoore (✭ supporter ✭, #75) [Link] (8 responses)

The big problems with Linux being case sensitive come when it's interacting with other operating systems. For example, at my work we run our scientific instruments on Windows because that's what the instrument control software requires, but we archive our data to a Linux box using rsync. The archive box then shares the data using Samba so we can access years worth of older data on our Windows machines.

We recently ran into a big hassle when we updated one of our machines to Windows 10 and I accidentally named the data directory "Data" instead of "data". The Linux archive box treated this as a different directory and added the new data to the new directory. When the Samba server served it, it showed there being two directories, "Data" and "data", but Windows showed their contents as being the same, so we couldn't access our older data. We were eventually able to sort things out be renaming the "data" directory to "old_data", but it was an unnecessary difficulty. Having a case-insensitive filesystem would have avoided the whole problem. Sure, you can blame Windows for the problem rather than Linux, but if you want to use Linux boxes to serve files for Windows computers, they need to be able to do things in a Windows-friendly fashion.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 2:35 UTC (Sat) by gb (subscriber, #58328) [Link] (6 responses)

Why Linux should be Windows-friendly? Can't Windows be Linux friendly? What about making windows case-sensitive?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 7:06 UTC (Sat) by zorro (subscriber, #45643) [Link] (4 responses)

They already have. See https://petri.com/turn-windows-10-ntfs-case-sensitivity

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 2:08 UTC (Sun) by zlynx (guest, #2285) [Link] (2 responses)

Unfortunately with Windows if you try case sensitivity or removing 8.3 names on a boot drive or any drive with programs installed everything breaks.

I tried stripping a virtual machine Windows boot drive of all its 8.3 compatibility names once. I was shocked at how many "modern" Windows programs use C:\PROGRA~1 as some kind of shortcut to Program Files. Those are probably the same programs that would miserably fail if the boot drive was F not C.

I believe this kind of thing is why REFS is limited to data drives only.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 18:21 UTC (Sun) by khim (subscriber, #9252) [Link] (1 responses)

> Those are probably the same programs that would miserably fail if the boot drive was F not C.

I actually had Windows 98 (when that was a thing) installed on drive D. Surprisngly few programs failed. But amount of pain I needed to make installers work... it's just not worth it.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 16:00 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Yeah, I had an XP install that refused to see the disk I wanted as "C:", so I got a D: install drive. Most things actually worked, but it was indeed the installers that were most confused.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 13:56 UTC (Mon) by eru (subscriber, #2753) [Link]

<They already have. See https://petri.com/turn-windows-10-ntfs-case-sensitivity

I suspect even with that setting, con, aux, prn etc are still reserved names..

Just for fun, if you have access to SharePoint or OneDrive, try uploading a file named aux.txt from Linux, using the web browser interface. You will get a complaint that your file name contains invalid characters!

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 18:18 UTC (Sun) by khim (subscriber, #9252) [Link]

> Why Linux should be Windows-friendly?

Because most app developers use Windows.

> Can't Windows be Linux friendly?

It can but not by default. And even then - it wouldn't magically fix programs.

> What about making windows case-sensitive?

Out of the question because this would break these same programs.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 18:34 UTC (Sat) by k8to (guest, #15413) [Link]

There are other file mirror tools that can handle this case. rsync is not really built to be a heterogeneous os/fs file replicator.

I was sceptical above, but not now

Posted Aug 29, 2020 2:24 UTC (Sat) by gus3 (guest, #61103) [Link]

The big selling point for me is the first requirement that the directory must be empty in order to set it to be case-insensitive. This puts case-insensitivity on par with file encryption in ext4, ubifs and f2fs. Directories must be empty, in order to enable major dataset behaviors. Any pre-existing datasets preclude modifying such behavior(s).

TL;DR: I get it. I'm on board.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 6:47 UTC (Sat) by kunitz (subscriber, #3965) [Link] (8 responses)

There are multiple reasons why the addition of the feature doesn't make a lot of sense.

  1. Unicode capitalization rules are changing with the consequence that the exact behavior of the feature will depend on the Unicode version supported by the kernel.
  2. User-space software can still not rely on the feature being available, so software would need to check whether the feature is supported and implement a fallback if it is not.
  3. It will not be used widely and therefore not sufficiently tested; so the feature will break silently at one point in the future.

This is a typical might-be-useful feature that adds complexity and needs to be maintained forever.

The Floß/FLOSS example made me laugh because since 2017 the official rules of the German language allow the use of an capital ß additionally to the replacement by SS. Typographers are discussing the capital letter ß for over a century. More about it in the Wikipedia entry.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 11:47 UTC (Sat) by james (subscriber, #1325) [Link] (4 responses)

I don't know about "won't be used widely."

If Samba were to support it, I can imagine that Synology and other manufacturers of Linux-based NAS devices would want to use it, if only because it might help performance in reviews.

And if they use it, they have commercial reasons to test it.

(A quick search doesn't turn up any Samba patches that take advantage of this: just complaints that checking that none of the files in a very large directory matches a particular filename is very expensive.)

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 13:33 UTC (Sat) by barryascott (subscriber, #80640) [Link] (2 responses)

Don’t servers default to XFS not EXT4?
And Fedora is moving to btrfs.

In both cases unless this feature is added to those FS it less interesting?

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 18:46 UTC (Sat) by zdzichu (subscriber, #17118) [Link]

But Android uses ext4. Or f2fs, for which case-insensitivity support is being worked on.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 12:32 UTC (Sun) by james (subscriber, #1325) [Link]

Commercial NAS devices come with pre-installed firmware from the manufacturer, running the way they want.

So the choice of filesystem is up to them: they also get to specify which filesystem features are enabled and which software is provided.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 2:52 UTC (Mon) by jra (subscriber, #55261) [Link]

Samba already supports this and has done so for many years :-).

We've run on systems that are case-insensitive forever.

All you need to is tell Samba via the smb.conf that the system is case insensitive, and so doing a [l]stat given a name should alway succeed if the file exists. On ENOENT we then don't do the expensive search. It's really that simple.

set:

case sensitive = yes

leave the rest as default and you're done.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 18:06 UTC (Sat) by mariofutire (guest, #141044) [Link] (1 responses)

This is very important.

No application can rely on it, as it depends on the way the user has created / mounted the filesystem.

So now it is even more confusing because apps need to run in an environment which is a moving target, implementing fallback or workarounds.

It is a feature only useful to individual users, not to the community as a whole in my opinion.

If this is solving a wine problem, it could have been done with a system call or via a new parameter, which they (i.e. wine) can control.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 12:45 UTC (Mon) by kevincox (guest, #93938) [Link]

> No application can rely on it, as it depends on the way the user has created / mounted the filesystem.

Applications can check for the feature, and refuse to run or switch to a fallback. I agree that it will be many years until this can be assumed to be available (starting counting form when the default is switched to enabled) but already programs where the performance matters can start checking for this.

> If this is solving a wine problem, it could have been done with a system call or via a new parameter, which they (i.e. wine) can control.

It can't be done this way with good performance. I think the only way that it could be done is if wine stored files normalized, but kept the original name stashed somewhere. However this is a lot of complexity and doesn't work for directories not completely managed by wine.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 18:50 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

> Unicode capitalization rules are changing with the consequence that the exact behavior of the feature will depend on the Unicode version supported by the kernel.

Not true, see https://www.unicode.org/policies/stability_policy.html, which specifically notes that case-folding is stable from Unicode 5.0 onwards. It may change, but it will not change in a way that would be "noticeable" to strings only containing characters from a previous version of Unicode.

> User-space software can still not rely on the feature being available, so software would need to check whether the feature is supported and implement a fallback if it is not.

This has been used as an argument against every new feature of every piece of software since the dawn of time.

> It will not be used widely and therefore not sufficiently tested; so the feature will break silently at one point in the future.

There are a good half-dozen comments just on this article explaining why people want this feature, in a variety of contexts (mostly relating to Windows interoperability). Maybe you don't use Windows, but it's ridiculous to claim that Windows is "not used widely."

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 10:50 UTC (Sat) by tilt12345678 (subscriber, #126336) [Link]

As i see it, the feature of filename case-insensitivity is implemented for the sake of compatibility with MS Windows, specifically for interchange of files stored on Linux filesystems with MS Windows clients (SAMBA comes to mind, but there are other scenarios, too).

And for that purpose, to enable it as an option on a Linux-hosted fileshare, i welcome case-insensitivity; it provides a very useful feature in hybrid environments.

Thanks for the hard work.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 29, 2020 16:09 UTC (Sat) by marcH (subscriber, #57642) [Link] (1 responses)

What next, per-directory case-sensitivity?

Oh, wait... https://github.com/microsoft/vscode-cmake-tools/issues/531

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 12:05 UTC (Fri) by riking (subscriber, #95706) [Link]

That's... literally in this feature already. It's mentioned in the article.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 14:05 UTC (Sun) by gray_-_wolf (subscriber, #131074) [Link] (6 responses)

I wonder where we should stop. I mean, most users would expect file names that are different in just 0x20 and 0xa0 to compare equally. Should we also compare those?

What about same looking but technically different characters?

I just wonder what the line here is.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 30, 2020 18:32 UTC (Sun) by khim (subscriber, #9252) [Link] (5 responses)

> I just wonder what the line here is.

The line is where Windows draws the line, ultimately.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 4:16 UTC (Mon) by marcH (subscriber, #57642) [Link] (4 responses)

> The line is where Windows draws the line, ultimately.

Thanks to this and other comments I finally understand what this feature truly is: _Windows compatibility_. It should really be called like that instead of "case-insensitive filesystem" that never made much technical sense because of all the possible variations, configurations, evolutions, incompatibilities, bugs, complexity and other corner cases. Case is a natural language and informal concept after all, for instance most French people believe the capital letter for "é" is "E" (no accent) while all professional newspapers and books use "É" (the latter is very difficult to enter on Windows, which partly explains the former)

"The Windows implementation is the specification" finally does make some sense. I mean it still doesn't make sense but at least it provides a "technical" and "formal" specification for it. Don't forget to include the Windows and Unicode version numbers in the feature name too and maybe the Windows locale too.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 8:12 UTC (Mon) by abo (subscriber, #77288) [Link]

I wouldn't want to encourage anyone to resort to drinking to deal with this, but it seems fitting to name it Wine Mode, even though it may be useful in other cases (ha) too, like Samba and WSL2.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 15:44 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (2 responses)

> Thanks to this and other comments I finally understand what this feature truly is: _Windows compatibility_.

If that's the case, are all of the other Windows filenaming rules also being enforced? No trailing spaces, periods, special names, etc.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 16:26 UTC (Mon) by paultaysom (guest, #141070) [Link]

Don't forget the reserved file names. For strict Windows compatibility, you need an 8.3 version of the name. (My knowledge may be old.)

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 3, 2020 14:52 UTC (Thu) by khim (subscriber, #9252) [Link]

It's Linux we are talking about. So obviously the old rule if nobody notices, it's not broken is in effect.

The goal is not to faithfully reproduce Windows behavior, the goal is to make all these billions of lines of code written for Windows useful.

And while there are enormous corpus of which creates "SomeDataFile.DAT" and then tried to read "somedatafile.data"... all these special names are rarely a problem in practice.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Aug 31, 2020 4:20 UTC (Mon) by marcH (subscriber, #57642) [Link]

Ironically, it's still not possible to clone and browse the Linux kernel code on Windows or macOS by default:

git clone linux
cd linux
git status

Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: include/uapi/linux/netfilter/xt_CONNMARK.h
modified: include/uapi/linux/netfilter/xt_DSCP.h
modified: include/uapi/linux/netfilter/xt_MARK.h
....
modified: net/netfilter/xt_DSCP.c
modified: net/netfilter/xt_HL.c
modified: net/netfilter/xt_RATEEST.c
modified: net/netfilter/xt_TCPMSS.c

Back to a case-sensitive system:

git ls-tree --name-only v5.7 - net/netfilter/xt_* | sort -f -k4

net/netfilter/xt_dscp.c
net/netfilter/xt_DSCP.c !!!
net/netfilter/xt_ht.c
net/netfilter/xt_HT.c
etc.

Who thought this was a good idea?

I don't know if this serves a purpose

Posted Aug 31, 2020 10:55 UTC (Mon) by xophos (subscriber, #75267) [Link]

But the rationale given in the article is clearly bogus.
Applying the logic stated to it's conclusion "Report, important" should also be the same File. So we need a Dictionary and an AI to determine which filenames are the same.
If that is to far fetched for you just consider unicode code-points that have the same or similar looking characters attached.
Those should clearly be the same too!
The way i see it the only usecase is easier windows emulation. If that is worth the effort is debatable, but at least be honest about it.

Krisman: Using the Linux kernel's Case-insensitive feature in Ext4

Posted Sep 4, 2020 2:42 UTC (Fri) by RogerOdle (subscriber, #60791) [Link]

I hope nobody uses this misfeature in software development. File names need to be case sensitive or your software is just not portable. When I was still consulting, I got paid lots of money to make software written on Windows build on something else. Go ahead and use this, your just creating work for many other someone elses every time they try to build your software on a more traditional system.

For software development, the file system should assure that there is only one way to spell something and it should be an error if you use the wrong case.


Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds