A discussion between database and kernel developers

[Posted March 11, 2014 by corbet]

From:		Mel Gorman <mgorman-AT-suse.de>
To:		pgsql-hackers-AT-postgresql.org, mysql-AT-lists.mysql.com, maria-developers-AT-lists.launchpad.net, linux-kernel-AT-vger.kernel.org
Subject:		Database/kernel community topic at Collaboration Summit 2014
Date:		Mon, 10 Mar 2014 10:15:37 +0000
Message-ID:		<20140310101537.GC10663@suse.de>
Archive‑link:		Article

Hi,

Arrangements have been made to hold a meeting between database and kernel
developers at Collaboration Summit 2014 http://sched.co/1hEBRuq on March
27th 2014. This was organised after discussions on pain points encountered
by the PostgreSQL community. Originally the plan had been to just have a
topic for LSF/MM there was much more interest in the topic than anticipated
so the Collaboration Summit meeting will be much more open.

If there are developers attending Collaboration Summit that work in
the database or kernel communities, it would be great if you could come
along. Previous discussions were on the PostgreSQL list and that should be
expanded in case we accidentally build postgres-only features. The intent
is to identify the problems encountered by databases and where relevant,
test cases that can be used to demonstrate them if they exist.  While the
kernel community may be aware of some of the problems, they are not always
widely known or understood. There is a belief that some interfaces are fine
when in reality applications cannot use them properly. The ideal outcome
of the meeting would be concrete proposals on kernel features that could
be developed over the course of time to address any identified problem.

For reference, this is a summary of the discussion that took place when
the topic was proposed for LSF/MM.

Thanks.

---8<---

On testing of modern kernels
----------------------------

Josh Berkus claims that most people are using Postgres with 2.6.19 and
consequently there may be poor awareness of recent kernel developments.
This is a disturbingly large window of opportunity for problems to have
been introduced.

Minimally, Postgres has concerns about IO-related stalls which may or may
not exist in current kernels. There were indications that large writes
starve reads. There have been variants of this style of bug in the past but
it's unclear what the exact shape of this problem is and if IO-less dirty
throttling affected it. It is possible that Postgres was burned in the past
by data being written back from reclaim context in low memory situations.
That would have looked like massive stalls with drops in IO throughput
but it was fixed in relatively recent kernels. Any data on historical
tests would be helpful. Alternatively, a pgbench-based reproduction test
could potentially be used by people in the kernel community that track
performance over time and have access to a suitable testing rig.

It was mentioned that Postgres has an tool called pg_test_fsync which
was mentioned in the context of testing different wal_sync_methods. Potentially
it could also be used for evaluating some kernel patches.

Gregory Smith highlighted the existence of a benchmark wrapper for pgbench
called pgbench-tools: https://github.com/gregs1104/pgbench-tools . It can
track statistics of interest to Postgres as well as report in interesting
metrics such as transaction latency. He had a lot of information on testing
requirements and some very interesting tuning information and it's worth
reading the whole mail

	http://www.postgresql.org/message-id/52D99161.60305@gmail...

Postgres bug reports and LKML
-----------------------------

It is claimed that LKML does not welcome bug reports but it's less clear
what the basis of this claim is.  Is it because the reports are ignored? A
possible explanation is that they are simply getting lost in the LKML noise
and there would be better luck if the bug report was cc'd to a specific
subsystem list. A second possibility is the bug report is against an old
kernel and unless it is reproduced on a recent kernel the bug report will
be ignored. Finally it is possible that there is not enough data available
to debug the problem. The worst explanation is that to date the problem
has not been fixable but the details of this have been lost and are now
unknown. Is is possible that some of these bug reports can be refreshed
so at least there is a chance they get addressed?

Apparently there were changes to the reclaim algorithms that crippled
performance without any sysctls. The problem may be compounded by the
introduction of adaptive replacement cache in the shape of the thrash
detection patches currently being reviewed.  Postgres investigated the
use of ARC in the past and ultimately abandoned it. Details are in the
archives (http://www.Postgres.org/search/?m=1&q=arc&l=1&...). I
have not read then, just noting they exist for future reference.

Sysctls to control VM behaviour are not popular as such tuning parameters
are often used as an excuse to not properly fix the problem. Would it be
possible to describe a test case that shows 2.6.19 performing well and a
modern kernel failing? That would give the VM people a concrete basis to
work from to either fix the problem or identify exactly what sysctls are
required to make this work.

I am confident that any bug related to VM reclaim in this area has been lost.
At least, I recall no instances of it being discussed on linux-mm and it
has not featured on LSF/MM during the last years.

IO Scheduling
-------------

Kevin Grittner has stated that it is known that the DEADLINE and NOOP
schedulers perform better than any alternatives for most database loads.
It would be desirable to quantify this for some test case and see can the
default scheduler cope in some way.

The deadline scheduler makes sense to a large extent though. Postgres
is sensitive to large latencies due to IO write spikes. It is at least
plausible that deadline would give more deterministic behaviour for
parallel reads in the presence of large writes assuming there were not
ordering problems between the reads/writes and the underlying filesystem.

For reference, these IO spikes can be massive. If the shared buffer is
completely dirtied in a short space of time then it could be 20-25% of
RAM being dirtied and writeback required in typical configurations. There
have been cases where it was worked around by limiting the size of the
shared buffer to a small enough size so that it can be written back
quickly. There are other tuning options available such as altering when
dirty background writing starts within the kernel but that will not help if
the dirtying happens in a very short space of time. Dave Chinner described
the considerations as follows

	There's no absolute rule here, but the threshold for background
	writeback needs to consider the amount of dirty data being generated,
	the rate at which it can be retired and the checkpoint period the
	application is configured with. i.e. it needs to be slow enough to
	not cause serious read IO perturbations, but still fast enough that
	it avoids peaks at synchronisation points. And most importantly, it
	needs to be fast enought that it can complete writeback of all the
	dirty data in a checkpoint before the next checkpoint is triggered.

	In general, I find that threshold to be somewhere around 2-5s
	worth of data writeback - enough to keep a good amount of write
	combining and the IO pipeline full as work is done, but no more.

	e.g. if your workload results in writeback rates of 500MB/s,
	then I'd be setting the dirty limit somewhere around 1-2GB as
	an initial guess. It's basically a simple trade off buffering
	space for writeback latency. Some applications perform well with
	increased buffering space (e.g. 10-20s of writeback) while others
	perform better with extremely low writeback latency (e.g. 0.5-1s).

Some of this may have been addressed in recent changes with IO-less dirty
throttling. When considering stalls related to excessive IO it will be
important to check if the kernel was later than 3.2 and what the underlying
filesystem was.

Again, it really should be possible to demonstrate this with a test case,
one driven by pgbench maybe? Workload would generate a bunch of test data,
dirty a large percentage of it and try to sync. Metrics would be measuring
average read-only query latency when reading in parallel to the write,
average latencies from the underlying storage, IO queue lengths etc and
comparing default IO scheduler with deadline or noop.

NUMA Optimisations
------------------

The primary one that showed up was zone_reclaim_mode. Enabling that parameter
is a disaster for many workloads and apparently Postgres is one. It might
be time to revisit leaving that thing disabled by default and explicitly
requiring that NUMA-aware workloads that are correctly partitioned enable it.
Otherwise NUMA considerations are not that much of a concern right now.
Bruce Momjian highlighted this block entry that covered zone_reclaim

http://frosty-postgres.blogspot.com/2012/08/postgresql-nu...

Direct IO, buffered IO, double buffering and wishlists
------------------------------------------------------

The general position of Postgres is that the kernel knows more about
storage geometries and IO scheduling that an application can or should
know. It would be preferred to have interfaces that allow Postgres to
give hints to the kernel about how and when data should be written back.
The alternative is exposing details of the underlying storage to userspace
so Postgres can implement a full IO scheduler using direct IO. It has
been asserted on the kernel side that the optimal IO size and alignment
is the most important detail should be all the details that are required
in the majority of cases. While some database vendors have this option,
the Postgres community do not have the resources to implement something
of this magnitude. They also have tried direct IO in the past in the areas
where it should have mattered and had mixed results.

I can understand Postgres preference for using the kernel to handle these
details for them. They are a cross-platform application and the kernel
should not be washing its hands of the problem and hiding behind direct
IO as a solution. Ted Ts'o summarises the issues as

	The high order bit is what's the right thing to do when database
	programmers come to kernel engineers saying, we want to do <FOO>
	and the performance sucks.  Do we say, "Use O_DIRECT, dummy", not
	withstanding Linus's past comments on the issue?  Or do we have
	some general design principles that we tell database engineers that
	they should do for better performance, and then all developers for
	all of the file systems can then try to optimize for a set of new
	API's, or recommended ways of using the existing API's?

In an effort to avoid depending on direct IO there were some proposals
and/or wishlist items. These are listed in order of likliehood to be
implemented and usefulness to Postgres.

   1. Hint to asynchronously queue writeback now in preparation for a
      fsync in the near future. Postgres dirties a large amount of data and
      asks the kernel to push it to disk over the next few minutes. Postgres
      still is required to fsync later but the fsync time should be
      minimised. vm.dirty_writeback_centisecs is unreliable for this.

      One possibility would be an fadvise call that queues the data for
      writeback by a flusher thread now and returns immediately

   2. Hint that a page is a prime candidate for reclaim but only if there
      is reclaim pressure. This avoids a problem where fadvise(DONTNEED)
      discards a page only to have a read/write or WILLNEED hint immediately
      read it back in again. The requirements are similar to the volatile
      range hinting but they do not use mmap() currently and would need a
      file-descriptor based interface. Robert Hass had some concerns with
      the general concept and described them thusly

	This is an interesting idea but it stinks of impracticality.
	Essentially when the last buffer pin on a page is dropped we'd
	have to mark it as discardable, and then the next person wanting
	to pin it would have to check whether it's still there.  But the
	system call overhead of calling vrange() every time the last pin
	on a page was dropped would probably hose us.

	Well, I guess it could be done lazily: make periodic sweeps through
	shared_buffers, looking for pages that haven't been touched in a
	while, and vrange() them.  That's quite a bit of new mechanism,
	but in theory it could work out to a win.  vrange() would have
	to scale well to millions of separate ranges, though.  Will it?
	And a lot depends on whether the kernel makes the right decision
	about whether to chunk data from our vrange() vs. any other page
	it could have reclaimed.

   3. Hint that a page should be dropped immediately when IO completes.
      There is already something like this buried in the kernel internals
      and sometimes called "immediate reclaim" which comes into play when
      pages are bgin invalidated. It should just be a case of investigating
      if that is visible to userspace, if not why not and do it in a
      semi-sensible fashion.

   4. 8kB atomic write with OS support to avoid writing full page images
      in the WAL. This is a feature that is likely to be delivered anyway
      and one that Postgres is interested in.

   5. Only writeback some pages if explicitly synced or dirty limits
      are violated. Jeff Janes states that he has problems with large
      temporary files that generate IO spikes when the data starts hitting
      the platter even though the data does not need to be preserved. Jim
      Nasby agreed and commented that he "also frequently see this, and it
      has an even larger impact if pgsql_tmp is on the same filesystem as
      WAL. Which *theoretically* shouldn't matter with a BBU controller,
      except that when the kernel suddenly +decides your *temporary*
      data needs to hit the media you're screwed."

      One proposal that may address this is

	Allow a process with an open fd to hint that pages managed by this
	inode will have dirty-sticky pages. Pages will be ignored by dirty
	background writing unless there is an fsync call or dirty page limits
	are hit. The hint is cleared when no process has the file open.

   6. Only writeback pages if explicitly synced. Postgres has strict write
      ordering requirements. In the words of Tom Lane -- "As things currently
      stand, we dirty the page in our internal buffers, and we don't write
      it to the kernel until we've written and fsync'd the WAL data that
      needs to get to disk first". mmap() would avoid double buffering but
      it has no control about the write ordering which is a show-stopper.
      As Andres Freund described;

	Postgres' durability works by guaranteeing that our journal
	entries (called WAL := Write Ahead Log) are written & synced to
	disk before the corresponding entries of tables and indexes reach
	the disk. That also allows to group together many random-writes
	into a few contiguous writes fdatasync()ed at once. Only during
	a checkpointing phase the big bulk of the data is then (slowly,
	in the background) synced to disk. I don't see how that's doable
	with holding all pages in mmap()ed buffers.

      There are also concerns there would be an absurd number of mappings.

      The problem with this sort of dirty pinning interface is that it
      can deadlock the kernel if all dirty pages in the system cannot be
      written back by the kernel. James Bottomley stated

	No, I'm sorry, that's never going to be possible.  No user space
	application has all the facts.	If we give you an interface to
	force unconditional holding of dirty pages in core you'll livelock
	the system eventually because you made a wrong decision to hold
	too many dirty pages.

      However, it was very clearly stated that the writing ordering is
      critical. If the kernel breaks the requirement then the database
      can get trashed in the event of a power failure.

      This led to a discussion on write barriers which the kernel uses
      internally but there are scaling concerns both with the number of
      constraints that would exist and the requirement that Postgres use
      mapped buffers.

      There were few solid conclusions on this. It would need major
      reworking on all sides and it would handing control of system safety
      to userspace which is going to cause layering violations. This
      whole idea may be a bust but it is still worth recording. Greg Stark
      outlined the motivation best as follows;

	Ted T'so was concerned this would all be a massive layering violation
	and I have to admit that's a huge risk. It would take some clever
	API engineering to come with a clean set of primitives to express
	the kind of ordering guarantees we need without being too tied to
	Postgres's specific implementation. The reason I think it's more
	interesting though is that Postgres's journalling and checkpointing
	architecture is pretty bog-standard CS stuff and there are hundreds
	or thousands of pieces of software out there that do pretty much
	the same work and trying to do it efficiently with fsync or O_DIRECT
	is like working with both hands tied to your feet.

   7. Allow userspace process to insert data into the kernel page cache
      without marking the page dirty. This would allow the application
      to request that the OS use the application copy of data as page
      cache if it does not have a copy already. The difficulty here
      is that the application has no way of knowing if something else
      has altered the underlying file in the meantime via something like
      direct IO. Granted, such activity has probably corrupted the database
      already but initial reactions are that this is not a safe interface
      and there are coherency concerns.

      Dave Chinner asked "why, exactly, do you even need the kernel page
      cache here?"  when Postgres already knows how and when data should
      be written back to disk. The answer boiled down to "To let kernel do
      the job that it is good at, namely managing the write-back of dirty
      buffers to disk and to manage (possible) read-ahead pages". Postgres
      has some ordering requirements but it does not want to be responsible
      for all cache replacement and IO scheduling. Hannu Krosing summarised
      it best as

	Again, as said above the linux file system is doing fine. What we
	want is a few ways to interact with it to let it do even better
	when working with Postgres by telling it some stuff it otherwise
	would have to second guess and by sometimes giving it back some
	cache pages which were copied away for potential modifying but
	ended up clean in the end.

	And let the linux kernel decide if and how long to keep these pages
	in its	cache using its superior knowledge of disk subsystem and
	about what else is going on in the system in general.

   8. Allow copy-on-write of page-cache pages to anonymous. This would limit
      the double ram usage to some extent. It's not as simple as having a
      MAP_PRIVATE mapping of a file-backed page because presumably they want
      this data in a shared buffer shared between Postgres processes. The
      implementation details of something like this are hairy because it's
      mmap()-like but not mmap() as it does not have the same writeback
      semantics due to the write ordering requirements Postgres has for
      database integrity.

      Completely nuts and this was not mentioned on the list, but arguably
      you could try implementing something like this as a character device
      that allows MAP_SHARED with ioctls with ioctls controlling that file
      and offset backs pages within the mapping.  A new mapping would be
      forced resident and read-only. A write would COW the page. It's a
      crazy way of doing something like this but avoids a lot of overhead.
      Even considering the stupid solution might make the general solution
      a bit more obvious.

      For reference, Tom Lane comprehensively
      described the problems with mmap at
      http://www.Postgres.org/message-id/17515.1389715715@sss.p...

      There were some variants of how something like this could be achieved
      but no finalised proposal at the time of writing.

   9. Hint that a page in an anonymous buffer is a copy of a page cache
       page and invalidate the page cache page on COW. This limits the
       amount of double buffering. It's in as a low priority item as it's
       unclear if it's really necessary and also I suspect the implementation
       would be very heavy because of the amount of information we'd have
       to track in the kernel.

It is important to note in general that Postgres has a problem with some
files being written back too aggressively and other files not written back
aggressively enough. Temp files for purposes such as sorting should have
writeback deferred as long as possible. Data file writes that must complete
before portions of the WAL can be discarded should begin writeback early
so the final fsync does not stall for too long.  As Dave Chinner says

	IOWs, there are two very different IO and caching requirements
	in play here and tuning the kernel for one actively degrades the
	performance of the other.

Robert Hass categorised the IO patterns as follows

	- WAL files are written (and sometimes read) sequentially and
	  fsync'd very frequently and it's always good to write the data
	  out to disk as soon as possible

	- Temp files are written and read sequentially and never fsync'd.
	  They should only be written to disk when memory pressure demands
	  it (but are a good candidate when that situation comes up)

	- Data files are read and written randomly.  They are fsync'd at
	  checkpoint time; between checkpoints, it's best not to write
	  them sooner than necessary, but when the checkpoint arrives,
	  they all need to get out to the disk without bringing the system
	  to a standstill

No matter it was pointed out that fsync should never be able to screw the
system. Robert Hass again summaried it as follows

	IMHO, the problem is simpler than that: no single process should
	be allowed to completely screw over every other process on the
	system.  When the checkpointer process starts calling fsync(), the
	system begins writing out the data that needs to be fsync()'d so
	aggressively that service times for I/O requests from other process
	go through the roof.  It's difficult for me to imagine that any
	application on any I/O scheduler is ever happy with that behavior.
	We shouldn't need to sprinkle of fsync() calls with special magic
	juju sauce that says "hey, when you do this, could you try to avoid
	causing the rest of the system to COMPLETELY GRIND TO A HALT?".
	That should be the *default* behavior, if not the *only* behavior.

It is important to keep this in mind although sometimes the ordering
requirements of the filesystem may make it impossible to achieve.


At LSF/MM last year there was a discussion on whether userspace should
hint that files are "hot" or "cold" so the underlying layers could decide
to relocate some data to faster storage. I tuned out a bit during the
discussion and did not track what happened with it since but I guess that
any developments of that sort would be of interest to the Postgres community.

Some of these wish lists still need polish but could potentially be
discussed further at LSF/MM with a wider audience as well as on the
lists. Then in a of unicorns and ponies it's a case of picking some of
these hinting wishlists, seeing what it takes to implement it in kernel
and testing it with a suitably patched version of postgres running a test
case driven by something (pgbench presumably).

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

A discussion between database and kernel developers

Posted Mar 11, 2014 3:01 UTC (Tue) by fest3er (guest, #60379) [Link] (12 responses)

"Temp files for purposes such as sorting should have writeback deferred as long as possible."

Would it work to write such temp files to a tmpfs that has no backing store? Or are such files really pseudo-temporary?

Can a balance be found between RAM size and DB size (and activity) that would minimize memory pressure while allowing Linux to manage dirty page handling efficiently in the background? Knowing 'normative' relationships between DB size and RAM size, and where/when performance begins to degrade would help to define the real problem and would help to focus attention on problem areas of the kernel and the RDBMS.

I'll part with a truism. You can only squeeze so much performance out of a Yugo; at some point, you have to upgrade to a Neon or, heaven forfend, a nitrous-boosted TDI.

A discussion between database and kernel developers

Posted Mar 11, 2014 9:02 UTC (Tue) by mel (guest, #5484) [Link] (10 responses)

> Would it work to write such temp files to a tmpfs that has no backing
> store? Or are such files really pseudo-temporary?

If the size of a temporary file is a large percentage of physical RAM or exceeds physical RAM then it gets pushed to swap so no, it doesn't really work.

> Can a balance be found between RAM size and DB size (and activity) that
> would minimize memory pressure while allowing Linux to manage dirty page
> handling efficiently in the background?

This balance is already maintained but it's not the full story. If data writes to disk have to be strictly ordered for data consistency reasons then it may still be necessary to complete a large amount of writeback before the system can make forward progress. The kernel currently tracks the amount of dirty data that is in the system but not how long it takes to write it.

Would it work to write such temp files to a tmpfs

Posted Mar 11, 2014 15:01 UTC (Tue) by Wol (subscriber, #4433) [Link] (9 responses)

But this is a fairly easy thing to solve today, if you can direct your temporary files.

For example, I run gentoo. With 16Gb ram, I have (iirc) two 32Gb swap partitions (overkill, I know :-). I also have /var/tmp/portage mounted as a 20Gb tmpfs. I doubt I ever spill into swap while "emerge"ing an update, but I can leave it all to the OS to handle without worrying about it.

With PostgreSQL, all a user would need to do would be to add more ram, and the same thing would apply - if it fits in ram it stays in ram, if it doesn't then swap gets involved - but it would have to get involved anyway under any relevant scenario.

Cheers,
Wol

Would it work to write such temp files to a tmpfs

Posted Mar 11, 2014 15:11 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

It's not an unrealistic thing to have several hundred gigabytes of temporary files existing in some workloads.

Would it work to write such temp files to a tmpfs

Posted Mar 11, 2014 17:32 UTC (Tue) by dlang (guest, #313) [Link] (7 responses)

don't forget that paging in and out of swap tends to be significantly slower than to a simple disk file because the swap tends to be badly fragmented.

Would it work to write such temp files to a tmpfs

Posted Mar 11, 2014 17:46 UTC (Tue) by Wol (subscriber, #4433) [Link] (6 responses)

If you're going to use a lot of ram like this, just have several (hopefully very fast) disks and spread swap across them at equal priority.

Or if necessary (expensive but maybe worth it) just have swap on an ssd.

And tell the kernel developers you don't want the swap partition to fragment! :-)

Cheers,
Wol

Would it work to write such temp files to a tmpfs

Posted Mar 11, 2014 18:23 UTC (Tue) by dlang (guest, #313) [Link]

avoiding swap file fragementation requires the kernel either know ahead of time what's going to be swapped out in the future, or requiring that swap space be larger than virtual memory size so that there is a reserved spot for any page and it can be read in efficiently.

requiring a bunch of really fast disks to use swap when a medium speed disk doing a sequential write/read of the data will be just as fast is not a smart way to spend your money.

Would it work to write such temp files to a tmpfs

Posted Mar 12, 2014 16:00 UTC (Wed) by jeremiah (subscriber, #1221) [Link] (4 responses)

FWIW, I've found using an SSD for swap a great way to massively speed things up until the wear leveling really has to kick in. At which point in time everything goes downhill real fast. They seem to be okay when you have a little memory pressure, but when you hit a hard swap storm over a long period of time the drives have a hard time dealing with all of the overwriting of the same data repeatedly. This of course is all anecdotal and may have been caused by a particular version of Crucial/Micron's firmware in their m4 series, but I had ditch my swaps on those drives for fear of losing the whole drive.

Would it work to write such temp files to a tmpfs

Posted Mar 12, 2014 16:10 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Just do some TRIM-ming of empty space on these drives from time to time. Works wonders.

Would it work to write such temp files to a tmpfs

Posted Mar 12, 2014 20:44 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

I'll give it a shot, thanks

Would it work to write such temp files to a tmpfs

Posted Mar 12, 2014 19:05 UTC (Wed) by parcs (guest, #71985) [Link] (1 responses)

Have you tried enabling the "discard" mount option on the swap partition (or using swapon -d)?

Would it work to write such temp files to a tmpfs

Posted Mar 12, 2014 20:44 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

I have not, thanks.

A discussion between database and kernel developers

Posted Mar 11, 2014 9:21 UTC (Tue) by iq-0 (subscriber, #36655) [Link]

I still think that files that are not longer linked to the filesystem, but that have dirty-pages, should be put on a "we don't want any writeback for this unless there is memory pressure". This should probably be the default behaviour.

This makes unlinked temporary files very efficient (no unnecessary writes for stuff that we don't want to make persistent, like tmpfs) and explicitly backed by possible disk space (reserved from the filesystem, unlike tmpfs) and makes the memory easily reclaimable under memory pressure (sort of like tmpfs with dedicated swap space).
Effectively making all temporary file uses on all filesystems with optimized support effectively perform like tmpfs backed when there is no memory pressure (not a single byte should ever have to be written in that case, the inode could be fully in-memory and the space reservation would also be fully in-memory reservation of the blocks from the free space, on crash nothing would need to be recovered, because on disk it would already be in the free space list).

Should the new "link filedescriptor syscall" be used than that should probably trigger fdatasync on the filedescriptor (similar to sync on rename).

And an option to open normally linked file as such and having that file appear truncated in case of crash (because any blocks allocated would only be allocated in memory and on disk would always remain on the free space list).

A discussion between database and kernel developers

Posted Mar 11, 2014 9:28 UTC (Tue) by iq-0 (subscriber, #36655) [Link]

> 6. Only writeback pages if explicitly synced.

> No, I'm sorry, that's never going to be possible. No user space
> application has all the facts. If we give you an interface to
> force unconditional holding of dirty pages in core you'll livelock
> the system eventually because you made a wrong decision to hold
> too many dirty pages.

Wouldn't a "deferred dirty page writeback mmap area" be like marking the mmap'ed area 'MAP_PRIVATE' and when writeback is allowed to happen turn it back into a 'MAP_SHARED'?

Because in that case I don't see how this wouldn't be exactly the same problem as Postgresql holding it into some large other private memory area. The kernel wouldn't be allowed to do anything with it in that case either (short of swapping it).

Is this SQL databases or No-SQL?

Posted Mar 11, 2014 12:41 UTC (Tue) by Wol (subscriber, #4433) [Link] (40 responses)

Bearing in mind that relational is a mathematical theory, so it sounds to me that a lot of those requirements are optimisations for problems that other databases just don't have ...

I'm trying (for the umpteenth time) to write my own Free MV implementation, so here's my take on things ...

From the Multi-Value/Pick POV, temporary ram isn't important. The original design was very ram-starved ($1000s/4Kb) and it didn't have the concept of disk - the hard disk was used as permanent virtual ram. That said, there's probably a lot of "nice to have" features there.

The ability to hint to the kernel that certain files (or parts thereof) are best kept in memory if possible - lookup tables for example. The ability to memory-map files for example - if I design it so an account is a single file at the OS level, I can then stick to the original design and treat disk space as if it were ram - a persistent backing store. Someone described MV as a "directed graph" which is close to the mark, so I don't need ram to store large amounts of temporary data - a typical query will retrieve a list of primary keys off the principal table, then simply retrieve each row from disk as required.

Looking at 8kb write support - yes this would be great for MV. Tables are "blocky" and you can guarantee that they will be written in fixed-size chunks. It was initially 512b or 2Kb (to match then disk-sizes). Most modern implementations allow you to specify the size in multiples of those two. From my point of view, it would be nice to be able to query the fs and ask what the optimal size is, but note that more than a few K is NOT optimal for the data loads! So large block sizes (and 8K is tending that way) are not a good idea. Basically, if I can store more than 4 or 5 rows per block then the block is too large, and a row is typically measured in bytes not Kb.

Disk writeback stuff ... fsync and friends are well-known performance killers. I don't know much about the other things, but one thing that would be great is the ability to insert a write barrier. I'm planning to make all my files COW so I want to be able to guarantee write order. I want to be able to call the OS at the end of a transaction and say "flush all this stuff in this order, but I don't want to wait until it's done". The problem is I can't guarantee over what extent this will be required! If an account is a single file, then *typically* the barrier will extend only to this file. But my preferred approach is one file per table, so I'll need a barrier on a process basis not a file basis. And of course, once I do that, I can't guaranteed that all the files are on the same fs, or even computer! There though, I would simply have to warn the user and say that guarantees don't apply.

So, speaking from an engineering pov, and from an MV pov too, I'd say that any enhancements to disk i/o would probably help MV as much as SQL, but because MV just doesn't use anywhere near as much ram, improvements there are probably not going to make much difference to us.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 11, 2014 15:00 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (39 responses)

I keep seeing your posts waxing philosophic about this topic, but my question is: where are the other users of pick-like data stores? Why is "everyone" still using relational if the grass is so much greener elsewhere? I'm not going to argue a technical comparison with you, but I'd like to know what the social reasons behind its apparent (to me) non-use might be.

Is this SQL databases or No-SQL?

Posted Mar 11, 2014 15:17 UTC (Tue) by Wol (subscriber, #4433) [Link] (38 responses)

"Apparent" I think is the appropriate word ... I'll just give a little recent history for you.

Do you remember IBM buying Informix? When they bought it, U2 (a couple of Pick database implementations) came along for the ride. The whole lot was merged into the DB2 division. Obviously, I wasn't involved but I gather a whole lot of envious rivalries fired up inside the DB2 division as a result - the U2 bit was generating *double* *figure* year-on-year organic growth. So IBM sold it on to a company called Rocket Software. And when they sold it, it amounted to HALF of the entire DB2 division! So, in this one Pick company we're talking about it being the *same* size as IBM's database division.

Then there are a fair few other companies in that space too. And to give you a couple of big customers, the UK National Police people are a major user. I think a lot of your 911 services are users. A large chunk of the travel industry ... etc etc.

Something I'm well aware of, however, is that the big players seem woeful at supporting end users. I started out using Pr1me Information, and they supported end users very well. But I don't think the rest of the industry ever did, and when Pr1me went under about 1990 we started having real problems getting support - not because the companies didn't want to help us, but because they were geared up to selling to vars and didn't know HOW to help.

So the reason you probably don't see much Pick out there, is because it's sold as a package and you don't know what the underlying database actually is.

(As an aside, talking about speed, I said that ram isn't important ... I have personally experienced a 32-user 32MB system, SWAPPING LIKE MAD, and still providing adequate response times to users!!! It really is something to watch a Pick database provide decent service while struggling with inadequate capacity :-)

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 11, 2014 16:07 UTC (Tue) by Wol (subscriber, #4433) [Link] (37 responses)

A couple of other points I should add ...

Oracle took relational and marketed it heavily. It's got into most Computer departments in universities to such an extent that any other approach is considered heresy. Even getting people to LOOK at alternative approaches is hard work nowadays - it seems "relational" is a PHB tick-box that must be checked even if it's totally inappropriate for the job in hand :-(

On the other hand, Dick Pick (one of the original authors, and after whom the commercial version is named) was a control freak who had a habit of suing anyone who tried to compete with him. Despite the fact that the original system (called GIRLS - General Information Retrieval Language System iirc) was PD written for the DoD. So far too much effort was wasted in-fighting between competing vendors.

So we had relational - where the dominant vendor clearly did NOT own the basic design - and innovation could go unimpeded. Plus plenty of marketing dollars.

Versus Pick - where the dominant vendor was trying to claim the entire space for itself - and was spending far too many dollars on legal rather than marketing.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 11, 2014 16:51 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

My view is more from the FOSS side, so the *company* politics behind a lot of it isn't that interesting to me. I'm guessing the legal stuff is why FOSS Pick-like databases don't exist today[1] (are there any packaged in Debian?) and why I have yet to see an application support one as a backend? This is a FOSS-oriented website, so if I can't use a Pick-like data store without paying gobs of money…why would I care?

[1]I see that OpenQM is dual-licensed like MySQL (and has a fork ScarletDME, but that last saw activity 9 months ago), but it isn't packaged for Debian. Do you have any example FOSS applications which can use it as a backend?

Is this SQL databases or No-SQL?

Posted Mar 11, 2014 17:06 UTC (Tue) by Wol (subscriber, #4433) [Link]

MaVerick was a previous attempt to write an MV database that died - I started it in C, somebody else morphed it into Java, and it just faded away I think.

Be very wary of OpenQM - I don't think the copyright owners *get* the GPL. ScarletDME is still supported, I think, but yes it is very quiet.

If you don't want to use a "free as in beer" version, then yes I don't know what to suggest as a "free as in freedom" version. The problem is getting people (a) to even look at it, then (b) to realise that it is in fact a very good solution.

I think also Cache is beginning to steal Pick's thunder. While not a member of the Pick family it's actually very similar in concept, and one of the well known Pick guys has written a Pick compatibility layer that sits above the Cache data store. Apparently it's that close that you can port between Pick and Cache as easily as between two different Pick implementations (which is pretty close to the ease you can port between two SQL implementations - in theory a pushover, in practice plenty of sharp corners to bite you!)

(as an aside, how's this for a VALID statement in Pick BASIC:

REM: REM = REM( 6, 2); REM calculate the remainder of 6/2

That's four - yes *four* different syntactic meanings for the lexical symbol "REM" in one statement!!! That bit us in a port as different BASICs support different versions...)

Legally, there wouldn't be any hassle with a FOSS version today, I don't think, and I doubt it's why there aren't any FOSS versions. The problem is that most Pick applications were written by end users, so most Computer Science guys don't know much about it, and as I've said they mostly don't want to know, either :-(

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 11, 2014 22:58 UTC (Tue) by intgr (subscriber, #39733) [Link] (34 responses)

> Even getting people to LOOK at alternative approaches is hard work nowadays

Isn't NoSQL all the rage these days? I see ample opportunity right now for alternative database technologies to gain some mindshare. MongoDB, CouchDB, Riak and Redis all rode the NoSQL wave. Yet the only place where I've heard about Pick is the comments section on LWN. Sorry, I don't buy that explanation.

Is this SQL databases or No-SQL?

Posted Mar 12, 2014 12:57 UTC (Wed) by Wol (subscriber, #4433) [Link] (33 responses)

Bear in mind my comment elsewhere. Nearly all Pick apps were written by users - accounting systems were written by accountants, engineering systems were written by engineers, ...

Pick has never had good visibility in the Computer Science arena.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 20, 2014 11:32 UTC (Thu) by farnz (subscriber, #17727) [Link] (32 responses)

So, I'm curious, but not willing to spend money. Where would I go to learn about Pick, with a view to designing myself a toy application for a problem of the day? My usual Google and Wikipedia combo doesn't help.

Is this SQL databases or No-SQL?

Posted Mar 20, 2014 16:13 UTC (Thu) by Wol (subscriber, #4433) [Link] (31 responses)

Free as in beer?

http://www.rocketsoftware.com/products/rocket-universe/tr...

This is UniVerse, the version I know best. It's available as a freebie on both linux and windows.

Go to the U2 User Group - http://www.u2ug.net/ - for support. You'll find loads of friendly helpful people on the mailing list, including me. Just don't come over as some Indian Outsourcer looking for freebie training :-) but if you say you're playing with it there'll be plenty of advice.

There are other versions - Rocket own Unidata and D3, and I believe OpenQM is a freebie from their site, etc etc.

I look forward to seeing you on u2-users !!!

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 24, 2014 11:38 UTC (Mon) by farnz (subscriber, #17727) [Link] (30 responses)

Those links are rather off-putting - it seems that I'm expected to interact with the user community in order to find out whether I want to interact with the user community.

The UniVerse link doesn't point me to documentation, so I've got no clue whether it's something I already understand under a different name, or whether I'm facing a big learning exercise here. I also don't want to waste time installing a complex piece of software, only to find out that I'm not interested after all.

The user group link scares me, and I don't dare investigate too far given how unwelcome you've said I'll be - I will come over as an "Indian Outsourcer looking for freebie training", because I'm coming from precisely the same position; I know nothing about Pick and I don't want to invest a huge amount in learning about it. Further, just like the Indian Outsourcer, I know I'm not going to use Pick much in the near future; the only difference between me and them is that I'm not making money out of this, I'm acting on curiosity alone.

I'm after equivalents to YAHT, LYAH or RWH; something where I can get a feel for Pick without serious investment of time, and yet can go deeper if I'm hooked on it.

Is this SQL databases or No-SQL?

Posted Mar 24, 2014 21:48 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

On that basis, the best place to point you is http://www.pickwiki.com/cgi-bin/wiki.pl

There's various interesting write-ups there, a mess of source code, etc. Unfortunately, it looks to me to have bitrotted a bit - there have been a couple of ports to new systems and also a bunch of external links have rotted - not surprising since the wiki has been around for ages.

A couple of interesting links from the wiki ...

http://www.pickwiki.com/cgi-bin/wiki.pl?KenSimms
http://www.pickwiki.com/cgi-bin/wiki.pl?MultiValuedDatabases
http://www.pickwiki.com/cgi-bin/wiki.pl?PickDataStructure
http://www.pickwiki.com/cgi-bin/wiki.pl?PhilosophyOfPick
http://www.pickwiki.com/cgi-bin/wiki.pl?Getting_Started
http://www.pickwiki.com/cgi-bin/wiki.pl?MVDefinition

Note that one of the reasons I am a Pick fan is that I think that relational theory is broken. Not that the theory is not mathematically correct, but that empirically it doesn't fit the real world. In particular, C&D's first rule *defines* data as coming in rows and columns. Ime, in the real world, this is the exception not the norm, and as a result most databases that use RDBMSs need to use a sledgehammer to force square pegs into round holes :-) Pick just says "what shape is your data? Here, let me store it for you!"

I hope these pages give you a flavour of things. Oh - and why would you come over "like an indian oursourcer"? Come over as yourself, someone who wants to have a play with Pick. The outsourcers usually come over very obviously as individuals who have been plonked in front of a system, have a job to do, and are clueless as to how to do it. If you come over as someone who's playing on their own dime with the system, that's a very different kettle of fish ... (and we're pretty helpful, even to those indian outsourcers - it's not their fault they're sat in front of an unfamiliar system. We do tend to suggest, however, that if they're being paid, they should hire in some paid expertise :-)

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 25, 2014 9:16 UTC (Tue) by farnz (subscriber, #17727) [Link]

I'd come over like an Indian outsourcer because I'd be asking the same questions they would - "pls, show me the codes, I need to see how to use Pick". If the community is unwelcoming to that, then I'm going to be unwelcome.

Plus, having just spent a couple of minutes trying to find the "post question" form, I can't actually work out how to post on the u2ug forum - there's probably a super sekrit sign up page I have to go through, but finding that's more effort than I'm willing to put in. FWIW, I clicked "Forums" (in the Contribute box), then "Welcome Forum". I could see a "latest threads" thing, but not how to create a new thread of my own.

I'm fast coming to the conclusion that Pick is not for me to learn about - it's too much of a closed shop, and unwelcoming to newcomers.

Is this SQL databases or No-SQL?

Posted Mar 24, 2014 22:11 UTC (Mon) by Wol (subscriber, #4433) [Link] (27 responses)

Just looked on the Rocket website for the documentation.

http://www.rocketsoftware.com/brand/rocket-u2/technical-d...

Be warned - there is an AWFUL LOT of it - it's basically the full docs for a pretty powerful database!

I'd start by reading the RETRIEVE guide. It's basically the equivalent of SQL - the query language, but it has a fair few examples. Note that it only gets data OUT of the database. To get it in there are various tools, from a basic EDitor, to fancy screen builders that read the dictionary, to programs that you write yourself. Note also that there is a SQL processor that can use sql for example update statements ...

Remember also that Pick is a complete system. You can use it as a database back-end and access it with your favourite front end tools, but you can also do pretty much everything from inside Pick if you wish - there is a full-blown programming language (DataBasic), data extraction tools, fancy green-screen terminal handling, etc etc.

Once you've read the RETRIEVE guide, look at the USER REFERENCE. It'll give you an overview of the basic system.

But imho, where Pick scores (did I say that RDBMSs don't match the real world?) is that a lot of data integrity comes as a side-effect of the way Pick views the world (a well-designed Pick database will store all the data relating to an object in a single "row", obviating the need for cascading deletes, for example ...) such that data retrieval is fast and efficient. Pick doesn't have query optimisers - it's a simple proof that any attempt at automated optimisation is almost certain to cost more than it saves. There's just no headroom for optimisation to make any improvement. If an entire object is stored as a "row", then accessing any one attribute will pull any related attributes into ram, and its "directed graph" nature means that any reference from one object to another will permit efficient retrieval of that other object.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 25, 2014 9:17 UTC (Tue) by farnz (subscriber, #17727) [Link] (26 responses)

I've now read the docs you've recommended, and I'm still not getting it. Pick looks like just another key-value store, slightly obfuscated by using a BASIC-like language as the primary language. The wiki pages you mentioned in the other comment are also not helping - they tell me that there are magic values 253 and 254, and that somehow, by using these magic numbers in combination with a Pick system, I'll get amazing results, because Ken Simms was an amazing man for his time.

Is there a decent tutorial out there, with a motivating problem, that shows me why I would be interested in Pick, rather than rolling my own thing with Python dictionaries or using the SQL systems I'm familiar with today?

Basically, the position I'm in right now is that you're doing a good job of convincing me that I've missed something important, but every time you provide links to back your position, I feel that I'm either being asked to invest significant time interacting with a community that I may have no interest in at all, or being told that no, I'm not missing something, Pick really is that uninteresting.

A good tutorial, showing me how to use Pick to construct an application that would be harder with SQL or using an equivalent of pickled Python dictionaries as my persistence format would resolve that - it would show me what it is about Pick that makes it special, as compared to SQL.

Is this SQL databases or No-SQL?

Posted Mar 25, 2014 20:20 UTC (Tue) by Wol (subscriber, #4433) [Link] (25 responses)

Showing you how to construct an application that is easier with Pick than SQL? Well, pretty much any application with complicated data - that is, any application where data comes as lists and bags rather than sets.

Actually, alter that slightly. Change "SQL" to "Relational/First Normal Form". You can use SQL to access Pick databases, and if you're familiar with that I'd say that's not a bad idea.

Relational forces you to put everything into flat one-dimensional arrays. That has a bunch of nasty side-effects which I'll come to later, but basically it seriously flouts Einstein's dictum of "make everything as simple as possible BUT NO SIMPLER"! If I use relational maths to define my Pick database, I take each object of interest, analyse its attributes, and put them in a "table". That magic 254 character you saw simply tells Pick "this is a column separator", so I can do one-dimensional arrays like relational. But the magic 253 character means I can do a two-dimensional array! Where a single object has multiple instances of an attribute, I can store a list of them! For example, the table "person" can store a list of children!

This has several *massive* advantages. Let's say I'm writing a genealogy database. In both relational and Pick, I obviously need a "person" table". But in Pick I do NOT need a "is child of" table. I do not need a "is spouse of" table. (Obviously, I can have a "my father is" and "my mother is" column in both.) Oops - I'm now realising this might not be the best example, in that parent->child is a one->many that can be done pretty well in relational, but Pick can do a many->many this way, which requires an extra table in First Normal Form.

The next big advantage is, should you for some reason want to store the parent->child relationship in the parent record, you get order information "for free". To do this in relational, you would need an extra table, with three columns, ie parent, child, sort-order. AND SORT-ORDER IS NOT DATA - IT'S METADATA! The *content* of that field is meaningless, what matters is its collation sequence. So effectively, First Normal Form forces us to store garbage data.

The third big advantage is that we can store all strongly-related data in a single "row" of a "table". There are actually more magic characters - 252 and 251 (even 250 and 249) that let us create an array of 3 or more dimensions. Coupled with the fact we need a primary key, that we can store an array of foreign keys in a row, and that disk access has been optimised so much that the database generally retrieves a row with - on average - just over *one* seek request, that makes data access blindingly fast.

So basically, the main point of Pick is that the API between the datastore and the application passes *objects* back and forth. This then makes life conceptually simple for both the database guy and the application guy. If the database guy has any sense (Pick doesn't enforce it) he will do a relational analysis on the object so each row in the table exactly corresponds to a relational *view* of the object. So the application guy can use SQL to query the database if he wants :-)

Basically, from the engineering viewpoint, Pick mimics the real world - an object in the datastore matches an object in the real world - it's easy to understand. Also, because it's engineering-based, I can run engineering proofs! I can PROVE that Pick is fast - blinding fast, in fact.

Because Relational explicitly forbids the application layer to know anything about the database internals, such engineering proofs are impossible. In fact, I would go so far as to say that it is easy to prove that relational is crippled in its very design! Because an RDBMS is *not* a *data* store. It is a MODEL store. It contains oodles of crap "data" that has been created as a side effect of converting to first normal form - that sort column for instance! Even worse, it then passes all that crap back to the application layer so that the application has to recreate the object of interest.

A challenge I've repeatedly thrown out is "show me how to *STORE* a list in a first normal form database". So far, *NOBODY* has even responded to the challenge (a couple of people did say they'd do it, but they never did). I did say that you're not allowed to model the list, and you're not allowed to mix data and metadata in the same table. That pretty much defines the challenge as being mathematically impossible :-)

The other thing to note is that Pick seems to be faster and easier to develop in. Forgive me for giving you yet more reading matter - this is a manager's blog in which she wonders why her Pick staff always seemed to be on time and under budget, while the relational guys always overspent and ran out of time. She became a big Pick fan as a result.

http://www.tincat-group.com/mewsings/2006/01/is-codd-dead...

And one more small bit of reading ... Cache not Pick but never mind ...

http://www.linkedin.com/groups/PR-InterSystems-Cache-capt...

Note how they said Oracle could only achieve the required 100K rows/sec by cheating and inserting blobs. But cache did it no problem, and even achieved 250K rows/sec in production. Did I say Pick was blindingly fast (and yes I know Cache isn't Pick but it's engineeringly similar, but you've probably seen me compare Oracle on a twin Xeon 800 versus Pick on a P90 - and the Oracle system had difficulty keeping up ... !!!)

The only place, imho, where relational scores over Pick is that it doesn't have to answer the question "what is an object". This is, I freely admit, a very hard question. But, in dodging it, imho relational makes matters much worse. The real world comes as nouns, adjectives, and relations (we'll forget about verbs ...). Relational tries to model the real world solely in terms of adjectives and relations - and inevitably makes a pig's ear of it. You just can't describe the world in words of one syllable - it's too complicated for that.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 25, 2014 22:14 UTC (Tue) by peter-b (subscriber, #66996) [Link] (22 responses)

Hi Wol,

This is all quite interesting stuff -- I've been following the thread with interest. Have you considered writing an article series for LWN about Pick? Getting proper articles out to a wider with some concrete examples might be a great way of increasing awareness of the existence of and applications for Pick.

What do you think?

Is this SQL databases or No-SQL?

Posted Mar 26, 2014 14:56 UTC (Wed) by Wol (subscriber, #4433) [Link] (21 responses)

I'd love to. "Why pick Pick?", maybe :-)

As you've seen from the thread, however, there is no real "Libre Software" implementation to point people at :-( I've got all the plans in my head how to do it, it's pretty simple. All I need is a linear-hash file store, a DATABASIC compiler, and a databasic run-time engine. However, as a Chemist by training, my computer-science-fu has trouble with that compiler! (Copying an existing system is always easier than designing a new one :-)

Everything else seems pretty trivial to me and my experience as a programmer. Couple that with a simple "call out to C" mechanism, and a shout out to all my friendly Pick fanatics, and we could probably write the rest of the system in DATABASIC in fairly short order.

So yes, I'll try and write the articles, and see what Jon says. And I'll try and get a real Libre implementation going. MPL'd in all probability, partly for LibreOffice compatibility, but also because the *lack* of linking in the system rather defeats the (L)GPL.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 26, 2014 22:53 UTC (Wed) by nix (subscriber, #2304) [Link] (20 responses)

I have to say, Pick sounds interesting, but what it mostly sounds is deeply archaic. BASIC as its interface language? Magic characters to encode column separators? Ye gods.

Is this SQL databases or No-SQL?

Posted Mar 27, 2014 12:35 UTC (Thu) by Wol (subscriber, #4433) [Link] (19 responses)

There's nothing stopping you from using C++. Or Perl6. Or Python3. I think a fair few people do.

If you're using an RDBMS, most people use SQL to access it, do they not? And is not SQL just as archaic? Just because something's old, doesn't mean it's not very good at its job! (Actually, I'm inclined to agree with you here, I'd love to drag DATABASIC kicking and screaming into the 21st Century, but the fact is it's damn good at its job.)

As for magic characters, what does PostgreSQL use to separate its columns? Or MS SQL-Server? If we knew, I'm sure we'd be just as "Ye Gods" about it. And don't modern systems do the same, except they use even worse examples like tab and comma? I'm sure some people do know what RDBMSs use, it's just that Pick exposes its internals - and guarantees the API - which enables programmers to take advantage of an understanding of how the engine actually works. The alternative is things like Oracle where, after spending months empirically tuning your queries, they change the query optimisation engine and the query suddenly suffers an order of magnitude slowdown.

The other thing to remember is that original Pick was an OS - "the database is the computer". So it had to provide everything. Nowadays, if you treat it as just a database, you can use all the same tools to access it as you do to access PostgreSQL or SQL-Server, you just have all the power of the native tools AS WELL.

As an aside, I've said it often enough before, you do know Pick was the first commercial database available on linux? :-) When they moved away from being an OS, they ported it to Red Hat and sold it as D3 - you couldn't see linux underneath. A year later, Oracle started selling their database.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 27, 2014 13:41 UTC (Thu) by peter-b (subscriber, #66996) [Link] (1 responses)

Given that Pick uses magic characters to separate fields, how do you store Unicode text in a Pick database?

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 0:13 UTC (Fri) by Wol (subscriber, #4433) [Link]

Easy :-)

I've only ever used Ascii, but as far as I'm aware all modern implementations can support Unicode.

And all these magic characters have their own Unicode code points.

Bearing in mind all programs are supposed to (but often don't - it's not enforced :-) use system-defined equates, switching on Unicode support should not break any properly-written code.

Simples...

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 27, 2014 13:58 UTC (Thu) by intgr (subscriber, #39733) [Link] (16 responses)

> As for magic characters, what does PostgreSQL use to separate its columns?

It doesn't. Using magic characters often necessiates escaping/de-escaping, which has a cost in performance. And finding column N in a tuple requires you to walk through all preceding values in whole. Fixed-length columns always have defined size so they're known by position, variable-length values are prefixed with their length, which doesn't have these problems.

> it's just that Pick exposes its internals - and guarantees the API - which enables programmers to take advantage of an understanding of how the engine actually works

If I understand correctly, it seems that these delimiters are exposed to the coder, which invites misuse: are you sanitizing these values correctly in the input data? A failure may easily expose security bugs. How do you escape UTF-8 data, which can contain byte values 250-255? And it also paints the database into a corner such that it cannot change the storage format easily.

SQL has widely-known escaping problems too, but these days they're passed as parameters instead. I don't know how rigorous the use of abstractions is in Pick communities.

> The alternative is things like Oracle where, after spending months empirically tuning your queries, they change the query optimisation engine and the query suddenly suffers an order of magnitude slowdown.

You make lots of sweeping claims about performance. It's hard to argue without understanding the details of Pick, but from what I have heard I'm not impressed.

I haven't used Oracle much in particular, but I think query optimizers are a godsend based on my experience with PostgreSQL. Instead of needing to code the exact approach of how to retrieve the data, I just describe what I want and the database will figure out the best way to execute it. If the data distributions in the database change, queries will automatically adapt, rather than requiring coders to go back and revisit old queries.

While regressions and outliers do occur, they're a small minority and debugging them isn't difficult with some experience. If you have to spend "months" tuning queries then it seems you're trying to outmsart the optimizer to force it to do what you want, rather than helping it do its job and make good decisions.

And for every regression, there are tons of queries that get faster with each release, because new optimization techniques are implemented all the time.

Is this SQL databases or No-SQL?

Posted Mar 27, 2014 14:39 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

How do you escape UTF-8 data, which can contain byte values 250-255?

Byte streams representing text encoded using ISO 8859-[n], KOI8-R, or JIS X 0208 can contain those byte values, but unless-and-until the decision of the relevant standardization bodies to formally restrict the Universal Character Set to the range [0, 0x10ffff] is retracted, the presence of byte values in the range 245-255 in a bytestream means that whatever you have, it most assuredly is not UTF-8-encoded text.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 1:03 UTC (Fri) by Wol (subscriber, #4433) [Link] (9 responses)

> If I understand correctly, it seems that these delimiters are exposed to the coder, which invites misuse: are you sanitizing these values correctly in the input data?

Bang on - there are various safeguards in place that help prevent misuse, but at the end of the day it's down to the programmer to get it right.

> I haven't used Oracle much in particular, but I think query optimizers are a godsend based on my experience with PostgreSQL.

Let's look at what a relational optimiser has to do. Assuming crude unoptimised data access, EVERY row you look for has to search half the table!

Okay, if I'm only searching for a single row in Pick, it's quite likely I'll have to do the same. BUT!!!

Let's take the example of an invoice. Have you keyed it on the invoice number? I quite likely have. If I have, my mvDBMS can tell the OS where on the disk the data I am looking for is, with a 97% chance of being right. "Open file X, seek to position Y, read that block". If your invoice table is big (let's say 1Mb) you've got to read 1/2Mb to find it!

I've now got ALL the invoice data. Chances are (in fact it's pretty certain) your invoice data is plastered over several tables, so you are going to get further clobbered. Then what happens if you want to access the related customer details? Yet again, you probably have the data plastered over several tables and have to read half of each table to retrieve the data. My invoice record will contain the company key, and once again, I will be able to tell the OS exactly where to find it with a 97% chance of being right.

In other words, a raw unoptimised RDBMS data access is incredibly inefficient, and it gets worse as the database gets bigger. An efficiency of 1% is probably very good! So an optimiser doesn't have to be very good in order to earn its keep.

But for Pick? Let's take my worst case - assume I haven't keyed on invoice no. If I don't have an index I'll have to scan the entire table - so my worst case is equal to yours (if I'm being stupid!) But all I've got to do is declare an index on the invoice no ... and it takes one disk read (97% probability) to convert from invoice no to invoice key, and one more to retrieve the invoice.

(Okay, you can index your primary invoice table on invoice no too, but then you have to index all your subtables too, with all the overhead that involves.)

So. Your SQL optimiser can achieve a doubling of performance by increasing efficiency by a fraction of a percent.

My Pick optimiser (if I had one) can achieve a maximum of 1/3 of one percent increase in performance before it hits the unbreachable barrier of perfect efficiency. It's just not worth spending any effort searching for those sort of gains.

Oh - and there's another thing to take home from this. When making my claims about RDBMSs, I'm mostly speculating. You see, I'm *not* *allowed* to know what's going on under the covers, so I'm forced to guess. Because I know pretty much exactly what's going on under the covers of Pick, I can use logic to reason about it, and PROVE what's going on. I've got figures to work with. (Okay, the figures I have are the inverse of 97%. In a properly balanced Pick file, it takes on average 1.05 "get this block from disk" commands to score a direct hit on the target data. Oh, and rebalancing the file after a write (if that's necessary) incurs about 3 more writes. That's not that much overhead.)

And because every Pick record MUST have a key, most of the time I'm not searching for data, but targeting known records for which I have the key. (Plus I have far fewer tables to search to retrieve the same data.)

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 8:31 UTC (Fri) by cladisch (✭ supporter ✭, #50193) [Link]

> If I don't have an index I'll have to scan the entire table - so my worst case is equal to yours (if I'm being stupid!) But all I've got to do is declare an index on the invoice no ... and it takes one disk read (97% probability) to convert from invoice no to invoice key, and one more to retrieve the invoice.

Most relational databases (even SQLite) can store table rows ordered by their primary key (typically called "index-organized" or "clustered" tables), so no separate index lookup would be required.

> When making my claims about RDBMSs, I'm mostly speculating. You see, I'm *not* *allowed* to know what's going on under the covers, so I'm forced to guess.

Nobody forbids you to find out what's going on under the covers. Most relational database document low-level details (or even their file format) to allow manual optimizations.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 10:28 UTC (Fri) by intgr (subscriber, #39733) [Link] (7 responses)

> Have you keyed it on the invoice number? I quite likely have. If I have, my mvDBMS can tell the OS where on the disk the data I am looking for is, with a 97% chance of being right. "Open file X, seek to position Y, read that block". If your invoice table is big (let's say 1Mb) you've got to read 1/2Mb to find it!

What the? How does Pick know the position in the file, where to look for that row? I assume that by "keying" you mean what RDBMSes call indices. Isn't it dishonest to compare indexed access in Pick against non-indexed access in an RDBMS? No reasonable person creates a table without indices when they expect row lookups to be fast.

> I've now got ALL the invoice data. Chances are (in fact it's pretty certain) your invoice data is plastered over several tables, so you are going to get further clobbered. Then what happens if you want to access the related customer details? Yet again, you probably have the data plastered over several tables and have to read half of each table to retrieve the data.

This is true, but you're only talking about OLTP performance, with the assumption that every access to an invoice needs to read all the invoice lines and vice versa. It's true that a system like Pick will be far more efficient RDBMSes in this scenario, but in the all the systems I work on, OLTP queries are not the bottleneck (unless you do crazy things like don't index your data properly). It's the more complex queries that kill you: find all customers that have item X on their invoice; give me a breakdown of the costs of all invoice lines by item type. That's where query optimizers really shine. All evidence I've seen suggests that Pick deals terribly with these.

Even in development, I frequently whip up multi-line queries to look for anomalies in the data. It often just blows me away how these queries complete in seconds, despite churning through gigabytes of data and without any advance planning to make them fast.

In other words: you're worrying about optimizing something that's already more than fast enough. I worry about optimizing the things that are hard to optimize.

----

Also, PostgreSQL has a powerful type system, including composite types, arrays, mapping types, JSON, etc. You *can* organize data like you do in Pick if you want to. But there are real advantages to normalizing (with exceptions of course).

> When making my claims about RDBMSs, I'm mostly speculating. You see, I'm *not* *allowed* to know what's going on under the covers, so I'm forced to guess

That may be true for Oracle, but the page layout of PostgreSQL is documented. http://www.postgresql.org/docs/current/static/storage-pag... The difference from Pick is that the storage layout is not part of the API, but just an implementation detail. If the developers need to change it in a future version, for performance or other reasons, they can.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 19:33 UTC (Fri) by Wol (subscriber, #4433) [Link] (6 responses)

> What the? How does Pick know the position in the file, where to look for that row? I assume that by "keying" you mean what RDBMSes call indices. Isn't it dishonest to compare indexed access in Pick against non-indexed access in an RDBMS? No reasonable person creates a table without indices when they expect row lookups to be fast.

Urmmm. No.

What I mean by "keying" is sort-of what an RDBMS would call a primary key. In Pick, a primary key is mandatory. And that key contains all the information required for Pick to go straight to the place on disk where that row is stored. Pick files use linear hashing, so basically it's a modern hashed file. And unlike in a traditional hashed file, it only requires rewriting about three disk blocks to change the modulo. So it's FAST.

(So basically, the difference between a key and an index may be minor but it's crucial. An index tells you the key, while a key directly tells you where the data is. And again, Pick gives you rope. Screw up your primary keys and you screw up the hashing, which cripples performance. But in modern implementations that is HARD.)

> This is true, but you're only talking about OLTP performance, with the assumption that every access to an invoice needs to read all the invoice lines and vice versa. It's true that a system like Pick will be far more efficient RDBMSes in this scenario, but in the all the systems I work on, OLTP queries are not the bottleneck (unless you do crazy things like don't index your data properly). It's the more complex queries that kill you: find all customers that have item X on their invoice; give me a breakdown of the costs of all invoice lines by item type. That's where query optimizers really shine. All evidence I've seen suggests that Pick deals terribly with these.

Erm - no. Assuming I've understood what you want, it's a sequence of commands in Pick (and no, a Pick SELECT and a SQL SELECT are not equivalent).

SELECT INVOICES WHERE ITEM-TYPE EQ "X". If I've got an index on X, that's a single request to disk. That gives me a list of all invoices.
SELECT INVOICES SAVING CUST-ID. Okay, I have to go down my list of invoices but, assuming they're scattered randomly on disk that's 105 disk seeks per 100 invoices.
Okay, I can't remember the syntax for the next command, but it's something like
SELECT INVOICES WITH CUST-ID IN CURRENT SELECTLIST. If we have an index on CUST-ID, once again that's 105 seeks per 100 customers.

I've now got a list of invoices to report on ...

LIST INVOICES BY.EXP ITEM-TYPE TOTAL PRICE

Note that absolutely *nowhere* have I had to search for any information whatsoever. The user provided the initial "X", and every single data access from then on has been keyed - with a near perfect disk hit rate. I've had to throw away about 5% of my disk seeks because they failed to hit their target.

There are two places you might score on me. Sorting and collating the final report is one. But if you can't fit all the invoice details in ram I think I'm going to fly past you. I don't know the internals of the sort, but I suspect the systems I'm used to scan through each invoice in turn, creating an (invoice, line, price) tuple, and then sort that. They will then scan through that sorted list building up the report as they go. So they have to read each invoice at least twice, and once more for each extra relevant line.

But if it all fits in ram, the OS disk caching will help me, just as much as being able to store the entire report in ram will help you.

The other place you will score is indices. Your optimiser will spot it needs to scan INVOICES multiple times, and it will build indices on the fly if they don't exist. Pick assumes the user knows what s/he is doing, and gives you the rope to hang yourself.

Plus, you say "that's only relevant for OLTP" - erm no there too. Okay, it's only Monte Carlo probability, but in pretty much ALL scenarios the cost of retrieving that extra data is minimal, while the chance of it being required is relatively high. So much so that it's well worth it. If the system has been designed by a "practitioner in the art" (in this case an accountant), they will have subconsciously skewed the definition of an object such that this is pretty much inevitable.

As an aside, it was mentioned way back in the thread that PostgreSQL often requires gigabytes of temporary storage. I just cannot CONCEIVE of Pick needing that much, even if handling those sort of quantities of data!

> But there are real advantages to normalizing (with exceptions of course).

I couldn't agree more !!! But not in storing the data in First Normal Form. In fact, if a Pick developer did not normalise the data ON PAPER before designing his Pick schema, I would consider him negligent!

> Even in development, I frequently whip up multi-line queries to look for anomalies in the data. It often just blows me away how these queries complete in seconds, despite churning through gigabytes of data and without any advance planning to make them fast.

> In other words: you're worrying about optimizing something that's already more than fast enough. I worry about optimizing the things that are hard to optimize.

To quote from the Pick FAQ - "SQL optimises the easy task of finding things in memory, Pick optimises the hard task of getting them from disk into memory".

And to repeat my Xeon 800 war story - this Oracle system was designed to replace the old Pick system. It took Oracle Consultants some SIX MONTHS of hard work before it could run that query faster than the Pentium 90 system it was replacing ...

> That may be true for Oracle, but the page layout of PostgreSQL is documented. http://www.postgresql.org/docs/current/static/storage-pag... The difference from Pick is that the storage layout is not part of the API, but just an implementation detail. If the developers need to change it in a future version, for performance or other reasons, they can.

My point is that you're not supposed to know that information, not that you can't have insider knowledge. And if developers do need to change it, you have no recourse when all your finely tuned racing queries suddenly become bogged down in treacle. It's not their fault if your system is suddenly unusably slow and you have six months work (or more) retuning the system to get response times back to acceptable. Okay, that's unlikely to happen, but Oracle customers complain about it often enough. It just can't happen with Pick.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 29, 2014 0:11 UTC (Sat) by Wol (subscriber, #4433) [Link]

Replying to myself, let's expand a little on the following ...

> but in pretty much ALL scenarios the cost of retrieving that extra data is minimal, while the chance of it being required is relatively high.

Let's assume we've got 1000 invoices, and we're only reporting on 100 of them - 1 in ten. Let's also assume that we've got 10 invoices per block.

Using schoolboy howler statistics, we want 1 invoice in 10, there are 10 invoices per block, so we are going to have to read every block and we will only use 1 of the invoices in it. So we will get absolutely no benefit in the form of "accidentally" reading in another invoice row. The only possible benefit we will get is if we want multiple columns in the same row. (My howler may have simplified the maths, but doesn't affect the truth of the supposition :-)

The smaller the proportion of rows you want to access becomes, relative to the table size, the starker this effect becomes. So, by storing all the invoice data in a single row, a Pick database will benefit from this effect far more than an RDBMS, and the larger the database becomes, the more significant this benefit becomes. Because in everyday use it is rare to want to scan most of the database - you will only be interested in a few rows.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Apr 10, 2014 10:02 UTC (Thu) by nix (subscriber, #2304) [Link] (4 responses)

The other place you will score is indices. Your optimiser will spot it needs to scan INVOICES multiple times, and it will build indices on the fly if they don't exist.

You need to use some actual databases, rather than just reading Codd and assuming everyone implemented what he suggested. While doing that (incrementally, no less) would be really cool, nobody ever does that and I don't know of any database system that implements it. Indices are explicitly-constructed entities.

Yes, this does mean that in a lot of databases you end up with a rule that on particular large tables all queries must use an index, and often the database can even enforce that rule. But... you can have more than one index, allowing for more forms of lookup than just primary-keyed, and you can still have totally ad-hoc queries on smaller tables.

As far as I can see, Pick throws all that flexibility away. I can easily see myself having to rethink half an application because I want to add one query, since oops now I need to change what the primary key is! This seems very far from ideal.

(Disclaimer: I work for Oracle now, but in the period of my life when I did things with databases, I didn't.)

Is this SQL databases or No-SQL?

Posted Apr 10, 2014 20:14 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link] (3 responses)

> > […] it will build indices on the fly if they don't exist.

> I don't know of any database system that implements it.

As shown in this example, SQLite is happy to create a temporary index if it is estimated to be faster overall.

Is this SQL databases or No-SQL?

Posted Apr 15, 2014 14:45 UTC (Tue) by nix (subscriber, #2304) [Link] (2 responses)

Neat!

I did some digging in response: other, bigger RDBMSes (including PostgreSQL) will also create temporary indexes in temporary tablespace if needed. They're not visible to the database user, though, and are thrown away after the query terminates.

Is this SQL databases or No-SQL?

Posted Apr 15, 2014 15:27 UTC (Tue) by intgr (subscriber, #39733) [Link] (1 responses)

> bigger RDBMSes (including PostgreSQL) will also create temporary indexes

I think you are misunderstanding. The "temporary indexes" you speak probably refers to explicitly created indexes CREATE TEMPORARY TABLE tables issued by the client. I guess you were reading this: http://www.postgresql.org/docs/current/static/runtime-con...

PostgreSQL can create temporary in-memory hash tables, do temporary sorts (such as for merge joins or ORDER BY) and mid-query materialization of intermediate results. Sorts and materialized results can spill out to disk when there isn't enough memory to keep them, but they aren't called "temporary indexes" since they don't resemble actual index data structures.

Is this SQL databases or No-SQL?

Posted Apr 20, 2014 18:27 UTC (Sun) by nix (subscriber, #2304) [Link]

Yes, but those appear to be what wol is referring to as 'indexes' (they index the data, after all). They're not SQL-level entities, or user-visible, of course.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 1:21 UTC (Fri) by Wol (subscriber, #4433) [Link] (3 responses)

>> As for magic characters, what does PostgreSQL use to separate its columns?

> It doesn't. Using magic characters often necessiates escaping/de-escaping, which has a cost in performance. And finding column N in a tuple requires you to walk through all preceding values in whole. Fixed-length columns always have defined size so they're known by position, variable-length values are prefixed with their length, which doesn't have these problems.

Then how does PostgreSQL know where the data is stored? How does it know how to find column N of row R? You haven't told me what PostgreSQL actually does. Reading between the lines, it seems that every column is stored separately, so reading a row is horribly inefficient ... but I really don't have a clue ...

And yes, Pick does have to search all preceding columns to find column N (the first time. It can be optimised fairly easily by caching). It seems to work very efficiently :-)

> And it also paints the database into a corner such that it cannot change the storage format easily.

So how does PostgreSQL do it? As an application guy, I'm not supposed to know. Certainly for older RDBMSs, adding a new column was a painful process. It's never been a problem for Pick. Or is that not what you meant?

Because internally to the database, a row is just one big string. And it's worked great for 45 years. And it is the same technique as used by SQLite, too, so I understand.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 11:55 UTC (Fri) by intgr (subscriber, #39733) [Link] (2 responses)

You make all these sweeping claims about relational databases. But how much experience do you actually have with them?

> You haven't told me what PostgreSQL actually does. Reading between the lines, it seems that every column is stored separately

I tried, but failed. No, all the fields in a single row are stored together. The difference from Pick is that there are no field terminators. Instead, variable-length fields are prefixed with their length. If it's not the column you're looking for, the database reads the length value of the field and skips that number of bytes to find the next column. Fixed-length fields like integer/float/date/etc, do not need to be prefixed with their length since their offset is known just by their data type.

> > And it also paints the database into a corner such that it cannot change the storage format easily.
> Or is that not what you meant?

What I meant is, if the database software exposes its on-disk format directly to the clients, then it cannot easily change the format in future versions. They may want to change the format for efficiency or other considerations. Of course as always, it has its advantages, such as fewer transformations needed before sending it to client.

> Certainly for older RDBMSs, adding a new column was a painful process

Adding a column that defaults to NULL (i.e. no value) is quick PostgreSQL and many other databases too.

> Because internally to the database, a row is just one big string.

You can call any piece of data "just a string". That doesn't give you any insight about how it's accessed.

> And it is the same technique as used by SQLite, too, so I understand

SQLite uses length-prefixed fields too, not terminator bytes.

Is this SQL databases or No-SQL?

Posted Mar 28, 2014 23:52 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> You make all these sweeping claims about relational databases. But how much experience do you actually have with them?

I've used them. I wasn't impressed. Where I worked, management decided to migrate from Pick to SQL-Server. Okay, my knowledge of SQL might not be good, but when it takes me maybe seconds to run a dump from my NF2 table, and it then takes SQL-Server hours to run the script to create the matching SQL tables ... when I do a "dir" (it was on Windows) to find out how much space my Pick tables were taking up, and finding out that the SQL data store was nigh on an order of magnitude larger ... and we had consultants who could easily have helped me speed things up but I got the impression they were happy with that sort of speed - WTF!

And I'm interested in the theory of relational systems. You know I've repeated said "you're not supposed to know how the RDBMS stores data"? And it's been repeatedly pointed out that we know how PostgreSQL does it? But that requirement is spec'd in the definition of a generic RDBMS. That's what really annoys me - the spec says "users aren't allowed to know how the RDBMS does it, to permit the RDBMS developers find a faster way". As an *engineer*, I can prove that Pick does it the fastest way possible so why on earth would I want to allow space for the discovery of a faster way!?!?!?

Likewise, First Normal Form requires that everything is *forced* into sets. The poor business analyst is forced to use a sledgehammer to smash square lists and rhomboid bags into round sets! As an *engineer* I can prove this overcomplicates things, and means that an RDBMS is irreparably inefficient.

> I tried, but failed. No, all the fields in a single row are stored together. The difference from Pick is that there are no field terminators. Instead, variable-length fields are prefixed with their length. If it's not the column you're looking for, the database reads the length value of the field and skips that number of bytes to find the next column. Fixed-length fields like integer/float/date/etc, do not need to be prefixed with their length since their offset is known just by their data type.

Ah. But in practice, Pick is just as fast. When looking for column N

searchcol = N;
while (searchcol != 0 & not(end of string)) {
if currentchar == columnmarker) searchcol--;
get next char;
}

Okay, that's pseudocode but in practice it's just as fast as

currentpos = 0;
for (column = 0; column == N, column++) {
switch (columntype) {
case integer:
currentpos += 4;
case date:
currentpos += ?
...
...
}

(or maybe even faster! How long does that switch take to run for each column?)

Again, Pick gives you plenty of rope to hang yourself with - if you're daft enough to declare column 1000 (and I've known people do it) you're going to shoot yourself in the foot :-)

> Adding a column that defaults to NULL (i.e. no value) is quick PostgreSQL and many other databases too.

:-) Which is why I restricted my comment to older RDBMSs. I know newer ones have cracked it. Chances are, however, adding a field that has a default value that needs updating every single row is pretty quick in Pick. If you're sensible about it (and it's easy to force in Pick), it would probably be a case of "read block 1, write block 1, read block 2, write block 2, ..., read block n, write block n". How can you do it faster than that?

> You can call any piece of data "just a string". That doesn't give you any insight about how it's accessed.

But as far as the Pick data store (the DBMS itself) is concerned, why should the DBMS access the data at all? As far as the DBMS itself is concerned, the row is just a blob.

Part of the trouble, I think, is that an RDBMS internally contains several layers. There's the data store. Then there's the indexing and metadata management layer. Then there's the table management layer. Then there's the view management layer. And only at this point do we have the First Normal Form API through which all access to the database is supposed to occur. With Pick, so much more of this is exposed to the application (and, imho, the RDBMS FNF API is like a lady's corset that forces data into an unatural shape to get through it :-)

The only reason a Pick mvDBMS has to concern itself with the layout of the data is in the index and metadata management layer.

> SQLite uses length-prefixed fields too, not terminator bytes.

Ah. But apparently (so I've heard) they're all strings. Not that I've ever seen it, but I have heard that one of its failure modes is putting text data into numeric fields ... That was what I meant - that apparently all data in SQLite is stored as text.

Cheers,
Wol

Is this SQL databases or No-SQL?

Posted Apr 9, 2014 22:05 UTC (Wed) by kleptog (subscriber, #1183) [Link]

Your comments about Pick are interesting and I'm trying to follow you, but comments like this:

> I know newer ones have cracked it. Chances are, however, adding a field that has a default value that needs updating every single row is pretty quick in Pick. If you're sensible about it (and it's easy to force in Pick), it would probably be a case of "read block 1, write block 1, read block 2, write block 2, ..., read block n, write block n". How can you do it faster than that?

Make me just go "wow!". Obviously adding a column doesn't require rewriting the table, the database simply records that a column has been added and whenever a row is read that is missing that column it fills in the default value.

You seem to be quick to tar all RDBMSes with a single brush. No modern database system implements strict SQL. Indexes are not part of SQL, think about that. Yet every RDBMS has them. You keep talking about C&D's 1st rule, but no database actually requires that. PostgreSQL has supported arrays since the beginning and key/value fields for a long time. Recent releases allow you to store and index JSON documents. Which seems to me to give you all the performance of Pick with the flexibility of an RDBMS.

RDBMS builders are engineers too.

But finally, the bit that worries me most about Pick is that the entire pickwiki doesn't mention transactions at all. One of the fundamental features of databases and it's not even mentioned. You mention Pick uses hashes, but transaction safe hashes have historically been very difficult so the combination makes me think Pick don't do it at all.

And don't say "but when you store complete documents you don't need transactions". In the real world you need to be able to read a document, make changes to it and write it back knowing no-one else changed it in the meantime. You need to be able to perform complex queries without worrying you're getting an inconsistant state. Systems that don't need transactions tend to be trivial systems.

Hopefully I'm wrong though...

Is this SQL databases or No-SQL?

Posted Apr 10, 2014 9:56 UTC (Thu) by nix (subscriber, #2304) [Link]

Well, sometimes what he says is true. I did spend months -- actually, most of a year -- tuning a large database-backed Oracle system for better performance. About 20% of that was DML and even some DDL rewriting: perhaps twenty out of a million-plus queries in this system had bad performance. This was mostly down to places where the programmer had tried to be 'clever', Pick-style, and was denying the optimizer a chance to work (e.g. one place was repeatedly reissuing similar queries twelve times over rather than using an intermediate table and a suitable join to do all the work at once). Rewriting such queries to let the optimizer chew at them fixed about 80% of the speed problems we were seeing, speeding up the system by a factor of perhaps a hundred to a thousand under the relevant problematic workloads.

(The rest of the sloth? Maybe I could have fixed it with query rewriting, but with no low-hanging fruit it would have meant rewriting hundreds or thousands of queries. I chose to add application-side caches instead. All the speedup, none of hte pain.)

Is this SQL databases or No-SQL?

Posted Mar 26, 2014 10:25 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

Thank you! This is what I wanted - and the first of your two links is actually helpful as well (it describes a problem domain that's challenging in RDBMS tools, but easy in LDAP and Pick).

So, summarizing for me; the benefit of Pick is that it provides an LDAP-like database (different normalization to Codd's normal forms) but with a flexible schema and a better query language that's based around objects.

Is this SQL databases or No-SQL?

Posted Mar 26, 2014 14:39 UTC (Wed) by Wol (subscriber, #4433) [Link]

If you say so :-)

I don't think that was the intention, but yes, it works.

As I see it, Pick was designed by engineers, Relational by mathematicians. And you may well have noticed that I praise relational theory, but damn relational databases. Because the database imposes requirements (C&D's 1st rule, primarily) that were done simply to make the problems solvable. And cripple the DBMS in the process.

Pick being far more flexible, can take advantage of all the advances in theory, without being tied down by the maths.

If you do play with Pick, though, BE CAREFUL! It's the C of the database world - it gives you plenty of rope and will happily let you hang yourself with it. Do an EAR analysis, and then as I said declare each object type as a FILE (or "table") and store the object view in it. But Pick will happily let you split it across several files should there be sound engineering reasons ... :-)

Cheers,
Wol

A discussion between database and kernel developers

Posted Mar 13, 2014 20:06 UTC (Thu) by philipsbd (subscriber, #33789) [Link]

What I would like to see is a simple library that can be used for write ahead logs in databases. The current default is to sync on every log append. I know it doesn't solve all of the use cases for PG here but it might be a good start to create a library for these basic patterns to define the APIs and then work to optimize them with the latest kernel features. I would certainly use it on etcd. See you at the Collab Summit.