Message ID | 20240914-mgtime-v8-1-5bd872330bed@kernel.org |
---|---|
State | Not Applicable |
Headers | show |
Series | fs: multigrain timestamp redux | expand |
On Sat, Sep 14, 2024 at 10:07 AM Jeff Layton <jlayton@kernel.org> wrote: > > For multigrain timestamps, we must keep track of the latest timestamp > that has ever been handed out, and never hand out a coarse time below > that value. > > Add a static singleton atomic64_t into timekeeper.c that we can use to > keep track of the latest fine-grained time ever handed out. This is > tracked as a monotonic ktime_t value to ensure that it isn't affected by > clock jumps. > > Add two new public interfaces: > > - ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the > coarse-grained clock and the floor time > > - ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries > to swap it into the floor. A timespec64 is filled with the result. > > Since the floor is global, we take great pains to avoid updating it > unless it's absolutely necessary. If we do the cmpxchg and find that the > value has been updated since we fetched it, then we discard the > fine-grained time that was fetched in favor of the recent update. > > To maximize the window of this occurring when multiple tasks are racing > to update the floor, ktime_get_coarse_real_ts64_mg returns a cookie > value that represents the state of the floor tracking word, and > ktime_get_real_ts64_mg accepts a cookie value that it uses as the "old" > value when calling cmpxchg(). This last bit seems out of date. > --- > include/linux/timekeeping.h | 4 +++ > kernel/time/timekeeping.c | 82 +++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 86 insertions(+) > > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h > index fc12a9ba2c88..7aa85246c183 100644 > --- a/include/linux/timekeeping.h > +++ b/include/linux/timekeeping.h > @@ -45,6 +45,10 @@ extern void ktime_get_real_ts64(struct timespec64 *tv); > extern void ktime_get_coarse_ts64(struct timespec64 *ts); > extern void ktime_get_coarse_real_ts64(struct timespec64 *ts); > > +/* Multigrain timestamp interfaces */ > +extern void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts); > +extern void ktime_get_real_ts64_mg(struct timespec64 *ts); > + > void getboottime64(struct timespec64 *ts); > > /* > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index 5391e4167d60..16937242b904 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw ____cacheline_aligned = { > .base[1] = FAST_TK_INIT, > }; > > +/* > + * This represents the latest fine-grained time that we have handed out as a > + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the > + * realtime clock on an as-needed basis. > + */ > +static __cacheline_aligned_in_smp atomic64_t mg_floor; > + > static inline void tk_normalize_xtime(struct timekeeper *tk) > { > while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) { > @@ -2394,6 +2401,81 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts) > } > EXPORT_SYMBOL(ktime_get_coarse_real_ts64); > > +/** > + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor > + * @ts: timespec64 to be filled > + * > + * Adjust floor to realtime and compare it to the coarse time. Fill > + * @ts with the latest one. Note that this is a filesystem-specific > + * interface and should be avoided outside of that context. > + */ > +void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) > +{ > + struct timekeeper *tk = &tk_core.timekeeper; > + u64 floor = atomic64_read(&mg_floor); > + ktime_t f_real, offset, coarse; > + unsigned int seq; > + > + WARN_ON(timekeeping_suspended); > + > + do { > + seq = read_seqcount_begin(&tk_core.seq); > + *ts = tk_xtime(tk); > + offset = *offsets[TK_OFFS_REAL]; > + } while (read_seqcount_retry(&tk_core.seq, seq)); > + > + coarse = timespec64_to_ktime(*ts); > + f_real = ktime_add(floor, offset); > + if (ktime_after(f_real, coarse)) > + *ts = ktime_to_timespec64(f_real); > +} > +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg); > + > +/** > + * ktime_get_real_ts64_mg - attempt to update floor value and return result > + * @ts: pointer to the timespec to be set > + * > + * Get a current monotonic fine-grained time value and attempt to swap > + * it into the floor. @ts will be filled with the resulting floor value, > + * regardless of the outcome of the swap. Note that this is a filesystem > + * specific interface and should be avoided outside of that context. > + */ > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) Still passing a cookie. It doesn't match the header definition, so I'm surprised this builds. > +{ > + struct timekeeper *tk = &tk_core.timekeeper; > + ktime_t old = atomic64_read(&mg_floor); > + ktime_t offset, mono; > + unsigned int seq; > + u64 nsecs; > + > + WARN_ON(timekeeping_suspended); > + > + do { > + seq = read_seqcount_begin(&tk_core.seq); > + > + ts->tv_sec = tk->xtime_sec; > + mono = tk->tkr_mono.base; > + nsecs = timekeeping_get_ns(&tk->tkr_mono); > + offset = *offsets[TK_OFFS_REAL]; > + } while (read_seqcount_retry(&tk_core.seq, seq)); > + > + mono = ktime_add_ns(mono, nsecs); > + > + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { > + ts->tv_nsec = 0; > + timespec64_add_ns(ts, nsecs); > + } else { > + /* > + * Something has changed mg_floor since "old" was > + * fetched. "old" has now been updated with the > + * current value of mg_floor, so use that to return > + * the current coarse floor value. > + */ > + *ts = ktime_to_timespec64(ktime_add(old, offset)); > + } > +} > +EXPORT_SYMBOL_GPL(ktime_get_real_ts64_mg); Other than those issues, I'm ok with it. Thanks again for working through my concerns! Since I'm traveling for LPC soon, to save the next cycle, once the fixes above are sorted: Acked-by: John Stultz <jstultz@google.com> thanks -john
On Sat, 2024-09-14 at 13:10 -0700, John Stultz wrote: > On Sat, Sep 14, 2024 at 10:07 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > For multigrain timestamps, we must keep track of the latest timestamp > > that has ever been handed out, and never hand out a coarse time below > > that value. > > > > Add a static singleton atomic64_t into timekeeper.c that we can use to > > keep track of the latest fine-grained time ever handed out. This is > > tracked as a monotonic ktime_t value to ensure that it isn't affected by > > clock jumps. > > > > Add two new public interfaces: > > > > - ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the > > coarse-grained clock and the floor time > > > > - ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries > > to swap it into the floor. A timespec64 is filled with the result. > > > > Since the floor is global, we take great pains to avoid updating it > > unless it's absolutely necessary. If we do the cmpxchg and find that the > > value has been updated since we fetched it, then we discard the > > fine-grained time that was fetched in favor of the recent update. > > > > To maximize the window of this occurring when multiple tasks are racing > > to update the floor, ktime_get_coarse_real_ts64_mg returns a cookie > > value that represents the state of the floor tracking word, and > > ktime_get_real_ts64_mg accepts a cookie value that it uses as the "old" > > value when calling cmpxchg(). > > This last bit seems out of date. > Thanks. Dropped the last paragraph. > > --- > > include/linux/timekeeping.h | 4 +++ > > kernel/time/timekeeping.c | 82 +++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 86 insertions(+) > > > > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h > > index fc12a9ba2c88..7aa85246c183 100644 > > --- a/include/linux/timekeeping.h > > +++ b/include/linux/timekeeping.h > > @@ -45,6 +45,10 @@ extern void ktime_get_real_ts64(struct timespec64 *tv); > > extern void ktime_get_coarse_ts64(struct timespec64 *ts); > > extern void ktime_get_coarse_real_ts64(struct timespec64 *ts); > > > > +/* Multigrain timestamp interfaces */ > > +extern void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts); > > +extern void ktime_get_real_ts64_mg(struct timespec64 *ts); > > + > > void getboottime64(struct timespec64 *ts); > > > > /* > > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > > index 5391e4167d60..16937242b904 100644 > > --- a/kernel/time/timekeeping.c > > +++ b/kernel/time/timekeeping.c > > @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw ____cacheline_aligned = { > > .base[1] = FAST_TK_INIT, > > }; > > > > +/* > > + * This represents the latest fine-grained time that we have handed out as a > > + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the > > + * realtime clock on an as-needed basis. > > + */ > > +static __cacheline_aligned_in_smp atomic64_t mg_floor; > > + > > static inline void tk_normalize_xtime(struct timekeeper *tk) > > { > > while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) { > > @@ -2394,6 +2401,81 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts) > > } > > EXPORT_SYMBOL(ktime_get_coarse_real_ts64); > > > > +/** > > + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor > > + * @ts: timespec64 to be filled > > + * > > + * Adjust floor to realtime and compare it to the coarse time. Fill > > + * @ts with the latest one. Note that this is a filesystem-specific > > + * interface and should be avoided outside of that context. > > + */ > > +void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) > > +{ > > + struct timekeeper *tk = &tk_core.timekeeper; > > + u64 floor = atomic64_read(&mg_floor); > > + ktime_t f_real, offset, coarse; > > + unsigned int seq; > > + > > + WARN_ON(timekeeping_suspended); > > + > > + do { > > + seq = read_seqcount_begin(&tk_core.seq); > > + *ts = tk_xtime(tk); > > + offset = *offsets[TK_OFFS_REAL]; > > + } while (read_seqcount_retry(&tk_core.seq, seq)); > > + > > + coarse = timespec64_to_ktime(*ts); > > + f_real = ktime_add(floor, offset); > > + if (ktime_after(f_real, coarse)) > > + *ts = ktime_to_timespec64(f_real); > > +} > > +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg); > > + > > +/** > > + * ktime_get_real_ts64_mg - attempt to update floor value and return result > > + * @ts: pointer to the timespec to be set > > + * > > + * Get a current monotonic fine-grained time value and attempt to swap > > + * it into the floor. @ts will be filled with the resulting floor value, > > + * regardless of the outcome of the swap. Note that this is a filesystem > > + * specific interface and should be avoided outside of that context. > > + */ > > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) > > Still passing a cookie. It doesn't match the header definition, so I'm > surprised this builds. > Yeah, I didn't see a warning when I built it. Luckily the extra parameter is ignored anyway, so it no harm. I've fixed that up in my tree. > > +{ > > + struct timekeeper *tk = &tk_core.timekeeper; > > + ktime_t old = atomic64_read(&mg_floor); > > + ktime_t offset, mono; > > + unsigned int seq; > > + u64 nsecs; > > + > > + WARN_ON(timekeeping_suspended); > > + > > + do { > > + seq = read_seqcount_begin(&tk_core.seq); > > + > > + ts->tv_sec = tk->xtime_sec; > > + mono = tk->tkr_mono.base; > > + nsecs = timekeeping_get_ns(&tk->tkr_mono); > > + offset = *offsets[TK_OFFS_REAL]; > > + } while (read_seqcount_retry(&tk_core.seq, seq)); > > + > > + mono = ktime_add_ns(mono, nsecs); > > + > > + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { > > + ts->tv_nsec = 0; > > + timespec64_add_ns(ts, nsecs); > > + } else { > > + /* > > + * Something has changed mg_floor since "old" was > > + * fetched. "old" has now been updated with the > > + * current value of mg_floor, so use that to return > > + * the current coarse floor value. > > + */ > > + *ts = ktime_to_timespec64(ktime_add(old, offset)); > > + } > > +} > > +EXPORT_SYMBOL_GPL(ktime_get_real_ts64_mg); > > Other than those issues, I'm ok with it. Thanks again for working > through my concerns! > > Since I'm traveling for LPC soon, to save the next cycle, once the > fixes above are sorted: > Acked-by: John Stultz <jstultz@google.com> > Thanks for the review!
On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote: > For multigrain timestamps, we must keep track of the latest timestamp What is a multgrain timestamp? Can you please describe the concept behind it? I'm not going to chase random documentation or whatever because change logs have to self contained. And again 'we' do nothing. Describe the problem in technical terms and do not impersonate code. > To maximize the window of this occurring when multiple tasks are racing > to update the floor, ktime_get_coarse_real_ts64_mg returns a cookie > value that represents the state of the floor tracking word, and > ktime_get_real_ts64_mg accepts a cookie value that it uses as the "old" > value when calling cmpxchg(). Clearly: > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) Can you please get your act together? > +/** > + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor > + * @ts: timespec64 to be filled > + * > + * Adjust floor to realtime and compare it to the coarse time. Fill > + * @ts with the latest one. This explains nothing. > Note that this is a filesystem-specific > + * interface and should be avoided outside of that context. > + */ > +void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) > +{ > + struct timekeeper *tk = &tk_core.timekeeper; > + u64 floor = atomic64_read(&mg_floor); > + ktime_t f_real, offset, coarse; > + unsigned int seq; > + > + WARN_ON(timekeeping_suspended); > + > + do { > + seq = read_seqcount_begin(&tk_core.seq); > + *ts = tk_xtime(tk); > + offset = *offsets[TK_OFFS_REAL]; Why this indirection? What's wrong with using tk_core.timekeeper.offs_real directly? > + } while (read_seqcount_retry(&tk_core.seq, seq)); > + > + coarse = timespec64_to_ktime(*ts); > + f_real = ktime_add(floor, offset); How is any of this synchronized against concurrent updates of the floor value or the offset? I'm failing to see anything which keeps this consistent. If this is magically consistent then it wants a big fat comment in the code which explains it. > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) What is this cookie argument for and how does that match the declaration? > +extern void ktime_get_real_ts64_mg(struct timespec64 *ts); This does not even build. > +{ > + struct timekeeper *tk = &tk_core.timekeeper; > + ktime_t old = atomic64_read(&mg_floor); > + ktime_t offset, mono; > + unsigned int seq; > + u64 nsecs; > + > + WARN_ON(timekeeping_suspended); WARN_ON_ONCE() if at all. > + do { > + seq = read_seqcount_begin(&tk_core.seq); > + > + ts->tv_sec = tk->xtime_sec; > + mono = tk->tkr_mono.base; > + nsecs = timekeeping_get_ns(&tk->tkr_mono); > + offset = *offsets[TK_OFFS_REAL]; > + } while (read_seqcount_retry(&tk_core.seq, seq)); > + > + mono = ktime_add_ns(mono, nsecs); > + > + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { > + ts->tv_nsec = 0; > + timespec64_add_ns(ts, nsecs); > + } else { > + /* > + * Something has changed mg_floor since "old" was > + * fetched. "old" has now been updated with the > + * current value of mg_floor, so use that to return > + * the current coarse floor value. 'Something has changed' is a truly understandable technical explanation. I'm not going to accept this voodoo which makes everyone scratch his head who wasn't involved in this. Thanks, tglx
On Mon, Sep 16 2024 at 12:12, Thomas Gleixner wrote: > On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote: >> + do { >> + seq = read_seqcount_begin(&tk_core.seq); >> + >> + ts->tv_sec = tk->xtime_sec; >> + mono = tk->tkr_mono.base; >> + nsecs = timekeeping_get_ns(&tk->tkr_mono); >> + offset = *offsets[TK_OFFS_REAL]; >> + } while (read_seqcount_retry(&tk_core.seq, seq)); >> + >> + mono = ktime_add_ns(mono, nsecs); >> + >> + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { >> + ts->tv_nsec = 0; >> + timespec64_add_ns(ts, nsecs); >> + } else { >> + /* >> + * Something has changed mg_floor since "old" was >> + * fetched. "old" has now been updated with the >> + * current value of mg_floor, so use that to return >> + * the current coarse floor value. > > 'Something has changed' is a truly understandable technical > explanation. old = mg_floor mono = T1; mg_floor = mono preemption do { mono = T2; } cmpxchg fails and the function returns a value based on T1 No? Thanks, tglx
On Mon, 2024-09-16 at 12:32 +0200, Thomas Gleixner wrote: > On Mon, Sep 16 2024 at 12:12, Thomas Gleixner wrote: > > On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote: > > > + do { > > > + seq = read_seqcount_begin(&tk_core.seq); > > > + > > > + ts->tv_sec = tk->xtime_sec; > > > + mono = tk->tkr_mono.base; > > > + nsecs = timekeeping_get_ns(&tk->tkr_mono); > > > + offset = *offsets[TK_OFFS_REAL]; > > > + } while (read_seqcount_retry(&tk_core.seq, seq)); > > > + > > > + mono = ktime_add_ns(mono, nsecs); > > > + > > > + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { > > > + ts->tv_nsec = 0; > > > + timespec64_add_ns(ts, nsecs); > > > + } else { > > > + /* > > > + * Something has changed mg_floor since "old" was > > > + * fetched. "old" has now been updated with the > > > + * current value of mg_floor, so use that to return > > > + * the current coarse floor value. > > > > 'Something has changed' is a truly understandable technical > > explanation. > > old = mg_floor > mono = T1; > mg_floor = mono > preemption > > do { > mono = T2; > } > > cmpxchg fails and the function returns a value based on T1 > > No? > > Packing for LPC, so I can't respond to all of these just now, but I will later. You're correct, but either outcome is OK. The requirement is that we don't hand out any values that were below the floor at the time that the task entered the kernel. Since the time changed while the task was already inside the kernel, either T1 or T2 would be valid timestamps.
On Mon, 2024-09-16 at 12:12 +0200, Thomas Gleixner wrote: > On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote: > > For multigrain timestamps, we must keep track of the latest timestamp > > What is a multgrain timestamp? Can you please describe the concept > behind it? I'm not going to chase random documentation or whatever > because change logs have to self contained. > > And again 'we' do nothing. Describe the problem in technical terms and > do not impersonate code. > Hi Thomas! Sorry for the delay in responding. I'll try to summarize below, but I'll also note that patch #7 in the v8 series adds a file to Documentation/ that explains this in a bit more depth: Currently the kernel always stamps files (mtime, ctime, etc.) using the coarse-grained clock. This is usually a good thing, since it reduces the number of metadata updates, but means that you can't reliably use file timestamps to detect whether there have been changes to the file since it was last checked. This is particularly a problem for NFSv3 clients, which use timestamps to know when to invalidate their pagecache for an inode [1]. The idea is to allow the kernel to use fine-grained timestamps (mtime and ctime) on files when they are under direct observation. When a task does a ->getattr against an inode for STATX_MTIME or STATX_CTIME, a flag is set in the inode that tells the kernel to use the fine-grained clock for the timestamp update iff the current coarse-grained clock value would not cause a change to the mtime/ctime. This works, but there is a problem: It's possible for one inode to get a fine-grained timestamp, and then another to get a coarse-grained timestamp. If this happens within a single coarse-grained timer tick, then the files may appear to have been modified in reverse order, which breaks POSIX guarantees (and obscure programs like "make"). The fix for this is to establish a floor value for the coarse-grained clock. When stamping a file with a fine-grained timestamp, we update the floor value with the current monotonic time (using cmpxchg). Then later, when a coarse-grained timestamp is requested, check whether the floor is later than the current coarse-grained time. If it is, then the kernel will return the floor value (converted to realtime) instead of the current coarse-grained clock. That allows us to maintain the ordering guarantees. My original implementation of this tracked the floor value in fs/inode.c (also using cmpxchg), but that caused a performance regression, mostly due to multiple calls into the timekeeper functions with seqcount loops. By adding the floor to the timekeeper we can get that back down to 1 seqcount loop. Let me know if you have more questions about this, or suggestions about how to do this better. The timekeeping code is not my area of expertise (obviously) so I'm open to doing this a better way if there is one. Thanks for the review so far! [1]: NFSv4 mandates an opaque change attribute (usually using inode->i_version), but only some filesystems have a proper implementation of it (XFS being the notable exception). For the others, we end up using the ctime to generate a change attribute, which means that NFSv4 has the same problem on those filesystems. i_version also doesn't help NFSv3 clients and servers.
On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote: > For multigrain timestamps, we must keep track of the latest timestamp > that has ever been handed out, and never hand out a coarse time below > that value. How is that correct when the clock is off by an hour and then set back to the correct value? Then you'd get the same stale timestamp for an hour unless something invokes ktime_get_real_ts64_mg() which will set the "latest" timestamp back to a time before the previous one. > Add a static singleton atomic64_t into timekeeper.c that we can use to > keep track of the latest fine-grained time ever handed out. This is > tracked as a monotonic ktime_t value to ensure that it isn't affected by > clock jumps. That's just wishful thinking. ktime_get_real_ts64_mg(ts) ts = Tmono_1 + offset_1; // TReal_1 floor = Tmono_1; // newtime < TReal_1 clock_settime(REALTIME, newtime); xtime = newtime; // TReal_2 offset_2 = offset_1 + Treal_2 - TReal(now); --> offset_2 < offset_1 ktime_get_coarse_real_ts64_mg(ts) ts = tk_xtime(); // TReal_2 offs = offset_2; if (Tmono_1 + offset_2 > ts) ts = Tmono_1 + offset_2; // Not taken So this returns T_Real_2 because offset_2 < offset_1 and therefore Tmono_1 + offset_2 < TReal_2 so the returned time will jump backwards vs. TReal_1 as it should because that's the actual time, no? So if that's the intended behaviour then the changelog is misleading at best. If the intention is to never return a value < TReal_1 then this does not work. You can make it work by using the Realtime timestamp as floor, but that'd be more than questionable vs. clock_settime() making the clock go backwards. Thanks, tglx
On Mon, Sep 16 2024 at 06:57, Jeff Layton wrote: > On Mon, 2024-09-16 at 12:32 +0200, Thomas Gleixner wrote: >> > 'Something has changed' is a truly understandable technical >> > explanation. >> >> old = mg_floor >> mono = T1; >> mg_floor = mono >> preemption >> >> do { >> mono = T2; >> } >> >> cmpxchg fails and the function returns a value based on T1 >> >> No? >> >> > > Packing for LPC, so I can't respond to all of these just now, but I > will later. You're correct, but either outcome is OK. > > The requirement is that we don't hand out any values that were below > the floor at the time that the task entered the kernel. Since the time > changed while the task was already inside the kernel, either T1 or T2 > would be valid timestamps. That really needs to be documented. A similar scenario exists vs. ktime_get_coarse_real_ts64_mg(). Thanks, tglx
On Mon, 2024-09-30 at 21:13 +0200, Thomas Gleixner wrote: > On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote: > > > For multigrain timestamps, we must keep track of the latest timestamp > > that has ever been handed out, and never hand out a coarse time below > > that value. > > How is that correct when the clock is off by an hour and then set back > to the correct value? Then you'd get the same stale timestamp for an > hour unless something invokes ktime_get_real_ts64_mg() which will set > the "latest" timestamp back to a time before the previous one. > > > Add a static singleton atomic64_t into timekeeper.c that we can use to > > keep track of the latest fine-grained time ever handed out. This is > > tracked as a monotonic ktime_t value to ensure that it isn't affected by > > clock jumps. > > That's just wishful thinking. > > ktime_get_real_ts64_mg(ts) > ts = Tmono_1 + offset_1; // TReal_1 > floor = Tmono_1; > > // newtime < TReal_1 > clock_settime(REALTIME, newtime); > xtime = newtime; // TReal_2 > offset_2 = offset_1 + Treal_2 - TReal(now); > --> offset_2 < offset_1 > > ktime_get_coarse_real_ts64_mg(ts) > ts = tk_xtime(); // TReal_2 > offs = offset_2; > > if (Tmono_1 + offset_2 > ts) > ts = Tmono_1 + offset_2; // Not taken > > So this returns T_Real_2 because > > offset_2 < offset_1 > > and therefore > > Tmono_1 + offset_2 < TReal_2 > > so the returned time will jump backwards vs. TReal_1 as it should > because that's the actual time, no? > > So if that's the intended behaviour then the changelog is misleading at > best. > > If the intention is to never return a value < TReal_1 then this does not > work. You can make it work by using the Realtime timestamp as floor, but > that'd be more than questionable vs. clock_settime() making the clock go > backwards. > That is the intended behavior and I'll plan to fix the changelog to clarify this point: If someone jumps the realtime clock backward by a large value, then the realtime timestamp _can_ appear to go backward. This is a problem today even without this patchset. If two files get stamped and a realtime clock jump backward happens in between them, all bets are off as to which one will appear to have been modified first. I don't think that is something we can reasonably prevent, since we must stamp files according to the realtime clock. The main thing I'm trying to prevent is the timestamps being misordered in the absence of such a clock jump. Without tracking the floor as I am here, that's a possibility.
On Mon, 2024-09-30 at 21:16 +0200, Thomas Gleixner wrote: > On Mon, Sep 16 2024 at 06:57, Jeff Layton wrote: > > On Mon, 2024-09-16 at 12:32 +0200, Thomas Gleixner wrote: > > > > 'Something has changed' is a truly understandable technical > > > > explanation. > > > > > > old = mg_floor > > > mono = T1; > > > mg_floor = mono > > > preemption > > > > > > do { > > > mono = T2; > > > } > > > > > > cmpxchg fails and the function returns a value based on T1 > > > > > > No? > > > > > > > > > > Packing for LPC, so I can't respond to all of these just now, but I > > will later. You're correct, but either outcome is OK. > > > > The requirement is that we don't hand out any values that were below > > the floor at the time that the task entered the kernel. Since the time > > changed while the task was already inside the kernel, either T1 or T2 > > would be valid timestamps. > > That really needs to be documented. A similar scenario exists > vs. ktime_get_coarse_real_ts64_mg(). > Yes. I have the following section in the multigrain-ts.rst file that gets added in patch 7 of this series. I'll also plan to add some extra wording about how backward realtime clock jumps can affect ordering: Inode Timestamp Ordering ======================== In addition to providing info about changes to individual files, file timestamps also serve an important purpose in applications like "make". These programs measure timestamps in order to determine whether source files might be newer than cached objects. Userland applications like make can only determine ordering based on operational boundaries. For a syscall those are the syscall entry and exit points. For io_uring or nfsd operations, that's the request submission and response. In the case of concurrent operations, userland can make no determination about the order in which things will occur. For instance, if a single thread modifies one file, and then another file in sequence, the second file must show an equal or later mtime than the first. The same is true if two threads are issuing similar operations that do not overlap in time. If however, two threads have racing syscalls that overlap in time, then there is no such guarantee, and the second file may appear to have been modified before, after or at the same time as the first, regardless of which one was submitted first.
On Thu, Sep 19 2024 at 18:50, Jeff Layton wrote: > The fix for this is to establish a floor value for the coarse-grained > clock. When stamping a file with a fine-grained timestamp, we update > the floor value with the current monotonic time (using cmpxchg). Then > later, when a coarse-grained timestamp is requested, check whether the > floor is later than the current coarse-grained time. If it is, then the > kernel will return the floor value (converted to realtime) instead of > the current coarse-grained clock. That allows us to maintain the > ordering guarantees. > > My original implementation of this tracked the floor value in > fs/inode.c (also using cmpxchg), but that caused a performance > regression, mostly due to multiple calls into the timekeeper functions > with seqcount loops. By adding the floor to the timekeeper we can get > that back down to 1 seqcount loop. > > Let me know if you have more questions about this, or suggestions about > how to do this better. The timekeeping code is not my area of expertise > (obviously) so I'm open to doing this a better way if there is one. The comments I made about races and the clock_settime() inconsistency vs. the change log aside, I don't see room for improvement there. What worries me is the atomic_cmpxchg() under contention on large machines, but as it is not a cmpxchg() loop it might be not completely horrible. Thanks, tglx
On Mon, 2024-09-30 at 21:43 +0200, Thomas Gleixner wrote: > On Thu, Sep 19 2024 at 18:50, Jeff Layton wrote: > > The fix for this is to establish a floor value for the coarse-grained > > clock. When stamping a file with a fine-grained timestamp, we update > > the floor value with the current monotonic time (using cmpxchg). Then > > later, when a coarse-grained timestamp is requested, check whether the > > floor is later than the current coarse-grained time. If it is, then the > > kernel will return the floor value (converted to realtime) instead of > > the current coarse-grained clock. That allows us to maintain the > > ordering guarantees. > > > > My original implementation of this tracked the floor value in > > fs/inode.c (also using cmpxchg), but that caused a performance > > regression, mostly due to multiple calls into the timekeeper functions > > with seqcount loops. By adding the floor to the timekeeper we can get > > that back down to 1 seqcount loop. > > > > Let me know if you have more questions about this, or suggestions about > > how to do this better. The timekeeping code is not my area of expertise > > (obviously) so I'm open to doing this a better way if there is one. > > The comments I made about races and the clock_settime() inconsistency > vs. the change log aside, I don't see room for improvement there. > > What worries me is the atomic_cmpxchg() under contention on large > machines, but as it is not a cmpxchg() loop it might be not completely > horrible. > Thanks for taking a look. The code does try to avoid requesting a fine-grained timestamp unless there is no other option. In practice, that means that a write()+stat()+write() has to occur within the same coarse-grained timer tick. That obviously does happen, but I'm hoping it's not very frequent on real-world workloads. The main place where we see a performance hit from this set is the will-it-scale pipe test, which does 1-byte writes and reads on a pipe as fast as possible. The previous patchset tracked the floor inside fs/inode.c, and each timestamp update on the pipe inode could make multiple trips into the timekeeper seqcount loops. That added extra barriers and slowed things down. This patchset moves the code into the timekeeper, which means only a single barrier. That gets the cost down, but it's not entirely free. There is some small cost to dealing with the floor, regardless.
On Mon, Sep 30 2024 at 15:27, Jeff Layton wrote: > On Mon, 2024-09-30 at 21:13 +0200, Thomas Gleixner wrote: >> So if that's the intended behaviour then the changelog is misleading at >> best. > > That is the intended behavior and I'll plan to fix the changelog to > clarify this point: > > If someone jumps the realtime clock backward by a large value, then the > realtime timestamp _can_ appear to go backward. This is a problem today > even without this patchset. Correct. > If two files get stamped and a realtime clock jump backward happens in > between them, all bets are off as to which one will appear to have been > modified first. I don't think that is something we can reasonably > prevent, since we must stamp files according to the realtime clock. True. I just was utterly confused about the changelog. > The main thing I'm trying to prevent is the timestamps being misordered > in the absence of such a clock jump. Without tracking the floor as I am > here, that's a possibility. Correct.
On Mon, Sep 30 2024 at 15:37, Jeff Layton wrote: > On Mon, 2024-09-30 at 21:16 +0200, Thomas Gleixner wrote: > I have the following section in the multigrain-ts.rst file that gets > added in patch 7 of this series. I'll also plan to add some extra > wording about how backward realtime clock jumps can affect ordering: Please also add comments into the code / interface. > Inode Timestamp Ordering > ======================== > > In addition to providing info about changes to individual files, file > timestamps also serve an important purpose in applications like "make". These > programs measure timestamps in order to determine whether source files might be > newer than cached objects. > > Userland applications like make can only determine ordering based on > operational boundaries. For a syscall those are the syscall entry and exit > points. For io_uring or nfsd operations, that's the request submission and > response. In the case of concurrent operations, userland can make no > determination about the order in which things will occur. > > For instance, if a single thread modifies one file, and then another file in > sequence, the second file must show an equal or later mtime than the first. The > same is true if two threads are issuing similar operations that do not overlap > in time. > > If however, two threads have racing syscalls that overlap in time, then there > is no such guarantee, and the second file may appear to have been modified > before, after or at the same time as the first, regardless of which one was > submitted first. That makes me ask a question. Are the timestamps always taken in thread (syscall) context or can they be taken in other contexts (worker, [soft]interrupt, etc.) too? Thanks, tglx
On Mon, 2024-09-30 at 22:19 +0200, Thomas Gleixner wrote: > On Mon, Sep 30 2024 at 15:37, Jeff Layton wrote: > > On Mon, 2024-09-30 at 21:16 +0200, Thomas Gleixner wrote: > > I have the following section in the multigrain-ts.rst file that gets > > added in patch 7 of this series. I'll also plan to add some extra > > wording about how backward realtime clock jumps can affect ordering: > > Please also add comments into the code / interface. > Will do. > > Inode Timestamp Ordering > > ======================== > > > > In addition to providing info about changes to individual files, file > > timestamps also serve an important purpose in applications like "make". These > > programs measure timestamps in order to determine whether source files might be > > newer than cached objects. > > > > Userland applications like make can only determine ordering based on > > operational boundaries. For a syscall those are the syscall entry and exit > > points. For io_uring or nfsd operations, that's the request submission and > > response. In the case of concurrent operations, userland can make no > > determination about the order in which things will occur. > > > > For instance, if a single thread modifies one file, and then another file in > > sequence, the second file must show an equal or later mtime than the first. The > > same is true if two threads are issuing similar operations that do not overlap > > in time. > > > > If however, two threads have racing syscalls that overlap in time, then there > > is no such guarantee, and the second file may appear to have been modified > > before, after or at the same time as the first, regardless of which one was > > submitted first. > > That makes me ask a question. Are the timestamps always taken in thread > (syscall) context or can they be taken in other contexts (worker, > [soft]interrupt, etc.) too? > That's a good question. The main place we do this is inode_set_ctime_current(). That is mostly called in the context of a syscall or similar sort of operation (io_uring, nfsd RPC request, etc.). I certainly wouldn't rule out a workqueue job calling that function, but this is something we do while dirtying an inode, and that's not typically done in interrupt context.
On Mon, Sep 30 2024 at 16:53, Jeff Layton wrote: > On Mon, 2024-09-30 at 22:19 +0200, Thomas Gleixner wrote: >> On Mon, Sep 30 2024 at 15:37, Jeff Layton wrote: >> > If however, two threads have racing syscalls that overlap in time, then there >> > is no such guarantee, and the second file may appear to have been modified >> > before, after or at the same time as the first, regardless of which one was >> > submitted first. >> >> That makes me ask a question. Are the timestamps always taken in thread >> (syscall) context or can they be taken in other contexts (worker, >> [soft]interrupt, etc.) too? >> > > That's a good question. > > The main place we do this is inode_set_ctime_current(). That is mostly > called in the context of a syscall or similar sort of operation > (io_uring, nfsd RPC request, etc.). > > I certainly wouldn't rule out a workqueue job calling that function, > but this is something we do while dirtying an inode, and that's not > typically done in interrupt context. The reason I'm asking is that if it's always syscall context, i.e. write() or io_uring()/RPC request etc., then you can avoid the whole global floor value dance and make it strictly per thread, which simplifies the exercise significantly. But even if it's not syscall/thread context then the worker or io_uring state machine might just require to serialize against itself and not coordinate with something else. But what do I know. Thanks, tglx
On Mon, 2024-09-30 at 23:35 +0200, Thomas Gleixner wrote: > On Mon, Sep 30 2024 at 16:53, Jeff Layton wrote: > > On Mon, 2024-09-30 at 22:19 +0200, Thomas Gleixner wrote: > > > On Mon, Sep 30 2024 at 15:37, Jeff Layton wrote: > > > > If however, two threads have racing syscalls that overlap in time, then there > > > > is no such guarantee, and the second file may appear to have been modified > > > > before, after or at the same time as the first, regardless of which one was > > > > submitted first. > > > > > > That makes me ask a question. Are the timestamps always taken in thread > > > (syscall) context or can they be taken in other contexts (worker, > > > [soft]interrupt, etc.) too? > > > > > > > That's a good question. > > > > The main place we do this is inode_set_ctime_current(). That is mostly > > called in the context of a syscall or similar sort of operation > > (io_uring, nfsd RPC request, etc.). > > > > I certainly wouldn't rule out a workqueue job calling that function, > > but this is something we do while dirtying an inode, and that's not > > typically done in interrupt context. > > The reason I'm asking is that if it's always syscall context, > i.e. write() or io_uring()/RPC request etc., then you can avoid the > whole global floor value dance and make it strictly per thread, which > simplifies the exercise significantly. > I'm not sure I follow what you're proposing here. Consider two threads doing writes serially to different files. IOW, the second thread enters the write() syscall after the first thread returns from its write(). In that situation, the second timestamp mustn't appear to be earlier than the first (assuming no backward clock jump, of course). How would we ensure that with only per-thread structures? > But even if it's not syscall/thread context then the worker or io_uring > state machine might just require to serialize against itself and not > coordinate with something else. But what do I know.
On Tue, Oct 01 2024 at 05:45, Jeff Layton wrote: > On Mon, 2024-09-30 at 23:35 +0200, Thomas Gleixner wrote: >> > I certainly wouldn't rule out a workqueue job calling that function, >> > but this is something we do while dirtying an inode, and that's not >> > typically done in interrupt context. >> >> The reason I'm asking is that if it's always syscall context, >> i.e. write() or io_uring()/RPC request etc., then you can avoid the >> whole global floor value dance and make it strictly per thread, which >> simplifies the exercise significantly. >> > > I'm not sure I follow what you're proposing here. > > Consider two threads doing writes serially to different files. IOW, the > second thread enters the write() syscall after the first thread returns > from its write(). In that situation, the second timestamp mustn't > appear to be earlier than the first (assuming no backward clock jump, > of course). > > How would we ensure that with only per-thread structures? Bah. Right. Ignore my sleep deprived rambling.
On Tue, 2024-10-01 at 14:45 +0200, Thomas Gleixner wrote: > On Tue, Oct 01 2024 at 05:45, Jeff Layton wrote: > > On Mon, 2024-09-30 at 23:35 +0200, Thomas Gleixner wrote: > > > > I certainly wouldn't rule out a workqueue job calling that function, > > > > but this is something we do while dirtying an inode, and that's not > > > > typically done in interrupt context. > > > > > > The reason I'm asking is that if it's always syscall context, > > > i.e. write() or io_uring()/RPC request etc., then you can avoid the > > > whole global floor value dance and make it strictly per thread, which > > > simplifies the exercise significantly. > > > > > > > I'm not sure I follow what you're proposing here. > > > > Consider two threads doing writes serially to different files. IOW, the > > second thread enters the write() syscall after the first thread returns > > from its write(). In that situation, the second timestamp mustn't > > appear to be earlier than the first (assuming no backward clock jump, > > of course). > > > > How would we ensure that with only per-thread structures? > > Bah. Right. Ignore my sleep deprived rambling. No worries. My one big takeaway from working on all of this is that anything dealing with clocks and time is just difficult to conceptualize. Could I trouble you for an Acked-by on this patch? I think this series should probably go in via Christian's tree, but I don't think he wants to merge it without an explicit ack from the timekeeping maintainers. Thanks again for the review!
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h index fc12a9ba2c88..7aa85246c183 100644 --- a/include/linux/timekeeping.h +++ b/include/linux/timekeeping.h @@ -45,6 +45,10 @@ extern void ktime_get_real_ts64(struct timespec64 *tv); extern void ktime_get_coarse_ts64(struct timespec64 *ts); extern void ktime_get_coarse_real_ts64(struct timespec64 *ts); +/* Multigrain timestamp interfaces */ +extern void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts); +extern void ktime_get_real_ts64_mg(struct timespec64 *ts); + void getboottime64(struct timespec64 *ts); /* diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 5391e4167d60..16937242b904 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw ____cacheline_aligned = { .base[1] = FAST_TK_INIT, }; +/* + * This represents the latest fine-grained time that we have handed out as a + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the + * realtime clock on an as-needed basis. + */ +static __cacheline_aligned_in_smp atomic64_t mg_floor; + static inline void tk_normalize_xtime(struct timekeeper *tk) { while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) { @@ -2394,6 +2401,81 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts) } EXPORT_SYMBOL(ktime_get_coarse_real_ts64); +/** + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor + * @ts: timespec64 to be filled + * + * Adjust floor to realtime and compare it to the coarse time. Fill + * @ts with the latest one. Note that this is a filesystem-specific + * interface and should be avoided outside of that context. + */ +void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) +{ + struct timekeeper *tk = &tk_core.timekeeper; + u64 floor = atomic64_read(&mg_floor); + ktime_t f_real, offset, coarse; + unsigned int seq; + + WARN_ON(timekeeping_suspended); + + do { + seq = read_seqcount_begin(&tk_core.seq); + *ts = tk_xtime(tk); + offset = *offsets[TK_OFFS_REAL]; + } while (read_seqcount_retry(&tk_core.seq, seq)); + + coarse = timespec64_to_ktime(*ts); + f_real = ktime_add(floor, offset); + if (ktime_after(f_real, coarse)) + *ts = ktime_to_timespec64(f_real); +} +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg); + +/** + * ktime_get_real_ts64_mg - attempt to update floor value and return result + * @ts: pointer to the timespec to be set + * + * Get a current monotonic fine-grained time value and attempt to swap + * it into the floor. @ts will be filled with the resulting floor value, + * regardless of the outcome of the swap. Note that this is a filesystem + * specific interface and should be avoided outside of that context. + */ +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) +{ + struct timekeeper *tk = &tk_core.timekeeper; + ktime_t old = atomic64_read(&mg_floor); + ktime_t offset, mono; + unsigned int seq; + u64 nsecs; + + WARN_ON(timekeeping_suspended); + + do { + seq = read_seqcount_begin(&tk_core.seq); + + ts->tv_sec = tk->xtime_sec; + mono = tk->tkr_mono.base; + nsecs = timekeeping_get_ns(&tk->tkr_mono); + offset = *offsets[TK_OFFS_REAL]; + } while (read_seqcount_retry(&tk_core.seq, seq)); + + mono = ktime_add_ns(mono, nsecs); + + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { + ts->tv_nsec = 0; + timespec64_add_ns(ts, nsecs); + } else { + /* + * Something has changed mg_floor since "old" was + * fetched. "old" has now been updated with the + * current value of mg_floor, so use that to return + * the current coarse floor value. + */ + *ts = ktime_to_timespec64(ktime_add(old, offset)); + } +} +EXPORT_SYMBOL_GPL(ktime_get_real_ts64_mg); + void ktime_get_coarse_ts64(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper;
For multigrain timestamps, we must keep track of the latest timestamp that has ever been handed out, and never hand out a coarse time below that value. Add a static singleton atomic64_t into timekeeper.c that we can use to keep track of the latest fine-grained time ever handed out. This is tracked as a monotonic ktime_t value to ensure that it isn't affected by clock jumps. Add two new public interfaces: - ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the coarse-grained clock and the floor time - ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries to swap it into the floor. A timespec64 is filled with the result. Since the floor is global, we take great pains to avoid updating it unless it's absolutely necessary. If we do the cmpxchg and find that the value has been updated since we fetched it, then we discard the fine-grained time that was fetched in favor of the recent update. To maximize the window of this occurring when multiple tasks are racing to update the floor, ktime_get_coarse_real_ts64_mg returns a cookie value that represents the state of the floor tracking word, and ktime_get_real_ts64_mg accepts a cookie value that it uses as the "old" value when calling cmpxchg(). Signed-off-by: Jeff Layton <jlayton@kernel.org> --- include/linux/timekeeping.h | 4 +++ kernel/time/timekeeping.c | 82 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 86 insertions(+)