For some time after I learned CL, I’d always go for a struct. After all,
defstruct
is so easy to work with – you get so many useful accessors for
free!
(defstruct foo
a
b)
This gives us the type FOO
, the functions MAKE-FOO
, COPY-FOO
and FOO-P
and the slot accessing functions FOO-A
and FOO-B
out of the box.
Compare that to the equivalent defclass
and the only thing you get is the
type. There’s no copying function, the type needs to be passed around to
MAKE-INSTANCE
, TYPEP
and don’t even get me started on the verbosity of
SLOT-VALUE
.
That said, at least this problem can be solved by using macros like WITH-SLOTS
or defclass*
(readily available on Quicklisp).
However, there’s still the issue of performance. Because there’s no dynamic dispatch, structs are usually faster than classes - plus their functions can be inlined, and structs themselves can also be stack allocated.
What’s not there to like?
The big problem with structs, especially when you are still exploring things, is modifications. Change the above struct to the following:
(defstruct foo
x
a
b)
And SBCL will immediately complain with this:
WARNING: change in instance length of class FOO:
current length: 2
new length: 3
debugger invoked on a SIMPLE-ERROR in thread
#<THREAD "main thread" RUNNING {1004AC0203}>:
attempt to redefine the STRUCTURE-OBJECT class FOO incompatibly with the
current definition
Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [CONTINUE ] Use the new definition of FOO, invalidating
already-loaded code and instances.
1: [RECKLESSLY-CONTINUE] Use the new definition of FOO as if it were
compatible, allowing old accessors to use new
instances and allowing new accessors to use old
instances.
2: [ABORT ] Exit debugger, returning to top level.
For your own sake, just abort (restart 2) or continue (restart 0). In no case shall ye recklessly continue, because then you are just asking for trouble – ok maybe try it just for fun, but don’t do this in production!
Classes, on the other hand, are born to be redefined. Add a new slot, or remove an existing one, your instances will keep working just fine.
And while classes may not be as performant as structs, their performance is good enough most of the time, even more so when you are exploring things. Here’s a good collection of articles on CLOS efficiency.
In conclusion, my opinion on this matter has done a 180-degree turn and today I
default to using a defclass
when exploring new compound types.
fork()
vs vfork()
and asserts that
fork()
is evil and vfork()
is good.
The essence of that post is that fork()
is slow and expensive, whereas
vfork()
is fast and cheap. Therefore vfork()
is good, and fork()
is bad.
That’s wrong.
vfork()
is a pre-mature optimization, and a highly dangerous one at
that. Pre-mature optimization is the root of all evil. Therefore, vfork()
is
still evil.
vfork()
has a significant problem, and the post in question alludes to it:
vfork() does have one downside: that the parent (specifically: the thread in the parent that calls vfork()) and child share a stack, necessitating that the parent (thread) be stopped until the child exec()s or _exit()s.
Unfortunately, it completely glosses over the real problem because the focus here is on the parent process being blocked. The blocking behaviour is just a symptom, the real problem here is that the stack is shared between the parent and the child process.
More generally, the entire memory of the parent is shared with the child until
an exec()
call is made or the child exits.
Here’s what the Linux manual says about vfork()
.
(From POSIX.1) The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.
And from the macOS/BSD system calls manual:
Many problems can occur when replacing fork(2) with vfork(). For example, it does not work to return while running in the child’s context from the pro- cedure that called vfork() since the eventual return from vfork() would then return to a no longer existent stack frame. Also, changing process state which is partially implemented in user space such as signal handlers with libthr(3) will corrupt the parent’s state.
Be careful, also, to call _exit(2) rather than exit(3) if you cannot execve(2), since exit(3) will flush and close standard I/O channels, and thereby mess up the parent processes standard I/O data structures. (Even with fork(2) it is wrong to call exit(3) since buffered data would then be flushed twice.)
You cannot blindly replace calls to fork()
with vfork()
.
fork()
has multiple use cases, but vfork()
has only one: when you want to
call the exec()
family of functions after vfork()
. That is, when you want to
launch another program.
And be careful what you do in the child process before calling exec()
. As
we’ve seen above, anything that modifies memory is unsafe. So is calling any
function that is not
async-signal-safe.
An interesting consequence of all this is that while calling dup2()
(to
redirect stdin/out) between vfork()
and exec()
is safe, if the call to
dup2()
itself fails, there is no easy way to signal to the user what went
wrong. That is because all of stdio is NOT async-signal-safe.
All said and done – just stick to fork()
. Sure, fork()
has its problems and
caveats, especially when you throw threads into the mix, but it is almost always
the better choice when compared to vfork()
. Use vfork()
only when you truly
need its performance benefits, and understand its problems well.
The thing with the Javascript Date
object is that, what it prints is
misleading.
new Date()
// => Date Mon Jan 31 2022 02:32:37 GMT+0530 (India Standard Time)
If you think a Date
contains all the things it prints, you are wrong.
Date
does not contain any year, month or dayDate
does not contain hours, minutes, seconds or millisecondsDate
certainly does not contain any time zoneAs the title says, the Date
is just a timestamp. It’s a number that represents
milliseconds since January 1, 1970 UTC. That is, it’s a single moment in
time – that moment (and the corresponding number) remains the same regardless
of what time zone you are living in.
You can create a Date
using this timestamp directly – just pass it as the
only value to the constructor. For example,
new Date(1640995200000)
// => Date Sat Jan 01 2022 05:30:00 GMT+0530 (India Standard Time)
You can also get or set this timestamp on a Date
object using the getTime()
and setTime()
methods respectively.
Everything else that the Date
object exposes is either computed from this
timestamp, cached or used from the environment.
This applies to the getter methods like getFullYear()
, getMonth()
,
getDate()
, getHours()
, etc.
That also applies to the getTimezoneOffset()
– this method just returns the
offset in minutes for the given timestamp in the local time zone. No matter
what you pass to the Date
object, getTimezoneOffset()
will always work with
the local time zone.
This misconception around what a Date
is and what it contains leads to a lot
of confusion, especially when it comes to time zone conversions.
Given a Date
(or a timestamp), can you tell what time the clock would say for
it in a time zone that is not the same as your local time zone? Or, say you need
to schedule a meeting across time zones. Is 10 AM in India too late in New York
– what would the local time be in another time zone at a particular moment?
For the longest time, browsers did not expose time zone data to JavaScript APIs, so if you wanted to do time zone conversions on the client, you had to use a library like Moment Timezone.
These days, the Intl API ships in most modern browsers. That has meant modern date/time libraries like Luxon can be much smaller since they don’t need to ship locales or tz files.
However, Luxon has its own API for dealing with date/times that is different
from Date
, and you might not want to bring an external dependency.
Can you, in this case, store the result of a time zone conversion in a Date
object? You can’t. While technically you can do it, the fact that it is a
timestamp will end up creating problems for you down the line.
Unfortunately, that is how some libraries (like date-fns-tz) do it.
x = new Date()
// => Wed Feb 02 2022 04:06:14 GMT+0530 (India Standard Time)
y = utcToZonedTime(x, 'America/New_York')
// => Tue Feb 01 2022 17:36:14 GMT+0530 (India Standard Time)
utcToZonedTime()
takes an input date and a target time zone, and returns a new
Date
that’s set up in such a way that the local time components
i.e. getHours()
, getMinutes()
, etc. return what they would have for the
target time zone.
However, since Date
is just a timestamp, what it’s doing is that it is
actually modifying the underlying timestamp. This can be confirmed by printing
the timestamp for both the dates.
x.getTime()
// => 1643754974808
y.getTime()
// => 1643717174808
Not only is this semantically incorrect (the timestamp should have remained the
same), it will also create problems down the line if one is not careful.
For example, if this date is used in arithmetic, it should only ever be used
with dates which have similarly been converted to the same time zone using
utcToZonedTime()
. If that’s not followed, your date arithmetic will go wrong.
Given these issues, is it possible to do time zone conversions without moving all date/time handling to a new library like Luxon? The answer is yes, and that is what naive-date does.
Use a NaiveDate
as opposed to a Date
when you want a Date like
object, but one that’s not a timestamp. For example,
NaiveDate
’s API is very similar to that of Date
and includes all of its
warts, like month indexes starting from 0.
By the way, the term naive is inspired by its usage in the Python datetime module, which categorizes date and time objects as “aware” or “naive” depending on whether they include time zone information or not.
To create a NaiveDate
, you pass a YMD date, or the full date/time components:
// date only
// since we use 0 based indexes, the month below is Feb, not Jan
x = new NaiveDate(2022, 1, 1)
// date and time
y = new NaiveDate(2022, 1, 1, 10, 0, 0)
Since a NaiveDate
is not linked to any time zone (and it’s not a timestamp),
when you print it you won’t see any zone info:
x.toString()
// => '2022-02-01T00:00:00.000'
y.toString()
// => '2022-02-01T10:00:00.000'
The getters getFullYear()
, getHours()
etc. do what you expect. However,
There’s no equivalent for getUTC...
and setUTC...
methods since they don’t
make sense (NaiveDate
is not a timestamp).
There’s no equivalent for getTimezoneOffset()
either, since a NaiveDate
, by
definition, is not linked to any time zone.
And, most importantly, time zone conversions do the right thing. They return a
NaiveDate
when you want the local time, and a Date
when you want a
timestamp.
Let’s say I want a timestamp which is equivalent to 12 PM on 1st of Feb, 2022 in
New York, which is not my local time zone. This is how you would get it using
NaiveDate
.
// First I create a NaiveDate to capture the local date/time components
const nyDate = new NaiveDate(2022, 1, 1, 12, 0, 0)
// Then I convert it into a timestamp using the toDate() instance method
nyDate.toDate('America/New_York')
// => Date Tue Feb 01 2022 22:30:00 GMT+0530 (India Standard Time)
Again, remember that the Date
is a timestamp. The fact that it’s printing it
in my local time zone is irrelevant.
Similarly, if I want to find the local time in another time zone for a given timestamp, this is how it can be done:
const timestamp = new Date(2022, 1, 2, 5, 0, 0)
timestamp
// => Date Wed Feb 02 2022 05:00:00 GMT+0530 (India Standard Time)
const nyDate = NaiveDate.from(timestamp, 'America/New_York')
nyDate.toString()
// => "2022-02-01T18:30:00.000"
To know more, see the naive-date README.
Just keep in mind two things:
Date
is a timestampDate
for time zone conversions – use a library like Luxon or
NaiveDate insteadfork() can magically make your program do things twice. Don’t believe me? Let’s
run this small program and see for ourselves. Create a file called fork.py
and
save the following code in it.
import sys
import os
import time
sys.stdout.write('Ready to fork? (Press enter to continue) ')
sys.stdout.flush()
sys.stdin.readline()
os.fork()
print('I will print twice')
time.sleep(10)
print('I will also print twice')
Now run this program (make sure you use Python 3), press enter on the “Ready to
fork?” prompt and observe the output. Curiously, the print statements following
the fork()
call do indeed print twice!
What’s happening? To understand this, run the program again, but do not press
enter on the “Ready to fork?” prompt. Now open another terminal window, and
observe the output of the following ps -af
command.
The output would look something like this:
UID PID PPID C STIME TTY TIME CMD
ubuntu 80568 80012 0 23:31 pts/1 00:00:00 python3 fork.py
ubuntu 80571 80547 0 23:31 pts/0 00:00:00 ps -af
Now, press enter, then quickly switch to the other terminal window and and run
ps -af
again (before the 10 second sleep call runs out).
This time you will look like this:
UID PID PPID C STIME TTY TIME CMD
ubuntu 80568 80012 0 23:31 pts/1 00:00:00 python3 fork.py
ubuntu 80579 80568 0 23:33 pts/1 00:00:00 python3 fork.py
ubuntu 80580 80547 0 23:33 pts/0 00:00:00 ps -af
What’s happening here? Are we really running our program twice?
Well yes we are!
To understand this better, let’s first understand what ps
does. ps
just
lists the actively running processes on a system.
And what’s a process? A process is what an operating system creates when you ask it to run a program. A process usually consists of the following things:
python3
).ps
command)When you call fork()
, the OS makes an almost identical copy of the current
process, which is called the child process. And the process in which the
fork()
call is made of course becomes the parent of this newly created child
process. In the output of the ps
command observe the values of the PID and
PPID (i.e. parent PID) columns for both the python processes.
The child process, after creation, continues execution from the point at which
fork()
returns. This is why you see duplicate output for both the print
statements in our program.
It is important to note that both the parent and the child process run in
parallel after the fork()
call is made, even on systems with only a single
core, single processor CPU. This is possible due to multitasking.
You might also have noticed that even though the print calls are made twice, the sleep lasts only for 10 seconds and not 20 seconds. This is a direct consequence of the processes running in parallel.
Exercise 1: Inside a running Python process, you can get its pid using the
os.getpid()
function. Modify the print statements above to also include pid
and observe the output.
Exercise 2: Remove the call to time.sleep()
in the program above and
observe the output.
One problem with the code we’ve written till now is this - after fork()
returns, inside the respective processes, how do you identify which one is the
parent and which one is the child?
One possible solution is to do something like this:
import sys
import os
import time
PARENT_PID = os.getpid()
sys.stdout.write('Ready to fork? (Press enter to continue) ')
sys.stdout.flush()
sys.stdin.readline()
os.fork()
PID_AFTER_FORK = os.getpid()
if PID_AFTER_FORK == PARENT_PID:
print('Inside parent')
else:
print('Inside child')
This should work, but fork provides an easier way: the return value of the
fork()
call is 0
in the child process, and it is set to the pid of the child
in the parent process. That is, this should also work:
import sys
import os
import time
sys.stdout.write('Ready to fork? (Press enter to continue) ')
sys.stdout.flush()
sys.stdin.readline()
PID_AFTER_FORK = os.fork()
if PID_AFTER_FORK > 0:
print('Inside parent')
else:
print('Inside child')
Exercise 3: After the fork, let the parent run to completion but put the
child to sleep. Observe the output of ps -af
. What happens to the chlid’s
parent PID after the parent exits?
Exercise 4: If the child process prints something after the parent process exits, what happens to its output?
Exercise 5: Write a function, launch_child
, that takes a function fn
and
any number of positional and keyword arguments as params. This function should
create a child process, call fn
inside the child process and pass it all the
positional and keyword arguments that were passed to it. After fn
finishes
running, the child process should exit.
To test your launch_child
function, use the following program:
import sys
import os
import time
def launch_child(fn, *args, **kwargs):
# Your implementation of launch_child here
def print_with_pid(*args, **kwargs):
print(os.getpid(), *args, **kwargs)
sys.stdout.write('Ready to fork? (Press enter to continue) ')
sys.stdout.flush()
sys.stdin.readline()
PID_OF_CHILD = launch_child(print_with_pid, 'This prints inside child')
print_with_pid('child pid is', PID_OF_CHILD)
It is important that child process must exit immediately after fn
returns. Which means that the “child pid is …” line MUST NOT print inside the
child process.
Two things that might be important for a parent process - it might want to wait till a child process completes, and it might want to know whether a child process run successfully or not.
Success or failure of a process is usually indicated by a number which is called its exit status. You can set the exit status of a Python process by calling sys.exit(). Calling this function gracefully terminates your Python process (by ensuring that the finally clauses of the try statement are run), and sets the exit status to the value passed to it.
An exit status can be between 0 and 127. 0 means success, everything else indicates failure.
A parent process can wait for a child process by using the os.waitpid() call. waitpid() takes a child pid as argument alongwith an integer specifying options (usually set to 0). It returns a tuple containing the child pid and exit status indication. The exit status indication is a 16-bit number whose low byte is the signal number that killed the process, and whose high byte is the exit status (if the signal number is zero). For now we will only worry about the exit status.
import sys
import os
import time
sys.stdout.write('Ready to fork? (Press enter to continue) ')
sys.stdout.flush()
sys.stdin.readline()
PID_AFTER_FORK = os.fork()
if PID_AFTER_FORK > 0:
print('Inside parent')
status_encoded = os.waitpid(PID_AFTER_FORK, 0)[1]
print('Inside parent, child exited with code', status_encoded >> 8)
else:
print('Inside child')
time.sleep(2)
sys.exit(127)
Exercise 6: Write a program that creates multiple children, and then waits for them. If any child exits, your program should print the pid of the child that exited. (Hint: check the different ways to specify the pid in the waitpid() call).
Exercise 7: Which process is the parent of the parent Python process? You
can figure this out by using the ps
command.
Exercise 8: How can you check the exit status of the last program that was run by a unix shell (e.g. bash).
Exercise 9: In bash, how do you run a series of commands one after another? The only constraint is that a command should run only if the previous one succeeded. That is, the pipeline should stop on first failure.
Exercise 10: Conversely, how do you run a pipeline of commands which should stop on first success?
When a child is forked, it gets an almost identical copy of all the memory segments of the parent. However, it is a copy - once forked, any further modifications made to any memory location by the parent process does not reflect in the child, and vice versa. This can be tested with the simple program below.
import sys
import os
import time
X = 100
Y = dict(foo=123)
if os.fork() > 0:
print('Inside parent')
X = 200
Y['foo'] = 456
print('Inside parent, X:', X)
print('Inside parent, Y:', Y)
# wait for child to complete
time.sleep(3)
else:
print('Inside child')
time.sleep(2)
print('Inside child, X:', X)
print('Inside child, Y:', Y)
Memory isolation between processes is fairly easy to grasp. What may not be so easy to understand is how external resources like files work when a fork happens.
Exercise 11: Consider the following program that writes to a file from two processes:
import sys
import os
import time
with open(sys.argv[1], 'w') as f:
if os.fork() > 0:
for i in range(10):
print('writing from parent, chunk:', i)
f.write('aaa\n')
time.sleep(1)
else:
time.sleep(0.5)
for i in range(10):
print('writing from child, chunk:', i)
f.write('bbb\n')
time.sleep(1)
Notice that the same file handle, f
, is open and available inside both the
child and the parent.
Without running it, can you say what this program will do? Keep in mind the fact that you are dealing with buffered I/O.
Exercise 12: Consider the following program that reads a file linewise from two processes:
import sys
import os
import time
with open(sys.argv[1], 'r') as f:
if os.fork() > 0:
for line in f:
print('reading from parent:', line, end='')
time.sleep(1)
else:
time.sleep(0.5)
for line in f:
print('reading from child:', line, end='')
time.sleep(1)
Again, keeping in mind that you are dealing with buffered I/O, what do you think will happen when this program is run?
Exercise 13: If, instead of reading a file, we instead tried to read the standard input linewise in both the parent and the child, what would happen? Modify the program in the previous exercise to read from stdin instead and explain the behaviour.
There are many ways for two processes on the same system to communicate with one another. One way to do it us to use pipes. Pipes are most commonly used in the shell to send ouptut of one command to another. For example,
ps -eaf | grep python | less
The following program uses a pipe to send a message from the child process to the parent:
import sys
import os
import time
read_fd, write_fd = os.pipe()
if os.fork() > 0:
# Close the write fd in parent, since we don't need it here
os.close(write_fd)
print('In parent, waiting for child to write something')
bytes_read = os.read(read_fd, 10)
print('In parent, child wrote:', bytes_read)
os.close(read_fd)
else:
# Close the read fd in child, since we don't need it here
os.close(read_fd)
time.sleep(1)
print('In child, writing something')
os.write(write_fd, b'hello')
os.close(write_fd)
Here’s how this works: the function pipe() returns two file
descriptors - read_fd
and write_fd
. Any data written to write_fd
can be
read on read_fd
.
File descriptors, or “fds” in short, are positive integers that actually power many operations on Unix - including files, sockets and pipes, among others. In fact, the high level file API in Python is actually built on top of file descriptors and the following system calls:
The high level buffered API provided by Python is built on top of the raw
unbuffered API provided by os.read()
and os.write()
.
When a fork happens, any file descriptors open in the parent process remain open in the child process. This is actually why files opened in a parent processs remain open in a child, as we covered in the previous section.
Now back to pipes - in our case, the child process wants to send a message to
the parent process. So child will write to write_fd
and
the parent reads from read_fd
.
Also, we want to close the fds we don’t need. As the parent process has no use
for write_fd
, it closes this fd immediately after the fork. And as the child
process has no use for read_fd
, it closes this fd as soon as it is created.
After everything is done, the other fd is also closed by both the processes.
Pipes is not the only way for two processes to communicate with each other. The wikipedia page on IPC lists the different approaches available.
Exercise 14: Write a program that launches multiple child processes. Provide a unique writable fd to each child. Whenever any child writes to its writable fd, the parent should print the byte string to console. You may need to use select() for this.
Exercise 15: The function map
takes at least two arguments - another
function and an iterable. It applies the given function to each element in the
iterable, and returns a new iterator with the result.
>>> map(round, [1.4, 3.5, 7.8])
<map object at 0x10df15470>
>>> list(_)
[1, 4, 8]
Write a new function, pmap
(parallelized map) that works similarly to
map
. It should take a function and an iterable as an argument. The difference
is that it should apply the function to each element in a separate child
process. The parent should then assemble the results in a new list and return.
You will need to use the pickle module to serialize object values between
parent and child processes - pickle.dumps()
and pickle.loads()
should be
sufficient.
exec is another magical piece of functionality in Unix systems. exec is how you run an executable in unix. It causes the program that is currently being run by the calling process to be replaced with a new program, with newly initialized stack, heap, and (initialized and uninitialized) data segments.
In other words, the new executable is loaded into the current process, and will have the same process id as the caller.
Let’s see it in action:
import sys
import os
sys.stdout.write('''Provide program name and args to run like you would in a shell.
Examples:
ls
ls -al
ls -l file1 file2 file3
$ ''')
sys.stdout.flush()
program_and_arguments = sys.stdin.readline().rstrip().split()
program = program_and_arguments[0]
arguments = program_and_arguments[1:]
os.execlp(program, program, *arguments)
sys.stdout.write('I executed a program\n')
sys.stdout.flush()
The exec functionality here is provided by os.execlp()
. Run the program above
and provide program name and args to run - what happens? Did you see the string
“I executed a program” in the output? If no, why not?
The Python interface to exec is provided by the os
module, and is documented
here. You will notice that
exec is not a single function but a family of functions. All these variants
provide the same functionality, differing only in one or more of the following:
PATH
or not.The modifiers e, l, p and v appended to the name “exec” tell us what combination of the above functionality is provided by a given variant. The documentation explains this in greater detail.
One thing you might have noticed is that in the invocation of execlp()
above
the program name was given twice. The first one tells execlp which program to
run. The second one actually becomes the first argument (arg0) to the
program. It is recommended that the first argument is always the name of the
program, but this is not enforced.
You can test this by compiling and running the following C program from our
program above, and passing a different arg0 rather than the program name (Python
does some funky stuff with sys.argv[0]
, which is why we are using a C program
as our target here):
#include<stdio.h>
int main(int argc, char **argv) {
printf("No of arguments: %d\n", argc);
for (int i = 0; i < argc; ++i) {
printf("argv[%d]: %s\n", i, argv[i]);
}
}
Since exec replaces the current process with a different program, how do we launch another program yet retain our current process? Simple, fork and then exec. This is the classic Unix-y way of launching a new process, and is in fact what your shell probably does. We will attempt to do the same in the exercise that follows.
Exercise 16: Can you verify that the process running before and after exec
is the same i.e. the pid remains the same before and after the call to exec
?
Exercise 17: Create a function, launch_program(program_name, *args)
that
takes a program name and its arguments, if any. It should run the program in a
separate process, wait for the program to exit, and after it does exit, return
its exit status in the parent process.
Exercise 18: (Optional) Create a function, pipeline(commands)
. commands
should be a list of commands. Each command is of the form [program_name, arg0,
arg1, ...]
i.e. it names a program and its arguments. pipeline()
should
launch each of these commands in parallel, and pipe the output of the first
command to the second, the second command to the third, and so on. That is, the
following,
pipeline(["ls", "-al"], ["grep", "-F', ".py"], ["wc", "-l"])
should work the same as
ls -al | grep -F .py | wc -l
The function should wait for all the commands to exit, and return their exit status codes in an array.
Besides using fork, exec, pipe and wait, you will need one more function to make
this work: dup2. dup2
is
also pretty special - it allows you to duplicate a given fd to a target fd of
your choice. This means you can duplicate one of the pipe fds to stdin or
stdout as required. This setting up of pipes will probably need to be done
between the calls to fork and exec.
SIMPLE-ARRAY
,
SIMLPE-BASE-STRING
, etc.) were passed to the encoding/decoding routines, but I
also wanted to support the more general types.
For example, the core encoding routine in qbase64, %ENCODE
, which looks
something like this (simplified):
(defun %encode (bytes string)
(loop ;; over bytes and write to string
...))
goes through the BYTES
array, taking groups of 3 octets each and writes the
encoded group of 4 characters into STRING
.
If I declared its types like this:
(defun %encode (bytes string)
(declare (type (simple-array (unsigned-byte 8)) bytes))
(declare (type simple-base-string string))
(declare (optimize speed))
(loop ...))
SBCL would produce very fast code, but the function would no longer work for
either ARRAY
or STRING
:
And if I was to redefine the routine with more general types:
(defun %encode (bytes string)
(declare (type array bytes))
(declare (type string string))
(declare (optimize speed))
(loop ...))
the code produced would be significantly slower.
My experience with generics is limited, but it seemed that generics could solve this problem elegantly. However, Common Lisp doesn’t have generics, but it does support macros, so I came up with an ugly-but-gets-the-job-done hack.
I created a macro, DEFUN/TD, that would take all the different type combinations I wanted to optimize and support upfront:
(defun/td %encode (bytes string)
(((bytes (simple-array (unsigned-byte 8))) (string simple-base-string))
((bytes (simple-array (unsigned-byte 8))) (string simple-string))
((bytes array) (string string)))
(declare (optimize speed))
(loop ...))
and generate code which would dispatch over the type combinations, then use
LOCALLY
to declare the types and splice the body in:
(defun %encode (bytes string)
(cond
((and (typep bytes '(simple-array (unsigned-byte 8)))
(typep string 'simple-base-string))
(locally
(declare (type bytes (simple-array (unsigned-byte 8))))
(declare (type string simple-base-string))
(declare (optimize speed))
(loop ...)))
((and (typep bytes '(simple-array (unsigned-byte 8)))
(typep string 'simple-string))
(locally
(declare (type bytes (simple-array (unsigned-byte 8))))
(declare (type string simple-string))
(declare (optimize speed))
(loop ...)))
((and (typep bytes 'array)
(typep string 'string))
(locally
(declare (type bytes array))
(declare (type string string))
(declare (optimize speed))
(loop ...)))
(t (error "Unsupported type combination"))))
The result is more generated code and an increase in the size of the Lisp image,
but now the loop is well optimized for each type combination given to
DEFUN/TD
. The run-time dispatch might incur a slight penalty, but it is more
than offset by the gains made.
This was a fairly interesting problem that I hadn’t dealt with before, nevertheless it looked like a fairly common one, so I asked on the cl-pro list a couple of years ago how others solved this; Mark Cox pointed me to a few libraries:
All of these are quite interesting and attack more or less the same problem in different ways.
Is there a trick or two that I’ve missed? Feel free to tell me.
]]>In this post I present an approach, using digital fingerprints, that will render tampering of EVMs and election results after close of polling useless. While there might be holes in this approach, I still believe there’s merit in discussing this.
The solution involves generating and disclosing to the public a digital fingerprint of the result from each EVM as soon as polls close. Each fingerprint is a seemingly random string of characters, however they have a couple of highly desirable properties:
The fingerprints are generated using a cryptographic hash function. In subsequent sections, we will see how they work. But first, it’s very important to understand what this solution does not solve.
The solution proposed here cannot prevent tampering of EVMs before, or during, the election. Nor cannot it solve the problem of booth capture.
It only focuses on securing one aspect of the polling process, and that is manipulation of election results after polling closes. In fact, it only works if EVMs have not been tampered with, and booth capture has not occurred.
After disclosure of the digital fingerprint, which should be done as soon as polling closes, tampering of EVMs becomes irrelevant as an altered result will not be able to match the disclosed fingerprint.
The following sections get into the details of how this scheme works.
(Skip this section if you already know how they work)
A cryptographic hash function is a mathematical construct that takes an input text of any length and mixes its bytes to produce a fixed size string. This string, also known as a digest or a hash, is a digital fingerprint of the input message.
Examples of such hash functions include MD5, SHA-1, SHA-3, etc.
Some example (SHA-1) hashes are shown below:
Text | Hash (SHA-1) |
---|---|
abracadabra | 0b8c31dd3a4c1e74b0764d5b510fd5eaac00426c |
the quick brown fox | ced71fa7235231bed383facfdc41c4ddcc22ecf1 |
the quick brown fix | e3a75de65fea42239e26476f6efe110f69932b8f |
the quick brown fox jumped over the lazy dog | 3e4991b48bcb1bd9d3c4c14a1f24c415deaba466 |
A few important properties of cryptographic hashes are:
Also, as the second and third examples show, even a slight change in text input usually leads to large changes in the output hash.
So, while it’s very easy to calculate the hash of the string “The quick brown
fox jumps over the lazy dog”, it is impossible to do the reverse – if all you
had was the hash 3e4991b48bcb1bd9d3c4c14a1f24c415deaba466
, you won’t be able
to find the string that produced this hash.
Moreover, it is impossible to find another string that has the same hash.
Hash functions are also deterministic i.e. they will always produce the same output for the same input, no matter when or how many times they are called.
It is important to understand that that hash functions DO NOT encrypt the input string. There is no secret key involved, so there’s no chance of losing a key that will break the whole scheme. Hash functions only take one input – the text for which the digest needs to be produced.
(Note that while the examples here use SHA-1, it is quite old and not as secure anymore. It is recommended to use SHA-3 instead. The only reason we use SHA-1 here is for the purpose of readability - hash strings generated by SHA-3 are a bit longer)
What we are trying to achieve is this: once polling closes, we want a guarantee that the result in an EVM at that moment will not be different from the result that is revealed on counting day.
The result recorded in an EVM is simply a sequence of numbers, where each number indicates the votes received by a candidate (the order of these numbers is the same as the order of candidates on the ballot unit, which is fixed a few weeks prior to voting).
Assume that at a polling station there are five candidates, and the result
stored inside the EVM at close of polling is this:
400,300,500,200,100
. (i.e. the first candidate received 400 votes, the second
candidate received 300 votes, and so on). The SHA-1 hash of this string is
91699a41d11cbe2e18319949151fd03ef529a833
.
The EVM will only reveal the generated hash string and nothing else. This can safely be disclosed to the public at large.
On the day of counting, the EVM reveals the actual result. Anyone can look at the result and compute its hash. If the computed hash matches the hash revealed earlier, one can be fairly confident that the EVM has not been tampered with or replaced after polling closed.
How do we know that this works? Remember that even if you know the original string and the hash, you cannot find another string that has the same hash. So even if someone were to break into an EVM, view the result and change it, they can’t find another sequence of numbers that would have the same hash. Replacing an EVM won’t help either since the hash is already public.
Another important aspect to consider is that one shouldn’t be able to figure out the result from the hash. Remember that it is impossible to figure out the original string just from the hash, so this should in theory work. However, since we already know the number of candidates and the voters, it may not be that difficult to calculate the result by brute force, especially if the number of candidates or voters is low. We’ll discuss this in more detail next.
Consider a polling station with 50 voters and only 2 candidates. There are only 51 ways in which the vote share can be divided between the two candidates:
0,50
1,49
2,48
...
50,0
So if someone wants to know the poll result beforehand, they can simply compute the hash for all 51 permutations of the result (i.e. create a rainbow table):
Result | Hash |
---|---|
0,50 | c87b42a20015ca36b3ee027a8e125c7a71e3d4f8 |
1,49 | 151eaff1df5bbc8f0259d679047560b45740544e |
2,48 | 1f5916b0dbfa228a07b7d6293aca31e0e1dd53d6 |
… | |
50,0 | 406840d6e2e9517378d13240b158c2cf843e8d67 |
Now compare the hash provided by the EVM with the hashes in this table. The result is the one whose hash matches with the one provided by the EVM.
In essence, you are not breaking the hash function, but since the number of possible inputs is small, you don’t need to. You can simply compute the hash of every possible input.
As the number of candidates and voters increase, the probability of being able to carry out a brute force attack decreases:
At 100 voters and 5 candidates, commodity hardware can crack the result in seconds.
At 600 voters and 10 candidates, the fastest bitcoin mining hardware around (which specializes in computing hashes at a high speed) will take a few days to crack the result.
At 1000 voters and 15 candidates, one can be fairly confident that even a nation-state cannot brute force their way to the result anytime soon.
(See the addendum for a more detailed analysis behind these numbers)
All said and done, cryptographic hash functions alone are not sufficient to protect the secrecy of election results. How do we fix this?
The answer lies in randomization. Generate a long enough random number, append it to the result text, then compute the hash of this combined text. On counting day, when the results are revealed, the random number that was used should be revealed too, so that hash computation can still be verified independently.
Going back to our hypothetical result string: 400,300,500,200,100
. Let’s say
the EVM generates this random number: 249825579
. We simply append this number
to the result: 400,300,500,200,100,249825579
and compute the hash of the
combined text. The resultant hash is revealed immediately. And on counting day,
the randomly generated number 249825579
is also revealed alongwith the each
candidate’s vote count.
What’s a long enough random number? A 128-bit random number (i.e. a number picked at random from 2128 possibilites) should be good enough. If a true 128-bit random number is appended to every result text, no matter how low the number of voters/candidates are, the number of permutations is no less than 2128. This is big enough that even if you had the all the bitcoin mining hardware in the world at your disposal, Earth itself will be incinerated by the Sun before you can compute the result.
The problem with random numbers, though, is that generating truly random numbers is hard. And it is impossible to generate them from software without an external source of randomness. Do EVMs ship with a component that generates high quality random numbers? I think not.
Can this scheme work? Probably yes.
Is it feasible to do this today? Probably no.
As discussed under randomization, EVMs most likely don’t ship with a hardware based random number generator. So adopting this approach will likely require a hardware upgrade to EVMs, besides firmware upgrades. This alone makes this scheme quite infeasible in the short term.
One of the problems that has come up with EVMs (that didn’t exist with paper ballots) is that candidates will know how many votes they received from each polling station in their constituency. Some of them have threatened voters with post-poll reprisals if a particular area did not vote for them. This led to the introduction of a Totalizer that allows votes cast in about 14 polling stations to be counted together.
Our approach, which requires the hash and the random number to be generated in the EVM, is not compatible with this.
For it to work, it’s the totalizer instead of the EVMs that needs to change.
In recent years, the election commission introduced VVPAT based EVMs – besides registering the vote electronically, VVPAT machines also print the vote on a paper, and store the paper votes in a sealed ballot box.
Unfortunately, only a small subset of paper votes are counted and tallied with the EVM result. If all the paper based votes were to be counted, that combined with a verifiable digital fingerprint of the result will, in my opinion, go a long way towards assuring the public about the sanctity of the polling process.
A more technical analysis of the efficacy of brute force attacks
For a polling station with n voters and k candidates, the number of different permutations of the result is n+k-1Ck-1. The stars and bars method proves this theorem.
How powerful a computer would you need to crack these results?
Finally, if you add a 128-bit random number in the mix, even with all the computation power of the bitcoin network, you would still need hundreds of billions of years to crack the result, well outside the realm of possibility.
]]>PREPARE
that
creates prepared statements for a PostgreSQL connection. It takes a SQL
query with placeholders ($1
, $2
, etc.) as input and returns a function which
takes one argument for every placeholder and executes the query.
The first time I used it, I did something like this:
(defun run-query (id)
(funcall (prepare "SELECT * FROM foo WHERE id = $1") id))
Soon after, I realized that running this function every time would generate a new prepared statement instead of re-using the old one. Let’s look at the macro expansion:
(macroexpand-1 '(prepare "SELECT * FROM foo WHERE id = $1"))
==>
(LET ((POSTMODERN::STATEMENT-ID (POSTMODERN::NEXT-STATEMENT-ID))
(QUERY "SELECT * FROM foo WHERE id = $1"))
(LAMBDA (&REST POSTMODERN::PARAMS)
(POSTMODERN::ENSURE-PREPARED *DATABASE* POSTMODERN::STATEMENT-ID QUERY)
(POSTMODERN::ALL-ROWS
(CL-POSTGRES:EXEC-PREPARED *DATABASE* POSTMODERN::STATEMENT-ID
POSTMODERN::PARAMS
'CL-POSTGRES:LIST-ROW-READER))))
T
ENSURE-PREPARED
checks if a statement with the given statement-id
exists for the current connection. If yes, it will be re-used, else a new one is
created with the given query.
The problem is that the macro generates a new statement id every time it is
run. This was a bit surprising, but the fix was simple: capture the function
returned by PREPARE
once, and use that instead.
(defparameter *prepared* (prepare "SELECT * FROM foo WHERE id = $1"))
(defun run-query (id)
(funcall *prepared* id))
You can also use Postmodern’s DEFPREPARED
instead, which similarly defines a
new function at the top-level.
This works well, but now are using top-level forms instead of the nicely encapsulated single form we used earlier.
To fix this, we can use LOAD-TIME-VALUE
.
(defun run-query (id)
(funcall (load-time-value (prepare "SELECT * FROM foo WHERE id = $1")) id))
LOAD-TIME-VALUE
is a special operator that
By wrapping PREPARE
inside LOAD-TIME-VALUE
, we get back our encapsulation
while ensuring that a new prepared statement is generated only once (per
connection), until the next time RUN-QUERY
is recompiled.
To avoid the need to wrap PREPARE
every time, we can create a converience
macro and use that instead:
(defmacro prepared-query (query &optional (format :rows))
`(load-time-value (prepare ,query ,format)))
(defun run-query (id)
(funcall (prepared-query "SELECT * FROM foo WHERE id = $1") id))
This only works for compiled code. As mentioned earlier, the form wrapped inside
LOAD-TIME-VALUE
is evaluated once only if you compile it. If uncompiled, it is
evaluated every time so this solution will not work there.
Another thing to remember about LOAD-TIME-VALUE
is that the form is evaluated
in the null lexical environment. So the form cannot use any lexically scoped
variables like in the example below:
(defun run-query (table id)
(funcall (load-time-value
(prepare (format nil "SELECT * FROM ~A WHERE id = $1" table)))
id))
Evaluating this will signal that the variable TABLE
is unbound.
In a nutshell, we need to ensure that the same code is being run everywhere, including the project source, its libraries and the version of Python on which it is run.
Below are some quick notes on one way to achieve this.
We will use pipenv and pyenv to get this done.
pipenv is a package manager that uses pip and virtualenv under the
hood. The project’s direct dependencies are added to a Pipfile
, and the
dependency graph is locked down in Pipfile.lock
, which is generated
automatically and never touched by hand. The lock file is crucial for
reproducible builds, we will see how that is under project syncing.
pyenv makes it a breeze to install and manage multiple versions of Python. You
specify the desired Python version in your Pipfile
and pipenv will use
pyenv to fetch and install the relevant Python version.
eval "$(pyenv init -)"
towards the end of your
shell’s init file (e.g. ~/.bash_profile
, ~/.profile
or ~/.bashrc
).If you use Homebrew or Linuxbrew you can simply run
brew install pipenv
Otherwise you will need to make use of the Python and Pip that already ship with your OS, or get it via pyenv. And then run something like:
pip install --user pipenv
Yeah, installing pipenv itself requires Python and pip. But this only needs to be done once.
See Installing Pipenv for more details.
Make sure that pyenv and pipenv are installed as indicated in the previous section.
mkdir test
cd test
Setup Python for your project: pipenv install --python 3
. This will create
a Pipfile
and Pipfile.lock
in the project directory.
If you use this command, by default pipenv will try to pick the Python 3 available on your system. If it doesn’t find one, it will ask if you want it to fetch a Python from pyenv.
If you want a more specific version of Python, use: pipenv install --python
3.7
.
Install the libraries that your project depends on using pipenv install
.
pipenv install django~=2.1.5
pipenv install djangorestframework~=3.9.1
You can skip specifying the version, but I won’t recommend doing that. Note
the use of the ~=
operator. It is the compatible release operator and
essentially means that a breaking version of the library won’t be installed
when you try to update it. More on this under updating dependencies.
Pipfile
and Pipfile.lock
to version control. Now you can share your
project with the team.Fetch the project from version control. Make sure that it contains both
Pipfile
and Pipfile.lock
.
pipenv sync
That’s it. pipenv will install all your project’s dependencies (including Python, via pyenv) and allow you to start using them.
pipenv sync
only looks at Pipfile.lock
, installs the given dependencies
locally and ensures that the hashes match. This is exactly what we need to
ensure that the build is reproducible.
You should pipenv sync
everytime the project’s dependencies are updated.
There are two ways to run our project using the newly installed Python and libraries:
The first is to invoke pipenv shell
. This will drop you into a new shell with
PATH
and sys.path
setup so that you get the correct version of
everything. You can exit this shell at any time via Ctrl-D
or exit
.
The other way is to use pipenv run <cmd>
. E.g. If you are, say, running
django, all you need to do is pipenv run python manage.py runserver
and
everything should work as expected.
How do you upgrade a library to a newer version?
One way is to simply run pipenv update name-of-library
. If you used the
compatible release operator, which you should, this will update the library to
the newest version allowed by this operator.
For example, if you specified django~=2.0.0
in your Pipfile
, then pipenv
update django
will update django to the highest version available under 2.0.x
but not to a newer version in the 2.1.x series.
And if you specified django~=2.0
, then it will update django to the highest
version available under 2.x but will not go up to 3.x.
If you want to update django to a higher version than the one allowed by the
compatible release operator, you need to use the install
subcommand i.e. do
something like pipenv install django~=2.1.0
.
The other way to do this is to simply update the Pipfile
by hand, and
subsequently run pipenv install
. This will install the specified library
version and also update Pipfile.lock
.
Once you get past the installation hurdle, it seems easy and simple enough to use pipenv (with help from pyenv) to manage a project’s dependencies and get reproducible builds.
For more on pipenv, you can go through:
]]>But how exactly does Chronicity work? In this post, we’ll dig into its innards and get a sense of the steps involved in writing it.
If you want to hack into Chronicity, or write your own NLP date parser, this might help.
Note: credit for Chronicity’s architecture goes to the Ruby library Chronic. It served both as an inspiration and as the implementation reference.
Broadly, Chronicity follows these steps to parse date and time strings:
We normalize the text before tokenizing it by doing the following:
All of this is accomplished by the PRE-NORMALIZE
function. To
convert numeric words to numbers the NUMERIZE
function is
used. One caveat: do not immediately normalize the term “second” – it can either
mean the ordinal number or the unit of time. So we wait until after tokenization
(see pre-process tokens) to resolve this ambiguity.
CHRONICITY> (pre-normalize "tomorrow at seven")
"next day at 7"
CHRONICITY> (pre-normalize "20 days ago")
"20 days past"
Next we assign a token to each word in the normalized text.
(defclass token ()
((word :initarg :word
:reader token-word)
(tags :initarg :tags
:initform nil
:accessor token-tags)))
(defun create-token (word &rest tags)
(make-instance 'token
:word word
:tags tags))
As you can see, besides the word, a token also contains a list of tags. Each tag indicates a possible way to interpret the given word or number. Take the phrase “20 days ago”. The number 20 can be interpreted in many ways:
Remember, we are still in the tokenization phase so we don’t know which interpretation is correct. So we will assign all four tags to the token for this number.
Each tag is a subclass of the TAG
class, which is defined as follows.
(defclass tag ()
((type :initarg :type
:reader tag-type)
(now :initarg :now
:accessor tag-now
:initform nil)))
(defun create-tag (class type &key now)
(make-instance class :type type :now now))
The slot TYPE
is a misnomer – it actually indicates the designated value of
the token for this tag. For example, the TYPE
for the year 2020 above will
be the integer 2020. For the time 8 PM it will be an object denoting the time.
The slot NOW
has the current timestamp. It is used by some tag classes like
REPEATER
for date-time computations (discussed later).
The various subclasses of TAG
are:
SEPARATOR
– Things like slash “/”, dash “-“, “in”, “at”, “on”,
etc.ORDINAL
– Numbers like 1st, 2nd, 3rd, etc.SCALAR
– Simple numbers like 1, 5, 10, etc. It is further
subclassed by SCALAR-DAY
(1-31), SCALAR-MONTH
(1-12) and SCALAR-YEAR
. A
token for any number will usually contain the SCALAR tag plus one or more of
the subclassed tags as applicable.POINTER
– Indicates whether we are looking forwards (“hence”,
“after”, “from”) or backwards (“ago”, “before”). These words are normalized
to “future” and “past” before they are tagged.GRABBER
– The terms “this”, “last” and “next” (as in this month
or last month).REPEATER
– Most of the date and time terms are tagged using this
class. This is described in more detail below.There are a number of subclasses of REPEATER
to
indicate the numerous date and time terms. For example:
REPEATER-YEAR
, REPEATER-MONTH
, REPEATER-WEEK
, REPEATER-DAY
.REPEATER-MONTH-NAME
is used to indicate month names like “jan” or “january”.REPEATER-DAY-NAME
indicates day names like “monday”.REPEATER-TIME
is used to indicate time strings like 20:00.REPEATER-DAY-PORTION
.In addition, all the REPEATER
subclasses need to implement a few methods that
are needed for date-time computations.
R-NEXT
– Given a repeater and a pointer i.e. :PAST
or :FUTURE
, returns a
time span in the immediate past or future relative to the NOW
slot. For
example, assume the date in NOW
is 31st December 2018.
(r-next repeater :past)
for a REPEATER-MONTH
will return a time span
starting 1st November 2018 and ending at 30th November.(r-next repeater :future)
will return a span for all of January 2019.REPEATER-DAY
this would have returned 30th December for
:PAST
and 1st January for the :FUTURE
pointer.R-THIS
is similar to R-NEXT
except it works in the current context. The
width of the span also depends on whether direction of the pointer.
(r-this repeater :past)
for a REPEATER-DAY
will return a span from the
start of day until now.(r-this repeater :future)
will return a span from now until the end of
day.(r-this repeater :none)
will return the whole day today.R-OFFSET
– Given a span, a pointer and an amount, returns a new span offset
from the given span. The offset is roughly the amount mulitplied by the width
of the repeater.Now we can put the whole tokenization and tagging piece together:
(defun tokenize (text)
(mapcar #'create-token
(cl-ppcre:split #?r"\s+" text)))
(defun tokenize-and-tag (text)
(let ((tokens (tokenize text)))
(loop
for type in (list 'repeater 'grabber 'pointer 'scalar 'ordinal 'separator)
do (scan-tokens type tokens))
tokens))
As you can see, computing the tags for each token is accomplished by the
SCAN-TOKENS
. This is a generic function specialized on the class name of the
tag.
One of the methods implementing SCAN-TOKENS
is shown below.
(defmethod scan-tokens ((tag (eql 'grabber)) tokens)
(let ((scan-map '(("last" :last)
("this" :this)
("next" :next))))
(dolist (token tokens tokens)
(loop
for (regex value) in scan-map
when (cl-ppcre:scan regex (token-word token))
do (tag (create-tag 'grabber value) token)))))
(defmethod tag (tag token)
(push tag (token-tags token)))
Going back to our original example, for the text “20 days ago”, these are the tags set for each token (after normalization).
Token Tags
----- ----
20 [SCALAR-YEAR, SCALAR-DAY, SCALAR, REPEATER-TIME]
days [REPEATER-DAY]
past [POINTER]
We are almost ready to run pattern matching to figure out the input date, but first, we need to resolve the ambiguity related to the term second that we faced during normalization. At that time, we did not convert it to the number 2 since it could refer to either the unit of time or the number.
Now with tokenization done, we resolve this ambiguity with a simple hack: if the
term second is followed by a repeater (i.e. month, day, year, january, etc.), we
assume that it is the ordinal number 2nd and not the unit of time. See
PRE-PROCESS-TOKENS
for more details.
The last piece of the puzzle is pattern matching. Armed with tokens and their corresponding tags, we define several date and time patterns that we know of and try to match them to their input tokens.
First we name a few pattern classes – each pattern we define belongs to one of these classes.
DATE
– patterns that match an absolute date and time e.g. “1st January”,
“January 1 at 2 PM”, etc.ANCHOR
– patterns that typically involve a grabber e.g. “yesterday”,
“tuesday” “last week”, etc.ARROW
– patterns like “2 days from now”, “3 weeks ago”, etc.NARROW
– patterns like “1st day this month”, “3rd wednesday in 2007”, etc.TIME
– simple time patterns like “2 PM”, “14:30”, etc.A pattern, at its simplest, is just a list of tag classes. A list of input tokens successfully matches a pattern if, for every token, at least one of its tags is an instance of the tag class mentioned at the corresponding position in the pattern. For example, the text “20 days ago” had these tags:
Token Tags
----- ----
20 [SCALAR-YEAR, SCALAR-DAY, SCALAR, REPEATER-TIME]
days [REPEATER-DAY]
past [POINTER]
It will match any of these patterns:
(scalar repeater pointer)
(scalar repeater-day pointer)
((? scalar) repeater pointer)
The last example shows a pattern with an optional tag – (? scalar)
. It will
match tokens with or without the scalar e.g. both “20 days ago” and “week ago”
will match.
Our pattern matching engine also allows us to match an entire pattern class. For example,
(repeater-month-name scalar-day (? separator-at) (? p time))
(? p time)
here means that any pattern that belongs to the TIME
pattern
class can match. So all of “January 1 at 12:30”, “January 1 at 2 PM” and
“January 1 at 6 in the evening” will match without us needing to duplicate all
the time patterns.
Note: There’s one limitation – a pattern class can only be specified at the
end of a pattern in Chronicity. So a pattern like (repeater (p time) pointer)
won’t work. This will be fixed in the future.
Each pattern has a handler function that decides how to convert the matching tokens to a date span.
A pattern and its handler function are defined using the
DEFINE-HANDLER
macro. It assigns one or more patterns to a
pattern class, and if either of these patterns match, the function body is
run. Its general form is:
(define-handler (pattern-class)
(tokens-var)
(pattern1 pattern2 ...)
... body ...
)
An example handler is shown below.
(define-handler (date)
(tokens)
((repeater-month-name scalar-year))
(let* ((month-name (token-tag-type 'repeater-month-name (first tokens)))
(month (month-index month-name))
(year (token-tag-type 'scalar-year (second tokens)))
(start (make-date year month)))
(make-span start (datetime-incr start :month))))
Most handler functions will use make use of the the repeater methods R-NEXT
,
R-THIS
and R-OFFSET
that we described above.
Chronicity implements this pattern matching logic in the
TOKENS-TO-SPAN
function. All the patterns and their handler
functions are defined inside handler-defs.lisp. Patterns defined earlier in
the file get precedence over those defined later. If you add, remove or modify a
handler, you should reload the whole file rather than just evaluating that
handler’s definition.
Finally, we put everything together.
(defun parse (text &key (guess t))
(let ((tokens (tokenize-and-tag (pre-normalize text))))
(pre-process-tokens tokens)
(values (guess-span (tokens-to-span tokens) guess) tokens)))
By default PARSE
will return a timestamp instead of a time span. This depends
on the value passed to the :GUESS
keyword – see the GUESS-SPAN
function to
see how it is interpreted. If you want to return a time span send NIL
instead.
The second value that this function returns is the list of tokens alongwith all its tags. This is useful for debugging Chronicity results in the REPL.
CHRONICITY> (parse "20 days ago")
@2018-12-12T12:01:53.758578+05:30
(#<TOKEN 20 [SCALAR-YEAR, SCALAR-DAY, SCALAR, REPEATER-TIME] {1007639243}>
#<TOKEN days [REPEATER-DAY] {10076AF5D3}> #<TOKEN past [POINTER] {1007553443}>)
CHRONICITY> (parse "20 days ago" :guess nil)
#<SPAN 2018-12-12T00:00:00.000000+05:30..2018-12-13T00:00:00.000000+05:30>
(#<TOKEN 20 [SCALAR-YEAR, SCALAR-DAY, SCALAR, REPEATER-TIME] {1001B78BC3}>
#<TOKEN days [REPEATER-DAY] {1001B78C03}> #<TOKEN past [POINTER] {1001B78C43}>)
The actual PARSE
function has a few more bells and whistles than the one
defined here:
:ENDIAN-PREFERENCE
to parse ambiguous dates as dd/mm (:LITTLE
) or mm/dd
(:MIDDLE
):AMBIGUOUS-TIME-RANGE
to specify whether a time like 5:00 is in the morning
(AM) or evening (PM).:CONTEXT
can be :PAST
, :FUTURE
or :NONE
. This determines the time span
returned for strings like “this day”. See the definition of R-THIS
above.Note: The ideas explored in this document have more or less been implemented for deftask. Go to api.deftask.com to see it in aciton.
Let’s say you’ve setup a brand new webapp at example.com
and want to expose a REST API. How do you design the URLs for API requests and documentation? How do you handle versioning?
One popular option is to use api.example.com
for API requests, another endpoint for documentation, and possibly a third endpoint for an API explorer (if it exists).
For authentication, the preferred option it seems is to generate an API key or get an OAuth access token, then send it using bearer authorization in the request: Authorization: Bearer <access_token>
Versioning is usually handled in one of two ways:
api.example.com/v1/
Accept: application/vnd.api.v1+json
in the request headersAll of this works, however it takes a bit of time to figure out. You have to find the API docs, then figure out the endpoint, authentication, versioning, etc. Moreover, unless you have an API explorer, trying out an actual response takes even longer (figure out the right curl
incancation or something similar). Testing even GET requests in the browser is really hard with many APIs.
This document proposes a small set of conventions to make working with REST APIs (discovery, testing and exploration) a little bit easier.
Given a webapp on example.com
, let’s use api.example.com
for exposing the API. We will use this endpoint not just for API requests but also for documentation.
Here’s how the URLs will look for documentation:
api.example.com
– API documentation home page (introduction, authentication, versioning, etc.)api.example.com/resource
– documentation for the resource example.com/resource
api.example.com/collection/:id
– documentation for example.com/collection/:id
.As you can see, for any given resource on example.com
, to check its documentation just change the domain to api.example.com
.
The same URLs are used for API requests. However, one needs to append the API version as a query parameter in the URL to make the API request. So,
api.example.com/resource?v=1
will be used send an API request for example.com/resource
at API version 1.Note that for URLs of type api.example.com/collection/:id?v=1
, :id
should obviously be a real id in the database when making an API request; when viewing documentation it can be anything.
This scheme, combined with basic authentication as explained below, means that a user can easily explore your API using GET requests in the browser itself.
Specify the version value alongwith show=doc
to show documentation for an older version. For example: api.example.com/resource?v=1&show=doc
.
We already allow users to send GET requests in the browser, but we can do much better.
# Request
GET /posts/1?v=1
Host: api.example.com
# Response
200 OK
{
id: 1,
title: "foo bar",
body: "...",
links: {
"comments": "https://api.example.com/posts/1/comments?v=1
}
}
show=pretty
alongwith the version to get the same response, except it returns HTML which renders the JSON in a pretty way – indented, syntax highlighted and with clickable links.This is fairly simple to implement but allows users to explore related API resources with just a click. Plus, since we use basic authentication, the credentials are cached automatically by the browser so the user doesn’t need to provide them every time they follow a link.
That said, this exploration is limited only to GET requests. If you want a full fledged API explorer, you can instead provide something like api.example.com/resource?show=explorer
which lists all the supported methods for the given resource and allows the user to test any of them.
Besides bearer authentication, I also recommend supporting basic authentication because of its support in the browser. Go with username and password, or if you only want to support access tokens, use bearer
as the username and the access token as the password. This, again, ensures that GET requests can be easily tested in the browser.
Some services allow sending the access token as a query parameter. I DON’T recommend doing this. That’s because an access token is sensitive data, but unfortunately query parameters are included by default in almost all HTTP logs. You might also inadvertently share an API URL with your access token in it. Use basic authentication instead, its much safer.
By following these two conventions:
We make discovery of our REST API and its documentation much easier.
Also, for any resource under example.com
, allow users to reach the API documentation, pretty or raw JSON response with just one or two clicks. This should allow users to get started with your API much quicker.
H/T @rakesh314 ↩