C3»Blog
Christoffer Lernö

When talking about packages / modules, I think it's useful to start with Java. As a language C/C++ but with an import / module system from the beginning, it ended up being a very influential.

Importing a namespace or a graph

Interestingly, the import statement in Java doesn't actually import anything. It's a simple namespace folding mechanism, allowing you to use something like java.util.Random as just Random. The fact that you can use the fully qualified name somewhere later in the source code to implicitly use another package, means that the imports do not fully define the dependencies of a Java source file.

In Java, given a collection of source files, all must be compiled to determine the actual dependencies. However, we can imagine instead a different model where the import statements create a dependency graph, starting from the source file that is the main entry point. In this model we may have N source files, but not all are even compiled, since only the subset M can be reached from the import graph.

This later model allows some extra features. For example we can build the feature where including a source file may also implicitly cause a dynamic or static library to be linked. Because only the source code in the graph is compiled, we'll then only get the extra link parameter if the imports reach the source file with the parameter.

The disadvantage is that the imports need to have a clear way of finding the additional dependencies. This is typically done with a file hierarchy or strict naming scheme, so that importing foo.bar allows the compiler to easily find the file or files that define that particular module.

Folding the import

For module systems that allow sub modules, so that there's both foo.bar and foo.baz, the problem with verbosity appears: do we really want to type std.io.net.Socket everywhere? I think the general consensus is that this is annoying.

The two common ways to solve this are namespace folding and namespace renaming, but I'm going to present one more which I term namespace shortening.

The namespace folding is the easiest. You import std.io.net and now you can use Socket unqualified. This is how it works in Java. However, we should note that in Java any global or function is actually prefixed with the class name, which means that even when folding the namespace, your globals and "functions" (static methods) ends up having a prefix.

To overcome collisions and shortcomings of namespace folding, there's namespace renaming, where the import explicitly renames the module name in the file scope, so std.io.net might become n and you now use n.Socket rather than the fully folded or fully qualified name. The downside is naming this namespace alias. Naming things well is known to be one of the harder things in programming, and it can also add to the confusion if the alias is chosen to be different in different parts of the program, e.g. n.Socket in one file and netio.Socket in another.

A way to address the renaming problem is to recognize that usually only the last namespace element is sufficient to distinguish one function from another, so we can allow an abbreviated namespace, allowing the shortened namespace to be used in place of the full one. With this scheme std.io.net.open_socket(), io.net.open_socket() and net.open_socket() are all valid as long as there is no ambiguity (for example, if an import made foo.net.open_socket() available in the current scope, then net.open_socket() would be ambiguous and a longer path, like io.net.open_socket() would be required). C3 uses this scheme for all globals, functions and macros and it seems successful so far.

Lots of imports

In Java, imports quickly became fairly onerous to write, since using a class foo.bar.Baz would use another class from foo.bar.Bar and now both needed to be imported. While wildcard imports helped a bit, those would pull in more classes than necessary, and so inspecting the import statements would obfuscate the actual dependencies.

As a workaround, languages like D added the concept of re-exported imports (D calls this feature "public imports"). So in our foo.bar.Baz case, it could import foo.bar.Bar and re-export it. So that an import of foo.bar.Baz implicitly imports foo.bar.Bar as well. The downside here again is that it's not possible from looking at the imports to see what the actual dependencies are.

A related feature is implicit imports determined by the namespace hierarchy. So for example in Java, any source file in the package foo.bar.baz has all the classes of foo.bar implicitly folded into its namespace. This folding goes bottom up, but not the other way around. So while foo.bar.baz.AbcClass sees foo.bar.Baz, Baz can't access foo.bar.baz.AbcClass without an explicit import.

An experiment: no imports

For C3 I wanted to try going completely without imports. This was feasible mainly due to two observations: (1) type names tend to be fairly universally unique (2) methods and globals are usually unique with a shortened namespace. So given, Foo and foo::some_function() these should mostly be unique without the need for imports. So this is a completely implicit import scheme.

This is completmented by the compiler requiring the programmer to explicitly say which libraries should be used for compilation. So imports could be said to be done globally for the whole program in the build settings.

This certainly works, but has a drawback: let's say a program relies on a library like Raylib. Now Raylib in itself will create a lot of types and functions and while it's no problem to resolve them, it could make it confusing for a casual reader "Oh, a Vector2, is this part of the C3 standard library?", whereas having an import raylib; at the top would immediately hint to the reader where Vector2 might be found.

Wildcard imports for all?

The problem with zero imports suggests an alternative of wildcard imports as the default, so import raylib; would be the standard type of imports and would recursively import everything in raylib, and similarly import std; would get the whole standard library. This would be more for the reader of the code to find the dependencies than being necessary for the compiler.

One problem with this design are the sub modules visibility rules: "what does foo::bar::baz and foo::bar see?"

Java would allow foo::bar::baz to see the foo::bar parent module, but not vice versa. However, looking at the actual usage patterns, it seems to make sense to make this bidirectional, so that all are visible to each other.

But if parent and children modules are visible to each other, what about sibling modules? E.g. does foo::bar::baz see foo::bar::abc? In actual usecases there are arguments both for and against. But if we have sibling visibility what about foo::def and foo::bar::abc? Could they be visible to each other? And if not, would such rules get complicated?

To create a more practical scenario, imagine that we have the following:

  1. std::io::file::open_filename_for_read() a function to open a file for reading
  2. std::io::Path representing a general path.
  3. std::io::OpenMode a distinct type for a mask value for file or resource opening
  4. std::io::readoptions::READ_ONLY a constant of type OpenMode

Let's say this is the implementation of (1)

fn File* open_filename_for_read(char[] filename)
{
  Path* p = io::path_from_string(foo);
  defer io::path_free(p);
  return file::open_file(p, readoptions::READ_ONLY);
}

Here we see that std::io::file must be able to use std::io and std::io::readoptions. The readoptions sub module needs std::io but not the file sub module. Note how C3 uses a function in a sub module as other languages would typically use static methods. If we want to avoid excessive imports in this case, then file would need sibling and parent visibility, whereas the readoptions use only requires parent visibility.

Excessive rules around visibility is both hard to implement well, hard to test and hard to remember, so it might be preferrable to simply say that a module has visibility to any other module in the same top module. The downside would of course be that visibility is much wider than what's probably desired (e.g. std::math having visibility to std::io).

Conclusions and further research for C3

Like everything in language design, imports and modules have a lot of trade-offs. Import statements may be used to narrow down the dependency graph, but at the same time a language with a lot of imports don't necessarily use them in that manner. For namespace folding it matters a lot whether functions are usually grouped as static methods or free functions. Imports can be used to implicitly determine things like linking arguments, in which case the actual import graph matters.

For C3, the scheme with implicit imports works thanks to library imports also being restricted by build scripts, but high level imports could still improve readability. However such a scheme would probably need recursive imports which raises the question of implicit imports between sub modules. For C3 in particular this an important usability concern as sub modules are used to organize functions and constants more than is common in many other languages. This is the area I'm currently researching, but I hope that within a few weeks I can have a design candidate.

Christoffer Lernö

Looking at old language presentations of programming languages that never managed to catch on I am often very interested in figuring out just why it failed.

Why is this useful?

I think it's useful for language designers to consider why some things fail and why some things succeed. In the end a language is serving some intended group of users [1], so ask the question "why didn't it succeed in doing that?"

I believe it's an important thing to ask, because the answer often isn't that "the language was bad". It often wasn't a bad language, but there was still something it failed to do which prevented people from using it.

It also implies that in order to actually serve a group of users (the presumed goal of a language) we do not only need to create a good language, but also a language which succeeds in reaching the users.

In order to succeed at language design we must not only make sure that the language is good, but also ensure that there is a way for the intended users to make use of it.

Why do languages fail?

The obvious and most common way a language can fail is by never being completed. It doesn't matter how good the features are if the language can't be implemented.

The second big thing I see is the "build it and they will come" thinking. That is, the idea that all you need to do is to write a sufficiently good language and then somehow that should be enough to make the language universally adopted by everyone. Unfortunately reality does not work like that.

Looking at successful languages, there is no real common pattern. Corporate backing helps, but isn't a guarantee. Lots of initial interest is good, but doesn't mean it will be a success etc.

But while it's difficult to clearly make out the road to success, we can still study other languages for clues on how to avoid the paths leading to failure.

In the end the failure to understand why languages fail is the biggest reason why languages fail.


[1] Zig has "together we serve the users" in its mission statement – addressing exactly this.

Christoffer Lernö

Can you really do a module system without import statements? And should you? If you’re like me you’d probably initially dismiss the idea: “surely that can only work for very simple examples!”

But someone filed an issue to add it to C3, so I had to motivate why it would be difficult / impossible to do well (this actually ended with me redesigning the module system quite a bit). – But the question whether it’s possible stuck with me.

Why it shouldn't work

Let’s quickly review the problems with no imports (where modules are loaded automatically).

1. Ambiguities

The classic example is the function “open”, which is would clash with open in all other modules, making it necessary to use the full module names:

module foo;
fn File* open(char* filename) {  }

module bar;
fn Connection* open(char* url) {  }

module baz;
fn void test()
{
   open(foo.txt); // Which one is intended?
}

2. Bad search & code completion

When all files are basically importing everything then every public function should be listed for code completion or search.

If you'd just match your own code it wouldn’t be so bad, but add to that the whole standard library + any library you’re importing… you'll get a lot of matches.

3. Compiling more than necessary

Some languages use imports to figure out exactly what files to compile. Implicitly having everything imported is means everything needs to be analyzed during compilation.

4. Dependencies are not obvious

Explicit imports help both readers of the source code and things like IDEs to limit the files the current file depends on in a simple way.

Summing it up

All in all the situation looks pretty grim so there’s a reason why we don’t see this.

There are outliers: pre-namespace PHP, and from what I’ve heard there’s a Prolog variant which has a form of auto import as well. Unfortunately these examples offer very little in terms of encouragement.

Making a try

Despite this I found that I personally couldn’t really dismiss the idea entirely, for my own peace of mind I had to make sure it wasn’t possible. Let’s revisit the problems:

1. Ambiguities

In this case I actually had the problem halfway solved: in C3 all functions are expected to be called with at least a partial path qualifier.

To call the function foo() from module std::bar in another module you have to write bar::foo() to call it (std::bar::foo() works as well).

I haven't seen the idea of using abbreviated module paths elsewhere, and so it seems to be a novel invention. It should be possible to implement in any namespace scheme where namespaces are separate from types.

However for C3 structs and other user defined types do not require any qualifiers. The reasoning is that type names in general tend to be fairly unique except where two libraries trying to abstract the same thing (for example two file IO libraries will probably both use File as a type name somewhere)

Name collisions are rare with explicit imports, but for implicit imports this might become a real issue.

module foo::io;
struct File { ... }

module std::io;
struct File { ... }

module bar;
File *f; // Is which File is this?

Fortunately we can introduce a simple feature to help us out: if we reintroduce import but change its meaning so that it simply makes the imported module’s types and functions preferred over modules that aren’t imported when doing type resolution.

So returning to the example with File: rather than to have to type foo::io::File to disambiguate it from std::io::File we simply add import foo::io to the start of the file:

module bar;
import foo::io;

File *f; // This is foo::io::File

If we sort of squint at it this is actually a little like Java’s imports work: they only add possibility to use the imported classes without qualifiers.

This seems like (1) could be considered solvable for any language that are fine with path qualifiers in front of functions and globals like in C3.

3. Compiling more than necessary

For reasons that will become apparent later, let's jump to this point first.

Trying to solve this requires us to look at our compilation model in general. For the more extreme version of this, let’s assume that all our libraries are in source form rather than precompiled. We can say we roughly have 3 types of source code: the application code, external libraries and the standard library.

In C3 you already specify the libraries you want to add by specifying the libraries you need for the project. The problem here are projects that bring in their own dependencies.

There’s an simple model we could use here:

  • the application code only sees what is public in the actual libraries imported.
  • the external libraries are resolved seeing only the dependencies they have and not the application code

Let’s say you have a library which allows you to set up an HTTPS service, which in turn uses a crypto library: your application code will not see the crypto library and the HTTPS service will not see other libraries that the application code uses.

To summarize:

  1. Application code: sees library and standard library public types, variables and functions.
  2. Library: sees only public declarations of its own dependencies and the standard library.
  3. Standard library: only sees itself.

Here we're moving dependencies and imports from the source files into the build configuration.

Unfortunately, in practice we will likely still parse most of the code and start decide what needs to be lowered into actual code after analysis. In other words this is not necessarily a win. Parsing and semantic analysis is a small part of the compile time so avoiding doing it for some code doesn't necessarily help much.

Java "modules"

Taking a detour now: Java has a large standard library and typically frameworks have a fair amount of additional dependencies. To address this Java introduced “modules” in Project Jigsaw (not to be confused with the Java packages that are used with import). Jigsaw modules are essentially creating groups of Java packages documented in a special file that also specifies dependencies to other “modules”. The idea is to drastically reduce the number of packages that need to be bundled for an application.

This is very similar to the compilation issue above. By providing a file which in detail describes what parts of the libraries the application uses, the compiler can actually begin with those library definitions before lexing and parsing starts. So in your app you could perhaps not just define the libraries you want to use, but also specify the subset of the modules we are actually depending on. In practical terms we define in a single place what our imports are and the compiler just needs to work with this subset. This is sort of an analogue of keeping a precompiled header in C with all the external library headers you want to use in the project. Although we're not necessarily reducing the compile time more, we're making the job a lot simpler for the compiler.

2. Bad search & code completion

Armed with this we can go back to the question of search: if we now use these package dependency files we've suddenly reduced the lookup for code completion to the subset of packages we actually use in our project, which effectively resolves this issue.

4. Dependencies are not obvious

We’re also ready to tackle the dependencies because we're now in a much better situation than with per-file imports: we can see all dependencies our project has, and also what dependencies the libraries we depend on have by inspection of a few files.

If libraries split their dependencies into multiple groups we can also get a reduction in the number of libraries we need for compilation.

As an example, let us envision a http server library which both supports http and https. The latter depends on a cryptography library which contains multiple types of algorithms. If the library is split into multiple modules, then we can perhaps let the http part simply depend on a TCP library, whereas the https also depends on some cryptography algorithms, but perhaps only in use for https.

Depending on how much granularity there is, something not using https might avoid the download of the cryptography library, and even if https is included, packages with deprecated hash and crypto algorithms do not need to be included to compile the https library.

Does this mean it works?

It seems like for most module systems it could work – given that the caveats listed are satisfied.

But should one do it? I would hedge my bets and say "possibly". Regular imports requires less of the language and is the proven approach, but I believe I've shown that "modules without imports" could still be up for consideration when designing a language.

Christoffer Lernö

Last time we were looking at this example:

macro int testmacro(int x)
{
  int z = 2;
  for (int i = 0; i < x; i++)
  {
    z *= 2;
  }
  return z;
}     

fn int test(int y)
{
  int z = getValue(y);
  int x = @testmacro(z);
  return z + x;
}  

In our previous solution, we had the variables in an array, where each scope would keep track of the current and last local. Before entering the testmacro, this list is [int y, int z, int x], but entering the macro we would get [int y, int z, int x, int x, int z]. Which would mean shadowing.

A naive solution would be name mangling, let's say macro names are prefixed with something: [int y, int z, int x, int _testmacro$x, int _testmacro$z. Our lookup must then:

  1. Lookup with the macro prefix.
  2. If not found, lookup without the macro prefix, but in this case only accept globals.

Aside from not actually solving later problems, it's complex for no real benefit, because we can essentially insert a sentinel in the list: [int y, int z, int x, SENTINEL, int x, int z].

Now when we scan back we always stop at the sentinel value. This means that entering the macro scope we simply push the sentinel value on the stack of locals (this is not the only way to introduce the same effect, but it's the simplest version to explain). When looking up locals in the array we can now stop as soon as we reach either the first element OR the sentinel value.

Problem solved?

Resolution without hierarchies

If your macro resolution only takes values, then this solution is sufficient. However, often we want to use macros to provide an expression that only conditionally is evaluated. In C3 we use # in front of the variable identifier to indicate an unresolved expression.

macro foo(#expr)
{
  return #expr * #expr;
}
fn int test(int z)
{
  return @foo(call(z));
  // => return call(z) * call(z);
}

Now we're running into problems. Both z and call should be resolved in the test function scope. Ooops.

What happens if we tag the #expr with the current scope? This seems like it could work, but in C3, like with GCC statement expressions, we can introduce new variables.

macro int two_times(#expr)
{
  int w = 1;
  #expr;
  #expr;
  return w;
}
  
fn void test2(int z)
{
  @two_times({| int y = z; call(y); |});
}  

So we go into two_times with [int z], then add w for [int z, SENTINEL, int w]. Now when we evaluate two_times we would like something like this: [int z, int y, SENTINEL, int w]. That is, we slip in a new scope in the function scope, and not in the macro scope we pushed.

Trying a hack

What we might realize here is that if we evaluate expr just to the declaration before entering, so that all declarations ar resolved, we might just get the behaviour we want. So something like this:

  1. Enter test2 scope
  2. Push z
  3. Start evaluating the macro call.
  4. Take the macro call argument and only check the declarations.
  5. Enter expr scope
  6. Push y
  7. Resolve z
  8. Resolve y
  9. Pop expr scope
  10. Pass in this pre-checked expression into the macro.
  11. Enter the two_times scope
  12. Push w
  13. Copy #expr and insert it.
  14. Evaluate #expr - which will not need a lookup
  15. Copy #expr and insert it.
  16. Evaluate #expr - which will not need a lookup
  17. Lookup w
  18. Pop the macro scope
  19. Pop the test2 scope

This scheme looks like it would work, but there are questions: what if the declarations inside should not be resolved the same way twice? What if the expr instead looks like:

@two_times({|
  $if (@some_compile_time_macro(...)):
    int y = 0;
  $else:
    int z = 0;
  $endif;
  $if ($defined(y)):
    y = 1;
  $endif;
|});  

Here it's not clear that two invocations of the same expr will even lower to the same declarations! So we can't do the lookup ahead of time.

The alternative is to completely evaluate expr, not just the declarations. It's a possible solution, but the corner cases with this approach are hard to foresee.

Summary

If our macros only take values then we can retain a simple model for symbol lookup using a single stack. However, if we can provide expressions or even statements, then these need to not only resolve symbols in the original scope but also possibly introduce them. Pre-checking expressions do not work well with compile time evaluation, since they may change every evaluation.

But maybe there is some way to salvage the model? We'll look at that next.

Christoffer Lernö

The problem: implementing lexical scope in a programming language.

Here is an example.

fn void test(int y)
{
  int x = 123;
  for (int i = 1; i < y; i++)
  {
    int z = foo(i + x); // The first z
    if (z > 100) x++;    
  }
  // The second z
  int z = x + 1; 
  // z += i; <- Error since i is not in scope 
}    

Formulated like this, the problem is fairly simple. One solution is like this:

Decl *locals[MAX_LOCALS];
Scope scopes[MAX_SCOPES];
Scope *next_scope = &scopes[0];
Decl **next_local = &locals[0];

void push_scope()
{
  *next_scope = { .local_start = next_local, .local_end = next_local };
  // TODO Check we didn't reach max 
  next_scope--;
}
void pop_scope()
{
  // Move the current local to the start.
  next_scope--;
  next_local = next_scope.local_start;
}
      
void push_local(Decl *decl)
{
  // TODO check we didn't reach max
  *next_local = decl;
  next_local++;
  next_scope[-1].local_end = next_local;
}  

This solution isn't the simplest we could do, but it has some advantages knowing both the end and the beginning of the locals. Doing a lookup we do a linear search back from the last local:

Decl *find_decl(const char *interned_name)
{
  Decl **current = next_local;
  while (current != &locals[0])
  {
    current--;
    if (current[0].name == interned_name) return current[0];
  }
  return NULL;
}  

This might not seem super fast, but variables are often used near their definition and the number of locals are usually fairly limited anyway.

Roughly this is what happens in this algorithm as it goes through the code above:

1. Push function scope

scope: [{ 0, 0 }]
decl: []

2. The param is added

scope: [{ 0, 1 }]
decl: [int y]

3. Push int x

scope: [{ 0, 2 }]
decl: [int y, int x]

4. Push "for" scope

scope: [{ 0, 2 }, { 2, 2 }]
decl: [int y, int x]

5. Push int i

scope: [{ 0, 2 }, { 2, 3 }]
decl: [int y, int x, int i]

6. Push "for body" scope

scope: [{ 0, 2 }, { 2, 3 }, { 3, 3 }]
decl: [int y, int x, int i]

7. Push int z

scope: [{ 0, 2 }, { 2, 3 }, { 3, 4 }]
decl: [int y, int x, int i, int z]

8. Pop "for body" scope

scope: [{ 0, 2 }, { 2, 3 }]
decl: [int y, int x, int i]

9. Pop "for" scope

scope: [{ 0, 2 }]
decl: [int y, int x]

10. Push int z

scope: [{ 0, 3 }]
decl: [int y, int x, int z]

11. Pop function scope

scope: []
decl: []

It's fairly straightforward to see how - after defining z - we can lookup x and i in step 7.

This model is simple and easy to understand – but we skipped something, didn't we? Yes, the lookup of the function foo. In this scheme, anything not found in the locals, will then first look through the current file, then the module and then finally any imports. (This is not how C works, but that is because in C everything is just a single flat file in the end, so we just "look through" the file).

Macros

So far so good, and if this was all we had, the story would end here. But there's an added complexity. What if we have macros?

Here is a simple example:

macro int testmacro(int x)
{
  int z = 2;
  for (int i = 0; i < x; i++)
  {
    z *= 2;
  }
  return z;
}     

fn int test(int y)
{
  int z = getValue(y);
  int x = @testmacro(z);
  return z + x;
}  

Assuming we want hygienic macros we cannot simply copy the macro into the function, names and all:

// Broken version
fn int test(int y)
{
  int z = getValue(y);
  int x = z;
  int z = 2;
  for (int i = 0; i < x; i++)
  {
    z *= 2;
  }
  int _macro_result = z;
  int x = _macro_result;
  return z + x;
}  

In C3 there is an expression block, working similar to a statement expression in GCC C. If this was used together with allowing shadowing, we'd get something that halfway worked:

// Works but...
fn int test(int y)
{
  int z = getValue(y);
  int x = {|
    int x = z;
    int z = 2;
    for (int i = 0; i < x; i++)
    {
      z *= 2;
    }
    return z;
  |};
  return z + x;
}  

This seems to work, but what if we wrote this incorrect macro:

macro testmacro(int x)
{
  return y * x;
}  

If we fold that:

fn int test(int y)
{
  int z = getValue(y);
  int x = {|
    int x = z;
    return y * x;
  |};
  return z + x;
}

So now y is leaking into the macro, which is implicitly capturing y. This is not what we want (and as we'll see later this is just the tip of the iceberg).

Next I'll show a way to extend our simple scope system to make sure that the macros are indeed hygenic, and show other things that breaks.