Thursday, February 20, 2014

Multithreaded Java - Screencast on the synchronized keyword

synchronized is quite well known in the Java community. Due to its early implementation which had a significant runtime overhead, it has quite a bad image. In modern JVMs this isn't the case anymore - still there's something to look out for.

Watch the screencast to learn more:


Multithreaded Java - Screencast on the volatile keyword

volatile is probably one of the least known keywords in Java. Still it serves an important purpose - an not knowing about it might ruin your day....

Watch this screencast to learn more:


Monday, February 3, 2014

Version Numbering Scheme - Yet another approach


Version numbering schemes are probably one of the few things we software engineers have more than sort algorithms. However, there's always room for one more.










While the classic approach of MAJOR.MINOR.PATCH (e.g. 1.8.2) works quite well for libraries or products which are distributed in a broad manner, it is still not as easy as it seems. What is a major change? What a minor? What comes after 1.9? 2.0 or 1.10? There are tons of examples where this classic approach fails, Java being one of the most prominent examples.

One the other hand, this approach is almost perfectly suited for libraries, as the rules are quite obvious here:
  • increment minor version for every release (2.4 -> 2.5)
  • increment major version when a backward incompatible change was made (2.4 -> 3.0)
  • increment the patch level for each update, which only fixed bugs but didn't add functionality (2.4 -> 2.4.1)
However, for software which runs in the cloud or is only delivered to a number of customers, the distinction is not always this clear. As we do not distinguish between minor or major updates (ask our sales guys, each release is a major step forward), we ended up using the build numbers of our Jenkins build server as version number.

Although this approach works quite well, there are two problems with it:
  1. You need a build server which issues consecutive build numbers
  2. Without looking at the build server, you cannot tell the age of a release (How much older is BUILD-51 compared to BUILD-52?)
Therefore we now started to switch to another approch for our SIRIUS based products: Inspired by date code placed on ICs, we started to use the same codes for our releases. A date code consists of four digits, the first two being the year and the second two being the week number. So this blog post would have 1406 as version.

As we don't perform more than one release per week, a version number is always unique. Furthermore these numbers are quite short and easy to remember (compared to full dates like foo-20130527). Still they provide a rough information concerning the release date.

Now as I said, this scheme is not superior over others. It's just a good solution for our problem. Use it if you like it, ignore it otherwise ;-)

Tuesday, January 7, 2014

Making HTTP content compression work in netty 4

Netty is really a great framework providing all the things needed to build a high performance HTTP server. The nice thing is, that nearly everything comes out of the box and has just to be put together in the right way. And content compression (gzip or deflate) is no exception. However, when it comes to compressing static content I stumbled quite a few times before everything worked as expected:





Update: First of all, widely used tools like wget use HTTP 1.0 and not HTTP 1.1 - therefore we cannot always deliver a chunked response (we have to live with disabling compression then). Also note that the netty guys had pretty much the same idea now: HttpChunkedInput - The problem with HTTP 1.0 or non compressable responses (see SmartContentCompressor below) however remains...

Based on the http/file example provided by netty I used to following approach to serve static files (same as used in netty 3.6.6):

RandomAccessFile raf = new RandomAccessFile(file, "r");
HttpResponse response = new DefaultHttpResponse(HTTP_1_1, OK); 
ctx.write(response);

if (useSendFile) {
    ctx.write(new DefaultFileRegion(raf.getChannel(), 0, fileLength));
} else {
    ctx.write(new ChunkedFile(raf, 0, fileLength, 8192));
}
However, as soon as I added a HttpContentCompressor to the pipeline, Firefox failed with a message like "invalid content encoding".
As it turns out, the HttpContentCompressor expects HttpContent objects as input chunks to be compressed. However, the ChunkedWriteHandler directly sent ByteBufs to the downstream. Also sending a FileRegion (useSendFile=true) left the content compressor unimpressed.
In order to overcome this problem I create a class named ChunkedInputAdapter which takes a ChunkedInput<ByteBuf> and represents ChunkedInput<HttpContent>. However, two things still weren't satisfying: First, FileRegions and the zero-copy capbility still couldn't be used and second, already compressed files like JPEGs will be compressed again. Therefore I sublassed HttpContentCompressor with a class called SmartContentCompressor. This class check if either a header "Content-Encoding: Identity" or a specific content-type or a content-length less than 1 kB is present. In there cases the content compression is bypassed.

Using this combination permits to use both, content compression when it is useful and the zero copy capability if the file is already compressed.
All the sources mentioned above are open sourced under the MIT license and part of the SIRIUS framework.

Wednesday, December 25, 2013

Getting a simple CDC demo (serial port via USB) working with a PIC32 (PIC32MX575)

This is just a quick write-up to save others some debugging time. The adventure is called: Get the "cdc_serial_emulator" example supplied of Microchip Harmony (/harmony/v0_70b/apps/usb/device/cdc_serial_emulator) up and running on my custom PIC32 board.

Once one understands the inner workings of the firmeware, the first steps are quite simple. First of all I modified the code which assumed we would run on an Explorer Board. Therefore I commented everything uncompilable and unneeded in bsp_sys_init.c out.

Now that I was able to compile the project, I was ready to enter the USB descriptor hell. And believe it or not, my Windows 7 machine was the most helpful tool for my first steps. Every time I plugged my board in, those annoying "USB device connected" sounds let me know if the communication worked or not. Of course, it didn't  by default. The reason is the clock frequency of the PIC. The USB module has an internal PLL which generates the required 48 MHz. This PLL is fed by the main oscillator and therefore everything needs to be setup correctly. I used (for no special reason) a 20 MHz crystal instead of the 8 MHz crystal assumed by the example.

Therefore I needed to tweak the config settings found in system_init.c: 
OSCO Pin(OSCIOFNC)                        = Enable
Primary Oscillator Configuration(POSCMOD) = External (Highspeed)
Secondary Oscillator Enable(FSOSCEN)      = Disabled
Oscillator Selection Bits(FNOSC)          = Primary osc with PLL


#pragma config OSCIOFNC = ON, POSCMOD = HS, FSOSCEN = OFF, FNOSC = PRIPLL

PLL Input Divider (FPLLIDIV)               = Divide by 5 
(20 Mhz / 5 = 4 MHz) 
PLL Multiplier (FPLLMUL)                   = Multiply by 20

(4 MHz * 20 = 80MHz)
System PLL Output Clock Divider (FPLLODIV) = Divide by 1
(80 MHz / 1 = 80 MHz system clock)
Peripheral Clock Divisor (FPBDIV)          = Divide by 1
(80 MHz / 1 = 80 MHz peripheral clock)

Watchdog Timer Enable (FWDTEN)                 = Disabled
Clock Switching and Monitor Selection (FCKSM)  

 = Clock Switch Enable, Fail Safe Clock Monitoring Enable

#pragma config FPLLIDIV = DIV_5, FPLLMUL = MUL_20, FPLLODIV = DIV_1
#pragma config FWDTEN = OFF, FCKSM = CSECME, FPBDIV = DIV_1 

Enable PLL for USB clock generation 

#pragma config UPLLEN   = ON

Divide external input clock by 5 before it is fed into the USB PLL.

20 MHz / 5 = 4 MHz (This is multiplied by 24 and then divided by 2 - see Reference Manual Page 27-3). This will result in the desired 48 MHz USB reference clock.

#pragma config UPLLIDIV = DIV_5


As this only took 3 evenings to figure out, this can be considered the easy part...The problem was, that neither Linux nor OSX would recognize the device as serial port (the device itself was found). Therefore it was clear that the timing was ok, but the USB descriptor wasn't.

After days of debugging (mostly on OSX using USBProbe) I switched to Linux and tried "lsusb -v". And there you go: The CDC-Descriptor provided by microchip was/is wrong! lsusb said something like "INVALID CDC(UNION) 0x04 0x24 0x06 0x00. Looking at the descriptor, this was really the input in system_config.c:


    // Size of the descriptor
    sizeof(USB_CDC_UNION_FUNCTIONAL_DESCRIPTOR_HEADER),
    // CS_INTERFACE
    USB_CDC_DESC_CS_INTERFACE,
    // Type of functional descriptor
    USB_CDC_FUNCTIONAL_UNION,
    //com interface number
    0,


Compared to any other CDC device I tried, this was missing one byte as the record (descriptor) has to look like LENGTH, TYPE, SUBTYPE, MASTER_INTERFACE, SLAVE_INTERFACE - so this was clearly missing one byte. Therefore I changed to code to:

0x05, // Size (5 bytes)
0x24, // DescriptorType: CS_INTERFACE
0x06, // DescriptorSubtype: Union Functional Descriptor
0x00, // MasterInterface
0x01, // SlaveInterface0


Of course, that didn't work out of the box - as I changed the descriptor by adding one byte, I had to fix the size at the begin of the descriptor:

/* Configuration 1 Descriptor */
const uint8_t configDescriptor1[]={
   
    /* Configuration Descriptor */
    //sizeof(USB_CFG_DSC),    // Size of this descriptor in bytes
    0x09,
    // CONFIGURATION descriptor type
    USB_DESCRIPTOR_CONFIGURATION,
    // Total length of data for this cfg
    67,0, // This was 66 originally


Using this, both, my Mac and Linux machines now recognize the device and supply a tty for it :-)

Saturday, December 21, 2013

How to write one of the fastest expression evaluators in Java


Granted, the title is a bit of an attention grabber, but nevertheless true (You course you never trust a benchmark you didn't fake yourself - but that's another story).

So last week I was looking for a small and usable library to evaluate mathematical expressions. I almost directly stumbled upon this stackoverflow post. The recommended library (Expr) is really quite fast and had almost everything I needed. However, what it didn't provide was the ability to limit the scope of variables (everything is in one global namespace within the VM).

Therefore I did, what one normally shouldn't do: I reinvented the wheel and wrote my own parser / evaluator. It was a rainy saturday anyway so I thought a small recursive descending parser, an AST which simplifies and eventually computes expressions along with a little helper for managing variables doesn't seem to be a big deal anyway. And it wasn't. I had an initial implementation up and running quite fast. Once I had some tests giving me confidence that it computed everything the right way, I wanted to know how fast the evaluator was, compared to other libraries mentioned in the original post. Having not hand-optimized every inner loop and everything, I had't much expectations, some of the libraries are commercial ones afterall. So I was quite suprised when I looked at the results. The list below shows a micro benchmark which evaluates the same expression using the respective library. The measurements for parsii, which is my library, were done using the final version, which performs some simplifications, like pre-evaluating constant expressions. However, no "black magic" like bytecode generation or anything in that league is done.

For a performance measurement the expression "2 + (7 - 5) * 3.14159 * x^(12-10) + sin(-3.141)" was evaluated with x running from 0 to 1000000. This was done 10 times to warm the JIT and then 15 times again of which the average execution time was taken:
  • PARSII:      28.3 ms
  • EXPR:        37.2 ms
  • MathEval:  7748.5 ms
  • JEP:        647.0 ms
  • MESP:       220.8 ms
  • JFEP:       274.3 ms
Now I'm sure, each of these libraries has their own strengths, so they can't be directly compared. Still it's amazing to see that a simple implementation can compete quite well.

For those of you who are not too deep into compiler contruction, here's a small outline of how it works:

As any parser or compiler parsii uses the classic approach of having a tokenizer, which converts a stream of characters into a stream of tokens. Therefore "4 + 3 *8" which is '4', ' ',  '+', ' ', '3' , ' ', '*', '8' as character array will be converted into:
  •  4 (INTEGER)
  • + (SYMBOL)
  • 3 (INTEGER)
  • * (SYMBOL)
  • 8 (INTEGER)
The tokenizer takes a look at the current charater, then decides what kind of token it is looking at and then reads all characters, which belong to that token. Each token has a type, textual contents and knows the position (line and character) where it started. A lot of in-depth tutorials are available on the net, so I won't go into any details here. You can take a look at the source code, but as I said, it is just a basic naiive implementation.

The parser which translates the given stream of tokens into an AST (Abstract Syntax Tree) which can then be evaluated, is a classic recursive descending parser. This is one of the simplest ways to build a parser, as it is completely written by hand and not generated by a tool. A parser like this basically contains a method for every syntax rule.

Again a lot of tutrials for this kind of parsers are available. However, what most example leave out is proper error handling. Next to parsing an expression correcly and fast, good error handling is one of the central aspects of a good parser. And it's not that hard: As you can see in the source code, the parser never throws an exception while parsing the expression. All errors are collected and the parser continues to go on as long as possible. Even though after the first error, the resulting AST cannot be evaluated correctly, it is important to go on as we can and should report as many errors as possible in one run. The same approach is used for the tokenizer, as reports malformed tokens, like decimal numbers with two decimal separators, to the same list of errors.

Evaluating an AST which is the result of a parsed expressions is quite easy. Each node of the syntax tree has an evaluate method which will be called by its parent node, starting from the root node. The result of eval here, is the result of evaluating the expression. A basic example of this approach can be found in BinaryOperation, which represents operations like +, -, * and so on.

In order to improve evaluation time a bit, three optimizations are performed:

First, after parsing the AST is reduced calling a method called simplify on the root node, which propagates to each child node. Each node then decides if a simpler representation of the own sub-expression can be found. As an example: For binary operations, we  check if both operands are constant (numbers). In that case, we evaluate the expression and returns a new constant containing the result of the operation. The same is done for functions where all parameters are constant.

The second optimization is done when using variables in expressions. The naiive approach here is to use a map and read or write the values of the variable when needed. While this certainly works, a lot of lookups while be performed. Therefore we have a special class called Variable which contains the name and the numeric value of the variable. When an expression is parsed, the variable is looked up once in the scope (which is basically just a map) and then used from now on. As each lookup returns the same instance, variable access when evaluating expressions is as cheap as a field read or write, as we just access the value field of Variable.

The third and last optimization won't probably often come into play. But as it is simple to realize, it was implemented anyway. It basically goes by the name "lazy evaluation" and is used when calling functions. A function does not automatically evaluate all its arguments and then perform the function call iteself. It rather looks at the arguments and can deciede by iteself, which argument to evaluate and which not. An example where this is used, can be found in the if function.

parsii is licensed under the MIT license. All sources can be found on GitHub along with a pre-compiled jar

(Speedy Gonzales Image-(c) by http://benztownbranding.wordpress.com/2011/11/29/become-speedy-gonzales-or-the-short-cuts-imperium/)

Thursday, September 19, 2013

How to kill Java with a Regular Expression

We recently stumbled upon a phenomen we absolutely weren't aware of: You can kill any Java IDE and also any Java process with a simple regular expression...

Back in university, I was taught that regular expressions, which are called regular grammers or type 3 grammers  always end up in an finite state automaton and can therefore be processed in linear time (input length doubles, processing time doubles). However, that's only true for "sane" expressions. A regular expression can also result in an non-deterministic finite state automaton and things can get messed up quite bad.

Consider the expression: (0*)*A  This will any number of zeros, followed by an upper case A. Now if you use Matcher.find() for this expression, everything is fine as long as there is a match in the input. However, if you call this, with "00000000000000000000" as input, your program will hang (and so will the regex console in Eclipse or IntelliJ and every (Java-based) online regex service).

What at first glance looks like an infinite loop, truns out to be catastrophic backtracking. What this basically means is, that the matcher detects, that no A was found at the end of the input. Now the outer quantifier goes on step back - the inner one forward and again - no result. Therefore the matcher goes back step by step retrying all combinations to find a match. It will eventually return (without a match) but the complexity (and therefore the runtime) of this is expotential (adding one character to the input doubles the runtime). A detailed description can be found here: catastrophic backtracking

Here are some runtimes I measured (which almost exactly double for each character added):

0000000000: 0.1ms
00000000000: 0.2ms
000000000000: 0.7ms
0000000000000: 1.3ms
00000000000000: 1.7ms
000000000000000: 3.5ms
0000000000000000: 7.2ms
00000000000000000: 13.9ms
000000000000000000: 27.5ms
0000000000000000000: 55.5ms
00000000000000000000: 113.0ms
000000000000000000000: 226.4ms
0000000000000000000000: 439.1ms
00000000000000000000000: 886.0ms


As a little side-note: For micro benchmarks like this, you always need to "warm" up the JVM as the HotSpot JIT will jump in at some point and optimize the code. Therefore the first run looks like this:

0000000000: 6.8ms
00000000000: 11.8ms
000000000000: 25.5ms
0000000000000: 39.5ms
00000000000000: 6.3ms   <- JIT jumped in and started to translate
000000000000000: 5.4ms     to native code.
0000000000000000: 7.1ms

00000000000000000: 14.2ms
000000000000000000: 26.8ms
0000000000000000000: 54.4ms
00000000000000000000: 109.6ms
000000000000000000000: 222.1ms
0000000000000000000000: 439.2ms
00000000000000000000000: 885.6ms


So what's the take-away here? If you're running a server application or anything critical used by many users, don't let them enter regular expressions unless you really trust them. There are regex implementations out there, which detect this problem and abort, but Java (up to JDK 8) doesn't.

Note: You can test this with your local IDE or a small Java program to your hearts content - but please don't start to knock out all the regex tester websites out there. Those guys provide a nice tool free of charge, so it would be quite unfair..

Here is the tiny benchmark I used:

public class Test {
    public static void main(String[] args) {
        for (int runs = 0; runs < 2; runs++) {
            Pattern pattern = Pattern.compile("(0*)*A");
            // Run from 5 to 25 characters
            for (int length = 5; length < 25; length++) {
                // Build input of specified length
                String input = "";
                for (int i = 0; i < length; i++) { input += "0"; }
               
                // Measure the average duration of two calls... 
                long start = System.nanoTime();
                for (int i = 0; i < 2; i++) {
                    pattern.matcher(input).find();
                }
                System.out.println(input + ": " 

                       + ((System.nanoTime() - start) / 2000000d) 
                       + "ms");
            }
        }
    }
}