Re: Raw MIME API proposal

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Raw MIME API proposal

Joshua Cranmer-2
On 7/31/2011 2:08 PM, Andrew Sutherland wrote:

> On 07/31/2011 12:14 PM, Joshua Cranmer wrote:
>> Well, this is my knowledge of the current state of multithreading in JS:
>> 1. JS XPCOM components cannot be accessed off the main thread
>> 2. If you want to use multiple threads, you have to use workers
>> 3. In worker threads, you can only access interfaces that are explicitly
>> marked as threadsafe via nsIClassInfo
>> 4. Necko interfaces claim to only work on the main thread
>>
>> In other words, if it were implemented in JS, to my knowledge, you could
>> only access the MIME tree via the main thread. I don't particularly care
>> about parsing and consuming on different threads, but I do want to make
>> sure that the MIME is not inaccessible to other threads.
>
> Yes, if exposed exclusively via XPConnect to C++ code, that could be
> troublesome.  But it's not that hard to spin up a JS runtime on
> another thread and expose a lightweight C++ wrapper class that is
> based on using the JS API to traverse the JS MIME object hierarchy.

I coded up a basic prototype that parses an RFC 822 envelope into
headers and body and ran it on a MIME torture test I have lying around
(I don't recall the original place I got it from, but I'm pretty sure it
was related to the IMAP unofficial mailing list). My first cut
implementation was a standard, simple recursive descent parser which
requires the entire message to be in the buffer (i.e., no progressive).

I then tried an approach which is a partial hack around the proposed
yield* operator which still tries to keep the code looking clean... The
result was a regression of "slow enough that my computer died", or at
least 20x regression.

I have a few ideas for alternative implementations:
1. Turn it into a more classical LR-ish generator. Non-recursive
solutions are more amenable to a generator implementation but they are
less readable, less maintainable, and much more likely to drive the
trace engine bonkers (giant switch statements kill our tracers, the last
I heard).
2. Try some sort of JS hack that allows us to "pause" the execution and
then "resume" it when we get more data. This would imply that we need a
C++ driver (unless you can guarantee the complete buffer of data or are
willing to synchronously block), is possibly impossible, and would in
any case require some hacks when it comes to JS.
2a. An alternative approach is to cause the native bufferer to
synchronously download the data by spinning the event loop until we get
the notification we need. This implies, probably, that we need to do
some more runnable proxies to forestall any unexpected reentrancy issues.
3. Give up and write it in C++.
4. Give up and require you to hold a buffer of the entire contents in
memory.

I think 2a is the best option at this point, although I'm nervous about
potential catastrophic failures.
_______________________________________________
dev-apps-thunderbird mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-apps-thunderbird
Reply | Threaded
Open this post in threaded view
|

Re: Raw MIME API proposal

Andrew Sutherland-3
On 08/05/2011 09:40 PM, Joshua Cranmer wrote:
> I have a few ideas for alternative implementations:

What about something more along the lines of the existing libmime
processing structure, but avoiding integrating display logic into the
parsing logic?

MIME parsing is reasonably straightforward until we get to the nesting.
  You have a simple state machine that needs to deal with line-oriented
data, handling headers and then switching to body parts.  The complexity
comes from nesting and the fact that the parent containers are
responsible for detecting the end of the child and potentially need to
deal with encoding/decoding/decryption, etc.

libmime's structure of stream processing handles this quite well, it's
just obfuscated somewhat by the custom class system, a lot of verbose
low-level C book-keeping operations, and all that display
logic/streaming output logic.

The only improvement on libmime that would just out at me is rather than
having parent containers either pass data through directly to children
via calls or by playing games with the sorta-virtual-function-table,
consider having an explicit driver class that has a stack of active
parts, active delimiters that should indicate the end of one of the
current parts, and active parse logic.  It would be my hope this might
simplify the mental model of the implementation, as well as making error
handling and debugging potentially easier.


Is there anything in particular you were trying to address with using
generators/yield?  Since they are not yet a standard and it would be
great to be able to use this code in other engines like V8 on node.js,
it would be nice to avoid using them.  (It is also my expectation that
they are unlikely to be extensively optimized until they become a
standard and become one of the benchmarks to optimize for.)

Andrew
_______________________________________________
dev-apps-thunderbird mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-apps-thunderbird
Reply | Threaded
Open this post in threaded view
|

Re: Raw MIME API proposal

Joshua Cranmer-2
On 8/5/2011 11:10 PM, Andrew Sutherland wrote:

> On 08/05/2011 09:40 PM, Joshua Cranmer wrote:
>> I have a few ideas for alternative implementations:
>
> What about something more along the lines of the existing libmime
> processing structure, but avoiding integrating display logic into the
> parsing logic?
>
> MIME parsing is reasonably straightforward until we get to the
> nesting.  You have a simple state machine that needs to deal with
> line-oriented data, handling headers and then switching to body
> parts.  The complexity comes from nesting and the fact that the parent
> containers are responsible for detecting the end of the child and
> potentially need to deal with encoding/decoding/decryption, etc.

There are one or two cases that we have right now where being unable to
look ahead across a newline causes issues (IIRC, image < > links in
plain text that cross multiple lines); attempts to gracefully handle
malformed data broken across multiple lines is also an issue that
concerns me. Instead, I wanted to write the parser in a more
synchronous-expecting manner, which allows cases where newlines are in
the middle of productions (particularly reinterpreted body parts or
folded whitespace in headers) to lookahead to decide if they need to eat
the newline or not.

> Is there anything in particular you were trying to address with using
> generators/yield?  Since they are not yet a standard and it would be
> great to be able to use this code in other engines like V8 on node.js,
> it would be nice to avoid using them.  (It is also my expectation that
> they are unlikely to be extensively optimized until they become a
> standard and become one of the benchmarks to optimize for.)

Mainly, I was trying to turn a parser which expects a blocking input
stream into an asynchronous input stream mechanism using generators...
to say that it blew up in my face is a totally true statement.
_______________________________________________
dev-apps-thunderbird mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-apps-thunderbird
Reply | Threaded
Open this post in threaded view
|

Re: Raw MIME API proposal

Andrew Sutherland-3
On 08/05/2011 11:59 PM, Joshua Cranmer wrote:
> There are one or two cases that we have right now where being unable to
> look ahead across a newline causes issues (IIRC, image < > links in
> plain text that cross multiple lines); attempts to gracefully handle
> malformed data broken across multiple lines is also an issue that
> concerns me. Instead, I wanted to write the parser in a more
> synchronous-expecting manner, which allows cases where newlines are in
> the middle of productions (particularly reinterpreted body parts or
> folded whitespace in headers) to lookahead to decide if they need to eat
> the newline or not.

Assuming our stream model is "your handler is invoked with an array of
newlines, and an EOF is conveyed by a null", it seems reasonably
straightforward to let the processing code be able to handle both the
"have another line" case or "need to save our state and wait for more
newlines to show up".  Specifically, a variable such as lineSoFar could
be closed over or accessed via 'this'.

To clarify, the expectation is that disk/network I/O would come to us in
blocks of data much larger than a newline, allowing us to process the
data in fairly large batches of multiple newlines at a time.  This
should method jit and trace jit fairly well.  Idiomatically this could
look like:

function makeParserUsingClosures() {
   var closedOver_a, closedOver_lineSoFar = null;

   return {
     function processLines(lines) {
       for (var iChunkLine = 0; iChunkLine < lines.length; iChunkLine++){
         var haveNextLine = (iChunkLine + 1 < lines.length),
             line = lines[iChunkLine];

         // (assume we are in header parsing)
         if ((line.length === 0) ||
             (closedOver_lineSoFar && !re_ws.test(line[0])) {
           // flush and parse closedOver_lineSoFar...
         }
         else {
           // concatenate the line to line so far
         }
       }
     }
   };
}

I believe that definitely works for the multi-line headers.

I'm not sure exactly what you are referring to with reinterpreted body
parts.  Do you mean something like multipart/related with a text/html
part that gets buffered and later has URLs transformed?  Since that's
something we potentially need to entirely buffer anyways, there's no
harm in performing a synchronous transformation step later on.

Andrew
_______________________________________________
dev-apps-thunderbird mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-apps-thunderbird
12