Your Daily Source for Apache News and Information |
Breaking News | Preferences | Contribute | Triggers | Link Us | Search | About |
In last month's column I wrote an output filter to add a header and footer to every web page, this month I want to investigate writing an input filter. This will be the last column devoted to I/O filtering in Apache 2.0.
Input filtering and outputing filtering are basically the same thing, with some very minor differeces. Both input and output filtering rely on buckets and bucket brigades to pass data from one filter to the next. Both have filters that are associated with the connection and filters that are associated with the request.
Output filters are relatively straight-forward, the filter gets handed data which it either adds to or modifies, and that data gets passed to the next filter. Input filtering can not work this way because Apache isn't generating the data, it has to rely on getting the data from the network. Because of this difference, input filters get called with an emtpy brigade and they pass this brigade to the next filter. The lowest filter in the chain inserts data into the brigade and returns to the previous filter. That filter can then modify the data and send the brigade to the previous filter, and so on until the brigade is returned to the Apache core.
Input filters differ from output filters in one other significant manor. Most output filters only deal with actual data, headers are stored in a table in the request_rec, and there is a core filter that converts that table to a stream of data that is sent to the client. The output headers filter sits is low enough in the filter stack that only filters that are dealing with formatting the data for transmission to the client (e.g. chunking) are after it. Input filtering and headers have a very different relationship. All data coming from a client must pass through the input filters to get to the Apache core. This means that input filters have an opportunity to change the headers of a request before the core ever sees it.
The module that I am presenting this month will modify the headers for a request while Apache reads it. This module came about at ApacheCon Europe 2000 because of the CD that was distributed with the conference proceedings. This CD was created on a Windows machine, and the proceedings were organized as a web site. The problem comes in that the HTML used spaces and forward slashes (/) in URLs for each page. Unfortunately, the URL "http://localhost/foo\Test Page.html" is not the same as "http://localhost/foo/Test%20Page.html". The first is not a valid URL, while the second is. This CD was tested with Internet Explorer, which automatically converts these invalid URLs into valid ones.
While working at Covalent's booth, I had a discussion with two of the conference attendees, Save Buchanan and Karl Royer. They had attended my session about writing Apache 2.0 modules, and suggested that a filter could be written to solve this problem on the server's side. Out of such humble beginnings mod_apachecon was born. This module walks the first line in a request and ensures that when the request is given to Apache all spaces have been converted to "%20" and any forward slashes are converted to back-slashes. This allows Apache 2.0 to successfully serve the ApacheCon CD to any web browser.
static int apcon_pre(conn_rec *c) { ap_add_input_filter("APACHECON_IN", NULL, NULL, c); return OK; } static void hf_register_hook(void) { ap_hook_pre_connection(apcon_pre, NULL, NULL, AP_HOOK_MIDDLE); ap_register_input_filter("APACHECON_IN", apcon_filter_in, AP_FTYPE_CONNECTION); } module MODULE_VAR_EXPORT apachecon_module = { STANDARD20_MODULE_STUFF, NULL, /* create per-directory config structure */ NULL, /* merge per-directory config structures */ NULL, /* create per-server config structure */ NULL, /* merge per-server config structures */ NULL, /* command apr_table_t */ NULL, /* handlers */ hf_register_hook /* register hooks */ };
We will take this in the reverse order of how it appears in the module. The last thing is the module structure. There is no configuration for this module because it will modify every request it receives, so this module structure is basically emtpy. The only field that is filled out is the register hooks field. This function is used for two purposes.
The first thing this function must do is register a function for the pre_connection phase. The pre_connection phase is called after Apache accepts the connection from the client, but before Apache begins to do anything with this connection. The point of this phase is to allow modules to setup connection based information. In this case mod_apachecon uses this phase to add an input filter to the connection. In reality this module should ensure that this request is received on a server that is handling HTTP requests, but this is a quick module that should never be enabled in a production server, so cutting a few corners is okay.
The second purpose of the register hooks function is to register the input filter that the pre_connection phase adds to the input filter stream. I have named this filter "APACHECON_IN", which is the name that the pre_connection phase uses to insert the filter. The function that actually implements the filter is apcon_filter_in
, so that is specified as the second argument. The final argument is the type of the filter. There are two basic types of input filters, connection and request based. Connection based filters are inserted before a connection is started and get to act upon all of the data that sent to the server. Request based filters are added after the request has been started and they only get to access the request body. In this case the filter is going to be acting on headers, so this has to be a connection based filter.
Now we get to the meat of the module. This is the filter that will replace all spaces with "%20" and forward slashes with back-slashes:
static apr_status_t apcon_filter_in(ap_filter_t *f, ap_bucket_brigade *b, ap_input_mode_t mode) { const char *str, *begin; int length, i, j; ap_bucket *e, *d; char data[256];
This portion of the code declares the filter function and sets up the local variables. The filter structure that is passed in is a reference to the current filter. The second is the brigade to be filled out, and the final argument is what mode this filter was called in. The mode is unique to input filters. There are three possible modes, AP_MODE_BLOCKING, AP_MODE_NONBLOCKING, and AP_MODE_PEEK. AP_MODE_BLOCKING and AP_MODE_NONBLOCKING are relatively straight-forward, when reading data from the client, it is done in either blocking or non-blocking mode. AP_MODE_PEEK requires a bit of thought. The problem is that Apache needs to determine if there is a second request coming over the same connection. AP_MODE_PEEK is a way for Apache to ask the input filters if there is more information on the connection without having any of the data returned to the caller. As follows:
ap_get_brigade(f->next, b, mode); e = AP_BRIGADE_FIRST(b); if (e->type == NULL) { return APR_SUCCESS; }
The first line calls the next filter in the chain to get the data from the client. Once the next filter returns, we need to ensure that we actually received data from it. The AP_BRIGADE_FIRST macro gets the first bucket in the brigade. This gives us a starting point. If that bucket is NULL, then we didn't actually get any data and we should just return to the previous filter. In the next section:
ap_bucket_read(e, &str, &length, 1); if (strncmp("GET ", str, strlen("GET "))) { return APR_SUCCESS; } ap_bucket_split(e, strlen("GET ") + 1); e = AP_BUCKET_NEXT(e); ap_bucket_read(e, &str, &length, 1); /* this should work, because we are just searching for HTTP/1.0 or HTTP/1.1 */ begin = str + (strlen(str) - 3); do { begin--; } while (strncmp("HTTP", begin, 4) && (begin > str)); ap_bucket_split(e, begin - str - 1);
Once we are sure we have data, we have to start looking at it to determine if we have modify anything. The first line of an HTTP GET request is GET URI HTTP/1.x
. This section of the code ensures that the data in this bucket matches that syntax, and while doing that, it splits the bucket into three buckets. The first bucket has contains "GET", the last one contains "HTTP/1.x", and the middle bucket contains the URI.
ap_bucket_read(e, &str, &length, 1); i = 0; j = 0; while (i < length) { if (str[i] == ' ') { data[j++] = '%'; data[j++] = '2'; data[j++] = '0'; i++; } else if (str[i] == '\\') { data[j++] = '/'; i++; } else { data[j++] = str[i++]; } } d = ap_bucket_create_transient(data, j); ap_bucket_setaside(d); AP_BUCKET_INSERT_AFTER(e, d); AP_BUCKET_REMOVE(e); ap_bucket_destroy(e); return APR_SUCCESS; }
This final section gets the data out of the middle bucket, and traverses it copying it to another location. As it copies the data it converts the illegal characters to the characters discussed earlier. After the data has been copied, we create another bucket out of it. This bucket is a transient bucket, which is done just to cut a corner. We know that using a transient bucket is OK to do in input filters, although this really should be a heap bucket. Once the new bucket is created, we need to insert it into the brigade in place of the original bucket. This is done by inserting after the original bucket and then removing the old bucket. To ensure that we do not leak any memory, any bucket that is removed from a brigade must be destroyed by the function that removed it. Finally, we return to the calling filter, so that it can interpret the request.
When this module is inserted into an Apache 2.0 server, that server will accept requests for URI's that contain both spaces and slashed. While this is not a good idea to add to a production server, this module does solve the problem that people were having with the ApacheCon CDs, and hopefully it shows some of the power of input filters.
Related Stories:
Apache 2.0 alpha 8 released!(Nov 20, 2000)
Filtering I/O in Apache 2.0: Part 2(Oct 23, 2000)
Filtering I/O in Apache 2.0(Sep 20, 2000)
Apache 2.0 Server Up and Running(Aug 19, 2000)
Looking at Apache 2.0 Alpha 4 (Jun 30, 2000)
An Introduction to Apache 2.0(May 28, 2000)
About Triggers | Media Kit | Security | Triggers | Login |
All times are recorded in UTC. Linux is a trademark of Linus Torvalds. Powered by Linux 2.2.12, Apache 1.3.9. and PHP 3.14 Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy. |