Microsoft Search Server and RemapperHttpModule broken

Topics: Troubleshooting
Aug 1, 2011 at 1:48 PM
Edited Aug 1, 2011 at 2:34 PM


Using the RemapperHttpModule with the 'Microsoft Search Server' package (to be able to index PDF's) prevents Search Server Express (2010 at least) to crawl and/or index your content.
The remapper does a rewritepath and (re)directs to the Microsoft Search Server page.aspx. In the onload of that page a Server.Execute to "~/Renderers/Page.aspx" is fired to get the page to be rendered as it normally would.

However.....drum rolls......

in "~/Renderers/Page.aspx" in the OnPreInit method the url is parsed on the 'Context.Request.Url.OriginalString' (line 63) which results in a null PageUrl object as the the Url.OriginalString is still the Microsoft Search Server page.aspx.
This will probably cause some inifinite recursion in Composite.Data.PageUrl.Parse (which it does not give back as an exception or such - just my guess).

Can it be solved? Yep, Prevent the recursion in case the search bot tries to render the page and parse the Context.Request.Url.OriginalString like the RemapperHttpModule did but then the other way around.

Replace line 63: _url = PageUrl.Parse(Context.Request.Url.OriginalString, out _foreignQueryStringParameters);


// If we are not called from the Microsoft Search bot - i.e. the default 
if (Context.Request.UserAgent.ToLower().IndexOf("ms search 6.0 robot") == -1) 
_url = PageUrl.Parse(Context.Request.Url.OriginalString, out _foreignQueryStringParameters);           
string modifiedUrl = Context.Request.Url.OriginalString.Replace("/Frontend/Composite/Search/MicrosoftSearchServer/Page.aspx", "/Renderers/Page.aspx");               
_url = PageUrl.Parse(modifiedUrl, out _foreignQueryStringParameters);           

NOTE: As I use Search Server Express 2010 I explicitly look for that string. If you want to be thorough you could duplicate the "IsBot" method from RemapperHttpModule and test for that.

UPDATE NOTE: You also might want to add an extra crawl rule in the Search Server (exclude, in first position) for the capcha.ashx:
add rule > path: http://*/Captcha.ashx* (match case checked)
         Crawl config : Exclude all items in this path (check exclude complex urls...)
On the Manage Crawl Rules page set the order (last column) for the just created rule to 1.

If you followed the documentation you also created a crawl rule for 'http://*/ShowMedia.ashx*'. You can remove this rule if you tick the 'Crawl complex urls' checkbox in the 'http://*' rule. Less rules, the faster search server performs.

Hope this helps,.


Aug 3, 2011 at 2:37 PM

I added the bug to the issue tracker