Request for comments: extra path info handling in a new URL regime

Topics: Feature requests, General
Coordinator
May 11, 2011 at 10:12 PM
Edited May 11, 2011 at 10:23 PM

We are currently addressing the feature requests and critique we have gotten for how our URL’s look – they are way too verbose, developers lack control over casing and how characters gets replaced and the .aspx extension is not desired. URL’s like http://contoso.fr/fr/Home/About.aspx should be http://contoso.fr/about

Extra path info

One area where we haven’t quite been able to find a silver bullet and would like to get some input and ideas is in relation to ‘extra path info’ which extends an existing page URL and can be consumed by a functionality embedded on the page to control what it shows.

Today a URL with extra path info would look like http://contoso.com/Home/Products.aspx/books and the .aspx part ensure that we can easily identify what is the ‘page part’ and what is the ‘extra part’ of the URL. We can easily send a 404 etc. when something is wrong today.  Under a new URL regime this URL could become http://contoso.com/products/books and we can no longer easily identify the page url part.

A quick example: http://server/products is a page that contain a C1 Function that by default will list products, but if the C1 Function is called with a product id as “extra path info”, that particular product is shown by the function instead of the product list. So the URL http://server/products/baconnaise would show the same Composite C1 page instance as http://server/products, but the C1 Function embedded on that page would behave differently, showing either a list or a single product.

When intercepting the request http://server/products/baconnaise Composite C1 would find that “/products” is the page path (deepest match found in the page sitemap) and “/baconnaise” is extra path info. The “/products” page would be rendered, leaving it up to functionality on that page to use (or not use) the extra path info.

The good part

Being able to just add a C1 Function to any page and then be able to use extra path info to ‘navigate inside the C1 Function’ using extra path info is a very easy and intuitive. You simply just create a C1 Function that feast on extra path info and then add it to a page where desired, and you can get sweet URL’s and move things around if needed without digging into code or config. Yay!

The bad part

A very loose URL regime like this is something that SEO people have warned me never to allow. For instance a page like http://server/about would serve the same content as http://server/about/crappy-company and http://server/about/add-anything-here since the CMS would find “/about” to be the best matching page. No 301 redirection or 404 errors would happen, a search engine would just find a potential limitless amount of URL’s with duplicate content (just add /something to the page and you have a new URL with the same content).

“Bad links” could happen in situations like this:

  • A page is moved or the URL title is renamed. Requests using this pages old URL would yield the parent page content, not a 301 redirect or a 404 errors. Content duplication could become an issue.
  • Someone with evil intent create links like http://server/about/crappy-company which with some effort could actually show up on Google results higher than the original /about url. Not cool.
  • A link to a page is misspelled, making the parent page be rendered. Duplicate content.
  • (other situations worth noting?)

Should we handle this?

Yes. Duplicate content issues are almost guaranteed to happen over time and slowly grow worse and I have spent enough time with SEO people to know that duplicate content is something a CMS should combat, not promote.

How should we handle this?

I would like to quickly recap the good part again: We have this nice and easy to use feature, where you as a dev can hook on to the URL of the page that host your C1 Function and use the extra path to route into your data. Our MVC Player works like this and it’s fairly sweet. An end user can add your C1 Function where they want it and this “just work”, no config or code changes are required and URL’s and routing adapt naturally.

Here are a few ways to prevent the duplicate content issue listed in no particular order. Most of them break the “just works” goodness described above, question is which approach is desirable or if better ways exist:

  1. Pages that accept extra path info must be ‘white listed’ first. Adding a C1 Function that uses extra path info will not function as expected before a user must explicitly allow it for the particular page. This could be a checkbox when editing the page or a config setting. A request to “/products/baconnaise” will yield 404 until “/products” is explicitly allowed to be requested with extra path info.
    (Good: Pretty much fix the issue. Bad: user actions are required, will while list anything, including /products/crappy-company)
  2. Composite C1 will allow the extra path info request to execute, but unless code executed as part of the request explicitly notify Composite C1 that it “used the extra path info” the rendered page is thrown away and a 404 is returned. This is kind of like the white list idea, except code do the “white listing” for the current request while executing and bad extra path info may still yield a 404.
    (Good: Allow devs to fix the problem with a high level of detail, letting bad url’s fail. Bad: Dev need to care about calling this. How would XSLT Function devs do it?)
  3. <link rel=”canonical” … /> is used to combat duplicate content. By default the current rendering page’s URL (i.e. /products) would be specified as canonical URL. Code that actively uses the path info is responsible for delivering a more exact canonical URL. By default pages will render just fine with /extra/stuff – the canonical url will contain the current pages current short form URL.
    (Good: things just work. Bad: require a canonical link element regime by default, devs must expand on it of lose google indexing inside their C1 Function urls)
  4. Introduce a “url validation provider” feature – dev can write a plug-in that gets called with seemingly invalid URL’s and then okay it at request time (perhaps pass it to a specific page). If no provider okay’s the url the request yields 404.
    (Good: You can write C# code that gets to okay a URL. Bad: We didn’t solve the problem, just moved the headache to a provider).

Here is a relevant video if you are not familiar with duplicate content or the canonical link element: http://www.youtube.com/watch?v=Cm9onOGTgeM

Your input!

Sorry if this post became long and murky - I hope it make sense and you either have ideas to share or can identify a model you would prefer.

May 11, 2011 at 10:57 PM

its all in all very simple and slightly mentioned here http://compositec1.codeplex.com/discussions/252587

  • Per default you don't allow these extra urls being appended, so a url not existing in the cms will throw a 404
  • If one need to add a parameter to his/her page the person would create a *-page in the cms that will be the placeholder for these unknown number of new unique urls
    • Manually having to create a *-page forces the system to have a handler that will know about these new urls and handle 404's correctly
    • Nesting of parameters is allowed with a hireachy of * pages
    • A catch all is handled with a ** page that will catch ALL levels of extra parameters
Coordinator
May 11, 2011 at 11:48 PM

@burningice on the face of it that approach sounds like a very verbose "white list" approach (create a new sub page with ** url name to allow extra path info - same effect .... almost) and this would force routing related to the 'extra path info' to be handled at the CMS level, since 'extra path info' == 'a new * or **  pages'. Code would need to be broken into pieces fitting each page, url naming/logic could end up spanning both domains.

I can see that having "*" or "**" as a URL folder name can have some merits if there is no other way to control request routing, but I am aiming for a situation where a functionality embedded on a page own the 'extra path info' and the routing that comes with it.  I'm looking for a solution that do not require multiple pages to be created, cms forced routing, and avoids code/page/url change coordination of any kind..

In a nut shell, adding a feature like our blog should require me to: "1) Create a page, 2) Add the blog function to that page. 3) Save and publish". Nothing more, especially not creating an extra ** page below it, with yet another blog function embedded on it. It seems weird and clumsy to involve cms and users in routing, if devs can fix it from day 1.

May 12, 2011 at 12:03 AM
Edited May 12, 2011 at 12:05 AM

Its addressing all your issues you're listing in regards to arbitrary extra urls floating around in places where they should otherwise not be allowed. Having a * page that catches ie. blog/something IMHO also makes more sense, since blog and blog/something is seen as two physical different pages that handles different concerns. blog/* is the page that actual renders a specific blog-entry or returns 404 if the entry-id is invalid, while just blog maybe lists the last 10 entries.

In terms of routing and MVC its also exactly the same you're doing when specifying which parts of the url belongs to controller and parameters. Creating * pages in the CMS tree is a point-and-click way to tell the routing engine how it should treat different parts of an url that would otherwise be unrecognized.

You could of course also make a property directly on a page telling the CMS that this page should allow to have parameters n-levels deep and in that way get rid of physical * pages in the tree, but relying on just dumping a function into the content, i can't see how would ever work out. Remember that people should have their 100% own choice of how to list their blogposts. let it be a UserControl or whatnot and never have to be forced to use some C1 function.

Coordinator
May 12, 2011 at 12:28 AM

> but relying on just dumping a function into the content, i can't see how would ever work out

Approach #2 and #3 would allow this I guess? Approach #2 is probably the cleanest one - let the developer have his code "okay" the request. by calling Composite C1. If not okay'ed and the request url has extra path info, it's a 404 (which could - by the way - contain some helpful info about how to "okay" a request, if the requesting user is validated in the C1 Console). A down side to this approach #2 is that we would have to actually render the best matching page before we can determine if this is a 404, question is what badness that could yield.

 

 

 

May 12, 2011 at 2:46 PM
Edited May 12, 2011 at 3:04 PM

As i was trying to point out, blog and blog/something is two different things and you could easily imagine wanting to have different page-templates for either of those two pages. You're talking about having a page that "owns" the extra parameters and thats exactly what the * page is for. IMO it starts to become clumsy when the page blog should both handle the rendering without any parameters and with a parameter. To take your logic to the extreme, there would even be no need to have ANY pages in the CMS, we could just have root page that takes all other paths as a parameter - but hey, how do you specify different templates or properties for specific pages?

The * page is for mimicking those pages that you would otherwise have to create manually by hand. Its a virtual page, that represents n number of urls that will share the same properties and template. blog and blog/a-specific-post doesn't necessarily share the same properties while blog/a-specific-post and blog/another-specific-post do.

Content/logic-wise its also making sense to look for the page blog/another-specific-post as a subpage to blog in the CMS and if you want to change for instance how the title of the post is being shown, maybe insert a extra space you would also look for this subpage under blog to do it. So its not just a marker for the CMS but it contains the actual content for showing specific post without having to kludge the code for if-statements whether a parameter was specified or not since we know that this page will only be hit IFF a parameter was specified.

With option #2 you always have to implement some method that can tell the CMS if parameters to a specific page is allowed. Virtually meaning that as a editor you can't insert a page that shows data-items filtered by a parameter, without having to write some "real" code. Using * pages you can manage do it it all via point-and-click

  1. Create a datatype named photoalbum
  2. Add 5 different items based on this datatype
  3. Create a page named photoalbums
  4. Write some content explaining how fond you are of taking pictures and you can see my albums by clicking here
  5. Insert into content as well a function that lists all photoalbum-datatypes
  6. Create a subpage named * under photoalbums
  7. Insert into content how you want the page for a specific photoalbum to look like. Title, description and a list of pictures
    1. Since this page represents any of those 5 photoalbums, title, description and list of pictures can't be hardcoded
    2. Instead we insert functions to extract title and description from a photoalbum-datatype. Which instance it is we find out by using the passed parameter from the url.

      <f:function name="Composite.Utils.GetPropertyValue" xmlns:f="http://www.composite.net/ns/function/1.0">      
         <f:param name="PropertyName" value="Title" />        
         <f:param name="InputObject">
            <f:function name="my.namespace.PhotoAlbum.GetDataReference">
               <f:param name="KeyValue">
                  <f:function name="Composite.Web.Request.ExtraUrlParameter" />
               </f:param>
            </f:function>      
         </f:param>    
      </f:function>
      
    3. List of pictures we can show with some list-pictures function where we again pass in our photoalbum datatype as in the previous step.

      <f:function name="MyFancyPhotoListViewer" xmlns:f="http://www.composite.net/ns/function/1.0">      
         <f:param name="PhotoAlbum">
            <f:function name="my.namespace.PhotoAlbum.GetDataReference">
               <f:param name="KeyValue">
                  <f:function name="Composite.Web.Request.ExtraUrlParameter" />
               </f:param>
            </f:function>      
         </f:param>    
      </f:function>

You could argue that this is what you can do already using "Visual Functions" that generates some layout taking a datatype as parameter, but here we elevate the same functionality unto the page itself, giving editors/authors the change to manipulate the look of a dynamic page even if they only have access to the content-tab in administration. 

Yes, you would run into a problem where the GetDataReference would return null because of an invalid url parameter and the whole page should in that case abort and return a 404 code. But hey, it doesn't look like you yourselves see that as the biggest issue http://www.composite.net/C1/About_us/Updates.aspx?story=201114:sdf+asd+reaches+for+the+cloud+with

May 12, 2011 at 3:01 PM

Another option, which is in line with Option 2 is to use the SiteMap.SiteMapResolve event http://msdn.microsoft.com/en-us/library/system.web.sitemap.sitemapresolve.aspx. If you really want developers to write code to tell C1 if a url is valid or not, at least due it the asp.net way. With this event, you would either return null if the url doesn't point to any valid page, or a SiteMapNode object.

Coordinator
May 13, 2011 at 10:14 AM

Here's Maw's approach in a nutshell

1. All urls are friendly urls. (F.e. if there's a url  "/a/b/c/d", and "/a" is a page url, this page will be rendered)

2. C1 root page "/" does not handle path info

3.  If there's a C1 function that has to be called in the case of successful handling of PathInfo, otherwise 404 will be shown

 

May 13, 2011 at 11:13 AM

In a nutshell its what my Contrib project does today but without a good method of rejecting these extra parameters it can end in a mess

This is a request for comments and i won't vote for automatically accepting all urls with the need of writing extra code to tell C1 if you want to accept the parameters or not. I prefer the MVC routing pattern where you declaratively tell the system what patterns you want to accept in advance. In MVC you for instance define a pattern like "{controller}/{action}/{id}" meaning that some/thing/good is valid while /some/thing/very/good is automatically 404.

It might seem like double standards since i've implemented your suggested approach already, but it has also given me many headaches and i've realized that its not a good way to go.

Coordinator
May 13, 2011 at 2:13 PM
Edited May 13, 2011 at 2:17 PM

> as a editor you can't insert a page that shows data-items filtered by a parameter, without having to write some "real" code.

I guess I have described option #2 badly if you have this impression. If a function use 'extra path info', the function will validate that extra path info. When the user inserts the function, he is indirectly inserting the correct validation as well. As simple as this: "Insert function. Done.". This simplicity (in UX) is the goal.

> I prefer the MVC routing pattern where you declarative tell the system what patterns you want to accept in advance.

I wish things were that simple - take your MVC sample and then add this axiom: "a MVC handler may (or may not) dynamically execute another handler embedded within it, which appends to it's path" - this is what we have here - making "/some/thing/very/good" potentially valid (2 valid routes: "/some/thing/very" and "/good"). Your solution to this is "let someone declare this by creating * pages in the CMS", mine is "let the code that reads "/good", okay it or not". Your approach will be able to okay it up front (routing), mine will have to render the page (compensating).

"Routing" is inherently better than "compensating", and in that light your approach is superior - but 'self contained, one click UX' is inherently better than 'create X pages, add special content all coordinated with code behavior', and is that light my approach is superior. Let us agree on these points (routing vs. UX), and when we agree on that, then lets start discussing costs/benefits.

I can see that "* pages" have some merits, but I don't find it to be the best way to validate URL's. I could give you tons of examples where the * page strategy breaks down completely (like "add mvcplayer to a teaser that can end up on any page"). Add to it that you require users to create these * and ** pages, just to make things work. "Lots of things you can not do" and "Horrible UX" are not an option in my book.

May 13, 2011 at 5:17 PM
Edited May 13, 2011 at 5:44 PM

I wonder just how strongly search engines would, if they do at all, punish a website for displaying a "User Not Found" page when someone looks for "~/user/noexist" or for showing "No blog entry exists for that date" when asked for a non-existent post. Sometimes "Item not found" is a more valid response than 404.

I note StackOverflow is not worried about people creating links like http://skeptics.stackexchange.com/questions/3132/Some-Idiot-Asks-Some-Stupid-Question for their site, however, the extra path info in their URLs is entirely superfluous and for browser bars only.

I would like this functionality

  • Multiple websites with exclusive hostnames which optionally include the root page UrlTitle in their URLs
  • Ability to indicate non-default delimiter for replacing spaces when encoding URLs
  • URL handling options include casing: title, lower, unaltered. I like the C1 case-insensitivity in URls!

It seems to me that "extra path information" is the part of the path after a node with an extension. That is how that works. I would expect C1 to have an option to hide extensions--at the end of URLs for URLs where no document (like /Composite/Top.aspx) actually exists, but I'm not sure I--as a site developer or as a package developer to be consumed at other sites--would expect C1 to remove the extension from the interiors of URLs. If a developer expected that feature, they should also expect to have to go through some workflow to indicate when (create the whitelist) an URL has extra path info (because you can't tell by looking at it anymore, someone removed the extension out of the middle!). Developers also have options via rewrites to create friendly looking URLs to & from URLs with query-string parameters.

As a site developer, I think it would be most helpful if C1 had a robust, powerful tool for creating url rewrite rules. I would expect extension-free URLs to be a standard option of the CMS, but I wouldn't expect the CMS to anticipate which Blog or Events package I might choose, how that blog package consumes query strings or extra path info, or what I name the pages on which I have functions which consume extra path info or query strings. 

If Composite C1 is no longer expected to remove the extension from the interior-node steps, and at the same time Composite C1 provides developers with easy tools to obfuscate (different string) or hide (title white-list) the extension in interiors of URLs, then some of these end scenarios can be avoided.

I appreciate wanting to do as much as possible for the developer, in fact I am still getting a kick out of discovering things C1 does for me, but I would be mightily impressed by functionality that could use extra-path-info from urls without indicating strings, never require developer to craft a rewrite rule, and also 404 when the extra-path-info is bad!

May 15, 2011 at 5:18 PM

Interesting that you're asking into 404s. Google just had a blog post about it, and the conclusion must be that 404 doesn't hurt. And that you need to remember that you can still send content to the user with a 404 statuscode, meaning that friendly "your page was not found"-pages should still be served with a 404 code.

http://googlewebmastercentral.blogspot.com/2011/05/do-404s-hurt-my-site.html

May 15, 2011 at 5:32 PM
Edited May 15, 2011 at 5:32 PM

Add to it that you require users to create these * and ** pages, just to make things work. "Lots of things you can not do" and "Horrible UX" are not an option in my book.

How is that different from having people to insert a function on the page, just to make things work.

My example with MVC was how conventions in the routing-path was used to map the correct controller and action. To stay in the same track, a page could even be named {id} instead of * to be able to name the parameters. Without these dedicated pages you would not be able to have seperate templates for ie.

... how is that for things you can not do

May 16, 2011 at 5:04 PM
burningice wrote:

Interesting that you're asking into 404s. Google just had a blog post about it, and the conclusion must be that 404 doesn't hurt. And that you need to remember that you can still send content to the user with a 404 statuscode, meaning that friendly "your page was not found"-pages should still be served with a 404 code.

http://googlewebmastercentral.blogspot.com/2011/05/do-404s-hurt-my-site.html

Good point!

Jul 13, 2011 at 9:03 PM
xanderlih wrote:

I wonder just how strongly search engines would, if they do at all, punish a website for displaying a "User Not Found" page when someone looks for "~/user/noexist" or for showing "No blog entry exists for that date" when asked for a non-existent post. Sometimes "Item not found" is a more valid response than 404.

I note StackOverflow is not worried about people creating links like http://skeptics.stackexchange.com/questions/3132/Some-Idiot-Asks-Some-Stupid-Question for their site, however, the extra path info in their URLs is entirely superfluous and for browser bars only.

I know this is an older thread, but I just wanted to say that while that's true that SO doesn't care about the "slug" that comes after the Question ID... if you check out the page source, they also use the "rel='canonical'" to ensure Google gets the preferred link:

<link rel="canonical" href="http://skeptics.stackexchange.com/questions/3132/will-a-mother-bird-abandon-her-young-if-touched-by-a-human">