X-Team Blog - The Most-Loved Company for Engineers

Make robots happy: Structure your HTML properly

Written by Bernardo Dias | Mar 16, 2015 4:00:00 AM

Content is the center of the digital experience. The ease of access, long-term maintainability, and search engine ranking of your content will largely be affected by how it is structured.

We’ll leave the topic of structuring data for a future conversation and focus on the most common way in which content is presented in most devices today: as HTML. For almost all developers, HTML needs no introduction, yet nevertheless there are some concepts worth reinforcing.

Imagine if you have a book and its summary looks like this:

1.
2.
  1. Untitled Section
  2. Untitled Section
  4. Untitled Section
  5. Untitled Section
  5. 
  6.
  7. Top News
    1. News Analysis
    2. ...
  8.
    1. The Unknown Notebooks of Jean-Michel Basquiat NYT Now
    2. ...
  9. The Opinion Pages
    1. Women at Work
    2. ...
  10. Untitled Section
  11. User Subscriptions
    1. Reading The Times With David Axelrod
    2. ...
  12. Watching
  13. timesvideo
    1. Untitled Section
  14. Inside Nytimes.com
    1. Sunday Book Review
      1. Review: Erik Larson’s ‘Dead Wake’NYT Now
    2. Opinion
      1. Room for Debate: Friends, but Only OnlineNYT Now
    3. Opinion
      1. Private Lives: A Manic Depressive’s Best FriendNYT Now
    4. ...
  15. Sections
    1. World »
      1. Attacker of Mark Lippert, U.S. Ambassador to South Korea, Said to Be ‘a Fringe Element’NYT Now
      2. ...
    2. Business Day »
      1. Chinese Premier Sketches a Lofty Vision for Private Enterprise but Warns of ObstaclesNYT Now
      2. ...
    3. ...
  16. Real Estate »
    1. The Hunt
    2. A Texan Makes His Way to Uptown ManhattanNYT Now
      1. More Articles in the Series
    3. Search for Homes for Sale or Rent
    4. Sell Your Home
    5. 
    6. Living In
    7. Sutton Place, Cozy Enclave by the East RiverNYT Now
      1. More Articles in the Series
  17. Untitled Section
  18. Site Index window.magnum.writeLogo('small', 'http://a1.nyt.com/assets/homepage/20150304-152909/images/foundation/logos/', '', '');
    1. News
    2. Opinion
    3. Arts
    4. Living
    5. Listings & More
    6. Subscribe
  19. Untitled Section
  20. Untitled Section

Makes no sense, right? This shows a lot mistakes on how the HTML is used and it is the current content structure of the New York Times homepage.

Now imagine how not helpful that is for crawler robots and SEO. Here’s what you need to know about sectioning and headings to do it right.

Semantic representation

Our brain can easily discern a difference in the meaning of text with a simple change in visual appearance. We interpret this as a change in emphasis, however subtleties like these are meaningless to machines.

Thus the style should never be used for semantic representation of the content, the true strength of HTML (especially HTML5) is found when we choose elements by their semantic meaning.

Apart from the <div> and <span> elements, all HTML elements have semantic meaning. These both are used in cases where no other element is appropriate, usually grouping elements for styling or to contain elements that inherit some attribute, such as class or lang.

Sectioning content

Sectioning content elements allow organization of the document in logical parts. Each section element defines the scope of headers and footers, and can potentially contain a heading, forming the document outline.

Sectioning content elements

The main sectioning content elements are:

  • <article> – an article, blog or forum post, comment, or any piece of content that is complete in itself;
  • <aside> – a sidebar, comment section, advertising, footnote, that is related to the page or content;
  • <nav> – a section of the site that has navigation links to other pages or own page sections;
  • <section> – a section of the page or chapter of an article, usually with a title;

Heading content elements

Elements that contain headings, called heading content, are represented by <h1>, <h2>, <h3>, <h4>, <h5> and <h6>. Headings have a rank, where the <h1> has the highest rank and <h6> has the lowest. These elements are not used for subtitles, taglines or alternative titles, except for in the context of a new section or subsection.

Document Outline

The document outline represents the content structure. This structure is formed by sections and headings that result in a content map, which can be very useful for generating tables of contents, for instance. Such a table of contents can be used by assistive technology to help the user navigate in the page. It is also parsed by search engines robots, which use it to more efficiently identify the important parts of the page’s content.

When all browsers and assistive technologies implement the document outline algorithm, we’ll have a big gain for accessibility. In the meantime we can still benefit from understanding how it works in HTML4 and HTML5, with a view towards future compability.

Outline in HTML4

Create a outline in HTML4 is quite simple. Heading content of lower rank start subsections that are part of the previous section. Subsequent headings of equal or higher rank start a new section. In both cases, the heading content element is the heading of an implied section.

See an example:

<h1>X-Team awesome site</h1>
<h2>About</h2>
<p>We're dedicated to building a future where every extraordinary developer in the world has access to incredible opportunities.</p>
<h3>Our Team</h3>
<p>A company lead by people born to solve challenges.</p>
<h2>Contact</h2>
<p>Developers you can trust, when you need them most.</p>

Will result in this outline:

1. X-Team awesome site
 1. About
 1. Our Team
 2. Contact

In this method the heading levels need to be properly organized and the document can contain no more than 6 levels. This is usually no problem, but can restrict and hinder the maintenance of more complex structures.

Outline in HTML5

The definition of document outline in HTML5 inherits the same as HTML4, but there are also new sectioning content elements. The first heading content element in a sectioning content element represents the heading of that section, no matter what the rank. Sectioning content elements are always considered subsections of their closer ancestral section, implied or otherwise.

In other words, sectioning content elements define the sections in the outline. It would be equivalent to just use <h1> in the previous example with a markup like this:

<h1>X-Team awesome site</h1>
<section>
  <h1>About</h1>
  <p>We're dedicated to building a future where every extraordinary developer in the world has access to incredible opportunities.</p>
  <section>
    <h1>Our Team</h1>
    <p>A company lead by people born to solve challenges.</p>
  </section>
</section>
<section>
  <h1>Contact</h1>
  <p>Developers you can trust, when you need them most.</p>
</section>

And the result is the same:

1. X-Team awesome site
 1. About
 1. Our Team
 2. Contact

The advantage here is that the sections are structured regardless of the rank of headings within the subsections. An example that clearly demonstrates this benefit is:

<h2>X-Team articles</h2>
<article>
  <h1>Name of one article</h1>
  <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit...</p>
</article>
<article>
  <h1>This is another article</h1>
  <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit...</p>
</article>

And the result is:

1. X-Team articles
 1. Name of one article
 2. This is another article

The following examples, while not good use cases, produce the same result:

<h1>X-Team articles</h1>
<article>
  <h3>Name of one article</h3>
</article>
<article>
  <h3>This is another article</h3>
</article>
<h2>X-Team articles</h2>
<article>
  <h2>Name of one article</h2>
</article>
<article>
  <h2>This is another article</h2>
</article>
<h6>X-Team articles</h6>
<article>
  <h1>Name of one article</h1>
</article>
<article>
  <h1>This is another article</h1>
</article>

Untitled sections

Sections that not have any heading content element will display as “Untitled” in the outline, representing the lack of a heading and preserving the section.

Sectioning roots

In HTML5 some elements are considered sectioning roots. These elements can have their own outline, but the headings and sections within these elements do not contribute to the outline of their ancestors.

This includes <blockquote>, ``, <details>, <dialog>, <fieldset>, <figure> and <td>.

The best practice today

If you want to provide a meaningful document structure, use the h1h6 elements to express the structure of the document, even in conjunction with sectioning content elements. This gives the strengths of both the HTML4 and HTML5 specifications, for example:

<h1>X-Team awesome site</h1>
<section>
  <h2>About</h2>
  <p>We're dedicated to building a future where every extraordinary developer in the world has access to incredible opportunities.</p>
  <section>
    <h3>Our Team</h3>
    <p>A company lead by people born to solve challenges.</p>
  </section>
</section>
<section>
  <h2>Contact</h2>
  <p>Developers you can trust, when you need them most.</p>
</section>

And the resulting outline will be correct, today and tomorrow:

1. X-Team awesome site
 1. About
 1. Our Team
 2. Contact

Wrap-up

Textual content with a well organized structure makes it easy to read for humans and machines. If used wisely with sectioning content elements, the entire document outline of the page will make sense.

See how it looks the outline of this article:

1. Structuring your content with HTML
 1. Semantic representation
 2. Sectioning content
 1. Sectioning content elements
 2. Heading content elements
 3. Document Outline
 1. Outline in HTML4
 2. Outline in HTML5
 1. Untitled sections
 2. Sectioning roots
 3. The best practice today
 4. Conclusion
 1. External references

External references