Some Sphinx internals

Directive lifecycle in Sphinx

Sphinx directives get stored in Sphinx domains (as opposed to in docutils registries, which docutils uses but Sphinx domains do not).

During parsing, directives are recognized as markup by the state machine, as in this loop snippet; The match in turn leads to the invocation of docutils.parsers.rst.states.Body.explicit_markup(self, match, context, next_state):

# Snippet from docutils.statemachine.py
class StateMachine:

         ...

   def check_line(self, context, state, transitions=None):

         ...

     for name in transitions:

         # For directives, match will happen when 'name' is 'explicit markup'
         #
         # In that case,
         #
         #      1) 'pattern' = ``'\\.\\.( +|$)'`, matching directives like '.. eaa:dataset:: normalized_jira_data'
         #
         #      2) 'method' is docutils.parsers.rst.states.Body.explicit_markup(self, match, context, next_state)
         #
         pattern, method, next_state = state.transitions[name]
         match = pattern.match(self.line)
         if match:
             if self.debug:
                 print((
                       '\nStateMachine.check_line: Matched transition '
                       '"%s" in state "%s".'
                       % (name, state.__class__.__name__)), file=self._stderr)
             return method(match, context, next_state)

Using regular expressions, the line of markup is understood to be a directive, and then looked up from the registry where directive names are mapped to directive classes. Note that as explained in Sphinx’s “hack” to dispatch directives & roles, the storage for this lookup differs between Sphinx and docutils.

Anyway, once directive isfound then it is instantiated and called, as in the snippet below:

# Snippet of code from docutils.parsers.rst.states.py
class Body(RSTState):

            ...

   # The 'directive' parameter is the concrete class for the directive in question. For example,
   # the class DataSetNode in $EA_ANALYST_HOME/documentation/sphinx_extensions/eaa_domain.py.
   def run_directive(self, directive, match, type_name, option_presets)

            ...

      # 'type_name' is the name of the directive (e.g., something a string like 'eaa:dataset').
      # 'block_text' is a string with the full directive to parse, including the options
      directive_instance = directive(
         type_name, arguments, options, content, lineno,
         content_offset, block_text, self, self.state_machine)
      try:
         result = directive_instance.run()
      except docutils.parsers.rst.DirectiveError as error:

               ...

When the directive runs, it will typically parse the block_text string, which is the full directive including the options appearing on other lines.

For example, if the RST file contains

.. eaa:dataset:: normalized_jira_data
   :tags: jira derived common
   :type: DataFrame

then the block_text will be this string (note that indentrations are respected):

".. eaa:dataset:: normalized_jira_data\n   :tags: jira derived common\n   :type: DataFrame\n"

As it parses the block_text, the directive “can do anything”. For example:

  • Add nodes: some directives generate nodes and append them to the document tree. For example, directives derived from sphinx.directies.ObjectDescription will produce a sub-tree of nodes using the handle_signature method, aiming to represent an object such as a method, function, class, etc.

  • Generate more RST: others act recursively: they create another block of RST text, which may include directives, and recursively call the parser to process that. For example, the .. automodule:: directive generates new RST text containing directives like .. py:function::.

Role lifecycle in Docutils and Sphinx

The registration of roles in docutils happens in docutils.parsers.rst.roles.py. For example:

def raw_role(role, rawtext, text, lineno, inliner, options={}, content=[]):

            ...

   return [node], []

raw_role.options = {'format': directives.unchanged}

register_canonical_role('raw', raw_role)

Here raw_role is the role processing function, i.e., the function called when the role is encountered while parsing an RST file.

In Sphinx the roles are not registered with docutils. Instead, they are stored in domains, but they still get mapped to role proecessing functions. It’s just that the dictionary with the registrations is held in Sphinx domain objects instead of in a docutils.parsers.rst.roles dictionary:

class EA_Analyst_Domain(Domain):

               ...

   roles = {
      'reref': XRefRole(),
      .     ... other roles mapped to role processing functions
   }

The role processing function will contribute some nodes to the document tree in doctutils.

Those nodes will then lead to HTML output written by classes such as sphinx.builders.html.py.

Directives recursive parsing

Directives have a run(self) function which is invoked by the parser when it encounters a directive block.

These functions can do anything, and some interpret the directive by generating another block of text lines which might include other directives, and then recursively invoking the parser so that another directive’s run(self) function is called.

In effect, directives allow for multiple rounds/levels of preprocessing of an RST file.

Sphinx’s “hack” to dispatch directives & roles

Domain is a Sphinx notion which the underlying Docutils libraries don’t know about, and for that reasons Sphinx needs to alter how Docutils dispatches to roles and directives.

Sphinx accomplishes this by a “hack” leveraging Python’s context managers construct, by using a context manager to temporarily change what Docutils dispatching function names point to, so that instead of pointing to their Docutils implementation they instead point to a Sphinx-provided implementation.

Specifically, calls to docutils.parsers.rst.roles.role and docutils.parsers.rst.directives.directive are re-routed by the mechanism in this Sphinix snippet:

# Snippet from sphinx.utils.docutils.py
class sphinx_domains:

         ...

   def __enter__(self) -> None:
      self.enable()

   def __exit__(self, exc_type: "Type[Exception]", exc_value: Exception, traceback: Any) -> None:  # NOQA
      self.disable()

   def enable(self) -> None:
      # Remember original values, to restore later
      self.directive_func = directives.directive
      self.role_func = roles.role

      # This is the hack: point these Docutils function names to point to Sphinx functions
      directives.directive = self.lookup_directive
      roles.role = self.lookup_role

   def disable(self) -> None:
      directives.directive = self.directive_func
      roles.role = self.role_func

This is then used as a context manager when parsing, as in this snippet:

# Snippet from sphinx.builders.__init__.py
class Builder:

        ...

  def read_doc(self, docname: str) -> None:

        ...

     # This context manager enables the Sphinx hack on entry
     with sphinx_domains(self.env), rst.default_role(docname, self.config.default_role):
           doctree = read_doc(self.app, self.env, self.env.doc2path(docname))

        ...

The difference in behavior is in lookup for for directives and roles. For example, to find a role with the Docutils implementation, Docutils will use its registries:

# Snippet from docutils.parsers.rst.roles
def role(role_name, language_module, lineno, reporter):

         ...
      # The role processing function is taken from static dictionaries ("registries") in the Docutils module
      role_fn = _role_registry[canonicalname]

         ...
      return role_fn, messages

         ...

By contrast, when Sphinx’s sphinx_domains context manager is enabled, any call to docutils.parsers.rst.roles.role(--) gets dispatched instead to

# Snippet from sphinx.util.docutils.py
class sphinx_domains:

   # First look in Sphinx's storage for role processing functions (self.lookup_domain_element)
   # and if that doesn't work then look in Docutil's (self.role_func)
   def lookup_role(self, role_name: str, language_module: ModuleType, lineno: int, reporter: Reporter) -> Tuple[RoleFunction, List[system_message]]:  # NOQA
      try:
            return self.lookup_domain_element('role', role_name)
      except ElementLookupError:
            return self.role_func(role_name, language_module, lineno, reporter)

   # Example: for role ':eaa:font:' of domain class EA_Analyst_Domain, parameters are as follows:
   #      type   = 'role'
   #      name   = 'eaa:font'
   def lookup_domain_element(self, type: str, name: str) -> Any:

            ...

      if ':' in name:
            domain_name, name = name.split(':', 1)
            if domain_name in self.env.domains:
               domain = self.env.get_domain(domain_name)

               # In the example, this calls the EA_Analyst_Domain.role('font') function, that will do a dictionary
               # lookup EA_Analyst_Domain.roles['font'] and return the value, which is a role processign function.
               element = getattr(domain, type)(name)
               if element is not None:
                  return element, []

            ...

Either way, a role processing function is returned, which can then proceed to parse the RST text containing the role.

Adding listeners to Sphinx events

Sphinx’s engine publishes messages, and listeners can be registered with commands such as this (illustrated for the 'source-read message sent after an RST document is loaded but before is is parsed). Here app is the Sphinx application object.

# This custom callback will modify the RST source code by interpreting EAA custom macros

app.connect('source-read', eaa_parsers.get_lib_macro())

Adding custom roles

Consider an example: a custom role :my_font: that would be an HTML pass-through.

This can be done via manual definition or programmatically.

First the custom approach, inspired by raw html role

.. role:: raw-html(raw)
   :format: html

However, sometimes we prefer to create it programmatically so that the technical writer can just take it “off the shelf”.

To do that, I looked at the code that gets called with that directive ..role:: raw-html(raw) gets processed. In particular, looked at the code of the run(self) method in class Role in docutils.parsers.rst.directives.misc.py . We took the call to the CustomRole constructor and the subsequent registration of the role from the last few lines of that run (self) method.

This results in the following programmatic way to create such a role:

new_role_name = 'my_font'
base_role     = roles.raw_role
options       = {'class': ['my_font'], 'format': 'html'}
content       = '.. role:: raw-html(raw)\n   :format: html\n'

role = roles.CustomRole(new_role_name, base_role, options, content)
roles.register_local_role(new_role_name, role)

An advantage of the programmatic approach is that one can replace roles.raw_role by a different role processing function. This is useful, for example, to wrap roles.raw_role so that the technical writer passes less text (maybe not all the HTML details), and have a custom role processing function miss the missing HTML boilerplate before calling roles.raw_role.

In the programmatic registration of a role one question is how to know what to pass for the parameters. In this exampe, that was determined by running the debugger and stopping at those lines, with the debugger processing a pilot Sphinx project (with dedicated conf.py, a custom_role_pilot.rst, a pilot_domain.py to pilot an implementation of this function).

The way the debugger helps to determine the parameters for the CustomRole constructor is that custom_role_pilot.rst contains these lines:

.. role:: raw-html(raw)
   :format: html

When Sphinx processes this RST file, the ..role:: directive will cause a call to the Role.run in misc.py in the debugger, from which input values can be read.

When registering a role programmatically, another way to do it is to add the roles to a custom domain. In that case Sphinx maintains the information about that mapping in the domain’s dictionary, so there is no need to register it with the docutils.parsers.rst.roles.register_local_role call above.

However, the role processing function must still be provided, with a signature as follows:

def font_processor_function(role, rawtext, text, lineno, inliner, options={}, content=[]):

Delegates to roles.raw_role to display text using HTML font properties.

However, for usability those HTML font properties are expressed by the userwithout the full HTML syntax of having a <font style="..."> prefix and a </font> suffix, so these are added by modifying the rawtext and text before delegating to roles.raw_role.

Example 1: Suppose that a line in an RST file has command like

:eaa:font:`<color:green>This is green text`

Then the parameters would be as follows:

Parameters
  • role – Name of the role, such as eaa:font

  • rawtext – Full role command, such as

':eaa:font:`<color:green>This is green text`'
Parameters
  • text – Body of the role command, such as <color:green>This is green text

  • lineno – integer, corresponding to the line number containing the text being parsed.

  • inliner – a parser, usually of type docutils.parsers.rst.states.Inliner. Its parse method processes an entire line in the RST input, and when it finds a backquote it calls the interpreted_or_phrase_ref method which determines the backquote corresponds to a role and calls the interpreted method to retrieve the role processing function (such as font_role) from a registry. For domains roles, that registry is the roles dictionary of the domain instance. The parser is passed as the inliner parameter to the role processing function.

  • options – ignored. It is required for polymorphic reasons, but for this font_role function to work the call to roles.raw_role will hardcode options to be a dictionary with two entries:

{'class': [<name of the role, like ':eaa:font:'>], 'options': 'html'}
Parameters

content – required for polymorphic reasons, but for this font_role function to work it should be an empty list.

Returns

A pair of lists [node], [], where node is an instance of docutils.nodes.raw.