Parsing¶
Under the hood, Msort parses source code into Syntax Trees.
Msort can work with two types of Syntax Trees:
Concrete Syntax Trees (CST)
Abstract Syntax Trees (AST)
CSTs capture all syntactic and grammar information without any loss of information.
ASTs capture the hierarchical flow of operations in the code but loses information in terms of syntax. Because of this, using the AST parser can result in unexpected changes to the original code.
Simple Example¶
Consider the expression a + b * c.
CST would represent this expression as:
Expression
├── Term
│ ├── Factor (a)
├── +
├── Term
│ ├── Factor (b)
│ ├── *
│ ├── Factor (c)
Every character in the expression and the relationships between characters is represented.
AST might something like:
+
├── a
└── *
├── b
└── c
Python Example¶
Lets consider the following function:
def add(x, y):
# summing function
return x + y
The AST tree would created using the python ast library would look like:
ast.Module(
body=[
ast.FunctionDef(
decorator_list=[],
returns=None,
name="add",
args=ast.arguments(
args=[
ast.arg(
arg="x"
),
ast.arg(
arg="y"
)
]
),
body=[
ast.Return(
value=ast.BinOp(
left=ast.Name(
id="x"
),
right=ast.Name(
id="y"
),
op=ast.Add()
)
)
]
)
]
)
This AST tree is enough to robustly capture the fact that the function takes two values and adds them. However, the comment is lost and whitespaces and line breaks might not be preserved.
Here is the CST for the same simple function:
Module(
body=[
FunctionDef(
name=Name(
value='add',
lpar=[],
rpar=[],
),
params=Parameters(
params=[
Param(
name=Name(
value='x',
lpar=[],
rpar=[],
),
annotation=None,
equal=MaybeSentinel.DEFAULT,
default=None,
comma=Comma(
whitespace_before=SimpleWhitespace(
value='',
),
whitespace_after=SimpleWhitespace(
value=' ',
),
),
star='',
whitespace_after_star=SimpleWhitespace(
value='',
),
whitespace_after_param=SimpleWhitespace(
value='',
),
),
Param(
name=Name(
value='y',
lpar=[],
rpar=[],
),
annotation=None,
equal=MaybeSentinel.DEFAULT,
default=None,
comma=MaybeSentinel.DEFAULT,
star='',
whitespace_after_star=SimpleWhitespace(
value='',
),
whitespace_after_param=SimpleWhitespace(
value='',
),
),
],
star_arg=MaybeSentinel.DEFAULT,
kwonly_params=[],
star_kwarg=None,
posonly_params=[],
posonly_ind=MaybeSentinel.DEFAULT,
),
body=IndentedBlock(
body=[
SimpleStatementLine(
body=[
Return(
value=BinaryOperation(
left=Name(
value='x',
lpar=[],
rpar=[],
),
operator=Add(
whitespace_before=SimpleWhitespace(
value=' ',
),
whitespace_after=SimpleWhitespace(
value=' ',
),
),
right=Name(
value='y',
lpar=[],
rpar=[],
),
lpar=[],
rpar=[],
),
whitespace_after_return=SimpleWhitespace(
value=' ',
),
semicolon=MaybeSentinel.DEFAULT,
),
],
leading_lines=[
EmptyLine(
indent=True,
whitespace=SimpleWhitespace(
value='',
),
comment=Comment(
value='# summing function',
),
newline=Newline(
value=None,
),
),
],
trailing_whitespace=TrailingWhitespace(
whitespace=SimpleWhitespace(
value='',
),
comment=None,
newline=Newline(
value=None,
),
),
),
],
header=TrailingWhitespace(
whitespace=SimpleWhitespace(
value='',
),
comment=None,
newline=Newline(
value=None,
),
),
indent=None,
footer=[],
),
decorators=[],
returns=None,
asynchronous=None,
leading_lines=[],
lines_after_decorators=[],
whitespace_after_def=SimpleWhitespace(
value=' ',
),
whitespace_after_name=SimpleWhitespace(
value='',
),
whitespace_before_params=SimpleWhitespace(
value='',
),
whitespace_before_colon=SimpleWhitespace(
value='',
),
type_parameters=None,
whitespace_after_type_parameters=SimpleWhitespace(
value='',
),
),
],
header=[
EmptyLine(
indent=True,
whitespace=SimpleWhitespace(
value='',
),
comment=None,
newline=Newline(
value=None,
),
),
],
footer=[],
encoding='utf-8',
default_indent=' ',
default_newline='\n',
has_trailing_newline=True,
)
The CST is considerably longer and more complex but holds information about syntax, formatting and comments.
By default, Msort uses the libcst python library to parse source code into a python friendly CST.
The parser can be changed to AST by using the --parser=ast option on the command line.
It is strongly recommended to use the default CST parser