Sequencing¶
This is a rough and brief introduction into sequencing text with the high-level programming language Python.
We will have a look at:
variables
strings
data types
working with strings
lists
loop
If you want to dive more into detail, we recommend to have a look at the following notebooks Expressions and Strings and Understanding lists and manipulating lines.
Programming = rule-based manipulation of symbols¶
Which symbols do we have (in a high-level programming language)? Numbers and characters (strings).
Programming means to perform actions on and with this symbols.
on = symbols are operands
with = symbols are operators
That’s what’s called executable text: the code contains instructions, which are performed when we execute it. A lot of the instructions (operations) are performed on the code itself and define/ alter the program flow.
In the following examples we have two operands (5
and 7
) and perform operations with the symbols +
, -
, /
and *
, which are the operators.
5 + 7
12
5 - 7
-2
5 / 7
0.7142857142857143
5 * 7
35
# Multiple arithmetic expressions are processed in the same order as we know it from school.
5 + 7 - 5 / 2 * 7
-5.5
(5 + 7 - 5) / 2 * 7
24.5
Variables¶
Instead of using this operands directly, more often we store them in variables. We use characters (strings) to give names to these variables:
a = 5
b = 7
# In Python we can declare multiple variables in one line like:
a, b = 5, 7
This names (here a
and b
) are then used as identifiers/ addresses, through which we can access this variables:
a + b
12
Of course we can store the result in another variable:
c = a + b
c
12
We can override variables:
a = b
a
7
Or we can use variables to perform actions on themselves:
a = a*2
# A shortcut would be:
# a *= 2
a
14
Strings¶
So far we have worked with numbers. Let’s continue with words. Words (or single characters) are called strings. We define strings with apostrophes. You can use '
or "
or '''
. We put our string in a pair of (similar) apostrophes:
a = '5'
b = "7"
a + b
'57'
As you can see operations on strings are different then operations on numbers.
We can imagine the string a
as a token (gaming piece) of scrabble with a 5 on it, b
as another piece with a 7 on it. The +
means that we put the latter right after the previous. They are kind of glued together. Another example:
'five' + 'seven'
'fiveseven'
This is called concatenation of two strings. We can create a sequence of text through concatenating several strings.
seq1 = 'Rose' # You can use numbers in variable names, but the names have to start with alphabetical characters
seq2 = ' ' + 'is' + " a" + ''' rose'''
seq1 + seq2
'Rose is a rose'
We can’t subtract or divide a string, but we can multiply it:
sequence = seq1 + 2 * seq2
sequence
'Rose is a rose is a rose'
Data types¶
We have worked with numbers and strings so far. These are different types of data.
If we are unsure what kind of data is stored in our variable, we can ask for it with a function called type()
.
We will not go into detail on what functions are at the moment. For now it’s just important to know that you can recognize a function through its ()
(except you write “()”inside a string) and that a function processes data.
A lot of the common functions require the data they should process as input. This means you have to insert the data as an argument into the function. This is done inside the ()
.
We have a variable called a
and insert this variable into the function type()
. This way we will receive the data type of a.
type(a)
str
The output str
means string.
In Python the data type of a variable depends on the data that is stored behind it.
If we change the data, the data type changes as well if necessary. We will look at this with another function, called print()
. This function will print/ output everything that you put into the brackets.
print(sequence)
Rose is a rose is a rose
print(a)
print('data type of a: ', type(a))
5
data type of a: <class 'str'>
Although the output (5) looks like a number, it is in fact a string.
print(seq1 + a * seq2)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-19-df6efae4bbec> in <module>
----> 1 print(seq1 + a * seq2)
TypeError: can't multiply sequence by non-int of type 'str'
Next we will convert it into a number (into an integer) with the function int()
:
# Converting the variable a from string to int
a = int(a)
print(a)
print('data type of a: ', type(a))
print(seq1 + a * seq2)
5
data type of a: <class 'int'>
Rose is a rose is a rose is a rose is a rose is a rose
Another data type for numbers is float. This means we do not have an integer value:
a, b = 5, 7
a /= b
print(a)
print(type(a))
print(seq1 + a * seq2)
0.7142857142857143
<class 'float'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-21-a7dea0aec895> in <module>
3 print(a)
4 print(type(a))
----> 5 print(seq1 + a * seq2)
TypeError: can't multiply sequence by non-int of type 'float'
Working with strings¶
import random
dir(random)
['BPF',
'LOG4',
'NV_MAGICCONST',
'RECIP_BPF',
'Random',
'SG_MAGICCONST',
'SystemRandom',
'TWOPI',
'_Sequence',
'_Set',
'__all__',
'__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__spec__',
'_accumulate',
'_acos',
'_bisect',
'_ceil',
'_cos',
'_e',
'_exp',
'_inst',
'_log',
'_os',
'_pi',
'_random',
'_repeat',
'_sha512',
'_sin',
'_sqrt',
'_test',
'_test_generator',
'_urandom',
'_warn',
'betavariate',
'choice',
'choices',
'expovariate',
'gammavariate',
'gauss',
'getrandbits',
'getstate',
'lognormvariate',
'normalvariate',
'paretovariate',
'randint',
'random',
'randrange',
'sample',
'seed',
'setstate',
'shuffle',
'triangular',
'uniform',
'vonmisesvariate',
'weibullvariate']
for i in range(10):
print(random.randint(0,10))
6
8
5
9
5
5
7
2
6
7
sequence
'Rose is a rose is a rose'
An object of type string can perform operations on itself through built-in functions. We’ll have a look at several of these functions.
sequence.replace('is', 'was')
'Rose was a rose was a rose'
sequence.lower()
'rose is a rose is a rose'
sequence.upper()
'ROSE IS A ROSE IS A ROSE'
sequence
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-27-5b150d1f0ed0> in <module>
----> 1 sequence
NameError: name 'sequence' is not defined
As you can see these functions do not change the value of the string. If you want to change it, you have to override the variable:
sequence = sequence.replace('is', 'was')
print(sequence)
# override it again
sequence = sequence.replace('was', 'is')
print(sequence)
Rose was a rose was a rose
Rose is a rose is a rose
Lists¶
We said that all the symbols of a string are kind of glued together. We can split them (think of tokens of scrabble), but then this object is not a string anymore. It is a list.
sequence_list = sequence.split(' ')
print(sequence_list)
['Rose', 'is', 'a', 'rose', 'is', 'a', 'rose']
Now we can deal with the singular tokens. For example print just the first 4 elements.
print(sequence_list[:4])
['Rose', 'is', 'a', 'rose']
Again, this does not change the values of our list. We still have all elements in it:
print(sequence_list)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-26-87cf129c79a7> in <module>
----> 1 print(sequence_list)
NameError: name 'sequence_list' is not defined
Accessing elements of a list¶
As you can see we can access single elements of our list. We can do this with the index of a element inside square brackets []
. Caution: the first index (of “Rose” in this example) is 0. In programming language counting starts from 0. This means that the last element is n-1, if n is the length of the list.
sequence_list[0]
'Rose'
sequence_list[1]
'is'
We can write expressions or functions inside the brackets.
sequence_list[4-2]
'a'
With negative numbers we can access the value counting from the end:
sequence_list[-4]
'rose'
If we try to access a value that is outside of the boundaries of our list, we will receive an error.
sequence_list[24]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-36-d9b151bb6527> in <module>
----> 1 sequence_list[24]
IndexError: list index out of range
But there are functions which alter the objects on which they are performed. For example:
import random # import the random library
random.shuffle(sequence_list)
print(sequence_list)
['Rose', 'is', 'rose', 'a', 'is', 'rose', 'a']
# If we want the original order back, we have to override our list with the original values,
# which are still stored in the string.
sequence_list = sequence.split(' ')
print(sequence_list)
['Rose', 'is', 'a', 'rose', 'is', 'a', 'rose']
We can use different symbols as separators to split our sequence. We will split it at rose. First we will lower the whole string, so that the first rose
is also recognized.
We can chain multiple operations (like .lower()
and .split()
together:
sequence_list = sequence.lower().split(' is a ')
print(sequence_list)
['rose', 'rose', 'rose']
If we need our list to be an array, we can transform it. One option to do that is the built-in function ().join
.
sequence_list = sequence.split(' ')
print(sequence_list)
sequence_str = (' ').join(sequence_list)
print(sequence_str)
['Rose', 'is', 'a', 'rose', 'is', 'a', 'rose']
Rose is a rose is a rose
Shuffle and sort¶
For the next examples we will use a longer text and perform some more operations on lists.
text = '''
He Hazardous of we strong
follow bacteria walks
by town guy place
'''
print(text)
He Hazardous of we strong
follow bacteria walks
by town guy place
Shuffle:
import random
# Split string into list.
text_list = text.split(' ')
# Shuffle list.
random.shuffle(text_list)
# Join list to string.
text_str = ' '.join(text_list)
print(text_str)
strong
follow town Hazardous we walks
by place
bacteria
He of guy
Sort:
text_list = text.split(' ')
# Sort list.
text_list.sort()
# Join list to string.
text_str = ' '.join(text_list)
print(text_str)
He Hazardous bacteria guy of place
strong
follow town walks
by we
Maybe you’ve expected a different order. We can have a look at our sorted list to see why we got this order:
text_list
['strong\nfollow',
'town',
'Hazardous',
'we',
'walks\nby',
'place\n',
'bacteria',
'\nHe',
'of',
'guy']
We can change our list in order to get a different order:
# As you know we can chain multiple expressions.
text_list = text.replace('\n', ' ').split(' ')
text_list.sort()
print(' '.join(text_list))
Hazardous He bacteria by follow guy of place strong town walks we
text_list = text.replace('\n', ' ').lower().split(' ')
text_list.sort()
print(' '.join(text_list))
bacteria by follow guy hazardous he of place strong town walks we
Length¶
Next we will sort our text according to the length of its words. We can retrieve the length of values with the function len().
print('Length of text:', len(text), 'characters.')
print('Length of text_list:', len(text_list), 'elements.')
Length of text: 67 characters.
Length of text_list: 14 elements.
We can use this function as a key (argument) for the sort() algorithm.
text_list = text.split(' ')
text_list.sort(key = len)
print(' '.join(text_list))
of we
He guy town place
bacteria walks
by Hazardous strong
follow
Looks strange again, the explanation for the result is below:
text_list
['of',
'we',
'\nHe',
'guy',
'town',
'place\n',
'bacteria',
'walks\nby',
'Hazardous',
'strong\nfollow']
Another example. Now we tokenize on the level of single characters.
text_list = [token for token in text]
text_list.sort()
print(' '.join(text_list))
H H a a a a a a b b c c d e e e e f f g g i k l l l l n n o o o o o o p r r r s s s t t t u u w w w w y y z
# Let's have a look at the inside of our list:
text_list
['\n',
'\n',
'\n',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'H',
'H',
'a',
'a',
'a',
'a',
'a',
'a',
'b',
'b',
'c',
'c',
'd',
'e',
'e',
'e',
'e',
'f',
'f',
'g',
'g',
'i',
'k',
'l',
'l',
'l',
'l',
'n',
'n',
'o',
'o',
'o',
'o',
'o',
'o',
'p',
'r',
'r',
'r',
's',
's',
's',
't',
't',
't',
'u',
'u',
'w',
'w',
'w',
'w',
'y',
'y',
'z']
We can create a vocabulary by reducing our list to one appearance per element. The easiest way is to use the function set().
text_list = set(text_list)
print(' '.join(text_list))
o t d b w s p y c z g
u l r a H e f i n k
A sorted version:
# Convert set (another data type) to list:
text_list = list(text_list)
# Sort list
text_list.sort()
print(' '.join(text_list))
H a b c d e f g i k l n o p r s t u w y z
Loops¶
We can use the length as a parameter to control a loop.
For-loop¶
The for-loop
below counts (iterates) from 0 to the length of the text_list (remember n-1). For each iteration it executes the indented code (in this example a print command).
for number in range(len(text_list)):
print(number, ' - ', text_list[number])
0 -
1 -
2 - H
3 - a
4 - b
5 - c
6 - d
7 - e
8 - f
9 - g
10 - i
11 - k
12 - l
13 - n
14 - o
15 - p
16 - r
17 - s
18 - t
19 - u
20 - w
21 - y
22 - z
There are more elegant ways to perform the above. If you don’t need the index (called i
above, you can choose any valid variable name), you can just iterate through the list:
for token in text_list:
print(token, ' ', end='')
# if you want to print each character in a separate line,
# remove the "end=''" argument.`
H a b c d e f g i k l n o p r s t u w y z
print(7, end=' ')
print(8)
7 8
If you want to loop through the elements and need an index, you can do it with the function enumerate()
:
for index, token in enumerate(text_list):
print(index, '-', token)
0 -
1 -
2 - H
3 - a
4 - b
5 - c
6 - d
7 - e
8 - f
9 - g
10 - i
11 - k
12 - l
13 - n
14 - o
15 - p
16 - r
17 - s
18 - t
19 - u
20 - w
21 - y
22 - z
While-loop¶
Another control structure next to the for-loop is the while-loop. The while() loop evaluates a conditional expression which is written inside the brackets. The loop continues as long as the expression returns True and it stops when the expression returns False. A common practice is to use a while-loop together with a counter. It looks like this:
# Initiate a variable with value 0
index = 0
# Check each iteration of the loop if the value of counter
# is smaller than the length of our list
while (index < len(text_list)):
# Perform some action
print(text_list[index])
# Increase the index, otherwise it would run infinitely.
index += 1
H
a
b
c
d
e
f
g
i
k
l
n
o
p
r
s
t
u
w
y
z
With a while-loop
it is very easy to write infinite programs. Below are 3 examples. You have to remove the # to run them. Or just interpret them for yourself without running them.
In the 1st example we compare to strings. The keyword is
is a special comparision, because it evaluates if the two objects are the same. This means that they occupy the same memory location on your machine. We’ll come back to that later.
The 2nd and 3rd example use the Boolean values True
and False
. Boolean is another data type.
# while ('a rose' is 'a rose'):
# print('a rose is ', end='')
# while (True is True):
# print('true is ', end='')
# while (False is False):
# print('false is', end='')