Installing a Xeon Phi coprocessor on Debian Squeeze

Written by erik - 29 july 2013 18:18

Today I installed a Xeon Phi coprocessor in a Debian box. The software is only supported for Red Hat systems so I was a little bit afraid. Fortunately it was easy. I took most of my information from the readme files of mpssd and from that guy. Thanks to him because he did most of the job.

Ok, let's get started. Hardware wise, you need a 16x PCI-express 3.0 slot and you need the BIOS to be configured to be compatible with devices with more than 4GB of ressource. There was a NVIDIA K20 with 6GB of memory plugged on the machine I used; but still the BIOS complained. A simple BIOS configuration allowed the machine boot. (Check the BIOS configuration under PCI-express.)

From here I did everything as root. Get the MPSS archive from Intel's website. (Mine was called mpss_gold_update_3-2.1.6720-15-rhel-6.3.tar). Untar it with tar xvf mpss_gold_update_*.tar. You'll get a bunch of RPM files. Install alien if you do not already have it with aptitude install alien. Convert all the RPM files except the -kmod- one into DEB files with for file in `ls *.rpm | grep -v kmod` ; do alien --scripts $file; done. Install them all with for file in `ls *.deb | grep -v kmod`;do dpkg --install $file; done.

Up to now, we installed all the files you need beside the kernel module. The kernel module that is provided is only for Red Hat system (with a kernel 2.6.32). My machine had a kernel 2.6.39, so I needed to do a little bit more work. Fortunately, the source for the kernel are available in the src/ directory. Convert that RPM file into a tgz with alien: alien -t intel-mic-kmod-*.src.rpm. When you uncompress it with tar zxvf intel-mic-kmod-*tgz you get a .spec file you do not care about and a tar.bz2 file that contain the kernel module source code. Uncompress it with tar jxvf intel-mic-kmod-*.tar.bz2. Compile it with make and install it with make install. You should now have a kernel module in /lib/modules/`uname -a`/extra/mic.ko. Somehow, the installation script does not run depmod to declare the module so you need to run it manually with depmod.

To use a Xeon Phi you also need a service running on the host. The files are already installed, but there is no startup script for the service. Go to /etc/init.d. We will use the skeleton script as a base. cp skeleton mpssd. You need to edit that file and perform the following modifications. Change the 3rd line to Provides: mpssd. Change line 22 to DESC="MIC service deamon". Change line 23 to NAME=mpssd. Comment line 25, you don't need arguments. At the beginning of function do_start, add the following line modprobe mic; [ -d "/var/lock/subsys" ] || mkdir /var/lock/subsys. The first part makes sure the mic kernel module is loaded. That should not be necessary because a udev rule was added earlier, but somehow I needed it. The second part ensures that the directory used by mpssd to store its lock file is properly created. Save and quit. Add the proper permission to the startup script with chmod +x mpssd. And at that point the service can be started with service mpssd start. Starting and stopping the service takes time (about a minute).

Now you need to be able to use the card. Typically one logs on the card using ssh. The card automatically takes the private IP address 172.31.1.1 but it is not routable automatically from linux. So you need to add a network interface in /etc/network/interfaces:

iface mic0 inet static
  address 172.31.1.254
  netmask 255.255.255.0

Load the interface with ifup mic0. To log on the card, you will need to ssh to the card using key-based autentification. If you have local users to the machine (I do not think that NIS based ones are supported with that technique), you should make sure that you have an ssh key on the ones that want to use the MIC card. If not, just get them to run ssh-keygen. After that, you need to load the ssh keys on the MIC card but mpssd need to be off to do that. Just running service stop mpssd; micctrl --resetconfig; service start mpssd should do the trick. I am not sure at which point, but an entry to /etc/hosts was made. So ssh mic0 should just work; otherwise ssh 172.31.1.1.

From there you are pretty much all set as you have a working installation of your Xeon Phi. You will certainly need ICC to do anything useful. Some people mount the host filesystem on the device, which can be pretty useful. I won't cover these because they are not really Debian specific and other people will certainly explain these better than me. NFS mounting is explained in MPSS readme file but I haven't tested it.

Happy hacking

EDIT on 2014/01/28: it seems that to use the offload mode, one need to have proper permission on /dev/mic/*. You can edit the udev rule in /etc/udev/rules.d/50-udev-scif.rules.

Classified in : Homepage, geek, programming, en - Tags : xeonphi, linux - 7 comments

Demenagement en Caroline du Nord

Written by erik - 12 april 2013 16:21

Bonjour a vous,

J'ai ete bien occupe ces derniers mois par mon travail et mes candidatures. Apres avoir envoye des dossiers a plus de 40 universites americaines et passe de nombreux entretiens, je suis arrive a un arrangement final. Je rejoindrai University of North Carolina at Charlotte dans le departement de Computer Science en tant qu'Assistant Professor. Je prendrais mon poste autour du 15 Aout et demenagerai de Columbus un peu avant.

Si vous passez dans le coin, faites moi signe!

PS: peut etre que je devrais renommer ce blog "Prof Saule in the US" :)

Classified in : Homepage, 3617mylife, fr - Tags : none - 2 comments

[C] Que faire en cas de "segmentation fault"?

Written by erik - 22 november 2012 12:03

Bonjour a vous,

Je recontre souvent des etudiants (locaux ou a travers internet) ou des debutants en programmation C. Ils posent souvent la question "J'obtiens segmentation fault. Qu'est ce que c'est? Que faire?".

Qu'est ce que "Segmentation fault"?

Il s'agit de l'erreur de l'on obtient en C lorsque le programme lis ou ecrit sur une partie de la memoire qui est protege en lecture ou en ecriture. Par protege, je veux dire que le systeme d'exploitation conserve une liste des adresses memoire sur lesquelles ont peut lire et sur lesquelles on peut ecrire. Ces zones sont appellees des segments. Il y a plusieurs raison pour laquelle un segment peut etre protege; le plus courant est que la zone memoire n'est pas allouee.

Par exemple, les deux codes suivants atteignent certainement segmentation fault:

int * p = NULL;
*p = 12;

ou encore

int* p = (int*) malloc (10*sizeof(int));
for (int i=0; i<1000; ++i)
  p[i] = 42;

Le premier code atteint segmentation fault parce que l'adresse 0 est toujours protege contre la lecture et l'ecriture. C'est une adresse qui est utilise par convention par les programmeurs pour signifier "nulle part, ca n'existe pas." Le second code atteint segmentation fault parcequ'il y a des ecriture bien apres la fin de l'espace memoire alloue: seulement 10 entiers ont ete alloues, mais 1000 sont ecrit. Certaiement bien apres la fin du tableau, on fini par rencontrer une adresse qui n'est pas allouee. C'est un depassement de tableau (en anglais, buffer overflow)

"segmentation fault" est utilise par Linux. D'autre systeme d'exploitation ont d'autre message d'erreur. Certain BSD rapportent aussi des "bus error". Windows typiquement rapporte une erreur du type "le programme a ete arrete a cause d'un access illegal en memoire" ou un truc du genre (ca fait trop longtemps que je n'ai pas utilise windows pour me rappeller du message exacte.)

Que faire contre une "segmentation fault"?

La plupart des cours en ligne ou des programmeurs disent quelquechose du genre "Utilise un debugger!" Un debugger est un outils comme gdb qui sert a suivre l'execution d'un programme et a afficher les valeurs des variables d'un programme pendant que le programme fonctionne ou a posteriori. L'utilisation classique des debugger est de laisser le programmer planter et de regarder l'etat de la memoire apres l'erreur. Je ne conseille gdb qu'en dernier recours. Parcequ'un debugger va vous avertir d'une "segmentation fault" apres que l'erreur soit survenu. Mais souvent l'erreur est une consequence d'un probleme moins important qui est survenu plus tot dans l'execution du programme. C'est ce probleme la qu'il est important d'attrapper.

Il y a quatre outils qui sont magique en C pour detecter les problemes tot.

Le compilateur

Les compilateurs sont votre premiere ligne de defense contre les erreurs. Le C est un langage assez permissif. Beaucoup de comportement incorrecte d'un programme peuvent etre detecter facilmenet pendant la compilation. Cependant comme le C est permissif beaucoup de ces comportement ne sont pas juger critique par votre compilateur et il en fait des warning, ou les passe sous silence. Dans mon experience, tous les warning d'un compilateur sont important. Activer tous les warning possible et imaginable et forcer le compilateur a les considerer comme des erreurs. Avec GCC, compilez votre code avec "-Wall -Wextra -Werror --std=c99". Et corrigez tous les problemes. Ne les passez pas sous silence. Ne retirez pas ce -Werror pour faire taire le compilateur. Il est la pour vous aider.

Quel type d'erreur le compilateur va relever qu'il ne n'indiquait pas avant:

  • Une variable est declare mais pas utilise, en particulier dans les parametres de fonctions. Si une variable n'est pas utilisee, retirer la. Cela augmentera la lisibilite de votre programme. Si un parametre n'est pas utilise, probablement la fonctin ne fait pas ce qu'elle devrait faire. (Si c'est un des cas ou un parametre est ignore, il n'est pas obligatoire en C de nommer un parametre de fonction.)
  • Une fonction n'est pas declare mais est utiliser. En C, il n'est pas obligatoire de declarer une fonction avant de l'utiliser. Si une fonction n'est pas declare, le C lui donne automatiquement la signature "int f (int)". Si vous passez autre chose qu'un int en parametre, le compilateur va convertir et certainement, vous obtiendrez un comportement non desire. Si la fonction ne retourne pas int, le probleme est encore pire, parcequ'il decale la pile d'appel.
  • Une fonction qui retourne un parametre n'a pas de "return". En C, ce n'est pas une erreur. Mais certainement si la signature indique que quelquechose est retourne, il y a une raison. La fonction ne fait pas ce qu'elle doit faire.
  • Une variable est utilisee sans etre initialisee. Le C ne donne pas de valeur par defaut au variable, une variable non initialise peut contenir n'importe quelle valeur. Certainement il s'agit d'une erreur qui n'apparaitera pas tout le temps.

Il y a plein d'autre comportement dans le code qui ne sont pas rapporte par gcc comme une erreur, mais qui dans 99% des cas va generer une erreur a un moment ou un autre.

L'analyseur statique splint

splint est un programme d'analyse de code statique. Vous pouvez le trouver ici ou dans le gestionnaire de paquet de votre distribution. splint analyse les fonction de votre code pour trouver des comportement problematique. Il peut detecter certaine fuite de memoire, des buffer overflows, ...

Certaines annotations peuvent etre donne au code pour exprimer le sens des operations et permettre a splint de faire la difference entre un probleme potentiel et un comportement voulu. Utiliser splint demande d'ajouter pas mal de chose dans le code comme commentaire ou type cast. Mais le jeu en vaut la chandelle, cela force a ecrire du code plus propre, plus robuste et mieux documente.

Un verificateur de memoire a l'execution: valgrind

valgrind est un outil d'analyse dynamique de programme (certainement disponible dans le gestionnaire de paquet de votre distribution). Basiquement c'est un emulateur x86 et x86-64 qui traque l'etat de la memoire. Il permet de detecter de nombreuse erreur a l'execution qui ne sont pas detectable statiquement. Par exemple, un tableau peut etre partiellement initialise: les indices pairs sont initialise, mais pas les indices impair. Un analyseur statique ne pourra pas faire la difference dnas la majorite des cas. valgrind conserver l'information pour chaque octet de la memoire: est ce que cette zone memoire est initialise ou non?. Ainsi, valrind permet de detecter des erreurs qui viennent de l'utilisation de memoire non initialise.

valgrind conserve egalement l'information "quelles sont les zones de memoire alloues?". Cela permet de detecter les fuites de memoires. A la fin de l'execution du programme valgrind rapporte combien de memoire est toujours alloue. Il rapporte egalement si certaineszone de memoire ne sont plus accessible: c'est a dire des zones de memoire vers lesquels le programme n'a plus de pointeur. De facon generale, une fuite de memoire doit etre considerer comme une erreur. Meme si le systeme va recuperer la memoire allouee a la fin de son execution, la presence de fuite de memoire denote un probleme de conception dans le logiciel.

Parcequ'il conserve toutes les zones de memoire alloue, valgrind peut pour chaque acces a la memoire verifier si cette acces est dans une zone alloue ou non. Il reportera tous les acces memoire qui ne sont pas normaux. Ces access sont souvent effectue avant une segfault. Dans certains cas, bien avant et ils sont le probleme qui cause une segfault incomprehensible. En effet, cela vient principalement du fonctionnement de malloc. Deux appels consecutifs a malloc retournent souvent des zone memoire proche. Donc si il y a un depassement de tableau, le programme va ecrire sur la prochaine zone memoire alloue. Si la zone memoire ecrite contient un pointeur, lorsque ce pointeur sera dereference plus tard, une segfault apparaitra. Un debugger ne vera pas le depassement de buffer, il ne vera que le dereferencement de pointeur. Et l'erreur paraitra incomprehensible. valgrind typiquement detecte ce genre d'erreur. En effet, malloc typiquement utilise de la memoire entre deux zones allouees pour sa gestion interne. Cet espace memoire est considere par valgrind comme etant innaccessible et il previendra si cette espace est utilise.

Compilez le code avec -g. Executez "valgrind ./programme parametre1 parametre2".

Une barriere electrique: Electric Fence

efence est une bibliotheque ecrit par Bruce Perens pour debugger du code C. Basiquement, efence utilise des fonctions systemes pour entourer les allocations de memoire d'un large segment de memoire protege en lecture et ecriture. Ainsi tous les acces memoire qui depasse une zone alloue (d'un cote ou de l'autre) va entrainer une segmentation fault. Ici le but est d'obtenir une erreur des le premier access "douteux" a la memoire.

efence fourni des protection similaire a valgrind. Cependant il se complete bien. valgrind est un logiciel qui est lent par nature, efence fonctionne a vitesse reel. valgrind detecte plus d'erreur (fuite memoire et memoire non initialise.). Bref, les deux se completent.

En resume

Lorsque vous avez une segmentation fault, assurez vous toujours d'avoir au moins verifier ces choses la:

  • Est ce que le code est compile avec avec toutes les verifications possibles? Le compilateur previent il de quoi que ce soit? Si oui, corrigez ceci d'abords.
  • Est ce que je peux utiliser un analyseur de code statique, comme splint, pour assurer la qualite de mon programme?
  • Quel est la premiere erreur reporte par valgrind?
  • Y a t'il une fuite de memoire?
  • Quel est le premier acces memoire invalide reporte par efence?

Ces choses sont faciles a verifier et la vraie source d'erreur dans la majorite des cas. Utilisez les!

Classified in : Homepage, fr, geek, programming - Tags : none - 4 comments

Security questions

Written by erik - 30 august 2012 15:39

Today I got a call because I applied for a credit card a few days ago. A part of the discussion went like this:

  • In order to verify your identity, I am going to ask you questions ONLY YOU knows the answer of. Are you ready?
  • Well, how are you going to know whether I lie or not?
  • I will know! I have the answers in your file.
  • So you know the answers too, I am not the only one knowing them.
  • long silence

These so called security questions makes no sense at all. Anybody that knows me would be able to answer them (my brother, my boss). Anybody that has access to my file will be able to answer them (my banker, my insurer). Anybody that asks me security questions will be able to answer them (phone companies, thousands of websites). Why are we still relying on such obvious questions?

Classified in : Homepage, 3617mylife, en - Tags : none - 2 comments

The best AI for cassis

Written by erik - 16 august 2012 23:27

Hi folks,

The AI I had written for Cassis were pretty bad. They were basically either playing randomly or just trying not to die. I was debatting on how to build the best AI for that game. I gave myself the goal of putting that AI running on a phone. There are only 6 points and 15 possible move possible at the first round. Here I present the simple techniques I used to reach this goal. I wrote everything with on my laptop first and then ported it to my android phone. I published version 0.91 of cassis for android.

The basics.

The AI is working on a brute force basis. Basically, it considers all the possible moves up to the end of the game and play the best. To see if this is feasible, I started by trying to decide whether player 1 wins or loses. So I wrote the simplest code which recursively expand all moves:

bool player1wins(Game g)
{
  if (g.gameover()) return g.winner() == PLAYER1;
  
  if (g.whoseturn() == PLAYER1) {
    for (edge e = 0; e != 15; ++e) {
      if (g.edge(e) != NOT_PLAYED_YET) continue;
      Game g2 = g;
      g2.play(e);
      if (player1wins(g2)) return true;
    }
    return false; //could not find a winning move. That means player 1 lost.
  }
  else
    //pretty much the same thing with player2
}

The code is pretty straight forward, if it is player 1's turn, it tries all the possible move and stops if it finds a winning one. The code for player 2 is the same as for player 1 except it stops if it find a losing move for player 1. I let it run for about 15 minutes on my laptop and it did not complete. Clearly I was doing something wrong. Let's do a little bit of maths. At the first iteration there are 15 possible moves, at the second one 14 possible move, then 13, ... Overall, there are 15! case to examine. That's about 1.3 x 10^12 combination. If the code can examine 1 combination per clock cycle (which it does not) and the processor runs at 1GHz, it will still take 1300 seconds (20 minutes). Provided it definitely need more that 1 clock cycle per combination, that is definitely not a working approach. It will never run fine on a phone.

Hashing based.

The main problem with that approach is that it processes the same combination multiple times. If the first moves are Player1/Edge1, Player2/Edge2, Player1/Edge3, the code will compute these combination even if it processed Player1/Edge3, Player2/Edge2, Player1/Edge1 before, despite they are actually the very same board. If the former is winning then the latter is winning too and vice versa (same goes with losing). If we can remember every board we processed so far, we will process less of them. Let's do some math first here, for each edge of the graph, it can be either played by 1, by 2 or not played yet. So each edge can have 3 different states, and there are 15 edges in the game. So there are at most 3^15 possible boards, that about 14 million boards. (Actually there are less possible board since some board are not reachable because a player lost first and the game stopped. but let's use that number as a worst case scenario.) That is not too many board so that seems like a reasonnable thing to try.

To be able to leverage that number, I need a simple way of naming a board. Otherwise, I would need to go through the list of all the board seen so far to decide whether I need to process it of not. That would give a quadratic algorithm and 14 million squared does not seem reasonnable anymore. Instead I can just store the boards in a table and annotate them winning or losing. If I have a simple way of building an offset from a board, then, I can just use that. Fortunately, the state of a board can be encoded with a number of 15 digit in base 3. Each digit represents an edge and is worth 0 for unplayed, 1 for played by player 1, and 2 for played by player 2. Actually, I would rather not allocate a table of size 14 millions (that won't work well on a phone), so I choose to use an associative data structure instead such as a hash map (actually, the code use std::map to be precise, so that's a red-black tree. but a hashmap would work the same (probably better actually)). And since I am going with hashing, I can directly make one hash map for each round since if two boards are in a different round they are clearly different. The code now looks like that:

bool player1wins(Game g)
{
  if (g.gameover()) return g.winner() == PLAYER1;

  if (HASH[g.round()].exist(g.hash())) return HASH[g.round()][g.hash()];
  
  if (g.whoseturn() == PLAYER1) {
    for (edge e = 0; e != 15; ++e) {
      if (g.edge(e) != NOT_PLAYED_YET) continue;
      Game g2 = g;
      g2.play(e);
      if (player1wins(g2)) {
        HASH[g.round()][g.hash()] = true;
        return true;
      }
    }
    HASH[g.round()][g.hash()] = false;
    return false; //could not find a winning move. That means player 1 lost.
  }
  else
    //pretty much the same thing with player2
}

With this approach, it takes about 1.65 seconds on my laptop to decide whether player 1 wins or not. And the size of the hashes are distributed like this:

winning[0].size() == 1
winning[1].size() == 15
winning[2].size() == 15
winning[3].size() == 104
winning[4].size() == 217
winning[5].size() == 969
winning[6].size() == 2693
winning[7].size() == 6896
winning[8].size() == 14560
winning[9].size() == 21724
winning[10].size() == 27754
winning[11].size() == 21757
winning[12].size() == 12826
winning[13].size() == 2433
winning[14].size() == 132

That represents 112096 state to store. Note that it is much lower than the 3^15 that were predicted, because some of the exploration stops early once player find a winning move. Yet 112K states are a lot to store, and generating them still takes 1.65 seconds on a laptop. A phone will be slower.

Isomorphism.

So what is the problem with that version? Well, we are not exploiting any kind of symetry. Look at the number of state evaluated at round 1: "winning[1].size() == 15". There are 15 of them. It means that the algorithm is trying all the first move possible. But all these moves are actually the same move up to a permutation of the vertices. In that case the board are said to be isomorph. The idea is then to store only one permutation for each board. Reaching one permutation per board might be a little complicated to do, so I will settle for storing less permutation. One thing that is constant for all the permutation of a board is the "degree" of the vertices. What I mean by degree is the number of edges connected to a given vertex. I introduce the notion of normal form of a board, which is a permutation of the board where the vertices of high degree are first. For the purpose of computing the hash value of a board, I first normalize the board and then compute its hash value. That way all the move at level 1 (and many more) are detected as equivalent. When running that variant of the code now takes 0.084939 seconds, with a total of 3173 different boards traversed and the distribution of values is now:

winning[0].size() == 1
winning[1].size() == 1
winning[2].size() == 1
winning[3].size() == 5
winning[4].size() == 7
winning[5].size() == 39
winning[6].size() == 70
winning[7].size() == 194
winning[8].size() == 347
winning[9].size() == 557
winning[10].size() == 703
winning[11].size() == 660
winning[12].size() == 442
winning[13].size() == 136
winning[14].size() == 10

Actually, we can prune a little more by differenciating for each vertex the number of incident edges from player 1 and from player 2. It reduces the time to 0.052072 seconds, 1568 different boards and the distribution is:

winning[0].size() == 1
winning[1].size() == 1
winning[2].size() == 1
winning[3].size() == 5
winning[4].size() == 7
winning[5].size() == 28
winning[6].size() == 48
winning[7].size() == 119
winning[8].size() == 207
winning[9].size() == 313
winning[10].size() == 353
winning[11].size() == 278
winning[12].size() == 168
winning[13].size() == 37
winning[14].size() == 2

Reconstructing the move.

Alright, that is enough for me, less than 2K states and about 50 milliseconds in total on my laptop. Though, it only gives us which board is winning and which one is losing. We need to know what to play. So when a move of player 1 leads to a win (for player 1), or when a move of player 2 leads to a lose (for player 1), we need to record the move. One of the proble is that we no longer evaluate every single reachable state. Only the ones that were not considered similar. So we need to compute the permutation to the normal form, and permute the move properly. However, this is not enough because two state that have the same normal form might not have derived state that have the same normal form. (It is not easy to find an example, but it has something to do with the sorting operation and the fact that two isomorph board can have the same representative.)

What else?

I guess there are more things to do, but it is fast on my phone right now. The code on the phone is actually using a stupid data structure in place of a std::map or an hash map, just a simple array. (There is no STL that works on android and supports exception.) Most of the performance to gain probably come from language details now and that's not really fun to do. So I'll stop my optimization there. There are more things to do on cassis though, probably adding a simpler difficulty mode, adding a proper icon for the game and adding a 2 player mode.

In the mean time, can you beat my new IA? Here is a link to the apk file for android. The code has been updated on github.

Classified in : Homepage, geek, programming, en - Tags : none - no comments

page 1 of 3 next »

Categories

Archives

Tags

Last articles

Last comments