Octavo Semestre: Teoría de la Información

Mostrando entradas con la etiqueta Teoría de la Información. Mostrar todas las entradas

miércoles, 29 de mayo de 2013

Last Homework: Image Compression

Para esta última tarea se nos encargó realizar compressión de imagenes. Una de las recomendaciones fue utilizar waveletes, ya que en clase de visión ya habían usado esto. Como no llevo la clase, mi compañero Avendaño me ayudó a entender el uso de estas.

Aquí está el codigo:

.....

y el resultado:

(original)

(comprimida)

Y las diferencias en los tamaños:

Referencias
Basic JPEG: http://www.whydomath.org/node/wavlets/basicjpg.html

lunes, 27 de mayo de 2013

Extra Points: Bioinformatics

Article:
Functional immunomics: Microarray analysis of IgG autoantibody repertoires predicts the future response of mice to induced diabetes

Authors:
Francisco J. Quintana, Peter H. Hagedorn, Gad Elizur, Yifat Merbl, Eytan Domany, Irun R. Cohen

Can be found in: www.pnas.org/content/101/suppl.2/14615.full

INTRODUCTION
One's repertoire of antibodies encodes the history of one's past immunological experience. Can the present autoantibody repertoire be consulted to predict susceptibility to a future disease? In this article, an antigen microarray chip is developed and bioinformatic analysis is used to study a model of type 1 diabetes developing in nonobese diabetic male mice in which the disease was accelerated and synchronized injecting cyclophosphamide at 4 weeks of age. Sera was obtained, the mice were treated to induce cyclophosphamide-accelerated diabetes (CAD), here 9 mice became severely diabetic, whereas 10 mice permanently resisted diabetes.

Male mice of the nonobese diabetic (NOD) group develop type 1 diabetes at a low incidence and late age. However, the start of diabetes can be significantly accelerated and synchronized by exposing NOD mice to cyclophosphamide. So, the CAD model of type 1 diabetes gives an opportunity to test whether the global autoantibody repertoire might reflect susceptibility to CAD in still healthy mice, before the cyclophosphamide insult is administered.

Recently, microarray antigen chips have been used to detect high-titer autoantibodies to known antigens in autoimmune diseases. However, rather than focusing only on known antigens, here some individual immune systems are profiled by their global patterns of autoantibodies free of bias for high-titer reactivities to particular antigens.

MATERIALS AND METHODS
Mice. Male NOD (non-obese diabetic) mice were raised an maintained under pathogen-free conditions. The mice were 4 weeks old at the start of the experiments. Nineteen mice were studied individually.

CAD. Diabetes was accelerated and synchronized by two injections of cyclophosphamide at 4 weeks of age, and again, 1 week later. In the colony, this treatment of NOD males leads of an incidence of diabetes of approximately 50%. The mice developing diabetes go on to die unless they are treated with insulin; those males that do not develop diabetes within 1 month after two injections of cyclophosphamide do not become diabetic later in life.

The next image is a schematic repersentation of the protocol.

The experimental protocol is as follows:

- Numbers refer to age of the mice (in weeks)

- Black vertical lines are serum sample collection (weeks 4 and 9)

- Grey vertical lines are Cyclophosphamide injection (weeks 4 and 5)

- Grey box (at week 6) shows when mice developed diabetes.

- Grey box (between week 11 and 13) shows time of death of untreated diabetic mice.

Diabetes

Blood glucose was measured weekly. A mouse was considered diabetic when its blood glucose concentration was >13 mM on two consecutive examinations. Of the 19 mice treated, 9 developed diabetes, and 10 remained healthy throughout a 2-month period.

Serum

Serum samples were collected 1 day before the first injection of cyclophosphamide and 1 month after the second injection.

Antigens

The 266 antigens spotted on the microarray chips include proteins, synthetic peptides, nucleotides, and phospholipid.

Image and Data Processing.

The pixels that comprised each spot in the TIFF files and the local background were identified by using histogram segmentation. The intensity of each spot and its local background were calculated as the mean of the corresponding pixel intensities. None of the spots containing antigens showed saturation. Technically faulty spots were identified by visual inspection and were removed from the data set. For each spot, the local background intensity was subtracted from the spot intensity. Spots with negative intensities were removed from the data set. A log-base-2 transformation of the intensities resulted in reasonably constant variability at all intensity levels. The log intensity of each antigen was calculated as the mean of the log intensities of the replicates on each slide. The coefficient of variability between replicates on each array was ≈30%.

RESULTS

Selection of Informative Antigens for Pre-CAD Mice

There have been reported previously that coupled two-way clustering (CTWC) could be used to successfully separate human subjects already diabetic from healthy persons. How ever different approach was taken. Based on the sera taken before cyclophosphamide treatment, there have been listed the 27 antigens that separated best between the sera of the 10 mice that later resisted the induction of CAD, and of the 9 mice that later developed CAD.

Figure A

Above figure Left is the two-way SPC of these antigens. The filled rectangles at the top of the box are the mice susceptible to CAD induction; empty rectangles are mice resistant to CAD induction. The 27 antigens are clustered at the rows and are identified by number. It can be seen that all 9 mice that were found later to be susceptible to CAD could be separated from 8 of the 10 mice that were later found to resist CAD.

Figure B


Figure C

Selection of Informative Antigens for Post-CAD Mice.

Then, 27 antigens were used effective in pre-CAD clustering to analyze the patterns of IgG antibodies developing in the diabetic and healthy mice post-CAD. These 27 antigens failed to discriminate between the two groups of mice. A third set of 27 antigens was generated by performing the Wilcoxon rank-sum test on the ratios by which each antigen changed post-CAD/pre-CAD. The ratios provide information on reactivity changes toward the antigen.

Figures B and C show that previously mentioned antigens could indeed separate between the healthy and diabetic mice post-CAD. Thus, the IgG repertoires of the pre- and post-CAD groups of healthy and sick mice could be clustered, but the informative patterns of reactivity required modified sets of antigens to develop discriminating patterns.

It can be seen that some of the antigens from the set of pre-CAD antigens were also present in the post-CAD set, or in the set of antigens determined from the pre-CAD/post-CAD ratios. For example, three of the pre-CAD antigen reactivities were also prominent post-CAD. The shared and distinct antigens are shown as a Venn diagram (figure above) for the overlap between the three lists of antigens mentioned previously.

DISCUSSION

The antigen microarray chip described in this paper required much preliminary work to obtain consistent results. Patterns of IgM antibodies were analyzed before and after CAD.

The article show that the patterns of IgG antibodies expressed pre-CAD in male NOD mice can mark susceptibility or resistance to CAD induced later. There were also found patterns of IgG antibodies characteristic of healthy or diabetic mice post-CAD, but these patterns required sets of antigens that differed from the informative pre-CAD set. Thus, IgG reactivities to some antigens may mark future susceptibility to CAD, but not CAD itself, once the disease emerges, and conversely, some IgG reactivities may mark the disease but not the susceptibility. Hence, prediction of future disease and diagnosis of present disease can depend on different data sets of information, at least in the CAD model.

Another notable finding was that health, both pre- and post-CAD, was associated with relatively high IgG autoreactivity to self-antigens, to which the susceptible mice were low IgG responders. Thus, some types of active autoimmunity may actually protect against autoimmune disease.

However, environmental factors can also prevent the development of type 1 diabetes. Stimulation of the NOD mouse immune system by infection, by vaccination with microbial antigens, or by treatment with ligands that activate innate immune receptors, can prevent diabetes. Type 1 diabetes appears in very young people, so critical aspects of autoimmune organization must occur fairly early in one's lifetime. The results of this bioinformatic study would suggest that, some organized patterns of IgG autoantibodies are shared by groups of individuals, at least among NOD mice.

The bioinformatic analysis described here relates to two separate, but linked issues: predictive medicine by means of functional immunomics and the biological meaning of the autoimmune repertoire.

The present study investigated patterns of antibodies, but not their function in the disease process. Nevertheless, the list of informative antigens may be connected to other observations regarding the pathophysiology of type 1 diabetes.

The mammalian immune system, in addition to its well studied role in defending the body against foreign invaders, is now understood to be heavily involved in maintaining the integrity of the body from within; immune system cells and molecules, which comprise the inflammatory response, are key factors in wound healing, neuroprotection, connective tissue formation, angiogenesis, tissue morphology and regeneration, and waste disposal. To dispense reparative inflammation at the right sites and occasions, the immune system has to assess the state of the body on the fly. In this respect, the immune system acts as it were the body's onboard bioinformatic computer. If so, predictive medicine would do well to mine this immune information, as this article suggests it might.

Referencias

Functional immunomics: Microarray analysis of IgG autoantibody repertoires predicts the future response of mice to induced diabetes;

Francisco J. Quintana, Peter H. Hagedorn, Gad Elizur, Yifat Merbl, Eytan Domany, Irun R. Cohen;

"This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Therapeutic Vaccines: Realities of Today and Hopes for Tomorrow,” held April 1–3, 2004, at the National Academy of Sciences in Washington, DC."

Link to the article: http://www.pnas.org/content/101/suppl.2/14615.full

[Consultado el 27 de Mayo de 2013]

Extra points: Card Game

1. Microarray (Found in this bioinformatics article)

An antigen microarray chip was developed and used bioinformatic analysis to study a model of type 1 diabetes...

2. Block
If the words wi appearing in an encoding scheme are all of the same length, the code is said to be a ﬁxed-length or block code, and the common length of the wi is said to be the length of the code. Otherwise, the code is said to be a variable-length code.

3. Relative
We have a memoryless channel with input alphabet A = {a1,...,an}, output alphabet B = {b1,...,bk}, and transition probabilities qij. For i ∈ {1,...,n}, let pi denote the relative frequency of transmission, or input frequency, of the input character ai.

4. Audio
Imagine a signal, that is, a function h of time t. It it’s helpful, you can think of h as an audio signal, i.e., a voltage level, ﬂuctuating with time

5. Stream
Nevertheless, we shall hold to the simplifying assumption that pi, the proportion of ai’s in the input text, is also the probability that the next letter is ai, at any point in the input stream.

6. Knowledge
There is a body of knowledge related to the Implicit Function Theorem in the calculus of functions of several variables that provides an answer of sorts.

7. Missing
The smoothing procedure sometimes makes good guesses about the missing data, but it cannot recover the original information.

8. Recover
To see why we do this, observe that if the decoder is supplied the source word length N and a number in A, then the decoder can recover the sequence i1,...,iN, and thus the source word si1···siN.

9. Alignment
The code string is shifted by the same amount in order to maintain alignment.

10. Transformation
The transformation x → 2x −1/2 doubles the directed distance from x to 1/2; call it the “doubling expansion around 1/2” if you like.

11.Approximate
Using approximate probabilities can permit replacement of multiplications by simple shift operations.

12. Uniform
It has been generally assumed that the relative source frequencies are equal, and the dazzling algebraic methods used to produce great coding and decoding under these assumptions automatically produce a sort of uniformity that makes p0 and p1 equal or trivially close to 1/2.

13. Optimal
Furthermore, if p1,..., pn are optimal input frequencies satisfying these equations, for some value of C, then C is the channel capacity.

14. Simultaneously
Thoughtful quantizing can help suppress both non-meaningful and weak mathematical frequencies simultaneously.

15. Fixed
If our attention is fixed to only the part of the graphs over the integers 0, 1, . . . , 7, then we might be led to believe that x is as smooth, if not smoother, than y.

16. Rate
The code words would have to be quite long, so that the rate of processing of source text would be quite slow...

17. Arbitrary
If, for some reason, we require the M in the WNC algorithm to be small, we
may allow rough and arbitrary approximation of the relative source frequencies.

18. Mapping
Given E and F , we can think of the mapping (i, j ) → I (E i , F j ) as a random variable on the system E ∧ F .

19. Hamming weight
The Hamming weight of a word w ∈ {0, 1} is wt(w) = number of ones appearing in w.

20. Noisy
Shannon’s Noisy Channel theorem applies to a more general sort of source, one which emits source letters, but not necessarily randomly and independently

21. Ambiguous
The code determined by φ is said to be unambiguous if and only if φ is one-to-one (injective). Otherwise, the code is ambiguous.

22. Code
The code determined by φ is uniquely decodable if and only if it is unam-biguous and there exists a VDR for it.

AND FROM THE PREVIOUS GAME:
23. Bitwise
In addition, a number of the steps in the scheme can be managed as simple bitwise operations.

24. Finite
Therefore, we will allow ourselves the convenience of sometimes attributing the SPP to finite sets of binary words.

jueves, 9 de mayo de 2013

PUNTOS EXTRA: REED-SOLOMON CODES

Los códigos Reed-Solomon son códigos cíclicos no binarios. Los códigos cíclicos son una subclase de los códigos de bloque estándar de detección y corrección de errores. Este tipo de códigos tiene un amplio rango de aplicaciones en comuniaciones digitales y almacenaje y se usan para corregir errores en muchos sistemas como:

Dispositivos de almacenamiento (CD, DVD, Códigos de Barras,etc.)
Comunicaciones inamámbricas o móviles
Comunicaciones satelitales
Televisión Digital
Modems de alta velocidad tales como ADSL, xDSL, etc.

La siguiente figura representa un caso típico:

El funcionamiento es el siguiente:
- El codificador Reed Solomon toma un bloque de información digital y añade bits extras "redundantes".
- Los errores ocurren durante la transmisión o el almacenamiento por varias razones (por ejemplo ruido o interferencia, o rayaduras en un CD).
- El decodificador Reed-Solomon procesa cada bloque y trata de corregir los errores y recuperar la información original.

Propiedades de los Códigos Reed-Solomon
Los códigos Reed-Solomon son bloques de códigos lineales. Un código Reed-Solomon es representado como RS(n,k) con símbolos s-bits.

Lo anterior significa que el codificador toma k símbolos de información de s bits cada uno y añade símbolos de paridad para hacer una palabra clave de símbolo n. Hay n-k símbolos de paridad de s bits cada uno. Un decodificador Reed-Solomon puede corregir hasta t símbolos que contengan errores en una palabra clave, donde 2t = n-k.

La siguiente es úna palabra clave común de Reed-Solomon:

Ejemplo:
Un código popular de Reed-Solomon es el RS(255,223) con símbolos de 8-bit. Cada palabra clave contiene 255 bytes de palabra clave, de los cuales 223 bytes son información y 32 bytes son paridad. Para este código:

n=255, k = 223, s=8
2t = 32, t=16

El codificador puede corregir cualesquiera 16 errores en la palabra clave.

Aplicaciones
La codificación Reed-Solomon es ampliamente usada en sistems almacenamiento en masa para corregir los errores de ráfaga de asociados con efectos de media.

También es un componente clave de los discos compactos. Este fue el primer uso de correción de error fuerte en un producto consumidor producido en masa. El DVD usa un esquema similar al usado por el CD.

El corrector de errores Reed-Solomon también es usado en archivos "parchive" (parity archive volumen set), los cuales comúnmente son son publicados acompañando archivos multimedia de USENET. El servicio de almacenamiento distribuido en línea llamado Wuala también usa el Reed-Solomon al partir los archivos.

Código de Barras
Casi todos los códigos de barras bidimensionales usan la corrección de errores Reed-Solomon. Cuando el scanner de código de barras no puede reconocer el símbolo, lo tratará como una raspadura.

Referencias
http://es.wikipedia.org/wiki/Reed-Solomon
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/pscico-guyb/realworld/www/reedsolomon/reed_solomon_codes.html

jueves, 25 de abril de 2013

Adaptive Code

For this week's homework, we had to implement a compression algorithm that doesn't compress all the input at once, but to make the compression in parts. That's the adaptive part of the coding.

This is how I implement it: For each part of the input that was going to be processed, take the input letters and make a dictionary with each letter as key, and its frequency as value. Next, a new dictionary is created, but this time the value is going to be its probability of apearing in the spanish alphabet. Also, for each part we had to see if the some letter hasn't appear before.

With this, we create a graph where each node is added by "asking": is it greater than the actual node? (go to right); is it lower than the actual node? (go to left).

This is what I did:
Clase principal:
Algoritmo:
Archivos

El resultado se cicla :(

miércoles, 17 de abril de 2013

Extra Points

1. Actual

"There is nothing in the definition that tells you how to provide yourself with S and P, given some actual experiment."

2. Alignment
"The code string is shifted by the same amount in order to maintain alignment."

3. Arithmetic
"Binary coder (such as the Q-coder) are an important special case in arithmetic coding."

4. Binomial Distribution
""

5. Channel
"A channel is a communication device with two ends, an input end, or transmitter, and an output end, or receiver."

6. Condition
"This theorme may seem, at first glance, to be saying that all you have to do to find the capacity of a channel and the optimal input frequencies is to solve the capacity equations of the channel, the equations arising from the Lagrange Multiplier Theorem, and the condition ( Σi )^n =1 pi = 1, for p1 ,..., pn > 0".

7. Content
"Since the average information content per character is H(S), it follows that the source is emitting information at the rate r H (S) information units per unit time, on average."

8. Denote
"Let C denote the channel capacity."

9. Differ
"By the Law of Large Numbers, if N is large, the source words of lenght N in which the proportions of the source letters within the word differ markedly from the fj have very small probability, collectively."

10. Entire
"Furthermore, there is a waste of time involved: you have to wait until the entire source string is scanned and processed before you have to code for it."

11. Exact
"There are two inconveniences to be dealt with when there is a long run of no rescaling. The first is that we have to carry on doing exact computations of the endpoints α and α + ℓ with the smaller and smaller values of ℓ."

12. Fingerprint
""

13. Frequency
"We now return to the general case, with A ={a1,...,an} and ai haveing input frequency pi."

14. Implicit
"The usual theory of binary block codes starts from the implicit assumption that the relative source frequencies are equal."

15. Independent
"The hidden cost suffered here is fixed, independent of the lenght of the source text, and is, therefore, usually essentially negligible."

16. Inexact
""

17. Input

"Input frequency: Is the relative proportion of ocurrence of an input symbol ai. Let's say we have a memoryless channe with input alphabet A = {a1,..., an}, output alphabet B={b1,...,bn}, and transition probabilities qij, for i ∈ {1,...,n}. The input frequency of a character ai is denoted by pi.

18. Joint
"Similarly, the joint and conditional entropies are average values of certain kind of

19. Jump
"The sequence x, as seen by the Fourier transform, makes a considerable jump each time k goes from -9 to -9, -1 to 0, 7 to 8, 15 to 16, etc."

20. Matching
"This can provide very fast matching, but it significantly reduces the size of the dictionary."

21. Prime
"The constant 40543 which appears in the definition is prime, but is there any other reason for its choice?"

22. Recursive
"This feature of the utility is known as revursive or iterative compression, and will enable you to compress your data files to a tiny fraction of the original size."

23. Relative
"When p=0, the capacity is zero, and any relative input frequencies are 'optimal'. "

24. Success
"When speaking of some unspecified Bernoulli trial, we will call one possible outcome Success, or S, and the ohter Failure, or F."

25. Symmetric
"A binary symmetric channel (BSC, for short) is a memoryless channel with A = B = {0,1}; whichever digit, 0 or 1, is being transmitted, the probability p that it will get through correctly is called the reliability of the channel."

26. Typical
"Higher-order models may do better, where the necessary statistics are gathered from a sample of 'typical' text."

27. Sequence
The word v is called the leave of the parsing of W into a sequence of the si.

28. Binary
The most common sort of choice for source alphabet is: S = {0, 1} L , the set
of binary words of some fixed length L.

jueves, 11 de abril de 2013

Huffman's algorithm

For this week's homework, we had to implement huffman's algorithm for the compression of a given string.

HUFFMAN'S ALGORITHM
This is a method for the construction of minimun redundancy codes. It builds a binary tree with a minimum weighted path length from a set of given weigths. This algorithm is applicable to many forms of data transimission.

It consists in the following steps:

1. Scan the text that is going to be compressed, and count the occurrence of each character.

2. Sort characters based on number of occurrences in the text.

3. Bulid Huffman tree based on the list of sorted characters.

4. Traverse the tree to determine the code words.

5. Scan text again and create a new text with the code words.

For a simple explanation of how this algorithm works, you can watch this video:

Next, is the code where I implemented this algorithms (based on a wikipedia example).

As you can see, the code is incomplete. I just reached the point where you get the nodes of the three (next I had to code each letter).

The tree of the wikipedia example is this:

And the result of my code is this, where it shows each node with its children.

References
Huffman's Algorithm
A simple video explanation
http://en.wikipedia.org/wiki/Huffman_coding
http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html

jueves, 21 de febrero de 2013

STRING MATCHING

For this homework we had to implement two algorithms for matching strings. These are Boyer-Moore algorithm and Knuth-Morris-Pratt algorithm. You can find a good and brief explanation along with C pseudocode of Boyer-Moore here, and here for Knuth-Morris-Prat.

Here's my implementation of the Boyer algorithm . It is unfinished.

////
You can run it by passing the TEXT and the PATTERN as arguments to the command prompt, it tells you if there is a match:

Octavo Semestre