The novel task of automatically generating caption which fuses insights from computer vision and natural language processing and holds future for different multimedia applications, such as image retrieval, development of tools supporting various fields of media management. It is possible to learn a caption generation model from weakly labeled data without costly human involvement. Instead of manually creating annotations, captions are treated as labels for the image. Although the captions generated are noisy compared to human-created keywords, we show that they can be used to learn the relation between visual and textual modalities, and also serve as a optimum for the caption generation task. We have presented caption generation models using the content selection and surface generation. A key aspect of our approach is to allow both the graphic and textual modalities to influence the caption generation task.